Message ID | 20240405115815.3226315-2-pbonzini@redhat.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify() | expand |
On Fri, Apr 5, 2024 at 5:28 PM Paolo Bonzini <pbonzini@redhat.com> wrote: > > The .change_pte() MMU notifier callback was intended as an > optimization. The original point of it was that KSM could tell KVM to flip > its secondary PTE to a new location without having to first zap it. At > the time there was also an .invalidate_page() callback; both of them were > *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), > and .invalidate_page() also doubled as a fallback implementation of > .change_pte(). > > Later on, however, both callbacks were changed to occur within an > invalidate_range_start/end() block. > > In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to > set_pte_at_notify with invalidate_range_start and invalidate_range_end", > 2012-10-09) did so to remove the fallback from .invalidate_page() to > .change_pte() and allow sleepable .invalidate_page() hooks. > > This however made KVM's usage of the .change_pte() callback completely > moot, because KVM unmaps the sPTEs during .invalidate_range_start() > and therefore .change_pte() has no hope of finding a sPTE to change. > Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as > well as all the architecture specific implementations. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> For KVM RISC-V: Acked-by: Anup Patel <anup@brainfault.org> Regards, Anup > --- > arch/arm64/kvm/mmu.c | 34 ----------------- > arch/loongarch/include/asm/kvm_host.h | 1 - > arch/loongarch/kvm/mmu.c | 32 ---------------- > arch/mips/kvm/mmu.c | 30 --------------- > arch/powerpc/include/asm/kvm_ppc.h | 1 - > arch/powerpc/kvm/book3s.c | 5 --- > arch/powerpc/kvm/book3s.h | 1 - > arch/powerpc/kvm/book3s_64_mmu_hv.c | 12 ------ > arch/powerpc/kvm/book3s_hv.c | 1 - > arch/powerpc/kvm/book3s_pr.c | 7 ---- > arch/powerpc/kvm/e500_mmu_host.c | 6 --- > arch/riscv/kvm/mmu.c | 20 ---------- > arch/x86/kvm/mmu/mmu.c | 54 +-------------------------- > arch/x86/kvm/mmu/spte.c | 16 -------- > arch/x86/kvm/mmu/spte.h | 2 - > arch/x86/kvm/mmu/tdp_mmu.c | 46 ----------------------- > arch/x86/kvm/mmu/tdp_mmu.h | 1 - > include/linux/kvm_host.h | 2 - > include/trace/events/kvm.h | 15 -------- > virt/kvm/kvm_main.c | 43 --------------------- > 20 files changed, 2 insertions(+), 327 deletions(-) > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index dc04bc767865..ff17849be9f4 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > return false; > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - kvm_pfn_t pfn = pte_pfn(range->arg.pte); > - > - if (!kvm->arch.mmu.pgt) > - return false; > - > - WARN_ON(range->end - range->start != 1); > - > - /* > - * If the page isn't tagged, defer to user_mem_abort() for sanitising > - * the MTE tags. The S2 pte should have been unmapped by > - * mmu_notifier_invalidate_range_end(). > - */ > - if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn))) > - return false; > - > - /* > - * We've moved a page around, probably through CoW, so let's treat > - * it just like a translation fault and the map handler will clean > - * the cache to the PoC. > - * > - * The MMU notifiers will have unmapped a huge PMD before calling > - * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and > - * therefore we never need to clear out a huge PMD through this > - * calling path and a memcache is not required. > - */ > - kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT, > - PAGE_SIZE, __pfn_to_phys(pfn), > - KVM_PGTABLE_PROT_R, NULL, 0); > - > - return false; > -} > - > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > { > u64 size = (range->end - range->start) << PAGE_SHIFT; > diff --git a/arch/loongarch/include/asm/kvm_host.h b/arch/loongarch/include/asm/kvm_host.h > index 2d62f7b0d377..69305441f40d 100644 > --- a/arch/loongarch/include/asm/kvm_host.h > +++ b/arch/loongarch/include/asm/kvm_host.h > @@ -203,7 +203,6 @@ void kvm_flush_tlb_all(void); > void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa); > int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write); > > -void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); > int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end, bool blockable); > int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); > int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); > diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c > index a556cff35740..98883aa23ab8 100644 > --- a/arch/loongarch/kvm/mmu.c > +++ b/arch/loongarch/kvm/mmu.c > @@ -494,38 +494,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > range->end << PAGE_SHIFT, &ctx); > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - unsigned long prot_bits; > - kvm_pte_t *ptep; > - kvm_pfn_t pfn = pte_pfn(range->arg.pte); > - gpa_t gpa = range->start << PAGE_SHIFT; > - > - ptep = kvm_populate_gpa(kvm, NULL, gpa, 0); > - if (!ptep) > - return false; > - > - /* Replacing an absent or old page doesn't need flushes */ > - if (!kvm_pte_present(NULL, ptep) || !kvm_pte_young(*ptep)) { > - kvm_set_pte(ptep, 0); > - return false; > - } > - > - /* Fill new pte if write protected or page migrated */ > - prot_bits = _PAGE_PRESENT | __READABLE; > - prot_bits |= _CACHE_MASK & pte_val(range->arg.pte); > - > - /* > - * Set _PAGE_WRITE or _PAGE_DIRTY iff old and new pte both support > - * _PAGE_WRITE for map_page_fast if next page write fault > - * _PAGE_DIRTY since gpa has already recorded as dirty page > - */ > - prot_bits |= __WRITEABLE & *ptep & pte_val(range->arg.pte); > - kvm_set_pte(ptep, kvm_pfn_pte(pfn, __pgprot(prot_bits))); > - > - return true; > -} > - > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > { > kvm_ptw_ctx ctx; > diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c > index 467ee6b95ae1..c17157e700c0 100644 > --- a/arch/mips/kvm/mmu.c > +++ b/arch/mips/kvm/mmu.c > @@ -444,36 +444,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > return true; > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - gpa_t gpa = range->start << PAGE_SHIFT; > - pte_t hva_pte = range->arg.pte; > - pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa); > - pte_t old_pte; > - > - if (!gpa_pte) > - return false; > - > - /* Mapping may need adjusting depending on memslot flags */ > - old_pte = *gpa_pte; > - if (range->slot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte)) > - hva_pte = pte_mkclean(hva_pte); > - else if (range->slot->flags & KVM_MEM_READONLY) > - hva_pte = pte_wrprotect(hva_pte); > - > - set_pte(gpa_pte, hva_pte); > - > - /* Replacing an absent or old page doesn't need flushes */ > - if (!pte_present(old_pte) || !pte_young(old_pte)) > - return false; > - > - /* Pages swapped, aged, moved, or cleaned require flushes */ > - return !pte_present(hva_pte) || > - !pte_young(hva_pte) || > - pte_pfn(old_pte) != pte_pfn(hva_pte) || > - (pte_dirty(old_pte) && !pte_dirty(hva_pte)); > -} > - > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > { > return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end); > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h > index 3281215097cc..ca3829d47ab7 100644 > --- a/arch/powerpc/include/asm/kvm_ppc.h > +++ b/arch/powerpc/include/asm/kvm_ppc.h > @@ -287,7 +287,6 @@ struct kvmppc_ops { > bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range); > bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); > bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); > - bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); > void (*free_memslot)(struct kvm_memory_slot *slot); > int (*init_vm)(struct kvm *kvm); > void (*destroy_vm)(struct kvm *kvm); > diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c > index 8acec144120e..0d0624088e6b 100644 > --- a/arch/powerpc/kvm/book3s.c > +++ b/arch/powerpc/kvm/book3s.c > @@ -899,11 +899,6 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > return kvm->arch.kvm_ops->test_age_gfn(kvm, range); > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - return kvm->arch.kvm_ops->set_spte_gfn(kvm, range); > -} > - > int kvmppc_core_init_vm(struct kvm *kvm) > { > > diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h > index 58391b4b32ed..4aa2ab89afbc 100644 > --- a/arch/powerpc/kvm/book3s.h > +++ b/arch/powerpc/kvm/book3s.h > @@ -12,7 +12,6 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm, > extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range); > extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); > extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); > -extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); > > extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu); > extern void kvmppc_mmu_destroy_pr(struct kvm_vcpu *vcpu); > diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c > index 2b1f0cdd8c18..1b51b1c4713b 100644 > --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c > +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c > @@ -1010,18 +1010,6 @@ bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range) > return kvm_test_age_rmapp(kvm, range->slot, range->start); > } > > -bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - WARN_ON(range->start + 1 != range->end); > - > - if (kvm_is_radix(kvm)) > - kvm_unmap_radix(kvm, range->slot, range->start); > - else > - kvm_unmap_rmapp(kvm, range->slot, range->start); > - > - return false; > -} > - > static int vcpus_running(struct kvm *kvm) > { > return atomic_read(&kvm->arch.vcpus_running) != 0; > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 8e86eb577eb8..35cb014a0c51 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -6364,7 +6364,6 @@ static struct kvmppc_ops kvm_ops_hv = { > .unmap_gfn_range = kvm_unmap_gfn_range_hv, > .age_gfn = kvm_age_gfn_hv, > .test_age_gfn = kvm_test_age_gfn_hv, > - .set_spte_gfn = kvm_set_spte_gfn_hv, > .free_memslot = kvmppc_core_free_memslot_hv, > .init_vm = kvmppc_core_init_vm_hv, > .destroy_vm = kvmppc_core_destroy_vm_hv, > diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c > index 5b92619a05fd..a7d7137ea0c8 100644 > --- a/arch/powerpc/kvm/book3s_pr.c > +++ b/arch/powerpc/kvm/book3s_pr.c > @@ -461,12 +461,6 @@ static bool kvm_test_age_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range) > return false; > } > > -static bool kvm_set_spte_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - /* The page will get remapped properly on its next fault */ > - return do_kvm_unmap_gfn(kvm, range); > -} > - > /*****************************************/ > > static void kvmppc_set_msr_pr(struct kvm_vcpu *vcpu, u64 msr) > @@ -2071,7 +2065,6 @@ static struct kvmppc_ops kvm_ops_pr = { > .unmap_gfn_range = kvm_unmap_gfn_range_pr, > .age_gfn = kvm_age_gfn_pr, > .test_age_gfn = kvm_test_age_gfn_pr, > - .set_spte_gfn = kvm_set_spte_gfn_pr, > .free_memslot = kvmppc_core_free_memslot_pr, > .init_vm = kvmppc_core_init_vm_pr, > .destroy_vm = kvmppc_core_destroy_vm_pr, > diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c > index ccb8f16ffe41..c664fdec75b1 100644 > --- a/arch/powerpc/kvm/e500_mmu_host.c > +++ b/arch/powerpc/kvm/e500_mmu_host.c > @@ -747,12 +747,6 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > return false; > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - /* The page will get remapped properly on its next fault */ > - return kvm_e500_mmu_unmap_gfn(kvm, range); > -} > - > /*****************************************/ > > int e500_mmu_host_init(struct kvmppc_vcpu_e500 *vcpu_e500) > diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c > index a9e2fd7245e1..b63650f9b966 100644 > --- a/arch/riscv/kvm/mmu.c > +++ b/arch/riscv/kvm/mmu.c > @@ -550,26 +550,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > return false; > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - int ret; > - kvm_pfn_t pfn = pte_pfn(range->arg.pte); > - > - if (!kvm->arch.pgd) > - return false; > - > - WARN_ON(range->end - range->start != 1); > - > - ret = gstage_map_page(kvm, NULL, range->start << PAGE_SHIFT, > - __pfn_to_phys(pfn), PAGE_SIZE, true, true); > - if (ret) { > - kvm_debug("Failed to map G-stage page (error %d)\n", ret); > - return true; > - } > - > - return false; > -} > - > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > { > pte_t *ptep; > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 0049d49aa913..87ba2a9da196 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -432,8 +432,8 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) > * The idea using the light way get the spte on x86_32 guest is from > * gup_get_pte (mm/gup.c). > * > - * An spte tlb flush may be pending, because kvm_set_pte_rmap > - * coalesces them and we are running out of the MMU lock. Therefore > + * An spte tlb flush may be pending, because they are coalesced and > + * we are running out of the MMU lock. Therefore > * we need to protect against in-progress updates of the spte. > * > * Reading the spte while an update is in progress may get the old value > @@ -1454,43 +1454,6 @@ static bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, > return __kvm_zap_rmap(kvm, rmap_head, slot); > } > > -static bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, > - struct kvm_memory_slot *slot, gfn_t gfn, int level, > - pte_t pte) > -{ > - u64 *sptep; > - struct rmap_iterator iter; > - bool need_flush = false; > - u64 new_spte; > - kvm_pfn_t new_pfn; > - > - WARN_ON_ONCE(pte_huge(pte)); > - new_pfn = pte_pfn(pte); > - > -restart: > - for_each_rmap_spte(rmap_head, &iter, sptep) { > - need_flush = true; > - > - if (pte_write(pte)) { > - kvm_zap_one_rmap_spte(kvm, rmap_head, sptep); > - goto restart; > - } else { > - new_spte = kvm_mmu_changed_pte_notifier_make_spte( > - *sptep, new_pfn); > - > - mmu_spte_clear_track_bits(kvm, sptep); > - mmu_spte_set(sptep, new_spte); > - } > - } > - > - if (need_flush && kvm_available_flush_remote_tlbs_range()) { > - kvm_flush_remote_tlbs_gfn(kvm, gfn, level); > - return false; > - } > - > - return need_flush; > -} > - > struct slot_rmap_walk_iterator { > /* input fields. */ > const struct kvm_memory_slot *slot; > @@ -1596,19 +1559,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > return flush; > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - bool flush = false; > - > - if (kvm_memslots_have_rmaps(kvm)) > - flush = kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmap); > - > - if (tdp_mmu_enabled) > - flush |= kvm_tdp_mmu_set_spte_gfn(kvm, range); > - > - return flush; > -} > - > static bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, > struct kvm_memory_slot *slot, gfn_t gfn, int level, > pte_t unused) > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c > index 318135daf685..283af5b90016 100644 > --- a/arch/x86/kvm/mmu/spte.c > +++ b/arch/x86/kvm/mmu/spte.c > @@ -322,22 +322,6 @@ u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled) > return spte; > } > > -u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn) > -{ > - u64 new_spte; > - > - new_spte = old_spte & ~SPTE_BASE_ADDR_MASK; > - new_spte |= (u64)new_pfn << PAGE_SHIFT; > - > - new_spte &= ~PT_WRITABLE_MASK; > - new_spte &= ~shadow_host_writable_mask; > - new_spte &= ~shadow_mmu_writable_mask; > - > - new_spte = mark_spte_for_access_track(new_spte); > - > - return new_spte; > -} > - > u64 mark_spte_for_access_track(u64 spte) > { > if (spte_ad_enabled(spte)) > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h > index 1a163aee9ec6..92da4c419171 100644 > --- a/arch/x86/kvm/mmu/spte.h > +++ b/arch/x86/kvm/mmu/spte.h > @@ -511,8 +511,6 @@ static inline u64 restore_acc_track_spte(u64 spte) > return spte; > } > > -u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn); > - > void __init kvm_mmu_spte_module_init(void); > void kvm_mmu_reset_all_pte_masks(void); > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 3627744fcab6..fbb86932b766 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1250,52 +1250,6 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn); > } > > -static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter, > - struct kvm_gfn_range *range) > -{ > - u64 new_spte; > - > - /* Huge pages aren't expected to be modified without first being zapped. */ > - WARN_ON_ONCE(pte_huge(range->arg.pte) || range->start + 1 != range->end); > - > - if (iter->level != PG_LEVEL_4K || > - !is_shadow_present_pte(iter->old_spte)) > - return false; > - > - /* > - * Note, when changing a read-only SPTE, it's not strictly necessary to > - * zero the SPTE before setting the new PFN, but doing so preserves the > - * invariant that the PFN of a present * leaf SPTE can never change. > - * See handle_changed_spte(). > - */ > - tdp_mmu_iter_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE); > - > - if (!pte_write(range->arg.pte)) { > - new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte, > - pte_pfn(range->arg.pte)); > - > - tdp_mmu_iter_set_spte(kvm, iter, new_spte); > - } > - > - return true; > -} > - > -/* > - * Handle the changed_pte MMU notifier for the TDP MMU. > - * data is a pointer to the new pte_t mapping the HVA specified by the MMU > - * notifier. > - * Returns non-zero if a flush is needed before releasing the MMU lock. > - */ > -bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - /* > - * No need to handle the remote TLB flush under RCU protection, the > - * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a > - * shadow page. See the WARN on pfn_changed in handle_changed_spte(). > - */ > - return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn); > -} > - > /* > * Remove write access from all SPTEs at or above min_level that map GFNs > * [start, end). Returns true if an SPTE has been changed and the TLBs need to > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h > index 6e1ea04ca885..58b55e61bd33 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.h > +++ b/arch/x86/kvm/mmu/tdp_mmu.h > @@ -31,7 +31,6 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range, > bool flush); > bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); > bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > -bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > > bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, > const struct kvm_memory_slot *slot, int min_level); > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index ab439706ea2f..8dea11701ab2 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -259,7 +259,6 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); > > #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER > union kvm_mmu_notifier_arg { > - pte_t pte; > unsigned long attributes; > }; > > @@ -273,7 +272,6 @@ struct kvm_gfn_range { > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > #endif > > enum { > diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h > index 011fba6b5552..74e40d5d4af4 100644 > --- a/include/trace/events/kvm.h > +++ b/include/trace/events/kvm.h > @@ -456,21 +456,6 @@ TRACE_EVENT(kvm_unmap_hva_range, > __entry->start, __entry->end) > ); > > -TRACE_EVENT(kvm_set_spte_hva, > - TP_PROTO(unsigned long hva), > - TP_ARGS(hva), > - > - TP_STRUCT__entry( > - __field( unsigned long, hva ) > - ), > - > - TP_fast_assign( > - __entry->hva = hva; > - ), > - > - TP_printk("mmu notifier set pte hva: %#016lx", __entry->hva) > -); > - > TRACE_EVENT(kvm_age_hva, > TP_PROTO(unsigned long start, unsigned long end), > TP_ARGS(start, end), > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 4eb8afd0b961..2fcd9979752a 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -717,48 +717,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn > return __kvm_handle_hva_range(kvm, &range).ret; > } > > -static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - /* > - * Skipping invalid memslots is correct if and only change_pte() is > - * surrounded by invalidate_range_{start,end}(), which is currently > - * guaranteed by the primary MMU. If that ever changes, KVM needs to > - * unmap the memslot instead of skipping the memslot to ensure that KVM > - * doesn't hold references to the old PFN. > - */ > - WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); > - > - if (range->slot->flags & KVM_MEMSLOT_INVALID) > - return false; > - > - return kvm_set_spte_gfn(kvm, range); > -} > - > -static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, > - struct mm_struct *mm, > - unsigned long address, > - pte_t pte) > -{ > - struct kvm *kvm = mmu_notifier_to_kvm(mn); > - const union kvm_mmu_notifier_arg arg = { .pte = pte }; > - > - trace_kvm_set_spte_hva(address); > - > - /* > - * .change_pte() must be surrounded by .invalidate_range_{start,end}(). > - * If mmu_invalidate_in_progress is zero, then no in-progress > - * invalidations, including this one, found a relevant memslot at > - * start(); rechecking memslots here is unnecessary. Note, a false > - * positive (count elevated by a different invalidation) is sub-optimal > - * but functionally ok. > - */ > - WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); > - if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) > - return; > - > - kvm_handle_hva_range(mn, address, address + 1, arg, kvm_change_spte_gfn); > -} > - > void kvm_mmu_invalidate_begin(struct kvm *kvm) > { > lockdep_assert_held_write(&kvm->mmu_lock); > @@ -976,7 +934,6 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { > .clear_flush_young = kvm_mmu_notifier_clear_flush_young, > .clear_young = kvm_mmu_notifier_clear_young, > .test_young = kvm_mmu_notifier_test_young, > - .change_pte = kvm_mmu_notifier_change_pte, > .release = kvm_mmu_notifier_release, > }; > > -- > 2.43.0 > >
On 2024/4/5 下午7:58, Paolo Bonzini wrote: > The .change_pte() MMU notifier callback was intended as an > optimization. The original point of it was that KSM could tell KVM to flip > its secondary PTE to a new location without having to first zap it. At > the time there was also an .invalidate_page() callback; both of them were > *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), > and .invalidate_page() also doubled as a fallback implementation of > change_pte(). > > Later on, however, both callbacks were changed to occur within an > invalidate_range_start/end() block. > > In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to > set_pte_at_notify with invalidate_range_start and invalidate_range_end", > 2012-10-09) did so to remove the fallback from .invalidate_page() to > change_pte() and allow sleepable .invalidate_page() hooks. > > This however made KVM's usage of the .change_pte() callback completely > moot, because KVM unmaps the sPTEs during .invalidate_range_start() > and therefore .change_pte() has no hope of finding a sPTE to change. > Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as > well as all the architecture specific implementations. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > arch/arm64/kvm/mmu.c | 34 ----------------- > arch/loongarch/include/asm/kvm_host.h | 1 - > arch/loongarch/kvm/mmu.c | 32 ---------------- > arch/mips/kvm/mmu.c | 30 --------------- > arch/powerpc/include/asm/kvm_ppc.h | 1 - > arch/powerpc/kvm/book3s.c | 5 --- > arch/powerpc/kvm/book3s.h | 1 - > arch/powerpc/kvm/book3s_64_mmu_hv.c | 12 ------ > arch/powerpc/kvm/book3s_hv.c | 1 - > arch/powerpc/kvm/book3s_pr.c | 7 ---- > arch/powerpc/kvm/e500_mmu_host.c | 6 --- > arch/riscv/kvm/mmu.c | 20 ---------- > arch/x86/kvm/mmu/mmu.c | 54 +-------------------------- > arch/x86/kvm/mmu/spte.c | 16 -------- > arch/x86/kvm/mmu/spte.h | 2 - > arch/x86/kvm/mmu/tdp_mmu.c | 46 ----------------------- > arch/x86/kvm/mmu/tdp_mmu.h | 1 - > include/linux/kvm_host.h | 2 - > include/trace/events/kvm.h | 15 -------- > virt/kvm/kvm_main.c | 43 --------------------- > 20 files changed, 2 insertions(+), 327 deletions(-) > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index dc04bc767865..ff17849be9f4 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > return false; > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - kvm_pfn_t pfn = pte_pfn(range->arg.pte); > - > - if (!kvm->arch.mmu.pgt) > - return false; > - > - WARN_ON(range->end - range->start != 1); > - > - /* > - * If the page isn't tagged, defer to user_mem_abort() for sanitising > - * the MTE tags. The S2 pte should have been unmapped by > - * mmu_notifier_invalidate_range_end(). > - */ > - if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn))) > - return false; > - > - /* > - * We've moved a page around, probably through CoW, so let's treat > - * it just like a translation fault and the map handler will clean > - * the cache to the PoC. > - * > - * The MMU notifiers will have unmapped a huge PMD before calling > - * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and > - * therefore we never need to clear out a huge PMD through this > - * calling path and a memcache is not required. > - */ > - kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT, > - PAGE_SIZE, __pfn_to_phys(pfn), > - KVM_PGTABLE_PROT_R, NULL, 0); > - > - return false; > -} > - > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > { > u64 size = (range->end - range->start) << PAGE_SHIFT; > diff --git a/arch/loongarch/include/asm/kvm_host.h b/arch/loongarch/include/asm/kvm_host.h > index 2d62f7b0d377..69305441f40d 100644 > --- a/arch/loongarch/include/asm/kvm_host.h > +++ b/arch/loongarch/include/asm/kvm_host.h > @@ -203,7 +203,6 @@ void kvm_flush_tlb_all(void); > void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa); > int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write); > > -void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); > int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end, bool blockable); > int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); > int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); > diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c > index a556cff35740..98883aa23ab8 100644 > --- a/arch/loongarch/kvm/mmu.c > +++ b/arch/loongarch/kvm/mmu.c > @@ -494,38 +494,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > range->end << PAGE_SHIFT, &ctx); > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - unsigned long prot_bits; > - kvm_pte_t *ptep; > - kvm_pfn_t pfn = pte_pfn(range->arg.pte); > - gpa_t gpa = range->start << PAGE_SHIFT; > - > - ptep = kvm_populate_gpa(kvm, NULL, gpa, 0); > - if (!ptep) > - return false; > - > - /* Replacing an absent or old page doesn't need flushes */ > - if (!kvm_pte_present(NULL, ptep) || !kvm_pte_young(*ptep)) { > - kvm_set_pte(ptep, 0); > - return false; > - } > - > - /* Fill new pte if write protected or page migrated */ > - prot_bits = _PAGE_PRESENT | __READABLE; > - prot_bits |= _CACHE_MASK & pte_val(range->arg.pte); > - > - /* > - * Set _PAGE_WRITE or _PAGE_DIRTY iff old and new pte both support > - * _PAGE_WRITE for map_page_fast if next page write fault > - * _PAGE_DIRTY since gpa has already recorded as dirty page > - */ > - prot_bits |= __WRITEABLE & *ptep & pte_val(range->arg.pte); > - kvm_set_pte(ptep, kvm_pfn_pte(pfn, __pgprot(prot_bits))); > - > - return true; > -} > - Thanks for cleanup. And for kvm loongarch: Reviewed-by: Bibo Mao <maobibo@loongson.cn> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > { > kvm_ptw_ctx ctx; > diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c > index 467ee6b95ae1..c17157e700c0 100644 > --- a/arch/mips/kvm/mmu.c > +++ b/arch/mips/kvm/mmu.c > @@ -444,36 +444,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > return true; > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - gpa_t gpa = range->start << PAGE_SHIFT; > - pte_t hva_pte = range->arg.pte; > - pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa); > - pte_t old_pte; > - > - if (!gpa_pte) > - return false; > - > - /* Mapping may need adjusting depending on memslot flags */ > - old_pte = *gpa_pte; > - if (range->slot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte)) > - hva_pte = pte_mkclean(hva_pte); > - else if (range->slot->flags & KVM_MEM_READONLY) > - hva_pte = pte_wrprotect(hva_pte); > - > - set_pte(gpa_pte, hva_pte); > - > - /* Replacing an absent or old page doesn't need flushes */ > - if (!pte_present(old_pte) || !pte_young(old_pte)) > - return false; > - > - /* Pages swapped, aged, moved, or cleaned require flushes */ > - return !pte_present(hva_pte) || > - !pte_young(hva_pte) || > - pte_pfn(old_pte) != pte_pfn(hva_pte) || > - (pte_dirty(old_pte) && !pte_dirty(hva_pte)); > -} > - > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > { > return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end); > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h > index 3281215097cc..ca3829d47ab7 100644 > --- a/arch/powerpc/include/asm/kvm_ppc.h > +++ b/arch/powerpc/include/asm/kvm_ppc.h > @@ -287,7 +287,6 @@ struct kvmppc_ops { > bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range); > bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); > bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); > - bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); > void (*free_memslot)(struct kvm_memory_slot *slot); > int (*init_vm)(struct kvm *kvm); > void (*destroy_vm)(struct kvm *kvm); > diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c > index 8acec144120e..0d0624088e6b 100644 > --- a/arch/powerpc/kvm/book3s.c > +++ b/arch/powerpc/kvm/book3s.c > @@ -899,11 +899,6 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > return kvm->arch.kvm_ops->test_age_gfn(kvm, range); > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - return kvm->arch.kvm_ops->set_spte_gfn(kvm, range); > -} > - > int kvmppc_core_init_vm(struct kvm *kvm) > { > > diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h > index 58391b4b32ed..4aa2ab89afbc 100644 > --- a/arch/powerpc/kvm/book3s.h > +++ b/arch/powerpc/kvm/book3s.h > @@ -12,7 +12,6 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm, > extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range); > extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); > extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); > -extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); > > extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu); > extern void kvmppc_mmu_destroy_pr(struct kvm_vcpu *vcpu); > diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c > index 2b1f0cdd8c18..1b51b1c4713b 100644 > --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c > +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c > @@ -1010,18 +1010,6 @@ bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range) > return kvm_test_age_rmapp(kvm, range->slot, range->start); > } > > -bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - WARN_ON(range->start + 1 != range->end); > - > - if (kvm_is_radix(kvm)) > - kvm_unmap_radix(kvm, range->slot, range->start); > - else > - kvm_unmap_rmapp(kvm, range->slot, range->start); > - > - return false; > -} > - > static int vcpus_running(struct kvm *kvm) > { > return atomic_read(&kvm->arch.vcpus_running) != 0; > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 8e86eb577eb8..35cb014a0c51 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -6364,7 +6364,6 @@ static struct kvmppc_ops kvm_ops_hv = { > .unmap_gfn_range = kvm_unmap_gfn_range_hv, > .age_gfn = kvm_age_gfn_hv, > .test_age_gfn = kvm_test_age_gfn_hv, > - .set_spte_gfn = kvm_set_spte_gfn_hv, > .free_memslot = kvmppc_core_free_memslot_hv, > .init_vm = kvmppc_core_init_vm_hv, > .destroy_vm = kvmppc_core_destroy_vm_hv, > diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c > index 5b92619a05fd..a7d7137ea0c8 100644 > --- a/arch/powerpc/kvm/book3s_pr.c > +++ b/arch/powerpc/kvm/book3s_pr.c > @@ -461,12 +461,6 @@ static bool kvm_test_age_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range) > return false; > } > > -static bool kvm_set_spte_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - /* The page will get remapped properly on its next fault */ > - return do_kvm_unmap_gfn(kvm, range); > -} > - > /*****************************************/ > > static void kvmppc_set_msr_pr(struct kvm_vcpu *vcpu, u64 msr) > @@ -2071,7 +2065,6 @@ static struct kvmppc_ops kvm_ops_pr = { > .unmap_gfn_range = kvm_unmap_gfn_range_pr, > .age_gfn = kvm_age_gfn_pr, > .test_age_gfn = kvm_test_age_gfn_pr, > - .set_spte_gfn = kvm_set_spte_gfn_pr, > .free_memslot = kvmppc_core_free_memslot_pr, > .init_vm = kvmppc_core_init_vm_pr, > .destroy_vm = kvmppc_core_destroy_vm_pr, > diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c > index ccb8f16ffe41..c664fdec75b1 100644 > --- a/arch/powerpc/kvm/e500_mmu_host.c > +++ b/arch/powerpc/kvm/e500_mmu_host.c > @@ -747,12 +747,6 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > return false; > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - /* The page will get remapped properly on its next fault */ > - return kvm_e500_mmu_unmap_gfn(kvm, range); > -} > - > /*****************************************/ > > int e500_mmu_host_init(struct kvmppc_vcpu_e500 *vcpu_e500) > diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c > index a9e2fd7245e1..b63650f9b966 100644 > --- a/arch/riscv/kvm/mmu.c > +++ b/arch/riscv/kvm/mmu.c > @@ -550,26 +550,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > return false; > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - int ret; > - kvm_pfn_t pfn = pte_pfn(range->arg.pte); > - > - if (!kvm->arch.pgd) > - return false; > - > - WARN_ON(range->end - range->start != 1); > - > - ret = gstage_map_page(kvm, NULL, range->start << PAGE_SHIFT, > - __pfn_to_phys(pfn), PAGE_SIZE, true, true); > - if (ret) { > - kvm_debug("Failed to map G-stage page (error %d)\n", ret); > - return true; > - } > - > - return false; > -} > - > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > { > pte_t *ptep; > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 0049d49aa913..87ba2a9da196 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -432,8 +432,8 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) > * The idea using the light way get the spte on x86_32 guest is from > * gup_get_pte (mm/gup.c). > * > - * An spte tlb flush may be pending, because kvm_set_pte_rmap > - * coalesces them and we are running out of the MMU lock. Therefore > + * An spte tlb flush may be pending, because they are coalesced and > + * we are running out of the MMU lock. Therefore > * we need to protect against in-progress updates of the spte. > * > * Reading the spte while an update is in progress may get the old value > @@ -1454,43 +1454,6 @@ static bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, > return __kvm_zap_rmap(kvm, rmap_head, slot); > } > > -static bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, > - struct kvm_memory_slot *slot, gfn_t gfn, int level, > - pte_t pte) > -{ > - u64 *sptep; > - struct rmap_iterator iter; > - bool need_flush = false; > - u64 new_spte; > - kvm_pfn_t new_pfn; > - > - WARN_ON_ONCE(pte_huge(pte)); > - new_pfn = pte_pfn(pte); > - > -restart: > - for_each_rmap_spte(rmap_head, &iter, sptep) { > - need_flush = true; > - > - if (pte_write(pte)) { > - kvm_zap_one_rmap_spte(kvm, rmap_head, sptep); > - goto restart; > - } else { > - new_spte = kvm_mmu_changed_pte_notifier_make_spte( > - *sptep, new_pfn); > - > - mmu_spte_clear_track_bits(kvm, sptep); > - mmu_spte_set(sptep, new_spte); > - } > - } > - > - if (need_flush && kvm_available_flush_remote_tlbs_range()) { > - kvm_flush_remote_tlbs_gfn(kvm, gfn, level); > - return false; > - } > - > - return need_flush; > -} > - > struct slot_rmap_walk_iterator { > /* input fields. */ > const struct kvm_memory_slot *slot; > @@ -1596,19 +1559,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > return flush; > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - bool flush = false; > - > - if (kvm_memslots_have_rmaps(kvm)) > - flush = kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmap); > - > - if (tdp_mmu_enabled) > - flush |= kvm_tdp_mmu_set_spte_gfn(kvm, range); > - > - return flush; > -} > - > static bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, > struct kvm_memory_slot *slot, gfn_t gfn, int level, > pte_t unused) > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c > index 318135daf685..283af5b90016 100644 > --- a/arch/x86/kvm/mmu/spte.c > +++ b/arch/x86/kvm/mmu/spte.c > @@ -322,22 +322,6 @@ u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled) > return spte; > } > > -u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn) > -{ > - u64 new_spte; > - > - new_spte = old_spte & ~SPTE_BASE_ADDR_MASK; > - new_spte |= (u64)new_pfn << PAGE_SHIFT; > - > - new_spte &= ~PT_WRITABLE_MASK; > - new_spte &= ~shadow_host_writable_mask; > - new_spte &= ~shadow_mmu_writable_mask; > - > - new_spte = mark_spte_for_access_track(new_spte); > - > - return new_spte; > -} > - > u64 mark_spte_for_access_track(u64 spte) > { > if (spte_ad_enabled(spte)) > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h > index 1a163aee9ec6..92da4c419171 100644 > --- a/arch/x86/kvm/mmu/spte.h > +++ b/arch/x86/kvm/mmu/spte.h > @@ -511,8 +511,6 @@ static inline u64 restore_acc_track_spte(u64 spte) > return spte; > } > > -u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn); > - > void __init kvm_mmu_spte_module_init(void); > void kvm_mmu_reset_all_pte_masks(void); > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 3627744fcab6..fbb86932b766 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1250,52 +1250,6 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn); > } > > -static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter, > - struct kvm_gfn_range *range) > -{ > - u64 new_spte; > - > - /* Huge pages aren't expected to be modified without first being zapped. */ > - WARN_ON_ONCE(pte_huge(range->arg.pte) || range->start + 1 != range->end); > - > - if (iter->level != PG_LEVEL_4K || > - !is_shadow_present_pte(iter->old_spte)) > - return false; > - > - /* > - * Note, when changing a read-only SPTE, it's not strictly necessary to > - * zero the SPTE before setting the new PFN, but doing so preserves the > - * invariant that the PFN of a present * leaf SPTE can never change. > - * See handle_changed_spte(). > - */ > - tdp_mmu_iter_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE); > - > - if (!pte_write(range->arg.pte)) { > - new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte, > - pte_pfn(range->arg.pte)); > - > - tdp_mmu_iter_set_spte(kvm, iter, new_spte); > - } > - > - return true; > -} > - > -/* > - * Handle the changed_pte MMU notifier for the TDP MMU. > - * data is a pointer to the new pte_t mapping the HVA specified by the MMU > - * notifier. > - * Returns non-zero if a flush is needed before releasing the MMU lock. > - */ > -bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - /* > - * No need to handle the remote TLB flush under RCU protection, the > - * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a > - * shadow page. See the WARN on pfn_changed in handle_changed_spte(). > - */ > - return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn); > -} > - > /* > * Remove write access from all SPTEs at or above min_level that map GFNs > * [start, end). Returns true if an SPTE has been changed and the TLBs need to > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h > index 6e1ea04ca885..58b55e61bd33 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.h > +++ b/arch/x86/kvm/mmu/tdp_mmu.h > @@ -31,7 +31,6 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range, > bool flush); > bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); > bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > -bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > > bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, > const struct kvm_memory_slot *slot, int min_level); > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index ab439706ea2f..8dea11701ab2 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -259,7 +259,6 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); > > #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER > union kvm_mmu_notifier_arg { > - pte_t pte; > unsigned long attributes; > }; > > @@ -273,7 +272,6 @@ struct kvm_gfn_range { > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); > #endif > > enum { > diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h > index 011fba6b5552..74e40d5d4af4 100644 > --- a/include/trace/events/kvm.h > +++ b/include/trace/events/kvm.h > @@ -456,21 +456,6 @@ TRACE_EVENT(kvm_unmap_hva_range, > __entry->start, __entry->end) > ); > > -TRACE_EVENT(kvm_set_spte_hva, > - TP_PROTO(unsigned long hva), > - TP_ARGS(hva), > - > - TP_STRUCT__entry( > - __field( unsigned long, hva ) > - ), > - > - TP_fast_assign( > - __entry->hva = hva; > - ), > - > - TP_printk("mmu notifier set pte hva: %#016lx", __entry->hva) > -); > - > TRACE_EVENT(kvm_age_hva, > TP_PROTO(unsigned long start, unsigned long end), > TP_ARGS(start, end), > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 4eb8afd0b961..2fcd9979752a 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -717,48 +717,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn > return __kvm_handle_hva_range(kvm, &range).ret; > } > > -static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - /* > - * Skipping invalid memslots is correct if and only change_pte() is > - * surrounded by invalidate_range_{start,end}(), which is currently > - * guaranteed by the primary MMU. If that ever changes, KVM needs to > - * unmap the memslot instead of skipping the memslot to ensure that KVM > - * doesn't hold references to the old PFN. > - */ > - WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); > - > - if (range->slot->flags & KVM_MEMSLOT_INVALID) > - return false; > - > - return kvm_set_spte_gfn(kvm, range); > -} > - > -static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, > - struct mm_struct *mm, > - unsigned long address, > - pte_t pte) > -{ > - struct kvm *kvm = mmu_notifier_to_kvm(mn); > - const union kvm_mmu_notifier_arg arg = { .pte = pte }; > - > - trace_kvm_set_spte_hva(address); > - > - /* > - * .change_pte() must be surrounded by .invalidate_range_{start,end}(). > - * If mmu_invalidate_in_progress is zero, then no in-progress > - * invalidations, including this one, found a relevant memslot at > - * start(); rechecking memslots here is unnecessary. Note, a false > - * positive (count elevated by a different invalidation) is sub-optimal > - * but functionally ok. > - */ > - WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); > - if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) > - return; > - > - kvm_handle_hva_range(mn, address, address + 1, arg, kvm_change_spte_gfn); > -} > - > void kvm_mmu_invalidate_begin(struct kvm *kvm) > { > lockdep_assert_held_write(&kvm->mmu_lock); > @@ -976,7 +934,6 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { > .clear_flush_young = kvm_mmu_notifier_clear_flush_young, > .clear_young = kvm_mmu_notifier_clear_young, > .test_young = kvm_mmu_notifier_test_young, > - .change_pte = kvm_mmu_notifier_change_pte, > .release = kvm_mmu_notifier_release, > }; > >
Paolo Bonzini <pbonzini@redhat.com> writes: > The .change_pte() MMU notifier callback was intended as an > optimization. The original point of it was that KSM could tell KVM to flip > its secondary PTE to a new location without having to first zap it. At > the time there was also an .invalidate_page() callback; both of them were > *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), > and .invalidate_page() also doubled as a fallback implementation of > .change_pte(). > > Later on, however, both callbacks were changed to occur within an > invalidate_range_start/end() block. > > In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to > set_pte_at_notify with invalidate_range_start and invalidate_range_end", > 2012-10-09) did so to remove the fallback from .invalidate_page() to > .change_pte() and allow sleepable .invalidate_page() hooks. > > This however made KVM's usage of the .change_pte() callback completely > moot, because KVM unmaps the sPTEs during .invalidate_range_start() > and therefore .change_pte() has no hope of finding a sPTE to change. > Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as > well as all the architecture specific implementations. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > arch/arm64/kvm/mmu.c | 34 ----------------- > arch/loongarch/include/asm/kvm_host.h | 1 - > arch/loongarch/kvm/mmu.c | 32 ---------------- > arch/mips/kvm/mmu.c | 30 --------------- > arch/powerpc/include/asm/kvm_ppc.h | 1 - > arch/powerpc/kvm/book3s.c | 5 --- > arch/powerpc/kvm/book3s.h | 1 - > arch/powerpc/kvm/book3s_64_mmu_hv.c | 12 ------ > arch/powerpc/kvm/book3s_hv.c | 1 - > arch/powerpc/kvm/book3s_pr.c | 7 ---- > arch/powerpc/kvm/e500_mmu_host.c | 6 --- LGTM. Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) cheers
On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote: > The .change_pte() MMU notifier callback was intended as an > optimization. The original point of it was that KSM could tell KVM to flip > its secondary PTE to a new location without having to first zap it. At > the time there was also an .invalidate_page() callback; both of them were > *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), > and .invalidate_page() also doubled as a fallback implementation of > .change_pte(). > > Later on, however, both callbacks were changed to occur within an > invalidate_range_start/end() block. > > In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to > set_pte_at_notify with invalidate_range_start and invalidate_range_end", > 2012-10-09) did so to remove the fallback from .invalidate_page() to > .change_pte() and allow sleepable .invalidate_page() hooks. > > This however made KVM's usage of the .change_pte() callback completely > moot, because KVM unmaps the sPTEs during .invalidate_range_start() > and therefore .change_pte() has no hope of finding a sPTE to change. > Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as > well as all the architecture specific implementations. Paolo, I may miss a bunch of details here (as I still remember some change_pte patches previously on the list..), however not sure whether we considered enable it? Asked because I remember Andrea used to have a custom tree maintaining that part: https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968 Maybe it can't be enabled for some reason that I overlooked in the current tree, or we just decided to not to? Thanks,
On Mon, Apr 8, 2024 at 3:56 PM Peter Xu <peterx@redhat.com> wrote: > Paolo, > > I may miss a bunch of details here (as I still remember some change_pte > patches previously on the list..), however not sure whether we considered > enable it? Asked because I remember Andrea used to have a custom tree > maintaining that part: > > https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968 The patch enables it only for KSM, so it would still require a bunch of cleanups, for example I also would still use set_pte_at() in all the places that are not KSM. This would at least fix the issue with the poor documentation of where to use set_pte_at_notify() vs set_pte_at(). With regard to the implementation, I like the idea of disabling the invalidation on the MMU notifier side, but I would rather have MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of overloading the event field. > Maybe it can't be enabled for some reason that I overlooked in the current > tree, or we just decided to not to? I have just learnt about the patch, nobody had ever mentioned it even though it's almost 2 years old... It's a lot of code though and no one has ever reported an issue for over 10 years, so I think it's easiest to just rip the code out. Paolo > Thanks, > > -- > Peter Xu >
On Thu, Apr 11, 2024 at 06:55:44PM +0200, Paolo Bonzini wrote: > On Mon, Apr 8, 2024 at 3:56 PM Peter Xu <peterx@redhat.com> wrote: > > Paolo, > > > > I may miss a bunch of details here (as I still remember some change_pte > > patches previously on the list..), however not sure whether we considered > > enable it? Asked because I remember Andrea used to have a custom tree > > maintaining that part: > > > > https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968 > > The patch enables it only for KSM, so it would still require a bunch > of cleanups, for example I also would still use set_pte_at() in all > the places that are not KSM. This would at least fix the issue with > the poor documentation of where to use set_pte_at_notify() vs > set_pte_at(). > > With regard to the implementation, I like the idea of disabling the > invalidation on the MMU notifier side, but I would rather have > MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of > overloading the event field. > > > Maybe it can't be enabled for some reason that I overlooked in the current > > tree, or we just decided to not to? > > I have just learnt about the patch, nobody had ever mentioned it even > though it's almost 2 years old... It's a lot of code though and no one > has ever reported an issue for over 10 years, so I think it's easiest > to just rip the code out. Right, it was pretty old and I have no idea if that was discussed or published before.. It would be better to have discussed this earlier. As long as we have a decision with that being aware and in mind, then it looks fine to me to take either way to go, and I also agree either way is better than keep the status quo. I also have Andrea copied anyway when I replied, so I guess he should be aware of this and he can chim in anytime. Thanks!
On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote: > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index dc04bc767865..ff17849be9f4 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > return false; > } > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > -{ > - kvm_pfn_t pfn = pte_pfn(range->arg.pte); > - > - if (!kvm->arch.mmu.pgt) > - return false; > - > - WARN_ON(range->end - range->start != 1); > - > - /* > - * If the page isn't tagged, defer to user_mem_abort() for sanitising > - * the MTE tags. The S2 pte should have been unmapped by > - * mmu_notifier_invalidate_range_end(). > - */ > - if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn))) > - return false; > - > - /* > - * We've moved a page around, probably through CoW, so let's treat > - * it just like a translation fault and the map handler will clean > - * the cache to the PoC. > - * > - * The MMU notifiers will have unmapped a huge PMD before calling > - * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and > - * therefore we never need to clear out a huge PMD through this > - * calling path and a memcache is not required. > - */ > - kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT, > - PAGE_SIZE, __pfn_to_phys(pfn), > - KVM_PGTABLE_PROT_R, NULL, 0); > - > - return false; > -} > - > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > { > u64 size = (range->end - range->start) << PAGE_SHIFT; Thanks. It's nice to see this code retire: Acked-by: Will Deacon <will@kernel.org> Also, if you're in the business of hacking the MMU notifier code, it would be really great to change the .clear_flush_young() callback so that the architecture could handle the TLB invalidation. At the moment, the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret' being set by kvm_handle_hva_range(), whereas we could do a much lighter-weight and targetted TLBI in the architecture page-table code when we actually update the ptes for small ranges. Will
On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon <will@kernel.org> wrote: > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote: > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > index dc04bc767865..ff17849be9f4 100644 > > --- a/arch/arm64/kvm/mmu.c > > +++ b/arch/arm64/kvm/mmu.c > > @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > > return false; > > } > > > > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > > -{ > > - kvm_pfn_t pfn = pte_pfn(range->arg.pte); > > - > > - if (!kvm->arch.mmu.pgt) > > - return false; > > - > > - WARN_ON(range->end - range->start != 1); > > - > > - /* > > - * If the page isn't tagged, defer to user_mem_abort() for sanitising > > - * the MTE tags. The S2 pte should have been unmapped by > > - * mmu_notifier_invalidate_range_end(). > > - */ > > - if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn))) > > - return false; > > - > > - /* > > - * We've moved a page around, probably through CoW, so let's treat > > - * it just like a translation fault and the map handler will clean > > - * the cache to the PoC. > > - * > > - * The MMU notifiers will have unmapped a huge PMD before calling > > - * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and > > - * therefore we never need to clear out a huge PMD through this > > - * calling path and a memcache is not required. > > - */ > > - kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT, > > - PAGE_SIZE, __pfn_to_phys(pfn), > > - KVM_PGTABLE_PROT_R, NULL, 0); > > - > > - return false; > > -} > > - > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > > { > > u64 size = (range->end - range->start) << PAGE_SHIFT; > > Thanks. It's nice to see this code retire: > > Acked-by: Will Deacon <will@kernel.org> > > Also, if you're in the business of hacking the MMU notifier code, it > would be really great to change the .clear_flush_young() callback so > that the architecture could handle the TLB invalidation. At the moment, > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret' > being set by kvm_handle_hva_range(), whereas we could do a much > lighter-weight and targetted TLBI in the architecture page-table code > when we actually update the ptes for small ranges. Indeed, and I was looking at this earlier this week as it has a pretty devastating effect with NV (it blows the shadow S2 for that VMID, with costly consequences). In general, it feels like the TLB invalidation should stay with the code that deals with the page tables, as it has a pretty good idea of what needs to be invalidated and how -- specially on architectures that have a HW-broadcast facility like arm64. Thanks, M.
On Fri, Apr 12, 2024, Marc Zyngier wrote: > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon <will@kernel.org> wrote: > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote: > > Also, if you're in the business of hacking the MMU notifier code, it > > would be really great to change the .clear_flush_young() callback so > > that the architecture could handle the TLB invalidation. At the moment, > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret' > > being set by kvm_handle_hva_range(), whereas we could do a much > > lighter-weight and targetted TLBI in the architecture page-table code > > when we actually update the ptes for small ranges. > > Indeed, and I was looking at this earlier this week as it has a pretty > devastating effect with NV (it blows the shadow S2 for that VMID, with > costly consequences). > > In general, it feels like the TLB invalidation should stay with the > code that deals with the page tables, as it has a pretty good idea of > what needs to be invalidated and how -- specially on architectures > that have a HW-broadcast facility like arm64. Would this be roughly on par with an in-line flush on arm64? The simpler, more straightforward solution would be to let architectures override flush_on_ret, but I would prefer something like the below as x86 can also utilize a range-based flush when running as a nested hypervisor. diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ff0a20565f90..b65116294efe 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -601,6 +601,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, struct kvm_gfn_range gfn_range; struct kvm_memory_slot *slot; struct kvm_memslots *slots; + bool need_flush = false; int i, idx; if (WARN_ON_ONCE(range->end <= range->start)) @@ -653,10 +654,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, break; } r.ret |= range->handler(kvm, &gfn_range); + + /* + * Use a precise gfn-based TLB flush when possible, as + * most mmu_notifier events affect a small-ish range. + * Fall back to a full TLB flush if the gfn-based flush + * fails, and don't bother trying the gfn-based flush + * if a full flush is already pending. + */ + if (range->flush_on_ret && !need_flush && r.ret && + kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start + gfn_range.end - gfn_range.start + 1)) + need_flush = true; } } - if (range->flush_on_ret && r.ret) + if (need_flush) kvm_flush_remote_tlbs(kvm); if (r.found_memslot)
On 11.04.24 18:55, Paolo Bonzini wrote: > On Mon, Apr 8, 2024 at 3:56 PM Peter Xu <peterx@redhat.com> wrote: >> Paolo, >> >> I may miss a bunch of details here (as I still remember some change_pte >> patches previously on the list..), however not sure whether we considered >> enable it? Asked because I remember Andrea used to have a custom tree >> maintaining that part: >> >> https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968 > > The patch enables it only for KSM, so it would still require a bunch > of cleanups, for example I also would still use set_pte_at() in all > the places that are not KSM. This would at least fix the issue with > the poor documentation of where to use set_pte_at_notify() vs > set_pte_at(). > > With regard to the implementation, I like the idea of disabling the > invalidation on the MMU notifier side, but I would rather have > MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of > overloading the event field. > >> Maybe it can't be enabled for some reason that I overlooked in the current >> tree, or we just decided to not to? > > I have just learnt about the patch, nobody had ever mentioned it even > though it's almost 2 years old... It's a lot of code though and no one I assume Andrea used it on his tree where he also has a version of "randprotect" (even included in that commit subject) to mitigate a KSM security issue that was reported by some security researchers [1] a while ago. From what I recall, the industry did not end up caring about that security issue that much. IIUC, with "randprotect" we get a lot more R/O protection even when not de-duplicating a page -- thus the name. Likely, the reporter mentioned in the commit is a researcher that played with Andreas fix for the security issue. But I'm just speculating at this point :) > has ever reported an issue for over 10 years, so I think it's easiest > to just rip the code out. Yes. Can always be readded in a possibly cleaner fashion (like you note above), when deemed necessary and we are willing to support it. [1] https://gruss.cc/files/remote_dedup.pdf
On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Apr 12, 2024, Marc Zyngier wrote: > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon <will@kernel.org> wrote: > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote: > > > Also, if you're in the business of hacking the MMU notifier code, it > > > would be really great to change the .clear_flush_young() callback so > > > that the architecture could handle the TLB invalidation. At the moment, > > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret' > > > being set by kvm_handle_hva_range(), whereas we could do a much > > > lighter-weight and targetted TLBI in the architecture page-table code > > > when we actually update the ptes for small ranges. > > > > Indeed, and I was looking at this earlier this week as it has a pretty > > devastating effect with NV (it blows the shadow S2 for that VMID, with > > costly consequences). > > > > In general, it feels like the TLB invalidation should stay with the > > code that deals with the page tables, as it has a pretty good idea of > > what needs to be invalidated and how -- specially on architectures > > that have a HW-broadcast facility like arm64. > > Would this be roughly on par with an in-line flush on arm64? The simpler, more > straightforward solution would be to let architectures override flush_on_ret, > but I would prefer something like the below as x86 can also utilize a range-based > flush when running as a nested hypervisor. > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index ff0a20565f90..b65116294efe 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -601,6 +601,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > struct kvm_gfn_range gfn_range; > struct kvm_memory_slot *slot; > struct kvm_memslots *slots; > + bool need_flush = false; > int i, idx; > > if (WARN_ON_ONCE(range->end <= range->start)) > @@ -653,10 +654,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > break; > } > r.ret |= range->handler(kvm, &gfn_range); > + > + /* > + * Use a precise gfn-based TLB flush when possible, as > + * most mmu_notifier events affect a small-ish range. > + * Fall back to a full TLB flush if the gfn-based flush > + * fails, and don't bother trying the gfn-based flush > + * if a full flush is already pending. > + */ > + if (range->flush_on_ret && !need_flush && r.ret && > + kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start > + gfn_range.end - gfn_range.start + 1)) > + need_flush = true; > } > } > > - if (range->flush_on_ret && r.ret) > + if (need_flush) > kvm_flush_remote_tlbs(kvm); > > if (r.found_memslot) I think this works for us on HW that has range invalidation, which would already be a positive move. For the lesser HW that isn't range capable, it also gives the opportunity to perform the iteration ourselves or go for the nuclear option if the range is larger than some arbitrary constant (though this is additional work). But this still considers the whole range as being affected by range->handler(). It'd be interesting to try and see whether more precise tracking is (or isn't) generally beneficial. Thanks, M.
On Sat, Apr 13, 2024, Marc Zyngier wrote: > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson <seanjc@google.com> wrote: > > > > On Fri, Apr 12, 2024, Marc Zyngier wrote: > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon <will@kernel.org> wrote: > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote: > > > > Also, if you're in the business of hacking the MMU notifier code, it > > > > would be really great to change the .clear_flush_young() callback so > > > > that the architecture could handle the TLB invalidation. At the moment, > > > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret' > > > > being set by kvm_handle_hva_range(), whereas we could do a much > > > > lighter-weight and targetted TLBI in the architecture page-table code > > > > when we actually update the ptes for small ranges. > > > > > > Indeed, and I was looking at this earlier this week as it has a pretty > > > devastating effect with NV (it blows the shadow S2 for that VMID, with > > > costly consequences). > > > > > > In general, it feels like the TLB invalidation should stay with the > > > code that deals with the page tables, as it has a pretty good idea of > > > what needs to be invalidated and how -- specially on architectures > > > that have a HW-broadcast facility like arm64. > > > > Would this be roughly on par with an in-line flush on arm64? The simpler, more > > straightforward solution would be to let architectures override flush_on_ret, > > but I would prefer something like the below as x86 can also utilize a range-based > > flush when running as a nested hypervisor. ... > I think this works for us on HW that has range invalidation, which > would already be a positive move. > > For the lesser HW that isn't range capable, it also gives the > opportunity to perform the iteration ourselves or go for the nuclear > option if the range is larger than some arbitrary constant (though > this is additional work). > > But this still considers the whole range as being affected by > range->handler(). It'd be interesting to try and see whether more > precise tracking is (or isn't) generally beneficial. I assume the idea would be to let arch code do single-page invalidations of stage-2 entries for each gfn? Unless I'm having a brain fart, x86 can't make use of that functionality. Intel doesn't provide any way to do targeted invalidation of stage-2 mappings. AMD provides an instruction to do broadcast invalidations, but it takes a virtual address, i.e. a stage-1 address. I can't tell if it's a host virtual address or a guest virtual address, but it's a moot point because KVM doen't have the guest virtual address, and if it's a host virtual address, there would need to be valid mappings in the host page tables for it to work, which KVM can't guarantee.
On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote: > On Sat, Apr 13, 2024, Marc Zyngier wrote: > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Fri, Apr 12, 2024, Marc Zyngier wrote: > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon <will@kernel.org> wrote: > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote: > > > > > Also, if you're in the business of hacking the MMU notifier code, it > > > > > would be really great to change the .clear_flush_young() callback so > > > > > that the architecture could handle the TLB invalidation. At the moment, > > > > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret' > > > > > being set by kvm_handle_hva_range(), whereas we could do a much > > > > > lighter-weight and targetted TLBI in the architecture page-table code > > > > > when we actually update the ptes for small ranges. > > > > > > > > Indeed, and I was looking at this earlier this week as it has a pretty > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with > > > > costly consequences). > > > > > > > > In general, it feels like the TLB invalidation should stay with the > > > > code that deals with the page tables, as it has a pretty good idea of > > > > what needs to be invalidated and how -- specially on architectures > > > > that have a HW-broadcast facility like arm64. > > > > > > Would this be roughly on par with an in-line flush on arm64? The simpler, more > > > straightforward solution would be to let architectures override flush_on_ret, > > > but I would prefer something like the below as x86 can also utilize a range-based > > > flush when running as a nested hypervisor. > > ... > > > I think this works for us on HW that has range invalidation, which > > would already be a positive move. > > > > For the lesser HW that isn't range capable, it also gives the > > opportunity to perform the iteration ourselves or go for the nuclear > > option if the range is larger than some arbitrary constant (though > > this is additional work). > > > > But this still considers the whole range as being affected by > > range->handler(). It'd be interesting to try and see whether more > > precise tracking is (or isn't) generally beneficial. > > I assume the idea would be to let arch code do single-page invalidations of > stage-2 entries for each gfn? Right, as it's the only code which knows which ptes actually ended up being aged. > Unless I'm having a brain fart, x86 can't make use of that functionality. Intel > doesn't provide any way to do targeted invalidation of stage-2 mappings. AMD > provides an instruction to do broadcast invalidations, but it takes a virtual > address, i.e. a stage-1 address. I can't tell if it's a host virtual address or > a guest virtual address, but it's a moot point because KVM doen't have the guest > virtual address, and if it's a host virtual address, there would need to be valid > mappings in the host page tables for it to work, which KVM can't guarantee. Ah, so it sounds like it would need to be an arch opt-in then. Will
On Thu, Apr 18, 2024, Will Deacon wrote: > On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote: > > On Sat, Apr 13, 2024, Marc Zyngier wrote: > > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > On Fri, Apr 12, 2024, Marc Zyngier wrote: > > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon <will@kernel.org> wrote: > > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote: > > > > > > Also, if you're in the business of hacking the MMU notifier code, it > > > > > > would be really great to change the .clear_flush_young() callback so > > > > > > that the architecture could handle the TLB invalidation. At the moment, > > > > > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret' > > > > > > being set by kvm_handle_hva_range(), whereas we could do a much > > > > > > lighter-weight and targetted TLBI in the architecture page-table code > > > > > > when we actually update the ptes for small ranges. > > > > > > > > > > Indeed, and I was looking at this earlier this week as it has a pretty > > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with > > > > > costly consequences). > > > > > > > > > > In general, it feels like the TLB invalidation should stay with the > > > > > code that deals with the page tables, as it has a pretty good idea of > > > > > what needs to be invalidated and how -- specially on architectures > > > > > that have a HW-broadcast facility like arm64. > > > > > > > > Would this be roughly on par with an in-line flush on arm64? The simpler, more > > > > straightforward solution would be to let architectures override flush_on_ret, > > > > but I would prefer something like the below as x86 can also utilize a range-based > > > > flush when running as a nested hypervisor. > > > > ... > > > > > I think this works for us on HW that has range invalidation, which > > > would already be a positive move. > > > > > > For the lesser HW that isn't range capable, it also gives the > > > opportunity to perform the iteration ourselves or go for the nuclear > > > option if the range is larger than some arbitrary constant (though > > > this is additional work). > > > > > > But this still considers the whole range as being affected by > > > range->handler(). It'd be interesting to try and see whether more > > > precise tracking is (or isn't) generally beneficial. > > > > I assume the idea would be to let arch code do single-page invalidations of > > stage-2 entries for each gfn? > > Right, as it's the only code which knows which ptes actually ended up > being aged. > > > Unless I'm having a brain fart, x86 can't make use of that functionality. Intel > > doesn't provide any way to do targeted invalidation of stage-2 mappings. AMD > > provides an instruction to do broadcast invalidations, but it takes a virtual > > address, i.e. a stage-1 address. I can't tell if it's a host virtual address or > > a guest virtual address, but it's a moot point because KVM doen't have the guest > > virtual address, and if it's a host virtual address, there would need to be valid > > mappings in the host page tables for it to work, which KVM can't guarantee. > > Ah, so it sounds like it would need to be an arch opt-in then. Even if x86 (or some other arch code) could use the precise tracking, I think it would make sense to have the behavior be arch specific. Adding infrastructure to get information from arch code, only to turn around and give it back to arch code would be odd. Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE, the best/easiest solution would be to let arm64 opt out of the common TLB flush when a SPTE is made young. With the range-based flushing bundled in, this? --- include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 40 +++++++++++++++++++++++++--------------- 2 files changed, 27 insertions(+), 15 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index afbc99264ffa..8fe5f5e16919 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header; extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[]; #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER +int kvm_arch_flush_tlb_if_young(void); + static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq) { if (unlikely(kvm->mmu_invalidate_in_progress)) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 38b498669ef9..5ebef8ef239c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -595,6 +595,11 @@ static void kvm_null_fn(void) } #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn) +int __weak kvm_arch_flush_tlb_if_young(void) +{ + return true; +} + /* Iterate over each memslot intersecting [start, last] (inclusive) range */ #define kvm_for_each_memslot_in_hva_range(node, slots, start, last) \ for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \ @@ -611,6 +616,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, struct kvm_gfn_range gfn_range; struct kvm_memory_slot *slot; struct kvm_memslots *slots; + bool need_flush = false; int i, idx; if (WARN_ON_ONCE(range->end <= range->start)) @@ -663,10 +669,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, break; } r.ret |= range->handler(kvm, &gfn_range); + + /* + * Use a precise gfn-based TLB flush when possible, as + * most mmu_notifier events affect a small-ish range. + * Fall back to a full TLB flush if the gfn-based flush + * fails, and don't bother trying the gfn-based flush + * if a full flush is already pending. + */ + if (range->flush_on_ret && !need_flush && r.ret && + kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start, + gfn_range.end - gfn_range.start + 1)) + need_flush = true; } } - if (range->flush_on_ret && r.ret) + if (need_flush) kvm_flush_remote_tlbs(kvm); if (r.found_memslot) @@ -680,7 +698,8 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn, unsigned long start, unsigned long end, - gfn_handler_t handler) + gfn_handler_t handler, + bool flush_on_ret) { struct kvm *kvm = mmu_notifier_to_kvm(mn); const struct kvm_mmu_notifier_range range = { @@ -688,7 +707,7 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn, .end = end, .handler = handler, .on_lock = (void *)kvm_null_fn, - .flush_on_ret = true, + .flush_on_ret = flush_on_ret, .may_block = false, }; @@ -700,17 +719,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn unsigned long end, gfn_handler_t handler) { - struct kvm *kvm = mmu_notifier_to_kvm(mn); - const struct kvm_mmu_notifier_range range = { - .start = start, - .end = end, - .handler = handler, - .on_lock = (void *)kvm_null_fn, - .flush_on_ret = false, - .may_block = false, - }; - - return __kvm_handle_hva_range(kvm, &range).ret; + return kvm_handle_hva_range(mn, start, end, handler, false); } void kvm_mmu_invalidate_begin(struct kvm *kvm) @@ -876,7 +885,8 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, { trace_kvm_age_hva(start, end); - return kvm_handle_hva_range(mn, start, end, kvm_age_gfn); + return kvm_handle_hva_range(mn, start, end, kvm_age_gfn, + kvm_arch_flush_tlb_if_young()); } static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn, base-commit: eae53272c8ad4e7ed2bbb11bd0456eb5b0484f0c --
On Thu, Apr 18, 2024 at 12:53:26PM -0700, Sean Christopherson wrote: > On Thu, Apr 18, 2024, Will Deacon wrote: > > On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote: > > > On Sat, Apr 13, 2024, Marc Zyngier wrote: > > > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > > > On Fri, Apr 12, 2024, Marc Zyngier wrote: > > > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon <will@kernel.org> wrote: > > > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote: > > > > > > > Also, if you're in the business of hacking the MMU notifier code, it > > > > > > > would be really great to change the .clear_flush_young() callback so > > > > > > > that the architecture could handle the TLB invalidation. At the moment, > > > > > > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret' > > > > > > > being set by kvm_handle_hva_range(), whereas we could do a much > > > > > > > lighter-weight and targetted TLBI in the architecture page-table code > > > > > > > when we actually update the ptes for small ranges. > > > > > > > > > > > > Indeed, and I was looking at this earlier this week as it has a pretty > > > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with > > > > > > costly consequences). > > > > > > > > > > > > In general, it feels like the TLB invalidation should stay with the > > > > > > code that deals with the page tables, as it has a pretty good idea of > > > > > > what needs to be invalidated and how -- specially on architectures > > > > > > that have a HW-broadcast facility like arm64. > > > > > > > > > > Would this be roughly on par with an in-line flush on arm64? The simpler, more > > > > > straightforward solution would be to let architectures override flush_on_ret, > > > > > but I would prefer something like the below as x86 can also utilize a range-based > > > > > flush when running as a nested hypervisor. > > > > > > ... > > > > > > > I think this works for us on HW that has range invalidation, which > > > > would already be a positive move. > > > > > > > > For the lesser HW that isn't range capable, it also gives the > > > > opportunity to perform the iteration ourselves or go for the nuclear > > > > option if the range is larger than some arbitrary constant (though > > > > this is additional work). > > > > > > > > But this still considers the whole range as being affected by > > > > range->handler(). It'd be interesting to try and see whether more > > > > precise tracking is (or isn't) generally beneficial. > > > > > > I assume the idea would be to let arch code do single-page invalidations of > > > stage-2 entries for each gfn? > > > > Right, as it's the only code which knows which ptes actually ended up > > being aged. > > > > > Unless I'm having a brain fart, x86 can't make use of that functionality. Intel > > > doesn't provide any way to do targeted invalidation of stage-2 mappings. AMD > > > provides an instruction to do broadcast invalidations, but it takes a virtual > > > address, i.e. a stage-1 address. I can't tell if it's a host virtual address or > > > a guest virtual address, but it's a moot point because KVM doen't have the guest > > > virtual address, and if it's a host virtual address, there would need to be valid > > > mappings in the host page tables for it to work, which KVM can't guarantee. > > > > Ah, so it sounds like it would need to be an arch opt-in then. > > Even if x86 (or some other arch code) could use the precise tracking, I think it > would make sense to have the behavior be arch specific. Adding infrastructure > to get information from arch code, only to turn around and give it back to arch > code would be odd. Sorry, yes, that's what I had in mind. Basically, a way for the arch code to say "I've handled the TLBI, don't worry about it." > Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE, > the best/easiest solution would be to let arm64 opt out of the common TLB flush > when a SPTE is made young. > > With the range-based flushing bundled in, this? > > --- > include/linux/kvm_host.h | 2 ++ > virt/kvm/kvm_main.c | 40 +++++++++++++++++++++++++--------------- > 2 files changed, 27 insertions(+), 15 deletions(-) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index afbc99264ffa..8fe5f5e16919 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header; > extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[]; > > #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER > +int kvm_arch_flush_tlb_if_young(void); > + > static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq) > { > if (unlikely(kvm->mmu_invalidate_in_progress)) > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 38b498669ef9..5ebef8ef239c 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -595,6 +595,11 @@ static void kvm_null_fn(void) > } > #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn) > > +int __weak kvm_arch_flush_tlb_if_young(void) > +{ > + return true; > +} I tend to find __weak functions a little ugly, but I think the gist of the diff looks good to me. Thanks for putting it together! > + > /* Iterate over each memslot intersecting [start, last] (inclusive) range */ > #define kvm_for_each_memslot_in_hva_range(node, slots, start, last) \ > for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \ > @@ -611,6 +616,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > struct kvm_gfn_range gfn_range; > struct kvm_memory_slot *slot; > struct kvm_memslots *slots; > + bool need_flush = false; > int i, idx; > > if (WARN_ON_ONCE(range->end <= range->start)) > @@ -663,10 +669,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > break; > } > r.ret |= range->handler(kvm, &gfn_range); > + > + /* > + * Use a precise gfn-based TLB flush when possible, as > + * most mmu_notifier events affect a small-ish range. > + * Fall back to a full TLB flush if the gfn-based flush > + * fails, and don't bother trying the gfn-based flush > + * if a full flush is already pending. > + */ > + if (range->flush_on_ret && !need_flush && r.ret && > + kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start, > + gfn_range.end - gfn_range.start + 1)) What's that '+ 1' needed for here? Will
On Fri, Apr 19, 2024, Will Deacon wrote: > > @@ -663,10 +669,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > > break; > > } > > r.ret |= range->handler(kvm, &gfn_range); > > + > > + /* > > + * Use a precise gfn-based TLB flush when possible, as > > + * most mmu_notifier events affect a small-ish range. > > + * Fall back to a full TLB flush if the gfn-based flush > > + * fails, and don't bother trying the gfn-based flush > > + * if a full flush is already pending. > > + */ > > + if (range->flush_on_ret && !need_flush && r.ret && > > + kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start, > > + gfn_range.end - gfn_range.start + 1)) > > What's that '+ 1' needed for here? (a) To see if you're paying attention. (b) Because more is always better. (c) Because math is hard. (d) Because I haven't tested this. (e) Both (c) and (d).
On Fri, Apr 19, 2024, Will Deacon wrote: > On Thu, Apr 18, 2024 at 12:53:26PM -0700, Sean Christopherson wrote: > > On Thu, Apr 18, 2024, Will Deacon wrote: > > > > I assume the idea would be to let arch code do single-page invalidations of > > > > stage-2 entries for each gfn? > > > > > > Right, as it's the only code which knows which ptes actually ended up > > > being aged. > > > > > > > Unless I'm having a brain fart, x86 can't make use of that functionality. Intel > > > > doesn't provide any way to do targeted invalidation of stage-2 mappings. AMD > > > > provides an instruction to do broadcast invalidations, but it takes a virtual > > > > address, i.e. a stage-1 address. I can't tell if it's a host virtual address or > > > > a guest virtual address, but it's a moot point because KVM doen't have the guest > > > > virtual address, and if it's a host virtual address, there would need to be valid > > > > mappings in the host page tables for it to work, which KVM can't guarantee. > > > > > > Ah, so it sounds like it would need to be an arch opt-in then. > > > > Even if x86 (or some other arch code) could use the precise tracking, I think it > > would make sense to have the behavior be arch specific. Adding infrastructure > > to get information from arch code, only to turn around and give it back to arch > > code would be odd. > > Sorry, yes, that's what I had in mind. Basically, a way for the arch code > to say "I've handled the TLBI, don't worry about it." > > > Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE, > > the best/easiest solution would be to let arm64 opt out of the common TLB flush > > when a SPTE is made young. > > > > With the range-based flushing bundled in, this? > > > > --- > > include/linux/kvm_host.h | 2 ++ > > virt/kvm/kvm_main.c | 40 +++++++++++++++++++++++++--------------- > > 2 files changed, 27 insertions(+), 15 deletions(-) > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index afbc99264ffa..8fe5f5e16919 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header; > > extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[]; > > > > #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER > > +int kvm_arch_flush_tlb_if_young(void); > > + > > static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq) > > { > > if (unlikely(kvm->mmu_invalidate_in_progress)) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index 38b498669ef9..5ebef8ef239c 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -595,6 +595,11 @@ static void kvm_null_fn(void) > > } > > #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn) > > > > +int __weak kvm_arch_flush_tlb_if_young(void) > > +{ > > + return true; > > +} > > I tend to find __weak functions a little ugly, but I think the gist of the > diff looks good to me. Thanks for putting it together! Circling back to this, I don't think we should pursue this specific tweak, at least not without hard data for a concrete use case. The clear_flush_young() hook is the only callback that overloads the return value, e.g. for invalidate_range_start(), arch code can simply return false if the flush has already been performed. And clear_flush_young() _always_ operates on a single page, i.e. the range will only ever cover a single page in the primary MMU. It's obviously possible that KVM's MMU has mapped a transparent hugepage using multiple smaller pages, but that should be relatively uncommon, and probably not worth optimizing for.
On Wed, Jun 12, 2024, Sean Christopherson wrote: > On Fri, Apr 19, 2024, Will Deacon wrote: > > On Thu, Apr 18, 2024 at 12:53:26PM -0700, Sean Christopherson wrote: > > > On Thu, Apr 18, 2024, Will Deacon wrote: > > > > > I assume the idea would be to let arch code do single-page invalidations of > > > > > stage-2 entries for each gfn? > > > > > > > > Right, as it's the only code which knows which ptes actually ended up > > > > being aged. > > > > > > > > > Unless I'm having a brain fart, x86 can't make use of that functionality. Intel > > > > > doesn't provide any way to do targeted invalidation of stage-2 mappings. AMD > > > > > provides an instruction to do broadcast invalidations, but it takes a virtual > > > > > address, i.e. a stage-1 address. I can't tell if it's a host virtual address or > > > > > a guest virtual address, but it's a moot point because KVM doen't have the guest > > > > > virtual address, and if it's a host virtual address, there would need to be valid > > > > > mappings in the host page tables for it to work, which KVM can't guarantee. > > > > > > > > Ah, so it sounds like it would need to be an arch opt-in then. > > > > > > Even if x86 (or some other arch code) could use the precise tracking, I think it > > > would make sense to have the behavior be arch specific. Adding infrastructure > > > to get information from arch code, only to turn around and give it back to arch > > > code would be odd. > > > > Sorry, yes, that's what I had in mind. Basically, a way for the arch code > > to say "I've handled the TLBI, don't worry about it." > > > > > Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE, > > > the best/easiest solution would be to let arm64 opt out of the common TLB flush > > > when a SPTE is made young. > > > > > > With the range-based flushing bundled in, this? > > > > > > --- > > > include/linux/kvm_host.h | 2 ++ > > > virt/kvm/kvm_main.c | 40 +++++++++++++++++++++++++--------------- > > > 2 files changed, 27 insertions(+), 15 deletions(-) > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > index afbc99264ffa..8fe5f5e16919 100644 > > > --- a/include/linux/kvm_host.h > > > +++ b/include/linux/kvm_host.h > > > @@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header; > > > extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[]; > > > > > > #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER > > > +int kvm_arch_flush_tlb_if_young(void); > > > + > > > static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq) > > > { > > > if (unlikely(kvm->mmu_invalidate_in_progress)) > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index 38b498669ef9..5ebef8ef239c 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -595,6 +595,11 @@ static void kvm_null_fn(void) > > > } > > > #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn) > > > > > > +int __weak kvm_arch_flush_tlb_if_young(void) > > > +{ > > > + return true; > > > +} > > > > I tend to find __weak functions a little ugly, but I think the gist of the > > diff looks good to me. Thanks for putting it together! > > Circling back to this, I don't think we should pursue this specific tweak, at > least not without hard data for a concrete use case. Ha, I spoke too soon. Based on the learning from the KVM+MGLRU thread[*], it looks like KVM should omit the TLB flush when aging pages whenever possible. If that's not doable on all architectures for whatever reason, then something like this is probably the way to go. [*] https://lore.kernel.org/all/CAOUHufYCmYNngmS=rOSAQRB0N9ai+mA0aDrB9RopBvPHEK42Ng@mail.gmail.com
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index dc04bc767865..ff17849be9f4 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) return false; } -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) -{ - kvm_pfn_t pfn = pte_pfn(range->arg.pte); - - if (!kvm->arch.mmu.pgt) - return false; - - WARN_ON(range->end - range->start != 1); - - /* - * If the page isn't tagged, defer to user_mem_abort() for sanitising - * the MTE tags. The S2 pte should have been unmapped by - * mmu_notifier_invalidate_range_end(). - */ - if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn))) - return false; - - /* - * We've moved a page around, probably through CoW, so let's treat - * it just like a translation fault and the map handler will clean - * the cache to the PoC. - * - * The MMU notifiers will have unmapped a huge PMD before calling - * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and - * therefore we never need to clear out a huge PMD through this - * calling path and a memcache is not required. - */ - kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT, - PAGE_SIZE, __pfn_to_phys(pfn), - KVM_PGTABLE_PROT_R, NULL, 0); - - return false; -} - bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { u64 size = (range->end - range->start) << PAGE_SHIFT; diff --git a/arch/loongarch/include/asm/kvm_host.h b/arch/loongarch/include/asm/kvm_host.h index 2d62f7b0d377..69305441f40d 100644 --- a/arch/loongarch/include/asm/kvm_host.h +++ b/arch/loongarch/include/asm/kvm_host.h @@ -203,7 +203,6 @@ void kvm_flush_tlb_all(void); void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa); int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write); -void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end, bool blockable); int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c index a556cff35740..98883aa23ab8 100644 --- a/arch/loongarch/kvm/mmu.c +++ b/arch/loongarch/kvm/mmu.c @@ -494,38 +494,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) range->end << PAGE_SHIFT, &ctx); } -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) -{ - unsigned long prot_bits; - kvm_pte_t *ptep; - kvm_pfn_t pfn = pte_pfn(range->arg.pte); - gpa_t gpa = range->start << PAGE_SHIFT; - - ptep = kvm_populate_gpa(kvm, NULL, gpa, 0); - if (!ptep) - return false; - - /* Replacing an absent or old page doesn't need flushes */ - if (!kvm_pte_present(NULL, ptep) || !kvm_pte_young(*ptep)) { - kvm_set_pte(ptep, 0); - return false; - } - - /* Fill new pte if write protected or page migrated */ - prot_bits = _PAGE_PRESENT | __READABLE; - prot_bits |= _CACHE_MASK & pte_val(range->arg.pte); - - /* - * Set _PAGE_WRITE or _PAGE_DIRTY iff old and new pte both support - * _PAGE_WRITE for map_page_fast if next page write fault - * _PAGE_DIRTY since gpa has already recorded as dirty page - */ - prot_bits |= __WRITEABLE & *ptep & pte_val(range->arg.pte); - kvm_set_pte(ptep, kvm_pfn_pte(pfn, __pgprot(prot_bits))); - - return true; -} - bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { kvm_ptw_ctx ctx; diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c index 467ee6b95ae1..c17157e700c0 100644 --- a/arch/mips/kvm/mmu.c +++ b/arch/mips/kvm/mmu.c @@ -444,36 +444,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) return true; } -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) -{ - gpa_t gpa = range->start << PAGE_SHIFT; - pte_t hva_pte = range->arg.pte; - pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa); - pte_t old_pte; - - if (!gpa_pte) - return false; - - /* Mapping may need adjusting depending on memslot flags */ - old_pte = *gpa_pte; - if (range->slot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte)) - hva_pte = pte_mkclean(hva_pte); - else if (range->slot->flags & KVM_MEM_READONLY) - hva_pte = pte_wrprotect(hva_pte); - - set_pte(gpa_pte, hva_pte); - - /* Replacing an absent or old page doesn't need flushes */ - if (!pte_present(old_pte) || !pte_young(old_pte)) - return false; - - /* Pages swapped, aged, moved, or cleaned require flushes */ - return !pte_present(hva_pte) || - !pte_young(hva_pte) || - pte_pfn(old_pte) != pte_pfn(hva_pte) || - (pte_dirty(old_pte) && !pte_dirty(hva_pte)); -} - bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end); diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 3281215097cc..ca3829d47ab7 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -287,7 +287,6 @@ struct kvmppc_ops { bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); - bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); void (*free_memslot)(struct kvm_memory_slot *slot); int (*init_vm)(struct kvm *kvm); void (*destroy_vm)(struct kvm *kvm); diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 8acec144120e..0d0624088e6b 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -899,11 +899,6 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) return kvm->arch.kvm_ops->test_age_gfn(kvm, range); } -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) -{ - return kvm->arch.kvm_ops->set_spte_gfn(kvm, range); -} - int kvmppc_core_init_vm(struct kvm *kvm) { diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h index 58391b4b32ed..4aa2ab89afbc 100644 --- a/arch/powerpc/kvm/book3s.h +++ b/arch/powerpc/kvm/book3s.h @@ -12,7 +12,6 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm, extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range); extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); -extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range); extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu); extern void kvmppc_mmu_destroy_pr(struct kvm_vcpu *vcpu); diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 2b1f0cdd8c18..1b51b1c4713b 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -1010,18 +1010,6 @@ bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range) return kvm_test_age_rmapp(kvm, range->slot, range->start); } -bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range) -{ - WARN_ON(range->start + 1 != range->end); - - if (kvm_is_radix(kvm)) - kvm_unmap_radix(kvm, range->slot, range->start); - else - kvm_unmap_rmapp(kvm, range->slot, range->start); - - return false; -} - static int vcpus_running(struct kvm *kvm) { return atomic_read(&kvm->arch.vcpus_running) != 0; diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 8e86eb577eb8..35cb014a0c51 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -6364,7 +6364,6 @@ static struct kvmppc_ops kvm_ops_hv = { .unmap_gfn_range = kvm_unmap_gfn_range_hv, .age_gfn = kvm_age_gfn_hv, .test_age_gfn = kvm_test_age_gfn_hv, - .set_spte_gfn = kvm_set_spte_gfn_hv, .free_memslot = kvmppc_core_free_memslot_hv, .init_vm = kvmppc_core_init_vm_hv, .destroy_vm = kvmppc_core_destroy_vm_hv, diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index 5b92619a05fd..a7d7137ea0c8 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -461,12 +461,6 @@ static bool kvm_test_age_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range) return false; } -static bool kvm_set_spte_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range) -{ - /* The page will get remapped properly on its next fault */ - return do_kvm_unmap_gfn(kvm, range); -} - /*****************************************/ static void kvmppc_set_msr_pr(struct kvm_vcpu *vcpu, u64 msr) @@ -2071,7 +2065,6 @@ static struct kvmppc_ops kvm_ops_pr = { .unmap_gfn_range = kvm_unmap_gfn_range_pr, .age_gfn = kvm_age_gfn_pr, .test_age_gfn = kvm_test_age_gfn_pr, - .set_spte_gfn = kvm_set_spte_gfn_pr, .free_memslot = kvmppc_core_free_memslot_pr, .init_vm = kvmppc_core_init_vm_pr, .destroy_vm = kvmppc_core_destroy_vm_pr, diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index ccb8f16ffe41..c664fdec75b1 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -747,12 +747,6 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) return false; } -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) -{ - /* The page will get remapped properly on its next fault */ - return kvm_e500_mmu_unmap_gfn(kvm, range); -} - /*****************************************/ int e500_mmu_host_init(struct kvmppc_vcpu_e500 *vcpu_e500) diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c index a9e2fd7245e1..b63650f9b966 100644 --- a/arch/riscv/kvm/mmu.c +++ b/arch/riscv/kvm/mmu.c @@ -550,26 +550,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) return false; } -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) -{ - int ret; - kvm_pfn_t pfn = pte_pfn(range->arg.pte); - - if (!kvm->arch.pgd) - return false; - - WARN_ON(range->end - range->start != 1); - - ret = gstage_map_page(kvm, NULL, range->start << PAGE_SHIFT, - __pfn_to_phys(pfn), PAGE_SIZE, true, true); - if (ret) { - kvm_debug("Failed to map G-stage page (error %d)\n", ret); - return true; - } - - return false; -} - bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { pte_t *ptep; diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0049d49aa913..87ba2a9da196 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -432,8 +432,8 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) * The idea using the light way get the spte on x86_32 guest is from * gup_get_pte (mm/gup.c). * - * An spte tlb flush may be pending, because kvm_set_pte_rmap - * coalesces them and we are running out of the MMU lock. Therefore + * An spte tlb flush may be pending, because they are coalesced and + * we are running out of the MMU lock. Therefore * we need to protect against in-progress updates of the spte. * * Reading the spte while an update is in progress may get the old value @@ -1454,43 +1454,6 @@ static bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, return __kvm_zap_rmap(kvm, rmap_head, slot); } -static bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, - struct kvm_memory_slot *slot, gfn_t gfn, int level, - pte_t pte) -{ - u64 *sptep; - struct rmap_iterator iter; - bool need_flush = false; - u64 new_spte; - kvm_pfn_t new_pfn; - - WARN_ON_ONCE(pte_huge(pte)); - new_pfn = pte_pfn(pte); - -restart: - for_each_rmap_spte(rmap_head, &iter, sptep) { - need_flush = true; - - if (pte_write(pte)) { - kvm_zap_one_rmap_spte(kvm, rmap_head, sptep); - goto restart; - } else { - new_spte = kvm_mmu_changed_pte_notifier_make_spte( - *sptep, new_pfn); - - mmu_spte_clear_track_bits(kvm, sptep); - mmu_spte_set(sptep, new_spte); - } - } - - if (need_flush && kvm_available_flush_remote_tlbs_range()) { - kvm_flush_remote_tlbs_gfn(kvm, gfn, level); - return false; - } - - return need_flush; -} - struct slot_rmap_walk_iterator { /* input fields. */ const struct kvm_memory_slot *slot; @@ -1596,19 +1559,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) return flush; } -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) -{ - bool flush = false; - - if (kvm_memslots_have_rmaps(kvm)) - flush = kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmap); - - if (tdp_mmu_enabled) - flush |= kvm_tdp_mmu_set_spte_gfn(kvm, range); - - return flush; -} - static bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head, struct kvm_memory_slot *slot, gfn_t gfn, int level, pte_t unused) diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c index 318135daf685..283af5b90016 100644 --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -322,22 +322,6 @@ u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled) return spte; } -u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn) -{ - u64 new_spte; - - new_spte = old_spte & ~SPTE_BASE_ADDR_MASK; - new_spte |= (u64)new_pfn << PAGE_SHIFT; - - new_spte &= ~PT_WRITABLE_MASK; - new_spte &= ~shadow_host_writable_mask; - new_spte &= ~shadow_mmu_writable_mask; - - new_spte = mark_spte_for_access_track(new_spte); - - return new_spte; -} - u64 mark_spte_for_access_track(u64 spte) { if (spte_ad_enabled(spte)) diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index 1a163aee9ec6..92da4c419171 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -511,8 +511,6 @@ static inline u64 restore_acc_track_spte(u64 spte) return spte; } -u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn); - void __init kvm_mmu_spte_module_init(void); void kvm_mmu_reset_all_pte_masks(void); diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 3627744fcab6..fbb86932b766 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1250,52 +1250,6 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn); } -static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter, - struct kvm_gfn_range *range) -{ - u64 new_spte; - - /* Huge pages aren't expected to be modified without first being zapped. */ - WARN_ON_ONCE(pte_huge(range->arg.pte) || range->start + 1 != range->end); - - if (iter->level != PG_LEVEL_4K || - !is_shadow_present_pte(iter->old_spte)) - return false; - - /* - * Note, when changing a read-only SPTE, it's not strictly necessary to - * zero the SPTE before setting the new PFN, but doing so preserves the - * invariant that the PFN of a present * leaf SPTE can never change. - * See handle_changed_spte(). - */ - tdp_mmu_iter_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE); - - if (!pte_write(range->arg.pte)) { - new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte, - pte_pfn(range->arg.pte)); - - tdp_mmu_iter_set_spte(kvm, iter, new_spte); - } - - return true; -} - -/* - * Handle the changed_pte MMU notifier for the TDP MMU. - * data is a pointer to the new pte_t mapping the HVA specified by the MMU - * notifier. - * Returns non-zero if a flush is needed before releasing the MMU lock. - */ -bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) -{ - /* - * No need to handle the remote TLB flush under RCU protection, the - * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a - * shadow page. See the WARN on pfn_changed in handle_changed_spte(). - */ - return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn); -} - /* * Remove write access from all SPTEs at or above min_level that map GFNs * [start, end). Returns true if an SPTE has been changed and the TLBs need to diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index 6e1ea04ca885..58b55e61bd33 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -31,7 +31,6 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range, bool flush); bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); -bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, const struct kvm_memory_slot *slot, int min_level); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ab439706ea2f..8dea11701ab2 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -259,7 +259,6 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER union kvm_mmu_notifier_arg { - pte_t pte; unsigned long attributes; }; @@ -273,7 +272,6 @@ struct kvm_gfn_range { bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); #endif enum { diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h index 011fba6b5552..74e40d5d4af4 100644 --- a/include/trace/events/kvm.h +++ b/include/trace/events/kvm.h @@ -456,21 +456,6 @@ TRACE_EVENT(kvm_unmap_hva_range, __entry->start, __entry->end) ); -TRACE_EVENT(kvm_set_spte_hva, - TP_PROTO(unsigned long hva), - TP_ARGS(hva), - - TP_STRUCT__entry( - __field( unsigned long, hva ) - ), - - TP_fast_assign( - __entry->hva = hva; - ), - - TP_printk("mmu notifier set pte hva: %#016lx", __entry->hva) -); - TRACE_EVENT(kvm_age_hva, TP_PROTO(unsigned long start, unsigned long end), TP_ARGS(start, end), diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4eb8afd0b961..2fcd9979752a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -717,48 +717,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn return __kvm_handle_hva_range(kvm, &range).ret; } -static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) -{ - /* - * Skipping invalid memslots is correct if and only change_pte() is - * surrounded by invalidate_range_{start,end}(), which is currently - * guaranteed by the primary MMU. If that ever changes, KVM needs to - * unmap the memslot instead of skipping the memslot to ensure that KVM - * doesn't hold references to the old PFN. - */ - WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); - - if (range->slot->flags & KVM_MEMSLOT_INVALID) - return false; - - return kvm_set_spte_gfn(kvm, range); -} - -static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long address, - pte_t pte) -{ - struct kvm *kvm = mmu_notifier_to_kvm(mn); - const union kvm_mmu_notifier_arg arg = { .pte = pte }; - - trace_kvm_set_spte_hva(address); - - /* - * .change_pte() must be surrounded by .invalidate_range_{start,end}(). - * If mmu_invalidate_in_progress is zero, then no in-progress - * invalidations, including this one, found a relevant memslot at - * start(); rechecking memslots here is unnecessary. Note, a false - * positive (count elevated by a different invalidation) is sub-optimal - * but functionally ok. - */ - WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); - if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) - return; - - kvm_handle_hva_range(mn, address, address + 1, arg, kvm_change_spte_gfn); -} - void kvm_mmu_invalidate_begin(struct kvm *kvm) { lockdep_assert_held_write(&kvm->mmu_lock); @@ -976,7 +934,6 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { .clear_flush_young = kvm_mmu_notifier_clear_flush_young, .clear_young = kvm_mmu_notifier_clear_young, .test_young = kvm_mmu_notifier_test_young, - .change_pte = kvm_mmu_notifier_change_pte, .release = kvm_mmu_notifier_release, };
The .change_pte() MMU notifier callback was intended as an optimization. The original point of it was that KSM could tell KVM to flip its secondary PTE to a new location without having to first zap it. At the time there was also an .invalidate_page() callback; both of them were *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), and .invalidate_page() also doubled as a fallback implementation of .change_pte(). Later on, however, both callbacks were changed to occur within an invalidate_range_start/end() block. In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) did so to remove the fallback from .invalidate_page() to .change_pte() and allow sleepable .invalidate_page() hooks. This however made KVM's usage of the .change_pte() callback completely moot, because KVM unmaps the sPTEs during .invalidate_range_start() and therefore .change_pte() has no hope of finding a sPTE to change. Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as well as all the architecture specific implementations. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- arch/arm64/kvm/mmu.c | 34 ----------------- arch/loongarch/include/asm/kvm_host.h | 1 - arch/loongarch/kvm/mmu.c | 32 ---------------- arch/mips/kvm/mmu.c | 30 --------------- arch/powerpc/include/asm/kvm_ppc.h | 1 - arch/powerpc/kvm/book3s.c | 5 --- arch/powerpc/kvm/book3s.h | 1 - arch/powerpc/kvm/book3s_64_mmu_hv.c | 12 ------ arch/powerpc/kvm/book3s_hv.c | 1 - arch/powerpc/kvm/book3s_pr.c | 7 ---- arch/powerpc/kvm/e500_mmu_host.c | 6 --- arch/riscv/kvm/mmu.c | 20 ---------- arch/x86/kvm/mmu/mmu.c | 54 +-------------------------- arch/x86/kvm/mmu/spte.c | 16 -------- arch/x86/kvm/mmu/spte.h | 2 - arch/x86/kvm/mmu/tdp_mmu.c | 46 ----------------------- arch/x86/kvm/mmu/tdp_mmu.h | 1 - include/linux/kvm_host.h | 2 - include/trace/events/kvm.h | 15 -------- virt/kvm/kvm_main.c | 43 --------------------- 20 files changed, 2 insertions(+), 327 deletions(-)