Message ID | 20210402005658.3024832-2-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: Consolidate and optimize MMU notifiers | expand |
On 02/04/21 02:56, Sean Christopherson wrote: > In KVM's .change_pte() notification callback, replace the notifier > sequence bump with a WARN_ON assertion that the notifier count is > elevated. An elevated count provides stricter protections than bumping > the sequence, and the sequence is guarnateed to be bumped before the > count hits zero. > > When .change_pte() was added by commit 828502d30073 ("ksm: add > mmu_notifier set_pte_at_notify()"), bumping the sequence was necessary > as .change_pte() would be invoked without any surrounding notifications. > > However, since commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify > with invalidate_range_start and invalidate_range_end"), all calls to > .change_pte() are guaranteed to be bookended by start() and end(), and > so are guaranteed to run with an elevated notifier count. > > Note, wrapping .change_pte() with .invalidate_range_{start,end}() is a > bug of sorts, as invalidating the secondary MMU's (KVM's) PTE defeats > the purpose of .change_pte(). Every arch's kvm_set_spte_hva() assumes > .change_pte() is called when the relevant SPTE is present in KVM's MMU, > as the original goal was to accelerate Kernel Samepage Merging (KSM) by > updating KVM's SPTEs without requiring a VM-Exit (due to invalidating > the SPTE). I.e. it means that .change_pte() is effectively dead code > on _all_ architectures. > > x86 and MIPS are clearcut nops if the old SPTE is not-present, and that > is guaranteed due to the prior invalidation. PPC simply unmaps the SPTE, > which again should be a nop due to the invalidation. arm64 is a bit > murky, but it's also likely a nop because kvm_pgtable_stage2_map() is > called without a cache pointer, which means it will map an entry if and > only if an existing PTE was found. > > For now, take advantage of the bug to simplify future consolidation of > KVMs's MMU notifier code. Doing so will not greatly complicate fixing > .change_pte(), assuming it's even worth fixing. .change_pte() has been > broken for 8+ years and no one has complained. Even if there are > KSM+KVM users that care deeply about its performance, the benefits of > avoiding VM-Exits via .change_pte() need to be reevaluated to justify > the added complexity and testing burden. Ripping out .change_pte() > entirely would be a lot easier. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > virt/kvm/kvm_main.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index d1de843b7618..8df091950161 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -461,12 +461,17 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, > > trace_kvm_set_spte_hva(address); > > + /* > + * .change_pte() must be bookended by .invalidate_range_{start,end}(), Changed to "surrounded" for the benefit of non-native speakers. :) Paolo > + * and so always runs with an elevated notifier count. This obviates > + * the need to bump the sequence count. > + */ > + WARN_ON_ONCE(!kvm->mmu_notifier_count); > + > idx = srcu_read_lock(&kvm->srcu); > > KVM_MMU_LOCK(kvm); > > - kvm->mmu_notifier_seq++; > - > if (kvm_set_spte_hva(kvm, address, pte)) > kvm_flush_remote_tlbs(kvm); > >
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d1de843b7618..8df091950161 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -461,12 +461,17 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, trace_kvm_set_spte_hva(address); + /* + * .change_pte() must be bookended by .invalidate_range_{start,end}(), + * and so always runs with an elevated notifier count. This obviates + * the need to bump the sequence count. + */ + WARN_ON_ONCE(!kvm->mmu_notifier_count); + idx = srcu_read_lock(&kvm->srcu); KVM_MMU_LOCK(kvm); - kvm->mmu_notifier_seq++; - if (kvm_set_spte_hva(kvm, address, pte)) kvm_flush_remote_tlbs(kvm);
In KVM's .change_pte() notification callback, replace the notifier sequence bump with a WARN_ON assertion that the notifier count is elevated. An elevated count provides stricter protections than bumping the sequence, and the sequence is guarnateed to be bumped before the count hits zero. When .change_pte() was added by commit 828502d30073 ("ksm: add mmu_notifier set_pte_at_notify()"), bumping the sequence was necessary as .change_pte() would be invoked without any surrounding notifications. However, since commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end"), all calls to .change_pte() are guaranteed to be bookended by start() and end(), and so are guaranteed to run with an elevated notifier count. Note, wrapping .change_pte() with .invalidate_range_{start,end}() is a bug of sorts, as invalidating the secondary MMU's (KVM's) PTE defeats the purpose of .change_pte(). Every arch's kvm_set_spte_hva() assumes .change_pte() is called when the relevant SPTE is present in KVM's MMU, as the original goal was to accelerate Kernel Samepage Merging (KSM) by updating KVM's SPTEs without requiring a VM-Exit (due to invalidating the SPTE). I.e. it means that .change_pte() is effectively dead code on _all_ architectures. x86 and MIPS are clearcut nops if the old SPTE is not-present, and that is guaranteed due to the prior invalidation. PPC simply unmaps the SPTE, which again should be a nop due to the invalidation. arm64 is a bit murky, but it's also likely a nop because kvm_pgtable_stage2_map() is called without a cache pointer, which means it will map an entry if and only if an existing PTE was found. For now, take advantage of the bug to simplify future consolidation of KVMs's MMU notifier code. Doing so will not greatly complicate fixing .change_pte(), assuming it's even worth fixing. .change_pte() has been broken for 8+ years and no one has complained. Even if there are KSM+KVM users that care deeply about its performance, the benefits of avoiding VM-Exits via .change_pte() need to be reevaluated to justify the added complexity and testing burden. Ripping out .change_pte() entirely would be a lot easier. Signed-off-by: Sean Christopherson <seanjc@google.com> --- virt/kvm/kvm_main.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)