Message ID | 20240222012640.2820927-1-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v5] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing | expand |
On Wed, 2024-02-21 at 17:26 -0800, Sean Christopherson wrote: > Retry page faults without acquiring mmu_lock, and without even faulting > the page into the primary MMU, if the resolved gfn is covered by an active > invalidation. Contending for mmu_lock is especially problematic on > preemptible kernels as the mmu_notifier invalidation task will yield > mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and > ultimately increase the latency of resolving the page fault. And in the > worst case scenario, yielding will be accompanied by a remote TLB flush, > e.g. if the invalidation covers a large range of memory and vCPUs are > accessing addresses that were already zapped. > > Faulting the page into the primary MMU is similarly problematic, as doing > so may acquire locks that need to be taken for the invalidation to > complete (the primary MMU has finer grained locks than KVM's MMU), and/or > may cause unnecessary churn (getting/putting pages, marking them accessed, > etc). > > Alternatively, the yielding issue could be mitigated by teaching KVM's MMU > iterators to perform more work before yielding, but that wouldn't solve > the lock contention and would negatively affect scenarios where a vCPU is > trying to fault in an address that is NOT covered by the in-progress > invalidation. > > Add a dedicated lockess version of the range-based retry check to avoid > false positives on the sanity check on start+end WARN, and so that it's > super obvious that checking for a racing invalidation without holding > mmu_lock is unsafe (though obviously useful). > > Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking > invalidation in a loop won't put KVM into an infinite loop, e.g. due to > caching the in-progress flag and never seeing it go to '0'. > > Force a load of mmu_invalidate_seq as well, even though it isn't strictly > necessary to avoid an infinite loop, as doing so improves the probability > that KVM will detect an invalidation that already completed before > acquiring mmu_lock and bailing anyways. > > Do the pre-check even for non-preemptible kernels, as waiting to detect > the invalidation until mmu_lock is held guarantees the vCPU will observe > the worst case latency in terms of handling the fault, and can generate > even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock, > detect retry, drop mmu_lock, re-enter the guest, retake the fault, and > eventually re-acquire mmu_lock. This behavior is also why there are no > new starvation issues due to losing the fairness guarantees provided by > rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting > on mmu_lock doesn't guarantee forward progress in the face of _another_ > mmu_notifier invalidation event. > > Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() > may generate a load into a register instead of doing a direct comparison > (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost > is a few bytes of code and maaaaybe a cycle or three. > > Reported-by: Yan Zhao <yan.y.zhao@intel.com> > Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com > Reported-by: Friedrich Weber <f.weber@proxmox.com> > Cc: Kai Huang <kai.huang@intel.com> > Cc: Yan Zhao <yan.y.zhao@intel.com> > Cc: Yuan Yao <yuan.yao@linux.intel.com> > Cc: Xu Yilun <yilun.xu@linux.intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > Acked-by: Kai Huang <kai.huang@intel.com>
On 22/02/2024 02:26, Sean Christopherson wrote: > Retry page faults without acquiring mmu_lock, and without even faulting > the page into the primary MMU, if the resolved gfn is covered by an active > invalidation. Contending for mmu_lock is especially problematic on > preemptible kernels as the mmu_notifier invalidation task will yield > mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and > ultimately increase the latency of resolving the page fault. And in the > worst case scenario, yielding will be accompanied by a remote TLB flush, > e.g. if the invalidation covers a large range of memory and vCPUs are > accessing addresses that were already zapped. > > Faulting the page into the primary MMU is similarly problematic, as doing > so may acquire locks that need to be taken for the invalidation to > complete (the primary MMU has finer grained locks than KVM's MMU), and/or > may cause unnecessary churn (getting/putting pages, marking them accessed, > etc). > > Alternatively, the yielding issue could be mitigated by teaching KVM's MMU > iterators to perform more work before yielding, but that wouldn't solve > the lock contention and would negatively affect scenarios where a vCPU is > trying to fault in an address that is NOT covered by the in-progress > invalidation. > > Add a dedicated lockess version of the range-based retry check to avoid > false positives on the sanity check on start+end WARN, and so that it's > super obvious that checking for a racing invalidation without holding > mmu_lock is unsafe (though obviously useful). > > Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking > invalidation in a loop won't put KVM into an infinite loop, e.g. due to > caching the in-progress flag and never seeing it go to '0'. > > Force a load of mmu_invalidate_seq as well, even though it isn't strictly > necessary to avoid an infinite loop, as doing so improves the probability > that KVM will detect an invalidation that already completed before > acquiring mmu_lock and bailing anyways. > > Do the pre-check even for non-preemptible kernels, as waiting to detect > the invalidation until mmu_lock is held guarantees the vCPU will observe > the worst case latency in terms of handling the fault, and can generate > even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock, > detect retry, drop mmu_lock, re-enter the guest, retake the fault, and > eventually re-acquire mmu_lock. This behavior is also why there are no > new starvation issues due to losing the fairness guarantees provided by > rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting > on mmu_lock doesn't guarantee forward progress in the face of _another_ > mmu_notifier invalidation event. > > Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() > may generate a load into a register instead of doing a direct comparison > (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost > is a few bytes of code and maaaaybe a cycle or three. > > Reported-by: Yan Zhao <yan.y.zhao@intel.com> > Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com > Reported-by: Friedrich Weber <f.weber@proxmox.com> > Cc: Kai Huang <kai.huang@intel.com> > Cc: Yan Zhao <yan.y.zhao@intel.com> > Cc: Yuan Yao <yuan.yao@linux.intel.com> > Cc: Xu Yilun <yilun.xu@linux.intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- Couldn't find the base-commit 21dbc438 (might not have looked at the right place though), so I applied this patch on top of c48617fb ("Merge tag 'kvmarm-fixes-6.8-3' ...") from kvm/kvm.git. Can confirm the patch fixes the temporary guest hangs in combination with KSM and NUMA balancing [1]. Thanks! [1] https://lore.kernel.org/kvm/832697b9-3652-422d-a019-8c0574a188ac@proxmox.com/
On Wed, 21 Feb 2024 17:26:40 -0800, Sean Christopherson wrote: > Retry page faults without acquiring mmu_lock, and without even faulting > the page into the primary MMU, if the resolved gfn is covered by an active > invalidation. Contending for mmu_lock is especially problematic on > preemptible kernels as the mmu_notifier invalidation task will yield > mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and > ultimately increase the latency of resolving the page fault. And in the > worst case scenario, yielding will be accompanied by a remote TLB flush, > e.g. if the invalidation covers a large range of memory and vCPUs are > accessing addresses that were already zapped. > > [...] Applied (quickly) to kvm-x86 fixes, as I want to get this into -next for at least a day or two before sending it to Paolo for 6.8. But I'm more than happy to squash in reviews/acks, especially since many people gave very helpful feedback on earlier versions. [1/1] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing https://github.com/kvm-x86/linux/commit/67e4022ffad6 -- https://github.com/kvm-x86/linux/tree/next
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> On Wed, Feb 21, 2024 at 05:26:40PM -0800, Sean Christopherson wrote: > Retry page faults without acquiring mmu_lock, and without even faulting > the page into the primary MMU, if the resolved gfn is covered by an active > invalidation. Contending for mmu_lock is especially problematic on > preemptible kernels as the mmu_notifier invalidation task will yield > mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and > ultimately increase the latency of resolving the page fault. And in the > worst case scenario, yielding will be accompanied by a remote TLB flush, > e.g. if the invalidation covers a large range of memory and vCPUs are > accessing addresses that were already zapped. > > Faulting the page into the primary MMU is similarly problematic, as doing > so may acquire locks that need to be taken for the invalidation to > complete (the primary MMU has finer grained locks than KVM's MMU), and/or > may cause unnecessary churn (getting/putting pages, marking them accessed, > etc). > > Alternatively, the yielding issue could be mitigated by teaching KVM's MMU > iterators to perform more work before yielding, but that wouldn't solve > the lock contention and would negatively affect scenarios where a vCPU is > trying to fault in an address that is NOT covered by the in-progress > invalidation. > > Add a dedicated lockess version of the range-based retry check to avoid > false positives on the sanity check on start+end WARN, and so that it's > super obvious that checking for a racing invalidation without holding > mmu_lock is unsafe (though obviously useful). > > Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking > invalidation in a loop won't put KVM into an infinite loop, e.g. due to > caching the in-progress flag and never seeing it go to '0'. > > Force a load of mmu_invalidate_seq as well, even though it isn't strictly > necessary to avoid an infinite loop, as doing so improves the probability > that KVM will detect an invalidation that already completed before > acquiring mmu_lock and bailing anyways. > > Do the pre-check even for non-preemptible kernels, as waiting to detect > the invalidation until mmu_lock is held guarantees the vCPU will observe > the worst case latency in terms of handling the fault, and can generate > even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock, > detect retry, drop mmu_lock, re-enter the guest, retake the fault, and > eventually re-acquire mmu_lock. This behavior is also why there are no > new starvation issues due to losing the fairness guarantees provided by > rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting > on mmu_lock doesn't guarantee forward progress in the face of _another_ > mmu_notifier invalidation event. > > Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() > may generate a load into a register instead of doing a direct comparison > (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost > is a few bytes of code and maaaaybe a cycle or three. > > Reported-by: Yan Zhao <yan.y.zhao@intel.com> > Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com > Reported-by: Friedrich Weber <f.weber@proxmox.com> > Cc: Kai Huang <kai.huang@intel.com> > Cc: Yan Zhao <yan.y.zhao@intel.com> > Cc: Yuan Yao <yuan.yao@linux.intel.com> > Cc: Xu Yilun <yilun.xu@linux.intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > > v5: > - Fix the inverted slot check. [Xu] > - Drop all the other patches (will post separately). > > arch/x86/kvm/mmu/mmu.c | 42 ++++++++++++++++++++++++++++++++++++++++ > include/linux/kvm_host.h | 26 +++++++++++++++++++++++++ > 2 files changed, 68 insertions(+) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 3c193b096b45..274acc53f0e9 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4405,6 +4405,31 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq; > smp_rmb(); > > + /* > + * Check for a relevant mmu_notifier invalidation event before getting > + * the pfn from the primary MMU, and before acquiring mmu_lock. > + * > + * For mmu_lock, if there is an in-progress invalidation and the kernel > + * allows preemption, the invalidation task may drop mmu_lock and yield > + * in response to mmu_lock being contended, which is *very* counter- > + * productive as this vCPU can't actually make forward progress until > + * the invalidation completes. > + * > + * Retrying now can also avoid unnessary lock contention in the primary > + * MMU, as the primary MMU doesn't necessarily hold a single lock for > + * the duration of the invalidation, i.e. faulting in a conflicting pfn > + * can cause the invalidation to take longer by holding locks that are > + * needed to complete the invalidation. > + * > + * Do the pre-check even for non-preemtible kernels, i.e. even if KVM > + * will never yield mmu_lock in response to contention, as this vCPU is > + * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held > + * to detect retry guarantees the worst case latency for the vCPU. > + */ > + if (fault->slot && > + mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) > + return RET_PF_RETRY; > + > ret = __kvm_faultin_pfn(vcpu, fault); > if (ret != RET_PF_CONTINUE) > return ret; > @@ -4415,6 +4440,18 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > if (unlikely(!fault->slot)) > return kvm_handle_noslot_fault(vcpu, fault, access); > > + /* > + * Check again for a relevant mmu_notifier invalidation event purely to > + * avoid contending mmu_lock. Most invalidations will be detected by > + * the previous check, but checking is extremely cheap relative to the > + * overall cost of failing to detect the invalidation until after > + * mmu_lock is acquired. > + */ > + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) { > + kvm_release_pfn_clean(fault->pfn); > + return RET_PF_RETRY; > + } > + > return RET_PF_CONTINUE; > } > > @@ -4442,6 +4479,11 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu, > if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu)) > return true; > > + /* > + * Check for a relevant mmu_notifier invalidation event one last time > + * now that mmu_lock is held, as the "unsafe" checks performed without > + * holding mmu_lock can get false negatives. > + */ > return fault->slot && > mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn); > } > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 18e28610749e..97afe4519772 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2062,6 +2062,32 @@ static inline int mmu_invalidate_retry_gfn(struct kvm *kvm, > return 1; > return 0; > } > + > +/* > + * This lockless version of the range-based retry check *must* be paired with a > + * call to the locked version after acquiring mmu_lock, i.e. this is safe to > + * use only as a pre-check to avoid contending mmu_lock. This version *will* > + * get false negatives and false positives. > + */ > +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm, > + unsigned long mmu_seq, > + gfn_t gfn) > +{ > + /* > + * Use READ_ONCE() to ensure the in-progress flag and sequence counter > + * are always read from memory, e.g. so that checking for retry in a > + * loop won't result in an infinite retry loop. Don't force loads for > + * start+end, as the key to avoiding infinite retry loops is observing > + * the 1=>0 transition of in-progress, i.e. getting false negatives > + * due to stale start+end values is acceptable. > + */ > + if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) && > + gfn >= kvm->mmu_invalidate_range_start && > + gfn < kvm->mmu_invalidate_range_end) > + return true; > + > + return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq; > +} > #endif > > #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING > > base-commit: 21dbc438dde69ff630b3264c54b94923ee9fcdcf > -- > 2.44.0.rc0.258.g7320e95886-goog >
On Fri, Feb 23, 2024, Yan Zhao wrote:
> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Squashed, new hash below. Thanks!
[1/1] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
https://github.com/kvm-x86/linux/commit/d02c357e5bfa
On Wed, Feb 21, 2024 at 05:26:40PM -0800, Sean Christopherson wrote: > Retry page faults without acquiring mmu_lock, and without even faulting > the page into the primary MMU, if the resolved gfn is covered by an active > invalidation. Contending for mmu_lock is especially problematic on > preemptible kernels as the mmu_notifier invalidation task will yield > mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and > ultimately increase the latency of resolving the page fault. And in the > worst case scenario, yielding will be accompanied by a remote TLB flush, > e.g. if the invalidation covers a large range of memory and vCPUs are > accessing addresses that were already zapped. > > Faulting the page into the primary MMU is similarly problematic, as doing > so may acquire locks that need to be taken for the invalidation to > complete (the primary MMU has finer grained locks than KVM's MMU), and/or > may cause unnecessary churn (getting/putting pages, marking them accessed, > etc). > > Alternatively, the yielding issue could be mitigated by teaching KVM's MMU > iterators to perform more work before yielding, but that wouldn't solve > the lock contention and would negatively affect scenarios where a vCPU is > trying to fault in an address that is NOT covered by the in-progress > invalidation. > > Add a dedicated lockess version of the range-based retry check to avoid > false positives on the sanity check on start+end WARN, and so that it's > super obvious that checking for a racing invalidation without holding > mmu_lock is unsafe (though obviously useful). > > Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking > invalidation in a loop won't put KVM into an infinite loop, e.g. due to > caching the in-progress flag and never seeing it go to '0'. > > Force a load of mmu_invalidate_seq as well, even though it isn't strictly > necessary to avoid an infinite loop, as doing so improves the probability > that KVM will detect an invalidation that already completed before > acquiring mmu_lock and bailing anyways. > > Do the pre-check even for non-preemptible kernels, as waiting to detect > the invalidation until mmu_lock is held guarantees the vCPU will observe > the worst case latency in terms of handling the fault, and can generate > even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock, > detect retry, drop mmu_lock, re-enter the guest, retake the fault, and > eventually re-acquire mmu_lock. This behavior is also why there are no > new starvation issues due to losing the fairness guarantees provided by > rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting > on mmu_lock doesn't guarantee forward progress in the face of _another_ > mmu_notifier invalidation event. > > Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() > may generate a load into a register instead of doing a direct comparison > (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost > is a few bytes of code and maaaaybe a cycle or three. > > Reported-by: Yan Zhao <yan.y.zhao@intel.com> > Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com > Reported-by: Friedrich Weber <f.weber@proxmox.com> > Cc: Kai Huang <kai.huang@intel.com> > Cc: Yan Zhao <yan.y.zhao@intel.com> > Cc: Yuan Yao <yuan.yao@linux.intel.com> > Cc: Xu Yilun <yilun.xu@linux.intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 3c193b096b45..274acc53f0e9 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4405,6 +4405,31 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq; smp_rmb(); + /* + * Check for a relevant mmu_notifier invalidation event before getting + * the pfn from the primary MMU, and before acquiring mmu_lock. + * + * For mmu_lock, if there is an in-progress invalidation and the kernel + * allows preemption, the invalidation task may drop mmu_lock and yield + * in response to mmu_lock being contended, which is *very* counter- + * productive as this vCPU can't actually make forward progress until + * the invalidation completes. + * + * Retrying now can also avoid unnessary lock contention in the primary + * MMU, as the primary MMU doesn't necessarily hold a single lock for + * the duration of the invalidation, i.e. faulting in a conflicting pfn + * can cause the invalidation to take longer by holding locks that are + * needed to complete the invalidation. + * + * Do the pre-check even for non-preemtible kernels, i.e. even if KVM + * will never yield mmu_lock in response to contention, as this vCPU is + * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held + * to detect retry guarantees the worst case latency for the vCPU. + */ + if (fault->slot && + mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) + return RET_PF_RETRY; + ret = __kvm_faultin_pfn(vcpu, fault); if (ret != RET_PF_CONTINUE) return ret; @@ -4415,6 +4440,18 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, if (unlikely(!fault->slot)) return kvm_handle_noslot_fault(vcpu, fault, access); + /* + * Check again for a relevant mmu_notifier invalidation event purely to + * avoid contending mmu_lock. Most invalidations will be detected by + * the previous check, but checking is extremely cheap relative to the + * overall cost of failing to detect the invalidation until after + * mmu_lock is acquired. + */ + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) { + kvm_release_pfn_clean(fault->pfn); + return RET_PF_RETRY; + } + return RET_PF_CONTINUE; } @@ -4442,6 +4479,11 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu, if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu)) return true; + /* + * Check for a relevant mmu_notifier invalidation event one last time + * now that mmu_lock is held, as the "unsafe" checks performed without + * holding mmu_lock can get false negatives. + */ return fault->slot && mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn); } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 18e28610749e..97afe4519772 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2062,6 +2062,32 @@ static inline int mmu_invalidate_retry_gfn(struct kvm *kvm, return 1; return 0; } + +/* + * This lockless version of the range-based retry check *must* be paired with a + * call to the locked version after acquiring mmu_lock, i.e. this is safe to + * use only as a pre-check to avoid contending mmu_lock. This version *will* + * get false negatives and false positives. + */ +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm, + unsigned long mmu_seq, + gfn_t gfn) +{ + /* + * Use READ_ONCE() to ensure the in-progress flag and sequence counter + * are always read from memory, e.g. so that checking for retry in a + * loop won't result in an infinite retry loop. Don't force loads for + * start+end, as the key to avoiding infinite retry loops is observing + * the 1=>0 transition of in-progress, i.e. getting false negatives + * due to stale start+end values is acceptable. + */ + if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) && + gfn >= kvm->mmu_invalidate_range_start && + gfn < kvm->mmu_invalidate_range_end) + return true; + + return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq; +} #endif #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
Retry page faults without acquiring mmu_lock, and without even faulting the page into the primary MMU, if the resolved gfn is covered by an active invalidation. Contending for mmu_lock is especially problematic on preemptible kernels as the mmu_notifier invalidation task will yield mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and ultimately increase the latency of resolving the page fault. And in the worst case scenario, yielding will be accompanied by a remote TLB flush, e.g. if the invalidation covers a large range of memory and vCPUs are accessing addresses that were already zapped. Faulting the page into the primary MMU is similarly problematic, as doing so may acquire locks that need to be taken for the invalidation to complete (the primary MMU has finer grained locks than KVM's MMU), and/or may cause unnecessary churn (getting/putting pages, marking them accessed, etc). Alternatively, the yielding issue could be mitigated by teaching KVM's MMU iterators to perform more work before yielding, but that wouldn't solve the lock contention and would negatively affect scenarios where a vCPU is trying to fault in an address that is NOT covered by the in-progress invalidation. Add a dedicated lockess version of the range-based retry check to avoid false positives on the sanity check on start+end WARN, and so that it's super obvious that checking for a racing invalidation without holding mmu_lock is unsafe (though obviously useful). Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking invalidation in a loop won't put KVM into an infinite loop, e.g. due to caching the in-progress flag and never seeing it go to '0'. Force a load of mmu_invalidate_seq as well, even though it isn't strictly necessary to avoid an infinite loop, as doing so improves the probability that KVM will detect an invalidation that already completed before acquiring mmu_lock and bailing anyways. Do the pre-check even for non-preemptible kernels, as waiting to detect the invalidation until mmu_lock is held guarantees the vCPU will observe the worst case latency in terms of handling the fault, and can generate even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock, detect retry, drop mmu_lock, re-enter the guest, retake the fault, and eventually re-acquire mmu_lock. This behavior is also why there are no new starvation issues due to losing the fairness guarantees provided by rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting on mmu_lock doesn't guarantee forward progress in the face of _another_ mmu_notifier invalidation event. Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() may generate a load into a register instead of doing a direct comparison (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost is a few bytes of code and maaaaybe a cycle or three. Reported-by: Yan Zhao <yan.y.zhao@intel.com> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com Reported-by: Friedrich Weber <f.weber@proxmox.com> Cc: Kai Huang <kai.huang@intel.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Yuan Yao <yuan.yao@linux.intel.com> Cc: Xu Yilun <yilun.xu@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> --- v5: - Fix the inverted slot check. [Xu] - Drop all the other patches (will post separately). arch/x86/kvm/mmu/mmu.c | 42 ++++++++++++++++++++++++++++++++++++++++ include/linux/kvm_host.h | 26 +++++++++++++++++++++++++ 2 files changed, 68 insertions(+) base-commit: 21dbc438dde69ff630b3264c54b94923ee9fcdcf