diff mbox series

[v3] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

Message ID 20240203003518.387220-1-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series [v3] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing | expand

Commit Message

Sean Christopherson Feb. 3, 2024, 12:35 a.m. UTC
Retry page faults without acquiring mmu_lock if the resolved gfn is covered
by an active invalidation.  Contending for mmu_lock is especially
problematic on preemptible kernels as the mmu_notifier invalidation task
will yield mmu_lock (see rwlock_needbreak()), delay the in-progress
invalidation, and ultimately increase the latency of resolving the page
fault.  And in the worst case scenario, yielding will be accompanied by a
remote TLB flush, e.g. if the invalidation covers a large range of memory
and vCPUs are accessing addresses that were already zapped.

Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
iterators to perform more work before yielding, but that wouldn't solve
the lock contention and would negatively affect scenarios where a vCPU is
trying to fault in an address that is NOT covered by the in-progress
invalidation.

Add a dedicated lockess version of the range-based retry check to avoid
false positives on the sanity check on start+end WARN, and so that it's
super obvious that checking for a racing invalidation without holding
mmu_lock is unsafe (though obviously useful).

Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking
invalidation in a loop won't put KVM into an infinite loop, e.g. due to
caching the in-progress flag and never seeing it go to '0'.

Force a load of mmu_invalidate_seq as well, even though it isn't strictly
necessary to avoid an infinite loop, as doing so improves the probability
that KVM will detect an invalidation that already completed before
acquiring mmu_lock and bailing anyways.

Do the pre-check even for non-preemptible kernels, as waiting to detect
the invalidation until mmu_lock is held guarantees the vCPU will observe
the worst case latency in terms of handling the fault, and can generate
even more mmu_lock contention.  E.g. the vCPU will acquire mmu_lock,
detect retry, drop mmu_lock, re-enter the guest, retake the fault, and
eventually re-acquire mmu_lock.  This behavior is also why there are no
new starvation issues due to losing the fairness guarantees provided by
rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting
on mmu_lock doesn't guarantee forward progress in the face of _another_
mmu_notifier invalidation event.

Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
may generate a load into a register instead of doing a direct comparison
(MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
is a few bytes of code and maaaaybe a cycle or three.

Reported-by: Yan Zhao <yan.y.zhao@intel.com>
Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
Cc: Kai Huang <kai.huang@intel.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Yuan Yao <yuan.yao@linux.intel.com>
Cc: Xu Yilun <yilun.xu@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---

Kai and Yan, I dropped your reviews as this changed just enough to make me
uncomfortable carrying reviews over from the previous version.

v3:
 - Release the pfn, i.e. put the struct page reference if one was held,
   as the caller doesn't expect to get a reference on "failure". [Yuan]
 - Fix a typo in the comment.

v2:
 - Introduce a dedicated helper and collapse to a single patch (because
   adding an unused helper would be quite silly).
 - Add a comment to explain the "unsafe" check in kvm_faultin_pfn(). [Kai]
 - Add Kai's Ack.

v1: https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com

 arch/x86/kvm/mmu/mmu.c   | 19 +++++++++++++++++++
 include/linux/kvm_host.h | 26 ++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)


base-commit: 60eedcfceda9db46f1b333e5e1aa9359793f04fb

Comments

Yan Zhao Feb. 5, 2024, 12:14 a.m. UTC | #1
On Fri, Feb 02, 2024 at 04:35:18PM -0800, Sean Christopherson wrote:
> Retry page faults without acquiring mmu_lock if the resolved gfn is covered
> by an active invalidation.  Contending for mmu_lock is especially
> problematic on preemptible kernels as the mmu_notifier invalidation task
> will yield mmu_lock (see rwlock_needbreak()), delay the in-progress
> invalidation, and ultimately increase the latency of resolving the page
> fault.  And in the worst case scenario, yielding will be accompanied by a
> remote TLB flush, e.g. if the invalidation covers a large range of memory
> and vCPUs are accessing addresses that were already zapped.
> 
> Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
> iterators to perform more work before yielding, but that wouldn't solve
> the lock contention and would negatively affect scenarios where a vCPU is
> trying to fault in an address that is NOT covered by the in-progress
> invalidation.
> 
> Add a dedicated lockess version of the range-based retry check to avoid
> false positives on the sanity check on start+end WARN, and so that it's
> super obvious that checking for a racing invalidation without holding
> mmu_lock is unsafe (though obviously useful).
> 
> Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking
> invalidation in a loop won't put KVM into an infinite loop, e.g. due to
> caching the in-progress flag and never seeing it go to '0'.
> 
> Force a load of mmu_invalidate_seq as well, even though it isn't strictly
> necessary to avoid an infinite loop, as doing so improves the probability
> that KVM will detect an invalidation that already completed before
> acquiring mmu_lock and bailing anyways.
> 
> Do the pre-check even for non-preemptible kernels, as waiting to detect
> the invalidation until mmu_lock is held guarantees the vCPU will observe
> the worst case latency in terms of handling the fault, and can generate
> even more mmu_lock contention.  E.g. the vCPU will acquire mmu_lock,
> detect retry, drop mmu_lock, re-enter the guest, retake the fault, and
> eventually re-acquire mmu_lock.  This behavior is also why there are no
> new starvation issues due to losing the fairness guarantees provided by
> rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting
> on mmu_lock doesn't guarantee forward progress in the face of _another_
> mmu_notifier invalidation event.
> 
> Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
> may generate a load into a register instead of doing a direct comparison
> (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
> is a few bytes of code and maaaaybe a cycle or three.
> 
> Reported-by: Yan Zhao <yan.y.zhao@intel.com>
> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
> Cc: Kai Huang <kai.huang@intel.com>
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Yuan Yao <yuan.yao@linux.intel.com>
> Cc: Xu Yilun <yilun.xu@linux.intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> 
> Kai and Yan, I dropped your reviews as this changed just enough to make me
> uncomfortable carrying reviews over from the previous version.
> 
> v3:
>  - Release the pfn, i.e. put the struct page reference if one was held,
>    as the caller doesn't expect to get a reference on "failure". [Yuan]
>  - Fix a typo in the comment.
> 
> v2:
>  - Introduce a dedicated helper and collapse to a single patch (because
>    adding an unused helper would be quite silly).
>  - Add a comment to explain the "unsafe" check in kvm_faultin_pfn(). [Kai]
>  - Add Kai's Ack.
> 
> v1: https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com
> 
>  arch/x86/kvm/mmu/mmu.c   | 19 +++++++++++++++++++
>  include/linux/kvm_host.h | 26 ++++++++++++++++++++++++++
>  2 files changed, 45 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c193b096b45..8ce9898914f1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4415,6 +4415,25 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  	if (unlikely(!fault->slot))
>  		return kvm_handle_noslot_fault(vcpu, fault, access);
>  
> +	/*
> +	 * Pre-check for a relevant mmu_notifier invalidation event prior to
> +	 * acquiring mmu_lock.  If there is an in-progress invalidation and the
> +	 * kernel allows preemption, the invalidation task may drop mmu_lock
> +	 * and yield in response to mmu_lock being contended, which is *very*
> +	 * counter-productive as this vCPU can't actually make forward progress
> +	 * until the invalidation completes.  This "unsafe" check can get false
> +	 * negatives, i.e. KVM needs to re-check after acquiring mmu_lock.
> +	 *
> +	 * Do the pre-check even for non-preemtible kernels, i.e. even if KVM
> +	 * will never yield mmu_lock in response to contention, as this vCPU is
> +	 * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
> +	 * to detect retry guarantees the worst case latency for the vCPU.
> +	 */
> +	if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) {
> +		kvm_release_pfn_clean(fault->pfn);
> +		return RET_PF_RETRY;
> +	}
> +
Could we also add this pre-check before fault in the pfn? like
@@ -4404,6 +4404,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,

        fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
        smp_rmb();
+       if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
+               return RET_PF_CONTINUE;

        ret = __kvm_faultin_pfn(vcpu, fault);
        if (ret != RET_PF_CONTINUE)

Though the mmu_seq would be always equal in the check, it can avoid repeated faulting
and release pfn for vain during a long zap cycle.


>  	return RET_PF_CONTINUE;
>  }
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7e7fd25b09b3..179df96b20f8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2031,6 +2031,32 @@ static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
>  		return 1;
>  	return 0;
>  }
> +
> +/*
> + * This lockless version of the range-based retry check *must* be paired with a
> + * call to the locked version after acquiring mmu_lock, i.e. this is safe to
> + * use only as a pre-check to avoid contending mmu_lock.  This version *will*
> + * get false negatives and false positives.
> + */
> +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
> +						   unsigned long mmu_seq,
> +						   gfn_t gfn)
> +{
> +	/*
> +	 * Use READ_ONCE() to ensure the in-progress flag and sequence counter
> +	 * are always read from memory, e.g. so that checking for retry in a
> +	 * loop won't result in an infinite retry loop.  Don't force loads for
> +	 * start+end, as the key to avoiding infinite retry loops is observing
> +	 * the 1=>0 transition of in-progress, i.e. getting false negatives
> +	 * due to stale start+end values is acceptable.
> +	 */
> +	if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
> +	    gfn >= kvm->mmu_invalidate_range_start &&
> +	    gfn < kvm->mmu_invalidate_range_end)
> +		return true;
> +
Is a smp_rmb() before below check better, given this retry is defined in a header
for all platforms?
It should be a just compiler barrier and free on x86?

> +	return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
> +}
>  #endif
>  
>  #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
> 
> base-commit: 60eedcfceda9db46f1b333e5e1aa9359793f04fb
> -- 
> 2.43.0.594.gd9cf4e227d-goog
>
Sean Christopherson Feb. 5, 2024, 6:27 p.m. UTC | #2
On Mon, Feb 05, 2024, Yan Zhao wrote:
> On Fri, Feb 02, 2024 at 04:35:18PM -0800, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 3c193b096b45..8ce9898914f1 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4415,6 +4415,25 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >  	if (unlikely(!fault->slot))
> >  		return kvm_handle_noslot_fault(vcpu, fault, access);
> >  
> > +	/*
> > +	 * Pre-check for a relevant mmu_notifier invalidation event prior to
> > +	 * acquiring mmu_lock.  If there is an in-progress invalidation and the
> > +	 * kernel allows preemption, the invalidation task may drop mmu_lock
> > +	 * and yield in response to mmu_lock being contended, which is *very*
> > +	 * counter-productive as this vCPU can't actually make forward progress
> > +	 * until the invalidation completes.  This "unsafe" check can get false
> > +	 * negatives, i.e. KVM needs to re-check after acquiring mmu_lock.
> > +	 *
> > +	 * Do the pre-check even for non-preemtible kernels, i.e. even if KVM
> > +	 * will never yield mmu_lock in response to contention, as this vCPU is
> > +	 * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
> > +	 * to detect retry guarantees the worst case latency for the vCPU.
> > +	 */
> > +	if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) {
> > +		kvm_release_pfn_clean(fault->pfn);
> > +		return RET_PF_RETRY;
> > +	}
> > +
> Could we also add this pre-check before fault in the pfn? like
> @@ -4404,6 +4404,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> 
>         fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
>         smp_rmb();
> +       if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
> +               return RET_PF_CONTINUE;
> 
>         ret = __kvm_faultin_pfn(vcpu, fault);
>         if (ret != RET_PF_CONTINUE)
> 
> Though the mmu_seq would be always equal in the check, it can avoid repeated faulting
> and release pfn for vain during a long zap cycle.

Just to be super claer, by "repeated faulting", you mean repeated faulting in the
primary MMU, correct?

Yeah, I agree, that makes sense.  The only question is whether to check before
and after, or only before.  When abusing KSM, I see ~99.5% of all invalidations
being detected before (21.5M out of just over 21.6M).

I think it makes sense to also check after getting the pfn?  The check is super
cheap, especially when there isn't an invalidation event as the checks should all
be quite predictable and cache hot.

> > +/*
> > + * This lockless version of the range-based retry check *must* be paired with a
> > + * call to the locked version after acquiring mmu_lock, i.e. this is safe to
> > + * use only as a pre-check to avoid contending mmu_lock.  This version *will*
> > + * get false negatives and false positives.
> > + */
> > +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
> > +						   unsigned long mmu_seq,
> > +						   gfn_t gfn)
> > +{
> > +	/*
> > +	 * Use READ_ONCE() to ensure the in-progress flag and sequence counter
> > +	 * are always read from memory, e.g. so that checking for retry in a
> > +	 * loop won't result in an infinite retry loop.  Don't force loads for
> > +	 * start+end, as the key to avoiding infinite retry loops is observing
> > +	 * the 1=>0 transition of in-progress, i.e. getting false negatives
> > +	 * due to stale start+end values is acceptable.
> > +	 */
> > +	if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
> > +	    gfn >= kvm->mmu_invalidate_range_start &&
> > +	    gfn < kvm->mmu_invalidate_range_end)
> > +		return true;
> > +
> Is a smp_rmb() before below check better, given this retry is defined in a header
> for all platforms?

No, because the unsafe check very deliberately doesn't provide any ordering
guarantees.  The READ_ONCE() is purely to ensure forward progress.  And trying to
provide ordering for mmu_invalidate_in_progress is rather nonsensical.  The smp_rmb()
in mmu_invalidate_retry() ensures the caller will see a different mmu_invalidate_seq
or an elevated mmu_invalidate_in_progress.

For this code, the order in which mmu_invalidate_seq and mmu_invalidate_in_progress
are checked truly doesn't matter as the range-based checks are will get false
negatives when performed outside of mmu_lock.   And from a correctness perspective,
there are zero constraints on when the checks are performed (beyond the existing
constraints from the earlier smp_rmb() and acquiring mmu_lock).

If an arch wants ordering guarantees, it can simply use mmu_invalidate_retry()
without a gfn, which has the advantage of being safe outside of mmu_lock (the
obvious disadvantage is that it's very imprecise).
Yan Zhao Feb. 6, 2024, 2:38 a.m. UTC | #3
On Mon, Feb 05, 2024 at 10:27:32AM -0800, Sean Christopherson wrote:
> On Mon, Feb 05, 2024, Yan Zhao wrote:
> > On Fri, Feb 02, 2024 at 04:35:18PM -0800, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 3c193b096b45..8ce9898914f1 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4415,6 +4415,25 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > >  	if (unlikely(!fault->slot))
> > >  		return kvm_handle_noslot_fault(vcpu, fault, access);
> > >  
> > > +	/*
> > > +	 * Pre-check for a relevant mmu_notifier invalidation event prior to
> > > +	 * acquiring mmu_lock.  If there is an in-progress invalidation and the
> > > +	 * kernel allows preemption, the invalidation task may drop mmu_lock
> > > +	 * and yield in response to mmu_lock being contended, which is *very*
> > > +	 * counter-productive as this vCPU can't actually make forward progress
> > > +	 * until the invalidation completes.  This "unsafe" check can get false
> > > +	 * negatives, i.e. KVM needs to re-check after acquiring mmu_lock.
> > > +	 *
> > > +	 * Do the pre-check even for non-preemtible kernels, i.e. even if KVM
> > > +	 * will never yield mmu_lock in response to contention, as this vCPU is
> > > +	 * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
> > > +	 * to detect retry guarantees the worst case latency for the vCPU.
> > > +	 */
> > > +	if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) {
> > > +		kvm_release_pfn_clean(fault->pfn);
> > > +		return RET_PF_RETRY;
> > > +	}
> > > +
> > Could we also add this pre-check before fault in the pfn? like
> > @@ -4404,6 +4404,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > 
> >         fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> >         smp_rmb();
> > +       if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
> > +               return RET_PF_CONTINUE;
> > 
> >         ret = __kvm_faultin_pfn(vcpu, fault);
> >         if (ret != RET_PF_CONTINUE)
> > 
> > Though the mmu_seq would be always equal in the check, it can avoid repeated faulting
> > and release pfn for vain during a long zap cycle.
> 
> Just to be super claer, by "repeated faulting", you mean repeated faulting in the
> primary MMU, correct?
>
Yes. Faulting in the primary MMU.
(Please ignore my typo in return type above :))

BTW, will you also send the optmization in v1 as below?
iff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e340098d034..c7617991e290 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5725,11 +5725,13 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
        }
 
        if (r == RET_PF_INVALID) {
-               r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
-                                         lower_32_bits(error_code), false,
-                                         &emulation_type);
-               if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
-                       return -EIO;
+               do {
+                       r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
+                                                 lower_32_bits(error_code),
+                                                 false, &emulation_type);
+                       if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
+                               return -EIO;
+               while (r == RET_PF_RETRY);
        }
 
        if (r < 0)

> Yeah, I agree, that makes sense.  The only question is whether to check before
> and after, or only before.  When abusing KSM, I see ~99.5% of all invalidations
> being detected before (21.5M out of just over 21.6M).
> 
> I think it makes sense to also check after getting the pfn?  The check is super
> cheap, especially when there isn't an invalidation event as the checks should all
> be quite predictable and cache hot.
I think checking at before and after is reasonable if the cost of check is not a
concern.


> > > +/*
> > > + * This lockless version of the range-based retry check *must* be paired with a
> > > + * call to the locked version after acquiring mmu_lock, i.e. this is safe to
> > > + * use only as a pre-check to avoid contending mmu_lock.  This version *will*
> > > + * get false negatives and false positives.
> > > + */
> > > +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
> > > +						   unsigned long mmu_seq,
> > > +						   gfn_t gfn)
> > > +{
> > > +	/*
> > > +	 * Use READ_ONCE() to ensure the in-progress flag and sequence counter
> > > +	 * are always read from memory, e.g. so that checking for retry in a
> > > +	 * loop won't result in an infinite retry loop.  Don't force loads for
> > > +	 * start+end, as the key to avoiding infinite retry loops is observing
> > > +	 * the 1=>0 transition of in-progress, i.e. getting false negatives
> > > +	 * due to stale start+end values is acceptable.
> > > +	 */
> > > +	if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
> > > +	    gfn >= kvm->mmu_invalidate_range_start &&
> > > +	    gfn < kvm->mmu_invalidate_range_end)
> > > +		return true;
> > > +
> > Is a smp_rmb() before below check better, given this retry is defined in a header
> > for all platforms?
> 
> No, because the unsafe check very deliberately doesn't provide any ordering
> guarantees.  The READ_ONCE() is purely to ensure forward progress.  And trying to
> provide ordering for mmu_invalidate_in_progress is rather nonsensical.  The smp_rmb()
> in mmu_invalidate_retry() ensures the caller will see a different mmu_invalidate_seq
> or an elevated mmu_invalidate_in_progress.
> 
> For this code, the order in which mmu_invalidate_seq and mmu_invalidate_in_progress
> are checked truly doesn't matter as the range-based checks are will get false
> negatives when performed outside of mmu_lock.   And from a correctness perspective,
> there are zero constraints on when the checks are performed (beyond the existing
> constraints from the earlier smp_rmb() and acquiring mmu_lock).
> 
> If an arch wants ordering guarantees, it can simply use mmu_invalidate_retry()
> without a gfn, which has the advantage of being safe outside of mmu_lock (the
> obvious disadvantage is that it's very imprecise).
Ok. It makes sense!
Sean Christopherson Feb. 6, 2024, 3:35 a.m. UTC | #4
On Tue, Feb 06, 2024, Yan Zhao wrote:
> > Just to be super claer, by "repeated faulting", you mean repeated faulting in the
> > primary MMU, correct?
> >
> Yes. Faulting in the primary MMU.
> (Please ignore my typo in return type above :))
> 
> BTW, will you also send the optmization in v1 as below?

No, mainly because I'm not entirely confident that it's safe/correct to loop
there, at least not that "tightly".  At the very least, there would need to be
resched checks, and then probably signal checks, etc.

I'm not opposed to something of this nature if it provides a measurable benefit
to the guest, but it's firmly an "on top" sort of change.

> iff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1e340098d034..c7617991e290 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5725,11 +5725,13 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>         }
>  
>         if (r == RET_PF_INVALID) {
> -               r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
> -                                         lower_32_bits(error_code), false,
> -                                         &emulation_type);
> -               if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
> -                       return -EIO;
> +               do {
> +                       r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
> +                                                 lower_32_bits(error_code),
> +                                                 false, &emulation_type);
> +                       if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
> +                               return -EIO;
> +               while (r == RET_PF_RETRY);
>         }
>  
>         if (r < 0)
Friedrich Weber Feb. 6, 2024, 1:52 p.m. UTC | #5
On 03/02/2024 01:35, Sean Christopherson wrote:
> Retry page faults without acquiring mmu_lock if the resolved gfn is covered
> by an active invalidation.  Contending for mmu_lock is especially
> problematic on preemptible kernels as the mmu_notifier invalidation task
> will yield mmu_lock (see rwlock_needbreak()), delay the in-progress
> invalidation, and ultimately increase the latency of resolving the page
> fault.  And in the worst case scenario, yielding will be accompanied by a
> remote TLB flush, e.g. if the invalidation covers a large range of memory
> and vCPUs are accessing addresses that were already zapped.
[...]

Can confirm this patch fixes temporary guest hangs in combination with
KSM and NUMA balancing I'm seeing [1], which is likely to be the same
issue as described in [2]:

* On this patch's base-commit 60eedcfc from
git.kernel.org/pub/scm/virt/kvm/kvm.git, I can reproduce the hangs (see
[1] for reproducer)
* With this patch applied on top, I cannot reproduce the hangs anymore.

Thanks!

[1]
https://lore.kernel.org/kvm/832697b9-3652-422d-a019-8c0574a188ac@proxmox.com/
[2] https://bugzilla.kernel.org/show_bug.cgi?id=218259
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c193b096b45..8ce9898914f1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4415,6 +4415,25 @@  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	if (unlikely(!fault->slot))
 		return kvm_handle_noslot_fault(vcpu, fault, access);
 
+	/*
+	 * Pre-check for a relevant mmu_notifier invalidation event prior to
+	 * acquiring mmu_lock.  If there is an in-progress invalidation and the
+	 * kernel allows preemption, the invalidation task may drop mmu_lock
+	 * and yield in response to mmu_lock being contended, which is *very*
+	 * counter-productive as this vCPU can't actually make forward progress
+	 * until the invalidation completes.  This "unsafe" check can get false
+	 * negatives, i.e. KVM needs to re-check after acquiring mmu_lock.
+	 *
+	 * Do the pre-check even for non-preemtible kernels, i.e. even if KVM
+	 * will never yield mmu_lock in response to contention, as this vCPU is
+	 * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
+	 * to detect retry guarantees the worst case latency for the vCPU.
+	 */
+	if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) {
+		kvm_release_pfn_clean(fault->pfn);
+		return RET_PF_RETRY;
+	}
+
 	return RET_PF_CONTINUE;
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7e7fd25b09b3..179df96b20f8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2031,6 +2031,32 @@  static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
 		return 1;
 	return 0;
 }
+
+/*
+ * This lockless version of the range-based retry check *must* be paired with a
+ * call to the locked version after acquiring mmu_lock, i.e. this is safe to
+ * use only as a pre-check to avoid contending mmu_lock.  This version *will*
+ * get false negatives and false positives.
+ */
+static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
+						   unsigned long mmu_seq,
+						   gfn_t gfn)
+{
+	/*
+	 * Use READ_ONCE() to ensure the in-progress flag and sequence counter
+	 * are always read from memory, e.g. so that checking for retry in a
+	 * loop won't result in an infinite retry loop.  Don't force loads for
+	 * start+end, as the key to avoiding infinite retry loops is observing
+	 * the 1=>0 transition of in-progress, i.e. getting false negatives
+	 * due to stale start+end values is acceptable.
+	 */
+	if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
+	    gfn >= kvm->mmu_invalidate_range_start &&
+	    gfn < kvm->mmu_invalidate_range_end)
+		return true;
+
+	return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
+}
 #endif
 
 #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING