diff mbox series

KVM: x86/mmu: Treat TDP MMU faults as spurious if access is already allowed

Message ID 20241218213611.3181643-1-seanjc@google.com (mailing list archive)
State New
Headers show
Series KVM: x86/mmu: Treat TDP MMU faults as spurious if access is already allowed | expand

Commit Message

Sean Christopherson Dec. 18, 2024, 9:36 p.m. UTC
Treat slow-path TDP MMU faults as spurious if the access is allowed given
the existing SPTE to fix a benign warning (other than the WARN itself)
due to replacing a writable SPTE with a read-only SPTE, and to avoid the
unnecessary LOCK CMPXCHG and subsequent TLB flush.

If a read fault races with a write fault, fast GUP fails for any reason
when trying to "promote" the read fault to a writable mapping, and KVM
resolves the write fault first, then KVM will end up trying to install a
read-only SPTE (for a !map_writable fault) overtop a writable SPTE.

Note, it's not entirely clear why fast GUP fails, or if that's even how
KVM ends up with a !map_writable fault with a writable SPTE.  If something
else is going awry, e.g. due to a bug in mmu_notifiers, then treating read
faults as spurious in this scenario could effectively mask the underlying
problem.

However, retrying the faulting access instead of overwriting an existing
SPTE is functionally correct and desirable irrespective of the WARN, and
fast GUP _can_ legitimately fail with a writable VMA, e.g. if the Accessed
bit in primary MMU's PTE is toggled and causes a PTE value mismatch.  The
WARN was also recently added, specifically to track down scenarios where
KVM is unnecessarily overwrites SPTEs, i.e. treating the fault as spurious
doesn't regress KVM's bug-finding capabilities in any way.  In short,
letting the WARN linger because there's a tiny chance it's due to a bug
elsewhere would be excessively paranoid.

Fixes: 1a175082b190 ("KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writable")
Reported-by: leiyang@redhat.com
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219588
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 12 ------------
 arch/x86/kvm/mmu/spte.h    | 17 +++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c |  5 +++++
 3 files changed, 22 insertions(+), 12 deletions(-)


base-commit: 3522c419758ee8dca5a0e8753ee0070a22157bc1

Comments

Sean Christopherson Dec. 19, 2024, 2:40 a.m. UTC | #1
On Wed, 18 Dec 2024 13:36:11 -0800, Sean Christopherson wrote:
> Treat slow-path TDP MMU faults as spurious if the access is allowed given
> the existing SPTE to fix a benign warning (other than the WARN itself)
> due to replacing a writable SPTE with a read-only SPTE, and to avoid the
> unnecessary LOCK CMPXCHG and subsequent TLB flush.
> 
> If a read fault races with a write fault, fast GUP fails for any reason
> when trying to "promote" the read fault to a writable mapping, and KVM
> resolves the write fault first, then KVM will end up trying to install a
> read-only SPTE (for a !map_writable fault) overtop a writable SPTE.
> 
> [...]

Applied very quickly to kvm-x86 fixes, so that it can get at least one day in
-next before I send it to Paolo.

[1/1] KVM: x86/mmu: Treat TDP MMU faults as spurious if access is already allowed
      https://github.com/kvm-x86/linux/commit/55f60a6498e7

--
https://github.com/kvm-x86/linux/tree/next
Lei Yang Dec. 19, 2024, 2:55 p.m. UTC | #2
I tested this patch with the bug's reproducer, the problem has gone.

Tested-by: Lei Yang <leiyang@redhat.com>

On Thu, Dec 19, 2024 at 10:41 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, 18 Dec 2024 13:36:11 -0800, Sean Christopherson wrote:
> > Treat slow-path TDP MMU faults as spurious if the access is allowed given
> > the existing SPTE to fix a benign warning (other than the WARN itself)
> > due to replacing a writable SPTE with a read-only SPTE, and to avoid the
> > unnecessary LOCK CMPXCHG and subsequent TLB flush.
> >
> > If a read fault races with a write fault, fast GUP fails for any reason
> > when trying to "promote" the read fault to a writable mapping, and KVM
> > resolves the write fault first, then KVM will end up trying to install a
> > read-only SPTE (for a !map_writable fault) overtop a writable SPTE.
> >
> > [...]
>
> Applied very quickly to kvm-x86 fixes, so that it can get at least one day in
> -next before I send it to Paolo.
>
> [1/1] KVM: x86/mmu: Treat TDP MMU faults as spurious if access is already allowed
>       https://github.com/kvm-x86/linux/commit/55f60a6498e7
>
> --
> https://github.com/kvm-x86/linux/tree/next
>
Yan Zhao Dec. 20, 2024, 3:17 a.m. UTC | #3
This also fixed an Accessed bit related bug in TDX, where
"KVM_BUG_ON(was_present, kvm)" can be hit in below scenario:


     CPU 0                 CPU 1
                           1. TD accepts GPA A
                           2. EPT violation in KVM
3. Prefault GPA A
4. Install an SPTE with
   Accessed bit unset
                           5. Install an SPTE with
                              Accessed bit set
                           6. KVM_BUG_ON() in set_external_spte_present()


Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>

On Wed, Dec 18, 2024 at 01:36:11PM -0800, Sean Christopherson wrote:
> Treat slow-path TDP MMU faults as spurious if the access is allowed given
> the existing SPTE to fix a benign warning (other than the WARN itself)
> due to replacing a writable SPTE with a read-only SPTE, and to avoid the
> unnecessary LOCK CMPXCHG and subsequent TLB flush.
> 
> If a read fault races with a write fault, fast GUP fails for any reason
> when trying to "promote" the read fault to a writable mapping, and KVM
> resolves the write fault first, then KVM will end up trying to install a
> read-only SPTE (for a !map_writable fault) overtop a writable SPTE.
> 
> Note, it's not entirely clear why fast GUP fails, or if that's even how
> KVM ends up with a !map_writable fault with a writable SPTE.  If something
> else is going awry, e.g. due to a bug in mmu_notifiers, then treating read
> faults as spurious in this scenario could effectively mask the underlying
> problem.
> 
> However, retrying the faulting access instead of overwriting an existing
> SPTE is functionally correct and desirable irrespective of the WARN, and
> fast GUP _can_ legitimately fail with a writable VMA, e.g. if the Accessed
> bit in primary MMU's PTE is toggled and causes a PTE value mismatch.  The
> WARN was also recently added, specifically to track down scenarios where
> KVM is unnecessarily overwrites SPTEs, i.e. treating the fault as spurious
> doesn't regress KVM's bug-finding capabilities in any way.  In short,
> letting the WARN linger because there's a tiny chance it's due to a bug
> elsewhere would be excessively paranoid.
> 
> Fixes: 1a175082b190 ("KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writable")
> Reported-by: leiyang@redhat.com
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219588
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     | 12 ------------
>  arch/x86/kvm/mmu/spte.h    | 17 +++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.c |  5 +++++
>  3 files changed, 22 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 22e7ad235123..2401606db260 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3364,18 +3364,6 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu,
>  	return true;
>  }
>  
> -static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
> -{
> -	if (fault->exec)
> -		return is_executable_pte(spte);
> -
> -	if (fault->write)
> -		return is_writable_pte(spte);
> -
> -	/* Fault was on Read access */
> -	return spte & PT_PRESENT_MASK;
> -}
> -
>  /*
>   * Returns the last level spte pointer of the shadow page walk for the given
>   * gpa, and sets *spte to the spte value. This spte may be non-preset. If no
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index f332b33bc817..af10bc0380a3 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -461,6 +461,23 @@ static inline bool is_mmu_writable_spte(u64 spte)
>  	return spte & shadow_mmu_writable_mask;
>  }
>  
> +/*
> + * Returns true if the access indicated by @fault is allowed by the existing
> + * SPTE protections.  Note, the caller is responsible for checking that the
> + * SPTE is a shadow-present, leaf SPTE (either before or after).
> + */
> +static inline bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
> +{
> +	if (fault->exec)
> +		return is_executable_pte(spte);
> +
> +	if (fault->write)
> +		return is_writable_pte(spte);
> +
> +	/* Fault was on Read access */
> +	return spte & PT_PRESENT_MASK;
> +}
> +
>  /*
>   * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for
>   * write-tracking, remote TLBs must be flushed, even if the SPTE was read-only,
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4508d868f1cd..2f15e0e33903 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -985,6 +985,11 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>  	if (fault->prefetch && is_shadow_present_pte(iter->old_spte))
>  		return RET_PF_SPURIOUS;
>  
> +	if (is_shadow_present_pte(iter->old_spte) &&
> +	    is_access_allowed(fault, iter->old_spte) &&
> +	    is_last_spte(iter->old_spte, iter->level))
One nit:
Do we need to warn on pfn_changed?

> +		return RET_PF_SPURIOUS;
> +
>  	if (unlikely(!fault->slot))
>  		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
>  	else
> 
> base-commit: 3522c419758ee8dca5a0e8753ee0070a22157bc1
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 
>
Sean Christopherson Dec. 20, 2024, 3:55 p.m. UTC | #4
On Fri, Dec 20, 2024, Yan Zhao wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 4508d868f1cd..2f15e0e33903 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -985,6 +985,11 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> >  	if (fault->prefetch && is_shadow_present_pte(iter->old_spte))
> >  		return RET_PF_SPURIOUS;
> >  
> > +	if (is_shadow_present_pte(iter->old_spte) &&
> > +	    is_access_allowed(fault, iter->old_spte) &&
> > +	    is_last_spte(iter->old_spte, iter->level))
> One nit:
> Do we need to warn on pfn_changed?

Hmm, I definitely don't think we "need" to, but it's not a bad idea.  The shadow
MMU kinda sorta WARNs on this scenario:

	if (!was_rmapped) {
		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
		rmap_add(vcpu, slot, sptep, gfn, pte_access);
	}

My only hesitation in adding a WARN is that the fast page fault path has similar
logic and doesn't WARN, but that's rather silly on my part because it ideally
would WARN, but grabbing the PFN to WARN would make it not-fast :-)

Want to post a patch?  I don't really want to squeeze the WARN into 6.13, just
in case there's some weird edge case we're forgetting.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 22e7ad235123..2401606db260 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3364,18 +3364,6 @@  static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu,
 	return true;
 }
 
-static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
-{
-	if (fault->exec)
-		return is_executable_pte(spte);
-
-	if (fault->write)
-		return is_writable_pte(spte);
-
-	/* Fault was on Read access */
-	return spte & PT_PRESENT_MASK;
-}
-
 /*
  * Returns the last level spte pointer of the shadow page walk for the given
  * gpa, and sets *spte to the spte value. This spte may be non-preset. If no
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index f332b33bc817..af10bc0380a3 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -461,6 +461,23 @@  static inline bool is_mmu_writable_spte(u64 spte)
 	return spte & shadow_mmu_writable_mask;
 }
 
+/*
+ * Returns true if the access indicated by @fault is allowed by the existing
+ * SPTE protections.  Note, the caller is responsible for checking that the
+ * SPTE is a shadow-present, leaf SPTE (either before or after).
+ */
+static inline bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
+{
+	if (fault->exec)
+		return is_executable_pte(spte);
+
+	if (fault->write)
+		return is_writable_pte(spte);
+
+	/* Fault was on Read access */
+	return spte & PT_PRESENT_MASK;
+}
+
 /*
  * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for
  * write-tracking, remote TLBs must be flushed, even if the SPTE was read-only,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4508d868f1cd..2f15e0e33903 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -985,6 +985,11 @@  static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	if (fault->prefetch && is_shadow_present_pte(iter->old_spte))
 		return RET_PF_SPURIOUS;
 
+	if (is_shadow_present_pte(iter->old_spte) &&
+	    is_access_allowed(fault, iter->old_spte) &&
+	    is_last_spte(iter->old_spte, iter->level))
+		return RET_PF_SPURIOUS;
+
 	if (unlikely(!fault->slot))
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
 	else