Message ID | 20241218213611.3181643-1-seanjc@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | KVM: x86/mmu: Treat TDP MMU faults as spurious if access is already allowed | expand |
On Wed, 18 Dec 2024 13:36:11 -0800, Sean Christopherson wrote: > Treat slow-path TDP MMU faults as spurious if the access is allowed given > the existing SPTE to fix a benign warning (other than the WARN itself) > due to replacing a writable SPTE with a read-only SPTE, and to avoid the > unnecessary LOCK CMPXCHG and subsequent TLB flush. > > If a read fault races with a write fault, fast GUP fails for any reason > when trying to "promote" the read fault to a writable mapping, and KVM > resolves the write fault first, then KVM will end up trying to install a > read-only SPTE (for a !map_writable fault) overtop a writable SPTE. > > [...] Applied very quickly to kvm-x86 fixes, so that it can get at least one day in -next before I send it to Paolo. [1/1] KVM: x86/mmu: Treat TDP MMU faults as spurious if access is already allowed https://github.com/kvm-x86/linux/commit/55f60a6498e7 -- https://github.com/kvm-x86/linux/tree/next
I tested this patch with the bug's reproducer, the problem has gone. Tested-by: Lei Yang <leiyang@redhat.com> On Thu, Dec 19, 2024 at 10:41 AM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, 18 Dec 2024 13:36:11 -0800, Sean Christopherson wrote: > > Treat slow-path TDP MMU faults as spurious if the access is allowed given > > the existing SPTE to fix a benign warning (other than the WARN itself) > > due to replacing a writable SPTE with a read-only SPTE, and to avoid the > > unnecessary LOCK CMPXCHG and subsequent TLB flush. > > > > If a read fault races with a write fault, fast GUP fails for any reason > > when trying to "promote" the read fault to a writable mapping, and KVM > > resolves the write fault first, then KVM will end up trying to install a > > read-only SPTE (for a !map_writable fault) overtop a writable SPTE. > > > > [...] > > Applied very quickly to kvm-x86 fixes, so that it can get at least one day in > -next before I send it to Paolo. > > [1/1] KVM: x86/mmu: Treat TDP MMU faults as spurious if access is already allowed > https://github.com/kvm-x86/linux/commit/55f60a6498e7 > > -- > https://github.com/kvm-x86/linux/tree/next >
This also fixed an Accessed bit related bug in TDX, where "KVM_BUG_ON(was_present, kvm)" can be hit in below scenario: CPU 0 CPU 1 1. TD accepts GPA A 2. EPT violation in KVM 3. Prefault GPA A 4. Install an SPTE with Accessed bit unset 5. Install an SPTE with Accessed bit set 6. KVM_BUG_ON() in set_external_spte_present() Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> On Wed, Dec 18, 2024 at 01:36:11PM -0800, Sean Christopherson wrote: > Treat slow-path TDP MMU faults as spurious if the access is allowed given > the existing SPTE to fix a benign warning (other than the WARN itself) > due to replacing a writable SPTE with a read-only SPTE, and to avoid the > unnecessary LOCK CMPXCHG and subsequent TLB flush. > > If a read fault races with a write fault, fast GUP fails for any reason > when trying to "promote" the read fault to a writable mapping, and KVM > resolves the write fault first, then KVM will end up trying to install a > read-only SPTE (for a !map_writable fault) overtop a writable SPTE. > > Note, it's not entirely clear why fast GUP fails, or if that's even how > KVM ends up with a !map_writable fault with a writable SPTE. If something > else is going awry, e.g. due to a bug in mmu_notifiers, then treating read > faults as spurious in this scenario could effectively mask the underlying > problem. > > However, retrying the faulting access instead of overwriting an existing > SPTE is functionally correct and desirable irrespective of the WARN, and > fast GUP _can_ legitimately fail with a writable VMA, e.g. if the Accessed > bit in primary MMU's PTE is toggled and causes a PTE value mismatch. The > WARN was also recently added, specifically to track down scenarios where > KVM is unnecessarily overwrites SPTEs, i.e. treating the fault as spurious > doesn't regress KVM's bug-finding capabilities in any way. In short, > letting the WARN linger because there's a tiny chance it's due to a bug > elsewhere would be excessively paranoid. > > Fixes: 1a175082b190 ("KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writable") > Reported-by: leiyang@redhat.com > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219588 > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 12 ------------ > arch/x86/kvm/mmu/spte.h | 17 +++++++++++++++++ > arch/x86/kvm/mmu/tdp_mmu.c | 5 +++++ > 3 files changed, 22 insertions(+), 12 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 22e7ad235123..2401606db260 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -3364,18 +3364,6 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, > return true; > } > > -static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte) > -{ > - if (fault->exec) > - return is_executable_pte(spte); > - > - if (fault->write) > - return is_writable_pte(spte); > - > - /* Fault was on Read access */ > - return spte & PT_PRESENT_MASK; > -} > - > /* > * Returns the last level spte pointer of the shadow page walk for the given > * gpa, and sets *spte to the spte value. This spte may be non-preset. If no > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h > index f332b33bc817..af10bc0380a3 100644 > --- a/arch/x86/kvm/mmu/spte.h > +++ b/arch/x86/kvm/mmu/spte.h > @@ -461,6 +461,23 @@ static inline bool is_mmu_writable_spte(u64 spte) > return spte & shadow_mmu_writable_mask; > } > > +/* > + * Returns true if the access indicated by @fault is allowed by the existing > + * SPTE protections. Note, the caller is responsible for checking that the > + * SPTE is a shadow-present, leaf SPTE (either before or after). > + */ > +static inline bool is_access_allowed(struct kvm_page_fault *fault, u64 spte) > +{ > + if (fault->exec) > + return is_executable_pte(spte); > + > + if (fault->write) > + return is_writable_pte(spte); > + > + /* Fault was on Read access */ > + return spte & PT_PRESENT_MASK; > +} > + > /* > * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for > * write-tracking, remote TLBs must be flushed, even if the SPTE was read-only, > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 4508d868f1cd..2f15e0e33903 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -985,6 +985,11 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, > if (fault->prefetch && is_shadow_present_pte(iter->old_spte)) > return RET_PF_SPURIOUS; > > + if (is_shadow_present_pte(iter->old_spte) && > + is_access_allowed(fault, iter->old_spte) && > + is_last_spte(iter->old_spte, iter->level)) One nit: Do we need to warn on pfn_changed? > + return RET_PF_SPURIOUS; > + > if (unlikely(!fault->slot)) > new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL); > else > > base-commit: 3522c419758ee8dca5a0e8753ee0070a22157bc1 > -- > 2.47.1.613.gc27f4b7a9f-goog > >
On Fri, Dec 20, 2024, Yan Zhao wrote: > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > index 4508d868f1cd..2f15e0e33903 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -985,6 +985,11 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, > > if (fault->prefetch && is_shadow_present_pte(iter->old_spte)) > > return RET_PF_SPURIOUS; > > > > + if (is_shadow_present_pte(iter->old_spte) && > > + is_access_allowed(fault, iter->old_spte) && > > + is_last_spte(iter->old_spte, iter->level)) > One nit: > Do we need to warn on pfn_changed? Hmm, I definitely don't think we "need" to, but it's not a bad idea. The shadow MMU kinda sorta WARNs on this scenario: if (!was_rmapped) { WARN_ON_ONCE(ret == RET_PF_SPURIOUS); rmap_add(vcpu, slot, sptep, gfn, pte_access); } My only hesitation in adding a WARN is that the fast page fault path has similar logic and doesn't WARN, but that's rather silly on my part because it ideally would WARN, but grabbing the PFN to WARN would make it not-fast :-) Want to post a patch? I don't really want to squeeze the WARN into 6.13, just in case there's some weird edge case we're forgetting.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 22e7ad235123..2401606db260 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3364,18 +3364,6 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, return true; } -static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte) -{ - if (fault->exec) - return is_executable_pte(spte); - - if (fault->write) - return is_writable_pte(spte); - - /* Fault was on Read access */ - return spte & PT_PRESENT_MASK; -} - /* * Returns the last level spte pointer of the shadow page walk for the given * gpa, and sets *spte to the spte value. This spte may be non-preset. If no diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index f332b33bc817..af10bc0380a3 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -461,6 +461,23 @@ static inline bool is_mmu_writable_spte(u64 spte) return spte & shadow_mmu_writable_mask; } +/* + * Returns true if the access indicated by @fault is allowed by the existing + * SPTE protections. Note, the caller is responsible for checking that the + * SPTE is a shadow-present, leaf SPTE (either before or after). + */ +static inline bool is_access_allowed(struct kvm_page_fault *fault, u64 spte) +{ + if (fault->exec) + return is_executable_pte(spte); + + if (fault->write) + return is_writable_pte(spte); + + /* Fault was on Read access */ + return spte & PT_PRESENT_MASK; +} + /* * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for * write-tracking, remote TLBs must be flushed, even if the SPTE was read-only, diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 4508d868f1cd..2f15e0e33903 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -985,6 +985,11 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, if (fault->prefetch && is_shadow_present_pte(iter->old_spte)) return RET_PF_SPURIOUS; + if (is_shadow_present_pte(iter->old_spte) && + is_access_allowed(fault, iter->old_spte) && + is_last_spte(iter->old_spte, iter->level)) + return RET_PF_SPURIOUS; + if (unlikely(!fault->slot)) new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL); else
Treat slow-path TDP MMU faults as spurious if the access is allowed given the existing SPTE to fix a benign warning (other than the WARN itself) due to replacing a writable SPTE with a read-only SPTE, and to avoid the unnecessary LOCK CMPXCHG and subsequent TLB flush. If a read fault races with a write fault, fast GUP fails for any reason when trying to "promote" the read fault to a writable mapping, and KVM resolves the write fault first, then KVM will end up trying to install a read-only SPTE (for a !map_writable fault) overtop a writable SPTE. Note, it's not entirely clear why fast GUP fails, or if that's even how KVM ends up with a !map_writable fault with a writable SPTE. If something else is going awry, e.g. due to a bug in mmu_notifiers, then treating read faults as spurious in this scenario could effectively mask the underlying problem. However, retrying the faulting access instead of overwriting an existing SPTE is functionally correct and desirable irrespective of the WARN, and fast GUP _can_ legitimately fail with a writable VMA, e.g. if the Accessed bit in primary MMU's PTE is toggled and causes a PTE value mismatch. The WARN was also recently added, specifically to track down scenarios where KVM is unnecessarily overwrites SPTEs, i.e. treating the fault as spurious doesn't regress KVM's bug-finding capabilities in any way. In short, letting the WARN linger because there's a tiny chance it's due to a bug elsewhere would be excessively paranoid. Fixes: 1a175082b190 ("KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writable") Reported-by: leiyang@redhat.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219588 Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/mmu/mmu.c | 12 ------------ arch/x86/kvm/mmu/spte.h | 17 +++++++++++++++++ arch/x86/kvm/mmu/tdp_mmu.c | 5 +++++ 3 files changed, 22 insertions(+), 12 deletions(-) base-commit: 3522c419758ee8dca5a0e8753ee0070a22157bc1