diff mbox series

KVM: x86/tdp_mmu: Trigger the callback only when an interesting change

Message ID 6eecc450d0326c9bedfbb34096a0279410923c8d.1726182754.git.isaku.yamahata@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/tdp_mmu: Trigger the callback only when an interesting change | expand

Commit Message

Isaku Yamahata Sept. 12, 2024, 11:36 p.m. UTC
This patch is for kvm-coco-queue branch.  Please Feel free to squash this
patch to the patch that this fixes.

Call the link_external_spt() hook only when the present bit is changed
from non-present to present. And add debug print of SPTE values.

TDH.MEM.PAGE.AUG() of the TDX backend via set_external_spt() hook fails
unexpectedly with TDX_EPT_ENTRY_STATE_INCORRECT + SPTE PENDING or PRESENT.
When the multiple vCPUs fault in the same GFN simultaneously, the hook is
called many times with some bits changed while both old/new SPTEs have the
present bits.  The leaf SPTE is a complex state machine, and the value can
change with software available bits and hardware bits.  However, the
callback assumes that it's called only when non-present => present leaf
SPTE change.

There are several options:
- Tame the SPTE state machine so that SPTE change won't happen for mirrored
  page table.
  PRO: It's conceptually natural to disable D bit support because mirrored
       page table doesn't support AD bits.
  CON: Not only D bit but also other bits can be modified.
       setting strict kvm_page_table.role.ad_disabled = true doesn't work.
       It requires to change the SPTE more deeply.

- Add a check to the TDP MMU so that it won't call the hook unnecessarily
  PRO: Direct fix.
  CON: Hard code in the TDP MMU when the hook is needed.

- Pass old and new SPTE values to the hooks, add a check to the backend
  PRO: The backend can determine whether the callback is needed.
  CON: The KVM MMU implementation details leak to the backend because
       The SPTE encoding is specific to the KVM MMU implementation.
       And it's subject to change in the future.
       For example, is_shadow_present_pte() needs to be exported from the
       KVM MMU to the backend.

The first choice is out because it doesn't easily handle the problem.  The
second option was chosen over the third choice because the current
interesting change is only non-present => present because we don't support
dirty page tracking by write protect and large page yet by TDX KVM for
now. (TDX supports them.) They're future work for TDX KVM.

Note that we care only about the leaf SPTE because non-leaf doesn't have a
state machine.  See make_nonleaf_spte().

The following is a summary of how the KVM MMU and the TDX behave to help
understanding.

KVM MMU behavior
================
The leaf SPTE state machine is coded in make_spte().  Consider AD bits and
the present bits for simplicity.  The two functionalities and AD bits
support are related in this context.  unsync (manipulate D bit and W bit,
and handle write protect fault) and access tracking (manipulate A bit and
present bit, and hand handle page fault).  (We don't consider dirty page
tracking for now as it's future work of TDX KVM.)

* If AD bit is enabled:
D bit state change for dirty page tracking
On the first EPT violation without prefetch,
- D bits are set.
- Make SPTE writable as TDX supports only RXW (or if write fault).
  (TDX KVM doesn't support write protection at this state. It's future work.)

On the second EPT violation.
- clear D bits if write fault.

A bit change for access tracking
If prefetch:
- clear A bit and clear present bit.

If !prefetch:
- Set A bit and set present bit.

* If AD bit is disabled:
- AD bits aren't set.
- Access tracking is still enabled.  A bit isn't set with only the present
  bits change.

TDX behavior
============
On the first EPT violation:
The SPTE of the mirrored page table is changed from non-present to present.
i.e. is_shadow_present_pte() false => true

If guest memory access causes an EPT violation, the SPTE state changes as
follows.
  FREE => EPT violation: TDH.MEM.PAGE.AUG()
       => PENDING and inject #VE
       => #VE handler issues TDH.MEM.PAGE.ACCEPT()
       => MAPPED

If TD.ATTRIBUTES.SEPT_VE_DISABLE=1, #VE isn't injected. The page remains in
PENDING and it falls in the EPT violation loop until another vCPU accepts
the page.

If TDG.MEM.PAGE.ACCEPT() causes EPT violation, and the SPTE state changes as
follows.
  FREE => EPT violation: TDH.MEM.PAGE.AUG()
       => MAPPED
       => TDG.MEM.PAGE.ACCEPT() success

On the racy second EPT violation, the SPTE was made present.  The page
state can be PENDING or MAPPED.
If PENDING, error = TDX_EPT_ENTRY_STATE_INCORRECT and TDX_SEPT_PENDING
If MAPPED, error = TDX_EPT_ENTRY_STATE_INCORRECT and TDX_SEPT_PRESENT

By checking was_present = false, is_present = true, we can avoid both
cases.

Reported-by: Sagi Shahar <sagis@google.com>
Fixes: 161d4f7c6d80 ("KVM: x86/tdp_mmu: Propagate building mirror page tables")
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/spte.h    |  6 ++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 28 +++++++++++++++++++++++++---
 2 files changed, 31 insertions(+), 3 deletions(-)


base-commit: d2c7662a6ea1c325a9ae878b3f1a265264bcd18b

Comments

Sean Christopherson Sept. 13, 2024, 12:07 a.m. UTC | #1
On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> KVM MMU behavior
> ================
> The leaf SPTE state machine is coded in make_spte().  Consider AD bits and
> the present bits for simplicity.  The two functionalities and AD bits
> support are related in this context.  unsync (manipulate D bit and W bit,
> and handle write protect fault) and access tracking (manipulate A bit and
> present bit, and hand handle page fault).  (We don't consider dirty page
> tracking for now as it's future work of TDX KVM.)
> 
> * If AD bit is enabled:
> D bit state change for dirty page tracking
> On the first EPT violation without prefetch,
> - D bits are set.
> - Make SPTE writable as TDX supports only RXW (or if write fault).
>   (TDX KVM doesn't support write protection at this state. It's future work.)
> 
> On the second EPT violation.
> - clear D bits if write fault.

Heh, I was literally just writing changelogs for patches to address this (I told
Sagi I would do it "this week"; that was four weeks ago).

This is a bug in make_spte().  Replacing a W=1,D=1 SPTE with a W=1,D=0 SPTE is
nonsensical.  And I'm pretty sure it's an outright but for the TDP MMU (see below).

Right now, the fixes for make_spte() are sitting toward the end of the massive
kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
fairly confident that series can land in 6.13 (lots and lots of small patches).

---
Author:     Sean Christopherson <seanjc@google.com>
AuthorDate: Thu Sep 12 16:23:21 2024 -0700
Commit:     Sean Christopherson <seanjc@google.com>
CommitDate: Thu Sep 12 16:35:06 2024 -0700

    KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
    
    Do a remote TLB flush if installing a leaf SPTE overwrites an existing
    leaf SPTE (with the same target pfn) and clears the Writable bit or the
    Dirty bit.  KVM isn't _supposed_ to clear Writable or Dirty bits in such
    a scenario, but make_spte() has a flaw where it will fail to set the Dirty
    if the existing SPTE is writable.
    
    E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
    SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
    second vCPU.  If the first vCPU (or another vCPU) accesses memory using
    the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
    the only SPTE that is dirty at the time of the next relevant clearing of
    the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
    because it sees the D=0 SPTE, and thus will complete the clearing of the
    dirty logs without performing a TLB flush.
    
    Opportunistically harden the TDP MMU against clearing the Writable bit as
    well, both to prevent similar bugs for write-protection, but also so that
    the logic can eventually be deduplicated into spte.c (mmu_spte_update() in
    the shadow MMU has similar logic).
    
    Fix the bug in the TDP MMU's page fault handler even though make_spte() is
    clearly doing odd things, e.g. it marks the page dirty in its slot but
    doesn't set the Dirty bit.  Precisely because replacing a leaf SPTE with
    another leaf SPTE is so rare, the cost of hardening KVM against such bugs
    is negligible.  The make_spte() will be addressed in a future commit.
    
    Fixes: bb18842e2111 ("kvm: x86/mmu: Add TDP MMU PF handler")
    Cc: David Matlack <dmatlack@google.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson <seanjc@google.com>

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3b996c1fdaab..7c6d1c610f0e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1038,7 +1038,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
        else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
                return RET_PF_RETRY;
        else if (is_shadow_present_pte(iter->old_spte) &&
-                !is_last_spte(iter->old_spte, iter->level))
+                (!is_last_spte(iter->old_spte, iter->level) ||
+                 (is_mmu_writable_spte(old_spte) && !is_writable_pte(new_spte)) ||
+                 (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))))
                kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
 
        /*
---

>  arch/x86/kvm/mmu/spte.h    |  6 ++++++
>  arch/x86/kvm/mmu/tdp_mmu.c | 28 +++++++++++++++++++++++++---
>  2 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index a72f0e3bde17..1726f8ec5a50 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -214,6 +214,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
>   */
>  #define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
>  
> +#define EXTERNAL_SPTE_IGNORE_CHANGE_MASK		\
> +	(shadow_acc_track_mask |			\
> +	 (SHADOW_ACC_TRACK_SAVED_BITS_MASK <<		\
> +	  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) |		\

Just make TDX require A/D bits, there's no reason to care about access tracking.

> +	 shadow_dirty_mask | shadow_accessed_mask)
> +
>  /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
>  static_assert(!(FROZEN_SPTE & SPTE_MMU_PRESENT_MASK));
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 019b43723d90..cfb82ede8982 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -503,8 +503,6 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
>  	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
>  	int ret = 0;
>  
> -	KVM_BUG_ON(was_present, kvm);
> -
>  	lockdep_assert_held(&kvm->mmu_lock);
>  	/*
>  	 * We need to lock out other updates to the SPTE until the external
> @@ -519,10 +517,34 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
>  	 * external page table, or leaf.
>  	 */
>  	if (is_leaf) {
> -		ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
> +		/*
> +		 * SPTE is state machine with software available bits used.
> +		 * Check if the change is interesting to the backend.
> +		 */
> +		if (!was_present)
> +			ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
> +		else {
> +			/*
> +			 * The external PTEs don't need updates for some bits,
> +			 * but if others are changed, bug the VM.
> +			 */
> +			if (KVM_BUG_ON(~EXTERNAL_SPTE_IGNORE_CHANGE_MASK &

There's no reason to bug the VM.  WARN, yes (maybe), but not bug the VM.  And I
think this should be short-circuited somewhere further up the chain, i.e. not
just for TDX.

One idea would be to WARN and skip setting the SPTE in tdp_mmu_map_handle_target_level().
I.e. WARN and ignore 1=>0 transitions for Writable and Dirty bits, and then drop
the TLB flush (assuming the above patch lands).

> +				       (old_spte ^ new_spte), kvm)) {

Curly braces are unnecessary.

> +				ret = -EIO;
> +			}
> +		}
> +
> +		/*
> +		 * The backend shouldn't return an error except EAGAIN.
> +		 * It's hard to debug without those info.
> +		 */
> +		if (ret && ret != EAGAIN)
> +			pr_debug("gfn 0x%llx old_spte 0x%llx new_spte 0x%llx level %d\n",
> +				 gfn, old_spte, new_spte, level);

Please no.  Not in upstream.  Yeah, for development it's fine, but sprinkling
printks all over the MMU eventually just results in stale printks, e.g. see all
of the pgprintk crud that got ripped out a while back.

>  	} else {
>  		void *external_spt = get_external_spt(gfn, new_spte, level);
>  
> +		KVM_BUG_ON(was_present, kvm);
>  		KVM_BUG_ON(!external_spt, kvm);
>  		ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, external_spt);
>  	}
> 
> base-commit: d2c7662a6ea1c325a9ae878b3f1a265264bcd18b
> -- 
> 2.45.2
>
Isaku Yamahata Sept. 13, 2024, 1:22 a.m. UTC | #2
On Thu, Sep 12, 2024 at 05:07:57PM -0700,
Sean Christopherson <seanjc@google.com> wrote:

> On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> > KVM MMU behavior
> > ================
> > The leaf SPTE state machine is coded in make_spte().  Consider AD bits and
> > the present bits for simplicity.  The two functionalities and AD bits
> > support are related in this context.  unsync (manipulate D bit and W bit,
> > and handle write protect fault) and access tracking (manipulate A bit and
> > present bit, and hand handle page fault).  (We don't consider dirty page
> > tracking for now as it's future work of TDX KVM.)
> > 
> > * If AD bit is enabled:
> > D bit state change for dirty page tracking
> > On the first EPT violation without prefetch,
> > - D bits are set.
> > - Make SPTE writable as TDX supports only RXW (or if write fault).
> >   (TDX KVM doesn't support write protection at this state. It's future work.)
> > 
> > On the second EPT violation.
> > - clear D bits if write fault.
> 
> Heh, I was literally just writing changelogs for patches to address this (I told
> Sagi I would do it "this week"; that was four weeks ago).
> 
> This is a bug in make_spte().  Replacing a W=1,D=1 SPTE with a W=1,D=0 SPTE is
> nonsensical.  And I'm pretty sure it's an outright but for the TDP MMU (see below).
> 
> Right now, the fixes for make_spte() are sitting toward the end of the massive
> kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
> fairly confident that series can land in 6.13 (lots and lots of small patches).

Thanks for addressing this.


> ---
> Author:     Sean Christopherson <seanjc@google.com>
> AuthorDate: Thu Sep 12 16:23:21 2024 -0700
> Commit:     Sean Christopherson <seanjc@google.com>
> CommitDate: Thu Sep 12 16:35:06 2024 -0700
> 
>     KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
>     
>     Do a remote TLB flush if installing a leaf SPTE overwrites an existing
>     leaf SPTE (with the same target pfn) and clears the Writable bit or the
>     Dirty bit.  KVM isn't _supposed_ to clear Writable or Dirty bits in such
>     a scenario, but make_spte() has a flaw where it will fail to set the Dirty
>     if the existing SPTE is writable.
>     
>     E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
>     SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
>     second vCPU.  If the first vCPU (or another vCPU) accesses memory using
>     the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
>     the only SPTE that is dirty at the time of the next relevant clearing of
>     the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
>     because it sees the D=0 SPTE, and thus will complete the clearing of the
>     dirty logs without performing a TLB flush.
>     
>     Opportunistically harden the TDP MMU against clearing the Writable bit as
>     well, both to prevent similar bugs for write-protection, but also so that
>     the logic can eventually be deduplicated into spte.c (mmu_spte_update() in
>     the shadow MMU has similar logic).
>     
>     Fix the bug in the TDP MMU's page fault handler even though make_spte() is
>     clearly doing odd things, e.g. it marks the page dirty in its slot but
>     doesn't set the Dirty bit.  Precisely because replacing a leaf SPTE with
>     another leaf SPTE is so rare, the cost of hardening KVM against such bugs
>     is negligible.  The make_spte() will be addressed in a future commit.
>     
>     Fixes: bb18842e2111 ("kvm: x86/mmu: Add TDP MMU PF handler")
>     Cc: David Matlack <dmatlack@google.com>
>     Cc: stable@vger.kernel.org
>     Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 3b996c1fdaab..7c6d1c610f0e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1038,7 +1038,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>         else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
>                 return RET_PF_RETRY;
>         else if (is_shadow_present_pte(iter->old_spte) &&
> -                !is_last_spte(iter->old_spte, iter->level))
> +                (!is_last_spte(iter->old_spte, iter->level) ||
> +                 (is_mmu_writable_spte(old_spte) && !is_writable_pte(new_spte)) ||
> +                 (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))))
>                 kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
>  
>         /*
> ---
> 
> >  arch/x86/kvm/mmu/spte.h    |  6 ++++++
> >  arch/x86/kvm/mmu/tdp_mmu.c | 28 +++++++++++++++++++++++++---
> >  2 files changed, 31 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index a72f0e3bde17..1726f8ec5a50 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -214,6 +214,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> >   */
> >  #define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
> >  
> > +#define EXTERNAL_SPTE_IGNORE_CHANGE_MASK		\
> > +	(shadow_acc_track_mask |			\
> > +	 (SHADOW_ACC_TRACK_SAVED_BITS_MASK <<		\
> > +	  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) |		\
> 
> Just make TDX require A/D bits, there's no reason to care about access tracking.

Ok.


> > +	 shadow_dirty_mask | shadow_accessed_mask)
> > +
> >  /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> >  static_assert(!(FROZEN_SPTE & SPTE_MMU_PRESENT_MASK));
> >  
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 019b43723d90..cfb82ede8982 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -503,8 +503,6 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> >  	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> >  	int ret = 0;
> >  
> > -	KVM_BUG_ON(was_present, kvm);
> > -
> >  	lockdep_assert_held(&kvm->mmu_lock);
> >  	/*
> >  	 * We need to lock out other updates to the SPTE until the external
> > @@ -519,10 +517,34 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> >  	 * external page table, or leaf.
> >  	 */
> >  	if (is_leaf) {
> > -		ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
> > +		/*
> > +		 * SPTE is state machine with software available bits used.
> > +		 * Check if the change is interesting to the backend.
> > +		 */
> > +		if (!was_present)
> > +			ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
> > +		else {
> > +			/*
> > +			 * The external PTEs don't need updates for some bits,
> > +			 * but if others are changed, bug the VM.
> > +			 */
> > +			if (KVM_BUG_ON(~EXTERNAL_SPTE_IGNORE_CHANGE_MASK &
> 
> There's no reason to bug the VM.  WARN, yes (maybe), but not bug the VM.  And I
> think this should be short-circuited somewhere further up the chain, i.e. not
> just for TDX.
> 
> One idea would be to WARN and skip setting the SPTE in tdp_mmu_map_handle_target_level().
> I.e. WARN and ignore 1=>0 transitions for Writable and Dirty bits, and then drop
> the TLB flush (assuming the above patch lands).

Ok, let me give it a try.


> > +				       (old_spte ^ new_spte), kvm)) {
> 
> Curly braces are unnecessary.

Will fix.


> > +				ret = -EIO;
> > +			}
> > +		}
> > +
> > +		/*
> > +		 * The backend shouldn't return an error except EAGAIN.
> > +		 * It's hard to debug without those info.
> > +		 */
> > +		if (ret && ret != EAGAIN)
> > +			pr_debug("gfn 0x%llx old_spte 0x%llx new_spte 0x%llx level %d\n",
> > +				 gfn, old_spte, new_spte, level);
> 
> Please no.  Not in upstream.  Yeah, for development it's fine, but sprinkling
> printks all over the MMU eventually just results in stale printks, e.g. see all
> of the pgprintk crud that got ripped out a while back.

Ok, will remove it.


> >  	} else {
> >  		void *external_spt = get_external_spt(gfn, new_spte, level);
> >  
> > +		KVM_BUG_ON(was_present, kvm);
> >  		KVM_BUG_ON(!external_spt, kvm);
> >  		ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, external_spt);
> >  	}
> > 
> > base-commit: d2c7662a6ea1c325a9ae878b3f1a265264bcd18b
> > -- 
> > 2.45.2
> > 
>
Yan Zhao Sept. 26, 2024, 7:53 a.m. UTC | #3
On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote:
> On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> > KVM MMU behavior
> > ================
> > The leaf SPTE state machine is coded in make_spte().  Consider AD bits and
> > the present bits for simplicity.  The two functionalities and AD bits
> > support are related in this context.  unsync (manipulate D bit and W bit,
> > and handle write protect fault) and access tracking (manipulate A bit and
> > present bit, and hand handle page fault).  (We don't consider dirty page
> > tracking for now as it's future work of TDX KVM.)
> > 
> > * If AD bit is enabled:
> > D bit state change for dirty page tracking
> > On the first EPT violation without prefetch,
> > - D bits are set.
> > - Make SPTE writable as TDX supports only RXW (or if write fault).
> >   (TDX KVM doesn't support write protection at this state. It's future work.)
> > 
> > On the second EPT violation.
> > - clear D bits if write fault.
> 
> Heh, I was literally just writing changelogs for patches to address this (I told
> Sagi I would do it "this week"; that was four weeks ago).
> 
> This is a bug in make_spte().  Replacing a W=1,D=1 SPTE with a W=1,D=0 SPTE is
> nonsensical.  And I'm pretty sure it's an outright but for the TDP MMU (see below).
> 
> Right now, the fixes for make_spte() are sitting toward the end of the massive
> kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
> fairly confident that series can land in 6.13 (lots and lots of small patches).
> 
> ---
> Author:     Sean Christopherson <seanjc@google.com>
> AuthorDate: Thu Sep 12 16:23:21 2024 -0700
> Commit:     Sean Christopherson <seanjc@google.com>
> CommitDate: Thu Sep 12 16:35:06 2024 -0700
> 
>     KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
>     
>     Do a remote TLB flush if installing a leaf SPTE overwrites an existing
>     leaf SPTE (with the same target pfn) and clears the Writable bit or the
>     Dirty bit.  KVM isn't _supposed_ to clear Writable or Dirty bits in such
>     a scenario, but make_spte() has a flaw where it will fail to set the Dirty
>     if the existing SPTE is writable.
>     
>     E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
>     SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
>     second vCPU.  If the first vCPU (or another vCPU) accesses memory using
>     the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
>     the only SPTE that is dirty at the time of the next relevant clearing of
>     the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
>     because it sees the D=0 SPTE, and thus will complete the clearing of the
>     dirty logs without performing a TLB flush.
But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no
matter clear_dirty_gfn_range() finds a D bit or not.

kvm_mmu_slot_apply_flags
  |kvm_mmu_slot_leaf_clear_dirty
  |  |kvm_tdp_mmu_clear_dirty_slot
  |    |clear_dirty_gfn_range
  |kvm_flush_remote_tlbs_memslot
     
>     Opportunistically harden the TDP MMU against clearing the Writable bit as
>     well, both to prevent similar bugs for write-protection, but also so that
>     the logic can eventually be deduplicated into spte.c (mmu_spte_update() in
>     the shadow MMU has similar logic).
>     
>     Fix the bug in the TDP MMU's page fault handler even though make_spte() is
>     clearly doing odd things, e.g. it marks the page dirty in its slot but
>     doesn't set the Dirty bit.  Precisely because replacing a leaf SPTE with
>     another leaf SPTE is so rare, the cost of hardening KVM against such bugs
>     is negligible.  The make_spte() will be addressed in a future commit.
>     
>     Fixes: bb18842e2111 ("kvm: x86/mmu: Add TDP MMU PF handler")
>     Cc: David Matlack <dmatlack@google.com>
>     Cc: stable@vger.kernel.org
>     Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 3b996c1fdaab..7c6d1c610f0e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1038,7 +1038,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>         else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
>                 return RET_PF_RETRY;
>         else if (is_shadow_present_pte(iter->old_spte) &&
> -                !is_last_spte(iter->old_spte, iter->level))
> +                (!is_last_spte(iter->old_spte, iter->level) ||
> +                 (is_mmu_writable_spte(old_spte) && !is_writable_pte(new_spte)) ||
> +                 (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))))
Should we also check is_accessed_spte() as what's done in mmu_spte_update()?

It is possible for make_spte() to make the second spte !is_dirty_spte(), e.g.
the second one is caused by a KVM_PRE_FAULT_MEMORY ioctl.

>                 kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
>  
>         /*
> ---
> 
> >  arch/x86/kvm/mmu/spte.h    |  6 ++++++
> >  arch/x86/kvm/mmu/tdp_mmu.c | 28 +++++++++++++++++++++++++---
> >  2 files changed, 31 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index a72f0e3bde17..1726f8ec5a50 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -214,6 +214,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> >   */
> >  #define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
> >  
> > +#define EXTERNAL_SPTE_IGNORE_CHANGE_MASK		\
> > +	(shadow_acc_track_mask |			\
> > +	 (SHADOW_ACC_TRACK_SAVED_BITS_MASK <<		\
> > +	  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) |		\
> 
> Just make TDX require A/D bits, there's no reason to care about access tracking.
If KVM_PRE_FAULT_MEMORY is allowed for TDX and if
cpu_has_vmx_ept_ad_bits() is false in TDX's hardware (not sure if it's possible),
access tracking bit is possbile to be changed, as in below scenario:

1. KVM_PRE_FAULT_MEMORY ioctl calls kvm_arch_vcpu_pre_fault_memory() to map
   a GFN, and make_spte() will call mark_spte_for_access_track() to
   remove shadow_acc_track_mask (i.e. RWX bits) and move R+X left to
   SHADOW_ACC_TRACK_SAVED_BITS_SHIFT.
2. If a concurrent page fault occurs on the same GFN on another vCPU, then
   make_spte() in that vCPU will not see prefetch and the new_spte is
   with RWX bits and with no bits set in SHADOW_ACC_TRACK_SAVED_MASK.

> 
> > +	 shadow_dirty_mask | shadow_accessed_mask)
> > +
> >  /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> >  static_assert(!(FROZEN_SPTE & SPTE_MMU_PRESENT_MASK));
> >  
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 019b43723d90..cfb82ede8982 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -503,8 +503,6 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> >  	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> >  	int ret = 0;
> >  
> > -	KVM_BUG_ON(was_present, kvm);
> > -
> >  	lockdep_assert_held(&kvm->mmu_lock);
> >  	/*
> >  	 * We need to lock out other updates to the SPTE until the external
> > @@ -519,10 +517,34 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> >  	 * external page table, or leaf.
> >  	 */
> >  	if (is_leaf) {
> > -		ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
> > +		/*
> > +		 * SPTE is state machine with software available bits used.
> > +		 * Check if the change is interesting to the backend.
> > +		 */
> > +		if (!was_present)
> > +			ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
> > +		else {
> > +			/*
> > +			 * The external PTEs don't need updates for some bits,
> > +			 * but if others are changed, bug the VM.
> > +			 */
> > +			if (KVM_BUG_ON(~EXTERNAL_SPTE_IGNORE_CHANGE_MASK &
> 
> There's no reason to bug the VM.  WARN, yes (maybe), but not bug the VM.  And I
> think this should be short-circuited somewhere further up the chain, i.e. not
> just for TDX.
> 
> One idea would be to WARN and skip setting the SPTE in tdp_mmu_map_handle_target_level().
> I.e. WARN and ignore 1=>0 transitions for Writable and Dirty bits, and then drop
> the TLB flush (assuming the above patch lands).
> 
> > +				       (old_spte ^ new_spte), kvm)) {
> 
> Curly braces are unnecessary.
> 
> > +				ret = -EIO;
> > +			}
> > +		}
> > +
> > +		/*
> > +		 * The backend shouldn't return an error except EAGAIN.
> > +		 * It's hard to debug without those info.
> > +		 */
> > +		if (ret && ret != EAGAIN)
> > +			pr_debug("gfn 0x%llx old_spte 0x%llx new_spte 0x%llx level %d\n",
> > +				 gfn, old_spte, new_spte, level);
> 
> Please no.  Not in upstream.  Yeah, for development it's fine, but sprinkling
> printks all over the MMU eventually just results in stale printks, e.g. see all
> of the pgprintk crud that got ripped out a while back.
> 
> >  	} else {
> >  		void *external_spt = get_external_spt(gfn, new_spte, level);
> >  
> > +		KVM_BUG_ON(was_present, kvm);
> >  		KVM_BUG_ON(!external_spt, kvm);
> >  		ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, external_spt);
> >  	}
> > 
> > base-commit: d2c7662a6ea1c325a9ae878b3f1a265264bcd18b
> > -- 
> > 2.45.2
> > 
>
Sean Christopherson Sept. 27, 2024, 1:51 p.m. UTC | #4
On Thu, Sep 26, 2024, Yan Zhao wrote:
> On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote:
> > On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> > > KVM MMU behavior
> > > ================
> > > The leaf SPTE state machine is coded in make_spte().  Consider AD bits and
> > > the present bits for simplicity.  The two functionalities and AD bits
> > > support are related in this context.  unsync (manipulate D bit and W bit,
> > > and handle write protect fault) and access tracking (manipulate A bit and
> > > present bit, and hand handle page fault).  (We don't consider dirty page
> > > tracking for now as it's future work of TDX KVM.)
> > > 
> > > * If AD bit is enabled:
> > > D bit state change for dirty page tracking
> > > On the first EPT violation without prefetch,
> > > - D bits are set.
> > > - Make SPTE writable as TDX supports only RXW (or if write fault).
> > >   (TDX KVM doesn't support write protection at this state. It's future work.)
> > > 
> > > On the second EPT violation.
> > > - clear D bits if write fault.
> > 
> > Heh, I was literally just writing changelogs for patches to address this (I told
> > Sagi I would do it "this week"; that was four weeks ago).
> > 
> > This is a bug in make_spte().  Replacing a W=1,D=1 SPTE with a W=1,D=0 SPTE is
> > nonsensical.  And I'm pretty sure it's an outright but for the TDP MMU (see below).
> > 
> > Right now, the fixes for make_spte() are sitting toward the end of the massive
> > kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
> > fairly confident that series can land in 6.13 (lots and lots of small patches).
> > 
> > ---
> > Author:     Sean Christopherson <seanjc@google.com>
> > AuthorDate: Thu Sep 12 16:23:21 2024 -0700
> > Commit:     Sean Christopherson <seanjc@google.com>
> > CommitDate: Thu Sep 12 16:35:06 2024 -0700
> > 
> >     KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
> >     
> >     Do a remote TLB flush if installing a leaf SPTE overwrites an existing
> >     leaf SPTE (with the same target pfn) and clears the Writable bit or the
> >     Dirty bit.  KVM isn't _supposed_ to clear Writable or Dirty bits in such
> >     a scenario, but make_spte() has a flaw where it will fail to set the Dirty
> >     if the existing SPTE is writable.
> >     
> >     E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
> >     SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
> >     second vCPU.  If the first vCPU (or another vCPU) accesses memory using
> >     the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
> >     the only SPTE that is dirty at the time of the next relevant clearing of
> >     the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
> >     because it sees the D=0 SPTE, and thus will complete the clearing of the
> >     dirty logs without performing a TLB flush.
> But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no
> matter clear_dirty_gfn_range() finds a D bit or not.

Oh, right, I forgot about that.  I'll tweak the changelog to call that out before
posting.  Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm:
x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if
anything should be backported it's that commit.

> kvm_mmu_slot_apply_flags
>   |kvm_mmu_slot_leaf_clear_dirty
>   |  |kvm_tdp_mmu_clear_dirty_slot
>   |    |clear_dirty_gfn_range
>   |kvm_flush_remote_tlbs_memslot
>      
> >     Opportunistically harden the TDP MMU against clearing the Writable bit as
> >     well, both to prevent similar bugs for write-protection, but also so that
> >     the logic can eventually be deduplicated into spte.c (mmu_spte_update() in
> >     the shadow MMU has similar logic).
> >     
> >     Fix the bug in the TDP MMU's page fault handler even though make_spte() is
> >     clearly doing odd things, e.g. it marks the page dirty in its slot but
> >     doesn't set the Dirty bit.  Precisely because replacing a leaf SPTE with
> >     another leaf SPTE is so rare, the cost of hardening KVM against such bugs
> >     is negligible.  The make_spte() will be addressed in a future commit.
> >     
> >     Fixes: bb18842e2111 ("kvm: x86/mmu: Add TDP MMU PF handler")
> >     Cc: David Matlack <dmatlack@google.com>
> >     Cc: stable@vger.kernel.org
> >     Signed-off-by: Sean Christopherson <seanjc@google.com>
> > 
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 3b996c1fdaab..7c6d1c610f0e 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1038,7 +1038,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> >         else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
> >                 return RET_PF_RETRY;
> >         else if (is_shadow_present_pte(iter->old_spte) &&
> > -                !is_last_spte(iter->old_spte, iter->level))
> > +                (!is_last_spte(iter->old_spte, iter->level) ||
> > +                 (is_mmu_writable_spte(old_spte) && !is_writable_pte(new_spte)) ||
> > +                 (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))))
> Should we also check is_accessed_spte() as what's done in mmu_spte_update()?

Heh, it's impossible to see in this standalone "posting", but earlier in the
massive series is a patch that removes that code.  Aptly titled "KVM: x86/mmu:
Don't force flush if SPTE update clears Accessed bit".

> It is possible for make_spte() to make the second spte !is_dirty_spte(), e.g.
> the second one is caused by a KVM_PRE_FAULT_MEMORY ioctl.

Ya, the mega-series also has a fix for that: "KVM: x86/mmu: Always set SPTE's
dirty bit if it's created as writable".

> >                 kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
> >  
> >         /*
> > ---
> > 
> > >  arch/x86/kvm/mmu/spte.h    |  6 ++++++
> > >  arch/x86/kvm/mmu/tdp_mmu.c | 28 +++++++++++++++++++++++++---
> > >  2 files changed, 31 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > > index a72f0e3bde17..1726f8ec5a50 100644
> > > --- a/arch/x86/kvm/mmu/spte.h
> > > +++ b/arch/x86/kvm/mmu/spte.h
> > > @@ -214,6 +214,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> > >   */
> > >  #define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
> > >  
> > > +#define EXTERNAL_SPTE_IGNORE_CHANGE_MASK		\
> > > +	(shadow_acc_track_mask |			\
> > > +	 (SHADOW_ACC_TRACK_SAVED_BITS_MASK <<		\
> > > +	  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) |		\
> > 
> > Just make TDX require A/D bits, there's no reason to care about access tracking.
> If KVM_PRE_FAULT_MEMORY is allowed for TDX and if
> cpu_has_vmx_ept_ad_bits() is false in TDX's hardware (not sure if it's possible),

Make it a requirement in KVM that TDX hardware supports A/D bits and that KVM's
module param is enabled.  EPT A/D bits have been supported in all CPUs since
Haswell, I don't expect them to ever go away.

> access tracking bit is possbile to be changed, as in below scenario:
> 
> 1. KVM_PRE_FAULT_MEMORY ioctl calls kvm_arch_vcpu_pre_fault_memory() to map
>    a GFN, and make_spte() will call mark_spte_for_access_track() to
>    remove shadow_acc_track_mask (i.e. RWX bits) and move R+X left to
>    SHADOW_ACC_TRACK_SAVED_BITS_SHIFT.
> 2. If a concurrent page fault occurs on the same GFN on another vCPU, then
>    make_spte() in that vCPU will not see prefetch and the new_spte is
>    with RWX bits and with no bits set in SHADOW_ACC_TRACK_SAVED_MASK.

This should be fixed by the mega-series.  I'll make sure to Cc you on that series.

Thanks much for the input and feedback!
Sean Christopherson Sept. 27, 2024, 2:32 p.m. UTC | #5
On Fri, Sep 27, 2024, Sean Christopherson wrote:
> On Thu, Sep 26, 2024, Yan Zhao wrote:
> > On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote:
> > > On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> > > Right now, the fixes for make_spte() are sitting toward the end of the massive
> > > kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
> > > fairly confident that series can land in 6.13 (lots and lots of small patches).
> > > 
> > > ---
> > > Author:     Sean Christopherson <seanjc@google.com>
> > > AuthorDate: Thu Sep 12 16:23:21 2024 -0700
> > > Commit:     Sean Christopherson <seanjc@google.com>
> > > CommitDate: Thu Sep 12 16:35:06 2024 -0700
> > > 
> > >     KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
> > >     
> > >     Do a remote TLB flush if installing a leaf SPTE overwrites an existing
> > >     leaf SPTE (with the same target pfn) and clears the Writable bit or the
> > >     Dirty bit.  KVM isn't _supposed_ to clear Writable or Dirty bits in such
> > >     a scenario, but make_spte() has a flaw where it will fail to set the Dirty
> > >     if the existing SPTE is writable.
> > >     
> > >     E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
> > >     SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
> > >     second vCPU.  If the first vCPU (or another vCPU) accesses memory using
> > >     the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
> > >     the only SPTE that is dirty at the time of the next relevant clearing of
> > >     the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
> > >     because it sees the D=0 SPTE, and thus will complete the clearing of the
> > >     dirty logs without performing a TLB flush.
> > But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no
> > matter clear_dirty_gfn_range() finds a D bit or not.
> 
> Oh, right, I forgot about that.  I'll tweak the changelog to call that out before
> posting.  Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm:
> x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if
> anything should be backported it's that commit.

Actually, a better idea.  I think it makes sense to fully commit to not flushing
when overwriting SPTEs, and instead rely on the dirty logging logic to do a remote
TLB flush.

E.g. on top of this change in the mega-series is a cleanup to unify the TDP MMU
and shadow MMU logic for clearing Writable and Dirty bits, with this comment
(which is a massaged version of an existing comment for mmu_spte_update()):

/*
 * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote
 * TLBs must be flushed.  Otherwise write-protecting the gfn may find a read-
 * only SPTE, even though the writable SPTE might be cached in a CPU's TLB.
 *
 * Remote TLBs also need to be flushed if the Dirty bit is cleared, as false
 * negatives are not acceptable, e.g. if KVM is using D-bit based PML on VMX.
 *
 * Don't flush if the Accessed bit is cleared, as access tracking tolerates
 * false negatives, and the one path that does care about TLB flushes,
 * kvm_mmu_notifier_clear_flush_young(), uses mmu_spte_update_no_track().
 *
 * Note, this logic only applies to leaf SPTEs.  The caller is responsible for
 * determining whether or not a TLB flush is required when modifying a shadow-
 * present non-leaf SPTE.
 */

But that comment is was made stale by commit b64d740ea7dd.  And looking through
the dirty logging logic, KVM (luckily? thankfully?) flushes based on whether or
not dirty bitmap/ring entries are reaped, not based on whether or not SPTEs were
modified.
Sean Christopherson Sept. 27, 2024, 3:06 p.m. UTC | #6
On Fri, Sep 27, 2024, Sean Christopherson wrote:
> On Fri, Sep 27, 2024, Sean Christopherson wrote:
> > On Thu, Sep 26, 2024, Yan Zhao wrote:
> > > On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote:
> > > > On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> > > > Right now, the fixes for make_spte() are sitting toward the end of the massive
> > > > kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
> > > > fairly confident that series can land in 6.13 (lots and lots of small patches).
> > > > 
> > > > ---
> > > > Author:     Sean Christopherson <seanjc@google.com>
> > > > AuthorDate: Thu Sep 12 16:23:21 2024 -0700
> > > > Commit:     Sean Christopherson <seanjc@google.com>
> > > > CommitDate: Thu Sep 12 16:35:06 2024 -0700
> > > > 
> > > >     KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
> > > >     
> > > >     Do a remote TLB flush if installing a leaf SPTE overwrites an existing
> > > >     leaf SPTE (with the same target pfn) and clears the Writable bit or the
> > > >     Dirty bit.  KVM isn't _supposed_ to clear Writable or Dirty bits in such
> > > >     a scenario, but make_spte() has a flaw where it will fail to set the Dirty
> > > >     if the existing SPTE is writable.
> > > >     
> > > >     E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
> > > >     SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
> > > >     second vCPU.  If the first vCPU (or another vCPU) accesses memory using
> > > >     the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
> > > >     the only SPTE that is dirty at the time of the next relevant clearing of
> > > >     the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
> > > >     because it sees the D=0 SPTE, and thus will complete the clearing of the
> > > >     dirty logs without performing a TLB flush.
> > > But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no
> > > matter clear_dirty_gfn_range() finds a D bit or not.
> > 
> > Oh, right, I forgot about that.  I'll tweak the changelog to call that out before
> > posting.  Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm:
> > x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if
> > anything should be backported it's that commit.
> 
> Actually, a better idea.  I think it makes sense to fully commit to not flushing
> when overwriting SPTEs, and instead rely on the dirty logging logic to do a remote
> TLB flush.

Oooh, but there's a bug.  KVM can tolerate/handle stale Dirty/Writable TLB entries
when dirty logging, but KVM cannot tolerate stale Writable TLB entries when write-
protecting for shadow paging.  The TDP MMU always flushes when clearing the MMU-
writable flag (modulo a bug that would cause KVM to make the SPTE !MMU-writable
in the page fault path), but the shadow MMU does not.

So I'm pretty sure we need the below, and then it may or may not make sense to have
a common "flush needed" helper (outside of the write-protecting flows, KVM probably
should WARN if MMU-writable is cleared).

---
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ce8323354d2d..7bd9c296f70e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -514,9 +514,12 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
 /* Rules for using mmu_spte_update:
  * Update the state bits, it means the mapped pfn is not changed.
  *
- * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote
- * TLBs must be flushed. Otherwise rmap_write_protect will find a read-only
- * spte, even though the writable spte might be cached on a CPU's TLB.
+ * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for
+ * write-tracking, remote TLBs must be flushed, even if the SPTE was read-only,
+ * as KVM allows stale Writable TLB entries to exist.  When dirty logging, KVM
+ * flushes TLBs based on whether or not dirty bitmap/ring entries were reaped,
+ * not whether or not SPTEs were modified, i.e. only the write-protected case
+ * needs to precisely flush when modifying SPTEs.
  *
  * Returns true if the TLB needs to be flushed
  */
@@ -533,8 +536,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
         * we always atomically update it, see the comments in
         * spte_has_volatile_bits().
         */
-       if (is_mmu_writable_spte(old_spte) &&
-             !is_writable_pte(new_spte))
+       if (is_mmu_writable_spte(old_spte) && !is_mmu_writable_spte(new_spte))
                flush = true;
 
        /*
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 345c7115b632..aa1ca24d1168 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -133,12 +133,6 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
  */
 bool spte_has_volatile_bits(u64 spte)
 {
-       /*
-        * Always atomically update spte if it can be updated
-        * out of mmu-lock, it can ensure dirty bit is not lost,
-        * also, it can help us to get a stable is_writable_pte()
-        * to ensure tlb flush is not missed.
-        */
        if (!is_writable_pte(spte) && is_mmu_writable_spte(spte))
                return true;
Yan Zhao Sept. 29, 2024, 9:27 a.m. UTC | #7
On Fri, Sep 27, 2024 at 07:32:10AM -0700, Sean Christopherson wrote:
> On Fri, Sep 27, 2024, Sean Christopherson wrote:
> > On Thu, Sep 26, 2024, Yan Zhao wrote:
> > > On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote:
> > > > On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> > > > Right now, the fixes for make_spte() are sitting toward the end of the massive
> > > > kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
> > > > fairly confident that series can land in 6.13 (lots and lots of small patches).
> > > > 
> > > > ---
> > > > Author:     Sean Christopherson <seanjc@google.com>
> > > > AuthorDate: Thu Sep 12 16:23:21 2024 -0700
> > > > Commit:     Sean Christopherson <seanjc@google.com>
> > > > CommitDate: Thu Sep 12 16:35:06 2024 -0700
> > > > 
> > > >     KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
> > > >     
> > > >     Do a remote TLB flush if installing a leaf SPTE overwrites an existing
> > > >     leaf SPTE (with the same target pfn) and clears the Writable bit or the
> > > >     Dirty bit.  KVM isn't _supposed_ to clear Writable or Dirty bits in such
> > > >     a scenario, but make_spte() has a flaw where it will fail to set the Dirty
> > > >     if the existing SPTE is writable.
> > > >     
> > > >     E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
> > > >     SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
> > > >     second vCPU.  If the first vCPU (or another vCPU) accesses memory using
> > > >     the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
> > > >     the only SPTE that is dirty at the time of the next relevant clearing of
> > > >     the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
> > > >     because it sees the D=0 SPTE, and thus will complete the clearing of the
> > > >     dirty logs without performing a TLB flush.
> > > But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no
> > > matter clear_dirty_gfn_range() finds a D bit or not.
> > 
> > Oh, right, I forgot about that.  I'll tweak the changelog to call that out before
> > posting.  Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm:
> > x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if
> > anything should be backported it's that commit.
> 
> Actually, a better idea.  I think it makes sense to fully commit to not flushing
> when overwriting SPTEs, and instead rely on the dirty logging logic to do a remote
> TLB flush.
> 
> E.g. on top of this change in the mega-series is a cleanup to unify the TDP MMU
> and shadow MMU logic for clearing Writable and Dirty bits, with this comment
> (which is a massaged version of an existing comment for mmu_spte_update()):
> 
> /*
>  * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote
>  * TLBs must be flushed.  Otherwise write-protecting the gfn may find a read-
>  * only SPTE, even though the writable SPTE might be cached in a CPU's TLB.
>  *
>  * Remote TLBs also need to be flushed if the Dirty bit is cleared, as false
>  * negatives are not acceptable, e.g. if KVM is using D-bit based PML on VMX.
>  *
>  * Don't flush if the Accessed bit is cleared, as access tracking tolerates
>  * false negatives, and the one path that does care about TLB flushes,
>  * kvm_mmu_notifier_clear_flush_young(), uses mmu_spte_update_no_track().
I have a question about why access tracking tolerates false negatives on the
path kvm_mmu_notifier_clear_flush_young().

kvm_mmu_notifier_clear_flush_young() invokes kvm_flush_remote_tlbs()
only when kvm_age_gfn() returns true. But age_gfn_range()/kvm_age_rmap() will
return false if the old spte is !is_accessed_spte().

So, if the Access bit is cleared in make_spte(), is a TLB flush required to
avoid that it's not done in kvm_mmu_notifier_clear_flush_young()?

>  *
>  * Note, this logic only applies to leaf SPTEs.  The caller is responsible for
>  * determining whether or not a TLB flush is required when modifying a shadow-
>  * present non-leaf SPTE.
>  */
> 
> But that comment is was made stale by commit b64d740ea7dd.  And looking through
> the dirty logging logic, KVM (luckily? thankfully?) flushes based on whether or
> not dirty bitmap/ring entries are reaped, not based on whether or not SPTEs were
> modified.
>
Yan Zhao Sept. 29, 2024, 9:29 a.m. UTC | #8
On Fri, Sep 27, 2024 at 06:51:58AM -0700, Sean Christopherson wrote:
> On Thu, Sep 26, 2024, Yan Zhao wrote:
> > On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote:
> > > > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > > > index a72f0e3bde17..1726f8ec5a50 100644
> > > > --- a/arch/x86/kvm/mmu/spte.h
> > > > +++ b/arch/x86/kvm/mmu/spte.h
> > > > @@ -214,6 +214,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> > > >   */
> > > >  #define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
> > > >  
> > > > +#define EXTERNAL_SPTE_IGNORE_CHANGE_MASK		\
> > > > +	(shadow_acc_track_mask |			\
> > > > +	 (SHADOW_ACC_TRACK_SAVED_BITS_MASK <<		\
> > > > +	  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) |		\
> > > 
> > > Just make TDX require A/D bits, there's no reason to care about access tracking.
> > If KVM_PRE_FAULT_MEMORY is allowed for TDX and if
> > cpu_has_vmx_ept_ad_bits() is false in TDX's hardware (not sure if it's possible),
> 
> Make it a requirement in KVM that TDX hardware supports A/D bits and that KVM's
> module param is enabled.  EPT A/D bits have been supported in all CPUs since
> Haswell, I don't expect them to ever go away.
Got it!

> 
> > access tracking bit is possbile to be changed, as in below scenario:
> > 
> > 1. KVM_PRE_FAULT_MEMORY ioctl calls kvm_arch_vcpu_pre_fault_memory() to map
> >    a GFN, and make_spte() will call mark_spte_for_access_track() to
> >    remove shadow_acc_track_mask (i.e. RWX bits) and move R+X left to
> >    SHADOW_ACC_TRACK_SAVED_BITS_SHIFT.
> > 2. If a concurrent page fault occurs on the same GFN on another vCPU, then
> >    make_spte() in that vCPU will not see prefetch and the new_spte is
> >    with RWX bits and with no bits set in SHADOW_ACC_TRACK_SAVED_MASK.
> 
> This should be fixed by the mega-series.  I'll make sure to Cc you on that series.
Thanks!

> Thanks much for the input and feedback!
:)
Sean Christopherson Sept. 30, 2024, 3:54 p.m. UTC | #9
On Sun, Sep 29, 2024, Yan Zhao wrote:
> On Fri, Sep 27, 2024 at 07:32:10AM -0700, Sean Christopherson wrote:
> > On Fri, Sep 27, 2024, Sean Christopherson wrote:
> > > On Thu, Sep 26, 2024, Yan Zhao wrote:
> > > > On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote:
> > > > > On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> > > > > Right now, the fixes for make_spte() are sitting toward the end of the massive
> > > > > kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
> > > > > fairly confident that series can land in 6.13 (lots and lots of small patches).
> > > > > 
> > > > > ---
> > > > > Author:     Sean Christopherson <seanjc@google.com>
> > > > > AuthorDate: Thu Sep 12 16:23:21 2024 -0700
> > > > > Commit:     Sean Christopherson <seanjc@google.com>
> > > > > CommitDate: Thu Sep 12 16:35:06 2024 -0700
> > > > > 
> > > > >     KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
> > > > >     
> > > > >     Do a remote TLB flush if installing a leaf SPTE overwrites an existing
> > > > >     leaf SPTE (with the same target pfn) and clears the Writable bit or the
> > > > >     Dirty bit.  KVM isn't _supposed_ to clear Writable or Dirty bits in such
> > > > >     a scenario, but make_spte() has a flaw where it will fail to set the Dirty
> > > > >     if the existing SPTE is writable.
> > > > >     
> > > > >     E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
> > > > >     SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
> > > > >     second vCPU.  If the first vCPU (or another vCPU) accesses memory using
> > > > >     the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
> > > > >     the only SPTE that is dirty at the time of the next relevant clearing of
> > > > >     the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
> > > > >     because it sees the D=0 SPTE, and thus will complete the clearing of the
> > > > >     dirty logs without performing a TLB flush.
> > > > But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no
> > > > matter clear_dirty_gfn_range() finds a D bit or not.
> > > 
> > > Oh, right, I forgot about that.  I'll tweak the changelog to call that out before
> > > posting.  Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm:
> > > x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if
> > > anything should be backported it's that commit.
> > 
> > Actually, a better idea.  I think it makes sense to fully commit to not flushing
> > when overwriting SPTEs, and instead rely on the dirty logging logic to do a remote
> > TLB flush.
> > 
> > E.g. on top of this change in the mega-series is a cleanup to unify the TDP MMU
> > and shadow MMU logic for clearing Writable and Dirty bits, with this comment
> > (which is a massaged version of an existing comment for mmu_spte_update()):
> > 
> > /*
> >  * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote
> >  * TLBs must be flushed.  Otherwise write-protecting the gfn may find a read-
> >  * only SPTE, even though the writable SPTE might be cached in a CPU's TLB.
> >  *
> >  * Remote TLBs also need to be flushed if the Dirty bit is cleared, as false
> >  * negatives are not acceptable, e.g. if KVM is using D-bit based PML on VMX.
> >  *
> >  * Don't flush if the Accessed bit is cleared, as access tracking tolerates
> >  * false negatives, and the one path that does care about TLB flushes,
> >  * kvm_mmu_notifier_clear_flush_young(), uses mmu_spte_update_no_track().
> I have a question about why access tracking tolerates false negatives on the
> path kvm_mmu_notifier_clear_flush_young().
> 
> kvm_mmu_notifier_clear_flush_young() invokes kvm_flush_remote_tlbs()
> only when kvm_age_gfn() returns true. But age_gfn_range()/kvm_age_rmap() will
> return false if the old spte is !is_accessed_spte().
> 
> So, if the Access bit is cleared in make_spte(), is a TLB flush required to
> avoid that it's not done in kvm_mmu_notifier_clear_flush_young()?

Because access tracking in general is tolerant of stale results due to lack of
TLB flushes.  E.g. on many architectures, the primary MMU has omitted TLB flushes
for years (10+ years on x86, commit b13b1d2d8692).  The basic argument is that if
there is enough memory pressure to trigger reclaim, then there will be enough TLB
pressure to ensure that omitting the TLB flush doesn't result in a large number
of "bad" reclaims[1].  And conversely, if there isn't much TLB pressure, then the
kernel shouldn't be reclaiming.

For KVM, I want to completely eliminate the TLB flush[2] for all architectures
where it's architecturally legal.  Because for KVM, the flushes are often even
more expensive than they are for the primary MMU, e.g. due to lack of batching,
the cost of VM-Exit => VM-Enter (for architectures without broadcast flushes).

[1] https://lore.kernel.org/all/CAOUHufYCmYNngmS=rOSAQRB0N9ai+mA0aDrB9RopBvPHEK42Ng@mail.gmail.com
[2] https://lore.kernel.org/all/Zmnbb-Xlyz4VXNHI@google.com
Yan Zhao Oct. 8, 2024, 12:44 a.m. UTC | #10
On Mon, Sep 30, 2024 at 08:54:38AM -0700, Sean Christopherson wrote:
> On Sun, Sep 29, 2024, Yan Zhao wrote:
> > On Fri, Sep 27, 2024 at 07:32:10AM -0700, Sean Christopherson wrote:
> > > On Fri, Sep 27, 2024, Sean Christopherson wrote:
> > > > On Thu, Sep 26, 2024, Yan Zhao wrote:
> > > > > On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote:
> > > > > > On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> > > > > > Right now, the fixes for make_spte() are sitting toward the end of the massive
> > > > > > kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
> > > > > > fairly confident that series can land in 6.13 (lots and lots of small patches).
> > > > > > 
> > > > > > ---
> > > > > > Author:     Sean Christopherson <seanjc@google.com>
> > > > > > AuthorDate: Thu Sep 12 16:23:21 2024 -0700
> > > > > > Commit:     Sean Christopherson <seanjc@google.com>
> > > > > > CommitDate: Thu Sep 12 16:35:06 2024 -0700
> > > > > > 
> > > > > >     KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
> > > > > >     
> > > > > >     Do a remote TLB flush if installing a leaf SPTE overwrites an existing
> > > > > >     leaf SPTE (with the same target pfn) and clears the Writable bit or the
> > > > > >     Dirty bit.  KVM isn't _supposed_ to clear Writable or Dirty bits in such
> > > > > >     a scenario, but make_spte() has a flaw where it will fail to set the Dirty
> > > > > >     if the existing SPTE is writable.
> > > > > >     
> > > > > >     E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
> > > > > >     SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
> > > > > >     second vCPU.  If the first vCPU (or another vCPU) accesses memory using
> > > > > >     the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
> > > > > >     the only SPTE that is dirty at the time of the next relevant clearing of
> > > > > >     the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
> > > > > >     because it sees the D=0 SPTE, and thus will complete the clearing of the
> > > > > >     dirty logs without performing a TLB flush.
> > > > > But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no
> > > > > matter clear_dirty_gfn_range() finds a D bit or not.
> > > > 
> > > > Oh, right, I forgot about that.  I'll tweak the changelog to call that out before
> > > > posting.  Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm:
> > > > x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if
> > > > anything should be backported it's that commit.
> > > 
> > > Actually, a better idea.  I think it makes sense to fully commit to not flushing
> > > when overwriting SPTEs, and instead rely on the dirty logging logic to do a remote
> > > TLB flush.
> > > 
> > > E.g. on top of this change in the mega-series is a cleanup to unify the TDP MMU
> > > and shadow MMU logic for clearing Writable and Dirty bits, with this comment
> > > (which is a massaged version of an existing comment for mmu_spte_update()):
> > > 
> > > /*
> > >  * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote
> > >  * TLBs must be flushed.  Otherwise write-protecting the gfn may find a read-
> > >  * only SPTE, even though the writable SPTE might be cached in a CPU's TLB.
> > >  *
> > >  * Remote TLBs also need to be flushed if the Dirty bit is cleared, as false
> > >  * negatives are not acceptable, e.g. if KVM is using D-bit based PML on VMX.
> > >  *
> > >  * Don't flush if the Accessed bit is cleared, as access tracking tolerates
> > >  * false negatives, and the one path that does care about TLB flushes,
> > >  * kvm_mmu_notifier_clear_flush_young(), uses mmu_spte_update_no_track().
> > I have a question about why access tracking tolerates false negatives on the
> > path kvm_mmu_notifier_clear_flush_young().
> > 
> > kvm_mmu_notifier_clear_flush_young() invokes kvm_flush_remote_tlbs()
> > only when kvm_age_gfn() returns true. But age_gfn_range()/kvm_age_rmap() will
> > return false if the old spte is !is_accessed_spte().
> > 
> > So, if the Access bit is cleared in make_spte(), is a TLB flush required to
> > avoid that it's not done in kvm_mmu_notifier_clear_flush_young()?
> 
> Because access tracking in general is tolerant of stale results due to lack of
> TLB flushes.  E.g. on many architectures, the primary MMU has omitted TLB flushes
> for years (10+ years on x86, commit b13b1d2d8692).  The basic argument is that if
> there is enough memory pressure to trigger reclaim, then there will be enough TLB
> pressure to ensure that omitting the TLB flush doesn't result in a large number
> of "bad" reclaims[1].  And conversely, if there isn't much TLB pressure, then the
> kernel shouldn't be reclaiming.
> 
> For KVM, I want to completely eliminate the TLB flush[2] for all architectures
> where it's architecturally legal.  Because for KVM, the flushes are often even
> more expensive than they are for the primary MMU, e.g. due to lack of batching,
> the cost of VM-Exit => VM-Enter (for architectures without broadcast flushes).
> 
> [1] https://lore.kernel.org/all/CAOUHufYCmYNngmS=rOSAQRB0N9ai+mA0aDrB9RopBvPHEK42Ng@mail.gmail.com
> [2] https://lore.kernel.org/all/Zmnbb-Xlyz4VXNHI@google.com

It makes sense. Thanks for explanation and the provided links!

Thinking more about the prefetched SPTEs, though the A bit tolerates fault
negative, do you think we still can have a small optimization to grant A bit to
prefetched SPTEs if the old_spte has already set it? So that if a prefault
happens right after a real fault, the A bit would not be cleared, basing on that
KVM not changing PFNs without first zapping the old SPTE.
(but I'm not sure if you have already covered this optmication in the
mega-series).

--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -163,6 +163,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
        int level = sp->role.level;
        u64 spte = SPTE_MMU_PRESENT_MASK;
        bool wrprot = false;
+       bool remove_accessed = prefetch && (!is_shadow_present_pte(old_spte) ||
+                              !s_last_spte(old_spte, level) || !is_accessed_spte(old_spte))
 
        /*
         * For the EPT case, shadow_present_mask has no RWX bits set if
@@ -178,7 +180,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
                spte |= SPTE_TDP_AD_WRPROT_ONLY;
 
        spte |= shadow_present_mask;
-       if (!prefetch)
+       if (!remove_accessed)
                spte |= spte_shadow_accessed_mask(spte);
 
        /*
@@ -259,7 +261,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
                spte |= spte_shadow_dirty_mask(spte);
 
 out:
-       if (prefetch)
+       if (remove_accessed)
                spte = mark_spte_for_access_track(spte);
 
        WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
Yan Zhao Oct. 8, 2024, 2:57 a.m. UTC | #11
On Fri, Sep 27, 2024 at 08:06:22AM -0700, Sean Christopherson wrote:
> On Fri, Sep 27, 2024, Sean Christopherson wrote:
> > On Fri, Sep 27, 2024, Sean Christopherson wrote:
> > > On Thu, Sep 26, 2024, Yan Zhao wrote:
> > > > On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote:
> > > > > On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> > > > > Right now, the fixes for make_spte() are sitting toward the end of the massive
> > > > > kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
> > > > > fairly confident that series can land in 6.13 (lots and lots of small patches).
> > > > > 
> > > > > ---
> > > > > Author:     Sean Christopherson <seanjc@google.com>
> > > > > AuthorDate: Thu Sep 12 16:23:21 2024 -0700
> > > > > Commit:     Sean Christopherson <seanjc@google.com>
> > > > > CommitDate: Thu Sep 12 16:35:06 2024 -0700
> > > > > 
> > > > >     KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
> > > > >     
> > > > >     Do a remote TLB flush if installing a leaf SPTE overwrites an existing
> > > > >     leaf SPTE (with the same target pfn) and clears the Writable bit or the
> > > > >     Dirty bit.  KVM isn't _supposed_ to clear Writable or Dirty bits in such
> > > > >     a scenario, but make_spte() has a flaw where it will fail to set the Dirty
> > > > >     if the existing SPTE is writable.
> > > > >     
> > > > >     E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
> > > > >     SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
> > > > >     second vCPU.  If the first vCPU (or another vCPU) accesses memory using
> > > > >     the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
> > > > >     the only SPTE that is dirty at the time of the next relevant clearing of
> > > > >     the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
> > > > >     because it sees the D=0 SPTE, and thus will complete the clearing of the
> > > > >     dirty logs without performing a TLB flush.
> > > > But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no
> > > > matter clear_dirty_gfn_range() finds a D bit or not.
> > > 
> > > Oh, right, I forgot about that.  I'll tweak the changelog to call that out before
> > > posting.  Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm:
> > > x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if
> > > anything should be backported it's that commit.
> > 
> > Actually, a better idea.  I think it makes sense to fully commit to not flushing
> > when overwriting SPTEs, and instead rely on the dirty logging logic to do a remote
> > TLB flush.
> 
> Oooh, but there's a bug.  KVM can tolerate/handle stale Dirty/Writable TLB entries
> when dirty logging, but KVM cannot tolerate stale Writable TLB entries when write-
> protecting for shadow paging.  The TDP MMU always flushes when clearing the MMU-
> writable flag (modulo a bug that would cause KVM to make the SPTE !MMU-writable
> in the page fault path), but the shadow MMU does not.
> 
> So I'm pretty sure we need the below, and then it may or may not make sense to have
> a common "flush needed" helper (outside of the write-protecting flows, KVM probably
> should WARN if MMU-writable is cleared).
> 
> ---
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ce8323354d2d..7bd9c296f70e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -514,9 +514,12 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
>  /* Rules for using mmu_spte_update:
>   * Update the state bits, it means the mapped pfn is not changed.
>   *
> - * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote
> - * TLBs must be flushed. Otherwise rmap_write_protect will find a read-only
> - * spte, even though the writable spte might be cached on a CPU's TLB.
> + * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for
> + * write-tracking, remote TLBs must be flushed, even if the SPTE was read-only,
> + * as KVM allows stale Writable TLB entries to exist.  When dirty logging, KVM
> + * flushes TLBs based on whether or not dirty bitmap/ring entries were reaped,
> + * not whether or not SPTEs were modified, i.e. only the write-protected case
> + * needs to precisely flush when modifying SPTEs.
>   *
>   * Returns true if the TLB needs to be flushed
>   */
> @@ -533,8 +536,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
Given all callers of mmu_spte_update() except mmu_set_spte() have handled TLB
flushes by themselves, could we just remove mmu_spte_update() and have
mmu_set_spte() directly checks
"if (is_mmu_writable_spte(old_spte) && !is_mmu_writable_pte(new_spte))" instead?

Then maybe the old checking of
"if (is_mmu_writable_spte(old_spte) && !is_writable_pte(new_spte))" is also
good in mmu_set_spte() due to shadow_mmu_writable_mask and PT_WRITABLE_MASK
appearing in pairs.

>          * we always atomically update it, see the comments in
>          * spte_has_volatile_bits().
>          */
> -       if (is_mmu_writable_spte(old_spte) &&
> -             !is_writable_pte(new_spte))
> +       if (is_mmu_writable_spte(old_spte) && !is_mmu_writable_spte(new_spte))
>                 flush = true;
>  
>         /*
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 345c7115b632..aa1ca24d1168 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -133,12 +133,6 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
>   */
>  bool spte_has_volatile_bits(u64 spte)
>  {
> -       /*
> -        * Always atomically update spte if it can be updated
> -        * out of mmu-lock, it can ensure dirty bit is not lost,
> -        * also, it can help us to get a stable is_writable_pte()
> -        * to ensure tlb flush is not missed.
> -        */
>         if (!is_writable_pte(spte) && is_mmu_writable_spte(spte))
>                 return true;
Sean Christopherson Oct. 8, 2024, 4:07 p.m. UTC | #12
On Tue, Oct 08, 2024, Yan Zhao wrote:
> On Mon, Sep 30, 2024 at 08:54:38AM -0700, Sean Christopherson wrote:
> > On Sun, Sep 29, 2024, Yan Zhao wrote:
> > > On Fri, Sep 27, 2024 at 07:32:10AM -0700, Sean Christopherson wrote:
> > > >  * Don't flush if the Accessed bit is cleared, as access tracking tolerates
> > > >  * false negatives, and the one path that does care about TLB flushes,
> > > >  * kvm_mmu_notifier_clear_flush_young(), uses mmu_spte_update_no_track().
> > > I have a question about why access tracking tolerates false negatives on the
> > > path kvm_mmu_notifier_clear_flush_young().
> > > 
> > > kvm_mmu_notifier_clear_flush_young() invokes kvm_flush_remote_tlbs()
> > > only when kvm_age_gfn() returns true. But age_gfn_range()/kvm_age_rmap() will
> > > return false if the old spte is !is_accessed_spte().
> > > 
> > > So, if the Access bit is cleared in make_spte(), is a TLB flush required to
> > > avoid that it's not done in kvm_mmu_notifier_clear_flush_young()?
> > 
> > Because access tracking in general is tolerant of stale results due to lack of
> > TLB flushes.  E.g. on many architectures, the primary MMU has omitted TLB flushes
> > for years (10+ years on x86, commit b13b1d2d8692).  The basic argument is that if
> > there is enough memory pressure to trigger reclaim, then there will be enough TLB
> > pressure to ensure that omitting the TLB flush doesn't result in a large number
> > of "bad" reclaims[1].  And conversely, if there isn't much TLB pressure, then the
> > kernel shouldn't be reclaiming.
> > 
> > For KVM, I want to completely eliminate the TLB flush[2] for all architectures
> > where it's architecturally legal.  Because for KVM, the flushes are often even
> > more expensive than they are for the primary MMU, e.g. due to lack of batching,
> > the cost of VM-Exit => VM-Enter (for architectures without broadcast flushes).
> > 
> > [1] https://lore.kernel.org/all/CAOUHufYCmYNngmS=rOSAQRB0N9ai+mA0aDrB9RopBvPHEK42Ng@mail.gmail.com
> > [2] https://lore.kernel.org/all/Zmnbb-Xlyz4VXNHI@google.com
> 
> It makes sense. Thanks for explanation and the provided links!
> 
> Thinking more about the prefetched SPTEs, though the A bit tolerates fault
> negative, do you think we still can have a small optimization to grant A bit to
> prefetched SPTEs if the old_spte has already set it? So that if a prefault
> happens right after a real fault, the A bit would not be cleared, basing on that
> KVM not changing PFNs without first zapping the old SPTE.
> (but I'm not sure if you have already covered this optmication in the
> mega-series).

Already handled :-)

Though I need to test the shadow MMU code, as I was initially thinking the issue
was unique to the TDP MMU (no idea why I thought that).

Hmm, though I think something like what you've proposed may be in order.  There
are currently four cases where @prefetch will be true:

 1. Prefault
 2. Async #PF
 3. Prefetching
 4. FNAME(sync_spte)

For 1-3, KVM shouldn't overwrite an existing shadow-present SPTE, which is what
my code does.

But for 4, FNAME(sync_spte) _needs_ to overwrite an existing SPTE.  And, ignoring
the awful A/D-disabled case, FNAME(prefetch_invalid_gpte) ensures the gPTE is
Accessed, i.e. there's no strong argument for clearing the Accessed bit.

Hrm, but the _only_ "speculative" access type when commit 947da5383069 ("KVM:
MMU: Set the accessed bit on non-speculative shadow ptes") went in was the
FNAME(sync_spte) case (at the time called FNAME(update_pte)).  I.e. KVM deliberately
clears the Accessed bit for that case.

But, I'm unconvinced that's actually appropriate.  As above, the gPTE is accessed,
and kvm_sync_spte() ensures the SPTE is shadow-present (with an asterisk, because
it deliberately doesn't use is_shadow_present() SPTE so that KVM can sync MMIO
SPTEs).

Aha!  Before commit c76e0ad27084 ("KVM: x86/mmu: Mark page/folio accessed only
when zapping leaf SPTEs"), clearing the Accessed bit kinda sorta makes sense,
because KVM propagates the information to the underlying struct page.  But when
that code is removed, KVM's SPTEs are the "source of truth" so to speak.

Double aha!  Spurious clearing of the Accessed (and Dirty) was mitigated by commit
e6722d9211b2 ("KVM: x86/mmu: Reduce the update to the spte in FNAME(sync_spte)"),
which changed FNAME(sync_spte) to only overwrite SPTEs if the protections are
actually changing.

So at the very least, somewhere in all of this, we should do as you suggest and
explicitly preserve the Accessed bit.  Though I'm quite tempted to be more aggressive
and always mark the SPTE as accessed when synchronizing, because odds are very
good that the guest still cares about the page that's pointed at by the unsync
SPTE.  E.g. the most common case where FNAME(sync_spte) actually overwrites an
existing SPTE is CoW in the guest.

I'll think a bit more on this, and either go with a variant of the below, or
something like:

	if (!prefetch || synchronizing)
		spte |= spte_shadow_accessed_mask(spte);

> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -163,6 +163,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>         int level = sp->role.level;
>         u64 spte = SPTE_MMU_PRESENT_MASK;
>         bool wrprot = false;
> +       bool remove_accessed = prefetch && (!is_shadow_present_pte(old_spte) ||
> +                              !s_last_spte(old_spte, level) || !is_accessed_spte(old_spte))
>         /*
>          * For the EPT case, shadow_present_mask has no RWX bits set if
> @@ -178,7 +180,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>                 spte |= SPTE_TDP_AD_WRPROT_ONLY;
>  
>         spte |= shadow_present_mask;
> -       if (!prefetch)
> +       if (!remove_accessed)
>                 spte |= spte_shadow_accessed_mask(spte);
>  
>         /*
> @@ -259,7 +261,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>                 spte |= spte_shadow_dirty_mask(spte);
>  
>  out:
> -       if (prefetch)
> +       if (remove_accessed)
>                 spte = mark_spte_for_access_track(spte);
>  
>         WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
Sean Christopherson Oct. 8, 2024, 7:15 p.m. UTC | #13
On Fri, Sep 27, 2024, Sean Christopherson wrote:
> On Fri, Sep 27, 2024, Sean Christopherson wrote:
> > On Fri, Sep 27, 2024, Sean Christopherson wrote:

...

> > > Oh, right, I forgot about that.  I'll tweak the changelog to call that out before
> > > posting.  Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm:
> > > x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if
> > > anything should be backported it's that commit.
> > 
> > Actually, a better idea.  I think it makes sense to fully commit to not flushing
> > when overwriting SPTEs, and instead rely on the dirty logging logic to do a remote
> > TLB flush.
> 
> Oooh, but there's a bug.

Nope, there's not.

> KVM can tolerate/handle stale Dirty/Writable TLB entries when dirty logging,
> but KVM cannot tolerate stale Writable TLB entries when write- protecting for
> shadow paging.  The TDP MMU always flushes when clearing the MMU- writable
> flag (modulo a bug that would cause KVM to make the SPTE !MMU-writable in the
> page fault path), but the shadow MMU does not.
> 
> So I'm pretty sure we need the below, and then it may or may not make sense to have
> a common "flush needed" helper (outside of the write-protecting flows, KVM probably
> should WARN if MMU-writable is cleared).
> 
> ---
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ce8323354d2d..7bd9c296f70e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -514,9 +514,12 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
>  /* Rules for using mmu_spte_update:
>   * Update the state bits, it means the mapped pfn is not changed.
>   *
> - * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote
> - * TLBs must be flushed. Otherwise rmap_write_protect will find a read-only
> - * spte, even though the writable spte might be cached on a CPU's TLB.
> + * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for
> + * write-tracking, remote TLBs must be flushed, even if the SPTE was read-only,
> + * as KVM allows stale Writable TLB entries to exist.  When dirty logging, KVM
> + * flushes TLBs based on whether or not dirty bitmap/ring entries were reaped,
> + * not whether or not SPTEs were modified, i.e. only the write-protected case
> + * needs to precisely flush when modifying SPTEs.
>   *
>   * Returns true if the TLB needs to be flushed
>   */
> @@ -533,8 +536,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>          * we always atomically update it, see the comments in
>          * spte_has_volatile_bits().
>          */
> -       if (is_mmu_writable_spte(old_spte) &&
> -             !is_writable_pte(new_spte))
> +       if (is_mmu_writable_spte(old_spte) && !is_mmu_writable_spte(new_spte))

It took me forever and a day to realize this, but !is_writable_pte(new_spte) is
correct, because the logic is checking if the new SPTE is !Writable, it's *not*
checking to see if the Writable bit is _cleared_.  I.e. KVM will flush if the
old SPTE is read-only but MMU-writable.

That said, I'm still going to include this change, albet with a drastically
different changelog.  Checking is_mmu_writable_spte() instead of is_writable_pte()
is still desirable, as it avoids unnecessary TLB flushes in the rare case where
KVM "refreshes" a !Writable SPTE.  Of course, with the other change to not clobber
SPTEs when prefetching, that scenario becomes even more rare, but it's still worth
doing, especially since IMO it makes it more obvious when KVM _does_ need to do a
remote TLB flush (before dropping mmu_lock).
Yan Zhao Oct. 9, 2024, 9:21 a.m. UTC | #14
On Tue, Oct 08, 2024 at 12:15:14PM -0700, Sean Christopherson wrote:
> On Fri, Sep 27, 2024, Sean Christopherson wrote:
> > On Fri, Sep 27, 2024, Sean Christopherson wrote:
> > > On Fri, Sep 27, 2024, Sean Christopherson wrote:
> 
> ...
> 
> > > > Oh, right, I forgot about that.  I'll tweak the changelog to call that out before
> > > > posting.  Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm:
> > > > x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if
> > > > anything should be backported it's that commit.
> > > 
> > > Actually, a better idea.  I think it makes sense to fully commit to not flushing
> > > when overwriting SPTEs, and instead rely on the dirty logging logic to do a remote
> > > TLB flush.
> > 
> > Oooh, but there's a bug.
> 
> Nope, there's not.
> 
> > KVM can tolerate/handle stale Dirty/Writable TLB entries when dirty logging,
> > but KVM cannot tolerate stale Writable TLB entries when write- protecting for
> > shadow paging.  The TDP MMU always flushes when clearing the MMU- writable
> > flag (modulo a bug that would cause KVM to make the SPTE !MMU-writable in the
> > page fault path), but the shadow MMU does not.
> > 
> > So I'm pretty sure we need the below, and then it may or may not make sense to have
> > a common "flush needed" helper (outside of the write-protecting flows, KVM probably
> > should WARN if MMU-writable is cleared).
> > 
> > ---
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index ce8323354d2d..7bd9c296f70e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -514,9 +514,12 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
> >  /* Rules for using mmu_spte_update:
> >   * Update the state bits, it means the mapped pfn is not changed.
> >   *
> > - * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote
> > - * TLBs must be flushed. Otherwise rmap_write_protect will find a read-only
> > - * spte, even though the writable spte might be cached on a CPU's TLB.
> > + * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for
> > + * write-tracking, remote TLBs must be flushed, even if the SPTE was read-only,
> > + * as KVM allows stale Writable TLB entries to exist.  When dirty logging, KVM
> > + * flushes TLBs based on whether or not dirty bitmap/ring entries were reaped,
> > + * not whether or not SPTEs were modified, i.e. only the write-protected case
> > + * needs to precisely flush when modifying SPTEs.
> >   *
> >   * Returns true if the TLB needs to be flushed
> >   */
> > @@ -533,8 +536,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> >          * we always atomically update it, see the comments in
> >          * spte_has_volatile_bits().
> >          */
> > -       if (is_mmu_writable_spte(old_spte) &&
> > -             !is_writable_pte(new_spte))
> > +       if (is_mmu_writable_spte(old_spte) && !is_mmu_writable_spte(new_spte))
> 
> It took me forever and a day to realize this, but !is_writable_pte(new_spte) is
> correct, because the logic is checking if the new SPTE is !Writable, it's *not*
> checking to see if the Writable bit is _cleared_.  I.e. KVM will flush if the
> old SPTE is read-only but MMU-writable.
For read-only, host-writable is false, so MMU-writable can't be true?

Compared to "!is_writable_pte(new_spte)", "!is_mmu_writable_spte(new_spte)" just
skips the case "MMU-writalbe=1 + !Writable", which is for dirty logging.
> 
> That said, I'm still going to include this change, albet with a drastically
> different changelog.  Checking is_mmu_writable_spte() instead of is_writable_pte()
> is still desirable, as it avoids unnecessary TLB flushes in the rare case where
> KVM "refreshes" a !Writable SPTE.  Of course, with the other change to not clobber
> SPTEs when prefetching, that scenario becomes even more rare, but it's still worth
> doing, especially since IMO it makes it more obvious when KVM _does_ need to do a
> remote TLB flush (before dropping mmu_lock).
>
Sean Christopherson Oct. 9, 2024, 1:14 p.m. UTC | #15
On Wed, Oct 09, 2024, Yan Zhao wrote:
> On Tue, Oct 08, 2024 at 12:15:14PM -0700, Sean Christopherson wrote:
> > On Fri, Sep 27, 2024, Sean Christopherson wrote:
> > > ---
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index ce8323354d2d..7bd9c296f70e 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -514,9 +514,12 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
> > >  /* Rules for using mmu_spte_update:
> > >   * Update the state bits, it means the mapped pfn is not changed.
> > >   *
> > > - * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote
> > > - * TLBs must be flushed. Otherwise rmap_write_protect will find a read-only
> > > - * spte, even though the writable spte might be cached on a CPU's TLB.
> > > + * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for
> > > + * write-tracking, remote TLBs must be flushed, even if the SPTE was read-only,
> > > + * as KVM allows stale Writable TLB entries to exist.  When dirty logging, KVM
> > > + * flushes TLBs based on whether or not dirty bitmap/ring entries were reaped,
> > > + * not whether or not SPTEs were modified, i.e. only the write-protected case
> > > + * needs to precisely flush when modifying SPTEs.
> > >   *
> > >   * Returns true if the TLB needs to be flushed
> > >   */
> > > @@ -533,8 +536,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> > >          * we always atomically update it, see the comments in
> > >          * spte_has_volatile_bits().
> > >          */
> > > -       if (is_mmu_writable_spte(old_spte) &&
> > > -             !is_writable_pte(new_spte))
> > > +       if (is_mmu_writable_spte(old_spte) && !is_mmu_writable_spte(new_spte))
> > 
> > It took me forever and a day to realize this, but !is_writable_pte(new_spte) is
> > correct, because the logic is checking if the new SPTE is !Writable, it's *not*
> > checking to see if the Writable bit is _cleared_.  I.e. KVM will flush if the
> > old SPTE is read-only but MMU-writable.
> For read-only, host-writable is false, so MMU-writable can't be true?

Read-only here refers to the SPTE itself, i.e. the !is_writable_pte() case.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index a72f0e3bde17..1726f8ec5a50 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -214,6 +214,12 @@  extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
  */
 #define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
 
+#define EXTERNAL_SPTE_IGNORE_CHANGE_MASK		\
+	(shadow_acc_track_mask |			\
+	 (SHADOW_ACC_TRACK_SAVED_BITS_MASK <<		\
+	  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) |		\
+	 shadow_dirty_mask | shadow_accessed_mask)
+
 /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
 static_assert(!(FROZEN_SPTE & SPTE_MMU_PRESENT_MASK));
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 019b43723d90..cfb82ede8982 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -503,8 +503,6 @@  static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
 	int ret = 0;
 
-	KVM_BUG_ON(was_present, kvm);
-
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
 	 * We need to lock out other updates to the SPTE until the external
@@ -519,10 +517,34 @@  static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 	 * external page table, or leaf.
 	 */
 	if (is_leaf) {
-		ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
+		/*
+		 * SPTE is state machine with software available bits used.
+		 * Check if the change is interesting to the backend.
+		 */
+		if (!was_present)
+			ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
+		else {
+			/*
+			 * The external PTEs don't need updates for some bits,
+			 * but if others are changed, bug the VM.
+			 */
+			if (KVM_BUG_ON(~EXTERNAL_SPTE_IGNORE_CHANGE_MASK &
+				       (old_spte ^ new_spte), kvm)) {
+				ret = -EIO;
+			}
+		}
+
+		/*
+		 * The backend shouldn't return an error except EAGAIN.
+		 * It's hard to debug without those info.
+		 */
+		if (ret && ret != EAGAIN)
+			pr_debug("gfn 0x%llx old_spte 0x%llx new_spte 0x%llx level %d\n",
+				 gfn, old_spte, new_spte, level);
 	} else {
 		void *external_spt = get_external_spt(gfn, new_spte, level);
 
+		KVM_BUG_ON(was_present, kvm);
 		KVM_BUG_ON(!external_spt, kvm);
 		ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, external_spt);
 	}