diff mbox series

[v3,04/28] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic

Message ID 20220226001546.360188-5-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Overhaul TDP MMU zapping and flushing | expand

Commit Message

Sean Christopherson Feb. 26, 2022, 12:15 a.m. UTC
Explicitly ignore the result of zap_gfn_range() when putting the last
reference to a TDP MMU root, and add a pile of comments to formalize the
TDP MMU's behavior of deferring TLB flushes to alloc/reuse.  Note, this
only affects the !shared case, as zap_gfn_range() subtly never returns
true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic().

Putting the root without a flush is ok because even if there are stale
references to the root in the TLB, they are unreachable because KVM will
not run the guest with the same ASID without first flushing (where ASID
in this context refers to both SVM's explicit ASID and Intel's implicit
ASID that is constructed from VPID+PCID+EPT4A+etc...).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     |  8 ++++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

Comments

Mingwei Zhang March 2, 2022, 11:59 p.m. UTC | #1
On Sat, Feb 26, 2022, Sean Christopherson wrote:
> Explicitly ignore the result of zap_gfn_range() when putting the last
> reference to a TDP MMU root, and add a pile of comments to formalize the
> TDP MMU's behavior of deferring TLB flushes to alloc/reuse.  Note, this
> only affects the !shared case, as zap_gfn_range() subtly never returns
> true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic().
> 
> Putting the root without a flush is ok because even if there are stale
> references to the root in the TLB, they are unreachable because KVM will
> not run the guest with the same ASID without first flushing (where ASID
> in this context refers to both SVM's explicit ASID and Intel's implicit
> ASID that is constructed from VPID+PCID+EPT4A+etc...).
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     |  8 ++++++++
>  arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 80607513a1f2..5a931c89d27b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5069,6 +5069,14 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
>  	kvm_mmu_sync_roots(vcpu);
>  
>  	kvm_mmu_load_pgd(vcpu);
> +
> +	/*
> +	 * Flush any TLB entries for the new root, the provenance of the root
> +	 * is unknown.  In theory, even if KVM ensures there are no stale TLB
> +	 * entries for a freed root, in theory, an out-of-tree hypervisor could
> +	 * have left stale entries.  Flushing on alloc also allows KVM to skip
> +	 * the TLB flush when freeing a root (see kvm_tdp_mmu_put_root()).
> +	 */
>  	static_call(kvm_x86_flush_tlb_current)(vcpu);
>  out:
>  	return r;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 12866113fb4f..e35bd88d92fd 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  	list_del_rcu(&root->link);
>  	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  
> -	zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> +	/*
> +	 * A TLB flush is not necessary as KVM performs a local TLB flush when
> +	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> +	 * to a different pCPU.  Note, the local TLB flush on reuse also
> +	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
> +	 * intermediate paging structures, that may be zapped, as such entries
> +	 * are associated with the ASID on both VMX and SVM.
> +	 */
> +	(void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);

Understood that we could avoid the TLB flush here. Just curious why the
"(void)" is needed here? Is it for compile time reason?
>  
>  	call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
>  }
> -- 
> 2.35.1.574.g5d30c73bfb-goog
>
Sean Christopherson March 3, 2022, 12:12 a.m. UTC | #2
On Wed, Mar 02, 2022, Mingwei Zhang wrote:
> On Sat, Feb 26, 2022, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 12866113fb4f..e35bd88d92fd 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> >  	list_del_rcu(&root->link);
> >  	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> >  
> > -	zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> > +	/*
> > +	 * A TLB flush is not necessary as KVM performs a local TLB flush when
> > +	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> > +	 * to a different pCPU.  Note, the local TLB flush on reuse also
> > +	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
> > +	 * intermediate paging structures, that may be zapped, as such entries
> > +	 * are associated with the ASID on both VMX and SVM.
> > +	 */
> > +	(void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> 
> Understood that we could avoid the TLB flush here. Just curious why the
> "(void)" is needed here? Is it for compile time reason?

Nope, no functional purpose, though there might be some "advanced" warning or
static checkers that care.

The "(void)" is to communicate to human readers that the result is intentionally
ignored, e.g. to reduce the probability of someone "fixing" the code by acting on
the result of zap_gfn_range().  The comment should suffice, but it's nice to have
the code be self-documenting as much as possible.
Mingwei Zhang March 3, 2022, 1:20 a.m. UTC | #3
On Thu, Mar 03, 2022, Sean Christopherson wrote:
> On Wed, Mar 02, 2022, Mingwei Zhang wrote:
> > On Sat, Feb 26, 2022, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 12866113fb4f..e35bd88d92fd 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > >  	list_del_rcu(&root->link);
> > >  	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > >  
> > > -	zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> > > +	/*
> > > +	 * A TLB flush is not necessary as KVM performs a local TLB flush when
> > > +	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> > > +	 * to a different pCPU.  Note, the local TLB flush on reuse also
> > > +	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
> > > +	 * intermediate paging structures, that may be zapped, as such entries
> > > +	 * are associated with the ASID on both VMX and SVM.
> > > +	 */
> > > +	(void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> > 
> > Understood that we could avoid the TLB flush here. Just curious why the
> > "(void)" is needed here? Is it for compile time reason?
> 
> Nope, no functional purpose, though there might be some "advanced" warning or
> static checkers that care.
> 
> The "(void)" is to communicate to human readers that the result is intentionally
> ignored, e.g. to reduce the probability of someone "fixing" the code by acting on
> the result of zap_gfn_range().  The comment should suffice, but it's nice to have
> the code be self-documenting as much as possible.

Right, I got the point. Thanks.

Coming back. It seems that I pretended to understand that we should
avoid the TLB flush without really knowing why.

I mean, leaving (part of the) stale TLB entries unflushed will still be
dangerous right? Or am I missing something that guarantees to flush the
local TLB before returning to the guest? For instance,
kvm_mmu_{re,}load()?
Sean Christopherson March 3, 2022, 1:41 a.m. UTC | #4
On Thu, Mar 03, 2022, Mingwei Zhang wrote:
> On Thu, Mar 03, 2022, Sean Christopherson wrote:
> > On Wed, Mar 02, 2022, Mingwei Zhang wrote:
> > > On Sat, Feb 26, 2022, Sean Christopherson wrote:
> > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > index 12866113fb4f..e35bd88d92fd 100644
> > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > @@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > > >  	list_del_rcu(&root->link);
> > > >  	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > >  
> > > > -	zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> > > > +	/*
> > > > +	 * A TLB flush is not necessary as KVM performs a local TLB flush when
> > > > +	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> > > > +	 * to a different pCPU.  Note, the local TLB flush on reuse also
> > > > +	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
> > > > +	 * intermediate paging structures, that may be zapped, as such entries
> > > > +	 * are associated with the ASID on both VMX and SVM.
> > > > +	 */
> > > > +	(void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> > > 
> > > Understood that we could avoid the TLB flush here. Just curious why the
> > > "(void)" is needed here? Is it for compile time reason?
> > 
> > Nope, no functional purpose, though there might be some "advanced" warning or
> > static checkers that care.
> > 
> > The "(void)" is to communicate to human readers that the result is intentionally
> > ignored, e.g. to reduce the probability of someone "fixing" the code by acting on
> > the result of zap_gfn_range().  The comment should suffice, but it's nice to have
> > the code be self-documenting as much as possible.
> 
> Right, I got the point. Thanks.
> 
> Coming back. It seems that I pretended to understand that we should
> avoid the TLB flush without really knowing why.
> 
> I mean, leaving (part of the) stale TLB entries unflushed will still be
> dangerous right? Or am I missing something that guarantees to flush the
> local TLB before returning to the guest? For instance,
> kvm_mmu_{re,}load()?

Heh, if SVM's ASID management wasn't a mess[*], it'd be totally fine.  The idea,
and what EPT architectures mandates, is that each TDP root is associated with an
ASID.  So even though there may be stale entries in the TLB for a root, because
that root is no longer used those stale entries are unreachable.  And if KVM ever
happens to reallocate the same physical page for a root, that's ok because KVM must
be paranoid and flush that root (see code comment in this patch).

What we're missing on SVM is proper ASID handling.  If KVM uses ASIDs the way AMD
intends them to be used, then this works as intended because each root is again
associated with a specific ASID, and KVM just needs to flush when (re)allocating
a root and when reusing an ASID (which it already handles).

[*] https://lore.kernel.org/all/Yh%2FJdHphCLOm4evG@google.com
Mingwei Zhang March 3, 2022, 4:50 a.m. UTC | #5
On Thu, Mar 03, 2022, Sean Christopherson wrote:
> On Thu, Mar 03, 2022, Mingwei Zhang wrote:
> > On Thu, Mar 03, 2022, Sean Christopherson wrote:
> > > On Wed, Mar 02, 2022, Mingwei Zhang wrote:
> > > > On Sat, Feb 26, 2022, Sean Christopherson wrote:
> > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > > index 12866113fb4f..e35bd88d92fd 100644
> > > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > > @@ -93,7 +93,15 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > > > >  	list_del_rcu(&root->link);
> > > > >  	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > > > >  
> > > > > -	zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> > > > > +	/*
> > > > > +	 * A TLB flush is not necessary as KVM performs a local TLB flush when
> > > > > +	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> > > > > +	 * to a different pCPU.  Note, the local TLB flush on reuse also
> > > > > +	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
> > > > > +	 * intermediate paging structures, that may be zapped, as such entries
> > > > > +	 * are associated with the ASID on both VMX and SVM.
> > > > > +	 */
> > > > > +	(void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
> > > > 
> > > > Understood that we could avoid the TLB flush here. Just curious why the
> > > > "(void)" is needed here? Is it for compile time reason?
> > > 
> > > Nope, no functional purpose, though there might be some "advanced" warning or
> > > static checkers that care.
> > > 
> > > The "(void)" is to communicate to human readers that the result is intentionally
> > > ignored, e.g. to reduce the probability of someone "fixing" the code by acting on
> > > the result of zap_gfn_range().  The comment should suffice, but it's nice to have
> > > the code be self-documenting as much as possible.
> > 
> > Right, I got the point. Thanks.
> > 
> > Coming back. It seems that I pretended to understand that we should
> > avoid the TLB flush without really knowing why.
> > 
> > I mean, leaving (part of the) stale TLB entries unflushed will still be
> > dangerous right? Or am I missing something that guarantees to flush the
> > local TLB before returning to the guest? For instance,
> > kvm_mmu_{re,}load()?
> 
> Heh, if SVM's ASID management wasn't a mess[*], it'd be totally fine.  The idea,
> and what EPT architectures mandates, is that each TDP root is associated with an
> ASID.  So even though there may be stale entries in the TLB for a root, because
> that root is no longer used those stale entries are unreachable.  And if KVM ever
> happens to reallocate the same physical page for a root, that's ok because KVM must
> be paranoid and flush that root (see code comment in this patch).
> 
> What we're missing on SVM is proper ASID handling.  If KVM uses ASIDs the way AMD
> intends them to be used, then this works as intended because each root is again
> associated with a specific ASID, and KVM just needs to flush when (re)allocating
> a root and when reusing an ASID (which it already handles).
> 
> [*] https://lore.kernel.org/all/Yh%2FJdHphCLOm4evG@google.com

Oh, putting AMD issues aside for now.

I think I might be too narrow down to the zapping logic previously. So,
I originally think anytime we want to zap, we have to do the following
things in strict order:

1) zap SPTEs.
2) flush TLBs.
3) flush cache (AMD SEV only).
4) deallocate shadow pages.

However, if you have already invalidated EPTP (pgd ptr), then step 2)
becomes optional, since those stale TLBs are no longer useable by the
guest due to the change of ASID.

Am I understanding the point correctly? So, for all invalidated roots,
the assumption is that we have already called "kvm_reload_rmote_mmus()",
which basically update the ASID.
Sean Christopherson March 3, 2022, 4:45 p.m. UTC | #6
On Thu, Mar 03, 2022, Mingwei Zhang wrote:
> On Thu, Mar 03, 2022, Sean Christopherson wrote:
> > Heh, if SVM's ASID management wasn't a mess[*], it'd be totally fine.  The idea,
> > and what EPT architectures mandates, is that each TDP root is associated with an
> > ASID.  So even though there may be stale entries in the TLB for a root, because
> > that root is no longer used those stale entries are unreachable.  And if KVM ever
> > happens to reallocate the same physical page for a root, that's ok because KVM must
> > be paranoid and flush that root (see code comment in this patch).
> > 
> > What we're missing on SVM is proper ASID handling.  If KVM uses ASIDs the way AMD
> > intends them to be used, then this works as intended because each root is again
> > associated with a specific ASID, and KVM just needs to flush when (re)allocating
> > a root and when reusing an ASID (which it already handles).
> > 
> > [*] https://lore.kernel.org/all/Yh%2FJdHphCLOm4evG@google.com
> 
> Oh, putting AMD issues aside for now.
> 
> I think I might be too narrow down to the zapping logic previously. So,
> I originally think anytime we want to zap, we have to do the following
> things in strict order:
> 
> 1) zap SPTEs.
> 2) flush TLBs.
> 3) flush cache (AMD SEV only).
> 4) deallocate shadow pages.

Not necessarily.  1-3 are actually all optional.  E.g. for #1, if KVM somehow
knew that the host didn't care about A/D bits (no writeback needed, no LRU info
needed), then KVM could skip straight to freeing the shadow pages when destroying
a VM.

Flushing the TLB before freeing pages is optional because KVM only needs to ensure
the guest can no longer access the memory.  E.g. at kvm_mmu_notifier_release(),
because KVM disallows KVM_RUN from a different mm, KVM knows that the guest will
never run again and so can skip the TLB flushes.

For the TLB, that does mean KVM needs to flush when using an ASID/EPT4 for the
first time, but KVM needs to do that regardless to guard against a different
hypervisor being loaded previously (where a "different" hypervisor could very
well be an older, buggier version of KVM).

> However, if you have already invalidated EPTP (pgd ptr), then step 2)
> becomes optional, since those stale TLBs are no longer useable by the
> guest due to the change of ASID.

Mostly.  It doesn't require an "invalidated EPTP", just a different EPT4A (or
ASID on SVM).

> Am I understanding the point correctly? So, for all invalidated roots,
> the assumption is that we have already called "kvm_reload_rmote_mmus()",
> which basically update the ASID.

No, the assumption (though I'd describe it a requirement) is that vCPUs can no
longer consume the TLB entries.  That could be due to a reload, but as above it
could also be due to KVM knowing KVM_RUN is unreachable.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 80607513a1f2..5a931c89d27b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5069,6 +5069,14 @@  int kvm_mmu_load(struct kvm_vcpu *vcpu)
 	kvm_mmu_sync_roots(vcpu);
 
 	kvm_mmu_load_pgd(vcpu);
+
+	/*
+	 * Flush any TLB entries for the new root, the provenance of the root
+	 * is unknown.  In theory, even if KVM ensures there are no stale TLB
+	 * entries for a freed root, in theory, an out-of-tree hypervisor could
+	 * have left stale entries.  Flushing on alloc also allows KVM to skip
+	 * the TLB flush when freeing a root (see kvm_tdp_mmu_put_root()).
+	 */
 	static_call(kvm_x86_flush_tlb_current)(vcpu);
 out:
 	return r;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 12866113fb4f..e35bd88d92fd 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -93,7 +93,15 @@  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	list_del_rcu(&root->link);
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 
-	zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
+	/*
+	 * A TLB flush is not necessary as KVM performs a local TLB flush when
+	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
+	 * to a different pCPU.  Note, the local TLB flush on reuse also
+	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
+	 * intermediate paging structures, that may be zapped, as such entries
+	 * are associated with the ASID on both VMX and SVM.
+	 */
+	(void)zap_gfn_range(kvm, root, 0, -1ull, false, false, shared);
 
 	call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }