diff mbox series

[v3,22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

Message ID 20220226001546.360188-23-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Overhaul TDP MMU zapping and flushing | expand

Commit Message

Sean Christopherson Feb. 26, 2022, 12:15 a.m. UTC
Zap defunct roots, a.k.a. roots that have been invalidated after their
last reference was initially dropped, asynchronously via the system work
queue instead of forcing the work upon the unfortunate task that happened
to drop the last reference.

If a vCPU task drops the last reference, the vCPU is effectively blocked
by the host for the entire duration of the zap.  If the root being zapped
happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
being active, the zap can take several hundred seconds.  Unsurprisingly,
most guests are unhappy if a vCPU disappears for hundreds of seconds.

E.g. running a synthetic selftest that triggers a vCPU root zap with
~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
Offloading the zap to a worker drops the block time to <100ms.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu_internal.h |  8 +++-
 arch/x86/kvm/mmu/tdp_mmu.c      | 65 ++++++++++++++++++++++++++++-----
 2 files changed, 63 insertions(+), 10 deletions(-)

Comments

Ben Gardon March 1, 2022, 5:57 p.m. UTC | #1
On Fri, Feb 25, 2022 at 4:16 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Zap defunct roots, a.k.a. roots that have been invalidated after their
> last reference was initially dropped, asynchronously via the system work
> queue instead of forcing the work upon the unfortunate task that happened
> to drop the last reference.
>
> If a vCPU task drops the last reference, the vCPU is effectively blocked
> by the host for the entire duration of the zap.  If the root being zapped
> happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
> being active, the zap can take several hundred seconds.  Unsurprisingly,
> most guests are unhappy if a vCPU disappears for hundreds of seconds.
>
> E.g. running a synthetic selftest that triggers a vCPU root zap with
> ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
> Offloading the zap to a worker drops the block time to <100ms.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>

> ---
>  arch/x86/kvm/mmu/mmu_internal.h |  8 +++-
>  arch/x86/kvm/mmu/tdp_mmu.c      | 65 ++++++++++++++++++++++++++++-----
>  2 files changed, 63 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index be063b6c91b7..1bff453f7cbe 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -65,7 +65,13 @@ struct kvm_mmu_page {
>                 struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
>                 tdp_ptep_t ptep;
>         };
> -       DECLARE_BITMAP(unsync_child_bitmap, 512);
> +       union {
> +               DECLARE_BITMAP(unsync_child_bitmap, 512);
> +               struct {
> +                       struct work_struct tdp_mmu_async_work;
> +                       void *tdp_mmu_async_data;
> +               };
> +       };

At some point (probably not in this series since it's so long already)
it would be good to organize kvm_mmu_page. It looks like we've got
quite a few anonymous unions in there for TDP / Shadow MMU fields.

>
>         struct list_head lpage_disallowed_link;
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index ec28a88c6376..4151e61245a7 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -81,6 +81,38 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
>  static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>                              bool shared);
>
> +static void tdp_mmu_zap_root_async(struct work_struct *work)
> +{
> +       struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page,
> +                                                tdp_mmu_async_work);
> +       struct kvm *kvm = root->tdp_mmu_async_data;
> +
> +       read_lock(&kvm->mmu_lock);
> +
> +       /*
> +        * A TLB flush is not necessary as KVM performs a local TLB flush when
> +        * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> +        * to a different pCPU.  Note, the local TLB flush on reuse also
> +        * invalidates any paging-structure-cache entries, i.e. TLB entries for
> +        * intermediate paging structures, that may be zapped, as such entries
> +        * are associated with the ASID on both VMX and SVM.
> +        */
> +       tdp_mmu_zap_root(kvm, root, true);
> +
> +       /*
> +        * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for
> +        * avoiding an infinite loop.  By design, the root is reachable while
> +        * it's being asynchronously zapped, thus a different task can put its
> +        * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an
> +        * asynchronously zapped root is unavoidable.
> +        */
> +       kvm_tdp_mmu_put_root(kvm, root, true);
> +
> +       read_unlock(&kvm->mmu_lock);
> +
> +       kvm_put_kvm(kvm);
> +}
> +
>  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>                           bool shared)
>  {
> @@ -142,15 +174,26 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>         refcount_set(&root->tdp_mmu_root_count, 1);
>
>         /*
> -        * Zap the root, then put the refcount "acquired" above.   Recursively
> -        * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an
> -        * infinite loop by freeing invalid roots.  By design, the root is
> -        * reachable while it's being zapped, thus a different task can put its
> -        * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a
> -        * defunct root is unavoidable.
> +        * Attempt to acquire a reference to KVM itself.  If KVM is alive, then
> +        * zap the root asynchronously in a worker, otherwise it must be zapped
> +        * directly here.  Wait to do this check until after the refcount is
> +        * reset so that tdp_mmu_zap_root() can safely yield.
> +        *
> +        * In both flows, zap the root, then put the refcount "acquired" above.
> +        * When putting the reference, use kvm_tdp_mmu_put_root() to test the
> +        * above logic for avoiding an infinite loop by freeing invalid roots.
> +        * By design, the root is reachable while it's being zapped, thus a
> +        * different task can put its last reference, i.e. flowing through
> +        * kvm_tdp_mmu_put_root() for a defunct root is unavoidable.
>          */
> -       tdp_mmu_zap_root(kvm, root, shared);
> -       kvm_tdp_mmu_put_root(kvm, root, shared);
> +       if (kvm_get_kvm_safe(kvm)) {
> +               root->tdp_mmu_async_data = kvm;
> +               INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_async);
> +               schedule_work(&root->tdp_mmu_async_work);
> +       } else {
> +               tdp_mmu_zap_root(kvm, root, shared);
> +               kvm_tdp_mmu_put_root(kvm, root, shared);
> +       }
>  }
>
>  enum tdp_mmu_roots_iter_type {
> @@ -954,7 +997,11 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>
>         /*
>          * Zap all roots, including invalid roots, as all SPTEs must be dropped
> -        * before returning to the caller.
> +        * before returning to the caller.  Zap directly even if the root is
> +        * also being zapped by a worker.  Walking zapped top-level SPTEs isn't
> +        * all that expensive and mmu_lock is already held, which means the
> +        * worker has yielded, i.e. flushing the work instead of zapping here
> +        * isn't guaranteed to be any faster.
>          *
>          * A TLB flush is unnecessary, KVM zaps everything if and only the VM
>          * is being destroyed or the userspace VMM has exited.  In both cases,
> --
> 2.35.1.574.g5d30c73bfb-goog
>
Paolo Bonzini March 2, 2022, 5:25 p.m. UTC | #2
On 2/26/22 01:15, Sean Christopherson wrote:
> Zap defunct roots, a.k.a. roots that have been invalidated after their
> last reference was initially dropped, asynchronously via the system work
> queue instead of forcing the work upon the unfortunate task that happened
> to drop the last reference.
> 
> If a vCPU task drops the last reference, the vCPU is effectively blocked
> by the host for the entire duration of the zap.  If the root being zapped
> happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
> being active, the zap can take several hundred seconds.  Unsurprisingly,
> most guests are unhappy if a vCPU disappears for hundreds of seconds.
> 
> E.g. running a synthetic selftest that triggers a vCPU root zap with
> ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
> Offloading the zap to a worker drops the block time to <100ms.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Do we even need kvm_tdp_mmu_zap_invalidated_roots() now?  That is,
something like the following:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd3625a875ef..5fd8bc858c6f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5698,6 +5698,16 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
  {
  	lockdep_assert_held(&kvm->slots_lock);
  
+	/*
+	 * kvm_tdp_mmu_invalidate_all_roots() needs a nonzero reference
+	 * count.  If we're dying, zap everything as it's going to happen
+	 * soon anyway.
+	 */
+	if (!refcount_read(&kvm->users_count)) {
+		kvm_mmu_zap_all(kvm);
+		return;
+	}
+
  	write_lock(&kvm->mmu_lock);
  	trace_kvm_mmu_zap_all_fast(kvm);
  
@@ -5732,20 +5742,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
  	kvm_zap_obsolete_pages(kvm);
  
  	write_unlock(&kvm->mmu_lock);
-
-	/*
-	 * Zap the invalidated TDP MMU roots, all SPTEs must be dropped before
-	 * returning to the caller, e.g. if the zap is in response to a memslot
-	 * deletion, mmu_notifier callbacks will be unable to reach the SPTEs
-	 * associated with the deleted memslot once the update completes, and
-	 * Deferring the zap until the final reference to the root is put would
-	 * lead to use-after-free.
-	 */
-	if (is_tdp_mmu_enabled(kvm)) {
-		read_lock(&kvm->mmu_lock);
-		kvm_tdp_mmu_zap_invalidated_roots(kvm);
-		read_unlock(&kvm->mmu_lock);
-	}
  }
  
  static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cd1bf68e7511..af9db5b8f713 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -142,10 +142,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
  	WARN_ON(!root->tdp_mmu_page);
  
  	/*
-	 * The root now has refcount=0 and is valid.  Readers cannot acquire
-	 * a reference to it (they all visit valid roots only, except for
-	 * kvm_tdp_mmu_zap_invalidated_roots() which however does not acquire
-	 * any reference itself.
+	 * The root now has refcount=0.  It is valid, but readers already
+	 * cannot acquire a reference to it because kvm_tdp_mmu_get_root()
+	 * rejects it.  This remains true for the rest of the execution
+	 * of this function, because readers visit valid roots only
+	 * (except for tdp_mmu_zap_root_work(), which however operates only
+	 * on one specific root and does not acquire any reference itself).

  	 *
  	 * Even though there are flows that need to visit all roots for
  	 * correctness, they all take mmu_lock for write, so they cannot yet
@@ -996,103 +994,16 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
  	}
  }
  
-static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
-						  struct kvm_mmu_page *prev_root)
-{
-	struct kvm_mmu_page *next_root;
-
-	if (prev_root)
-		next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
-						  &prev_root->link,
-						  typeof(*prev_root), link);
-	else
-		next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
-						   typeof(*next_root), link);
-
-	while (next_root && !(next_root->role.invalid &&
-			      refcount_read(&next_root->tdp_mmu_root_count)))
-		next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
-						  &next_root->link,
-						  typeof(*next_root), link);
-
-	return next_root;
-}
-
-/*
- * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
- * zap" completes.  Since kvm_tdp_mmu_invalidate_all_roots() has acquired a
- * reference to each invalidated root, roots will not be freed until after this
- * function drops the gifted reference, e.g. so that vCPUs don't get stuck with
- * tearing paging structures.
- */
-void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
-{
-	struct kvm_mmu_page *next_root;
-	struct kvm_mmu_page *root;
-
-	lockdep_assert_held_read(&kvm->mmu_lock);
-
-	rcu_read_lock();
-
-	root = next_invalidated_root(kvm, NULL);
-
-	while (root) {
-		next_root = next_invalidated_root(kvm, root);
-
-		rcu_read_unlock();
-
-		/*
-		 * Zap the root regardless of what marked it invalid, e.g. even
-		 * if the root was marked invalid by kvm_tdp_mmu_put_root() due
-		 * to its last reference being put.  All SPTEs must be dropped
-		 * before returning to the caller, e.g. if a memslot is deleted
-		 * or moved, the memslot's associated SPTEs are unreachable via
-		 * the mmu_notifier once the memslot update completes.
-		 *
-		 * A TLB flush is unnecessary, invalidated roots are guaranteed
-		 * to be unreachable by the guest (see kvm_tdp_mmu_put_root()
-		 * for more details), and unlike the legacy MMU, no vCPU kick
-		 * is needed to play nice with lockless shadow walks as the TDP
-		 * MMU protects its paging structures via RCU.  Note, zapping
-		 * will still flush on yield, but that's a minor performance
-		 * blip and not a functional issue.
-		 */
-		tdp_mmu_zap_root(kvm, root, true);
-
-		/*
-		 * Put the reference acquired in
-		 * kvm_tdp_mmu_invalidate_roots
-		 */
-		kvm_tdp_mmu_put_root(kvm, root, true);
-
-		root = next_root;
-
-		rcu_read_lock();
-	}
-
-	rcu_read_unlock();
-}
-
  /*
   * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
- * is about to be zapped, e.g. in response to a memslots update.  The caller is
- * responsible for invoking kvm_tdp_mmu_zap_invalidated_roots() to the actual
- * zapping.
- *
- * Take a reference on all roots to prevent the root from being freed before it
- * is zapped by this thread.  Freeing a root is not a correctness issue, but if
- * a vCPU drops the last reference to a root prior to the root being zapped, it
- * will get stuck with tearing down the entire paging structure.
- *
- * Get a reference even if the root is already invalid,
- * kvm_tdp_mmu_zap_invalidated_roots() assumes it was gifted a reference to all
- * invalid roots, e.g. there's no epoch to identify roots that were invalidated
- * by a previous call.  Roots stay on the list until the last reference is
- * dropped, so even though all invalid roots are zapped, a root may not go away
- * for quite some time, e.g. if a vCPU blocks across multiple memslot updates.
+ * is about to be zapped, e.g. in response to a memslots update.  The actual
+ * zapping is performed asynchronously, so a reference is taken on all roots
+ * as well as (once per root) on the struct kvm.
   *
- * Because mmu_lock is held for write, it should be impossible to observe a
- * root with zero refcount, i.e. the list of roots cannot be stale.
+ * Get a reference even if the root is already invalid, the asynchronous worker
+ * assumes it was gifted a reference to the root it processes.  Because mmu_lock
+ * is held for write, it should be impossible to observe a root with zero refcount,
+ * i.e. the list of roots cannot be stale.
   *
   * This has essentially the same effect for the TDP MMU
   * as updating mmu_valid_gen does for the shadow MMU.
@@ -1103,8 +1014,11 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
  
  	lockdep_assert_held_write(&kvm->mmu_lock);
  	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
-		if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
+		if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) {
  			root->role.invalid = true;
+			kvm_get_kvm(kvm);
+			tdp_mmu_schedule_zap_root(kvm, root);
+		}
  	}
  }
  

It passes a smoke test, and also resolves the debate on the fate of patch 1.

However, I think we now need a module_get/module_put when creating/destroying
a VM; the workers can outlive kvm_vm_release and therefore any reference
automatically taken by VFS's fops_get/fops_put.

Paolo
Sean Christopherson March 2, 2022, 5:35 p.m. UTC | #3
On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> However, I think we now need a module_get/module_put when creating/destroying
> a VM; the workers can outlive kvm_vm_release and therefore any reference
> automatically taken by VFS's fops_get/fops_put.

Haven't read the rest of the patch, but this caught my eye.  We _already_ need
to handle this scenario.  As you noted, any worker, i.e. anything that takes a
reference via kvm_get_kvm() without any additional guarantee that the module can't
be unloaded is suspect. x86 is mostly fine, though kvm_setup_async_pf() is likely
affected, and other architectures seem to have bugs.

Google has an internal patch that addresses this.  I believe David is going to post
the fix... David?
Sean Christopherson March 2, 2022, 6:01 p.m. UTC | #4
On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> On 2/26/22 01:15, Sean Christopherson wrote:
> > Zap defunct roots, a.k.a. roots that have been invalidated after their
> > last reference was initially dropped, asynchronously via the system work
> > queue instead of forcing the work upon the unfortunate task that happened
> > to drop the last reference.
> > 
> > If a vCPU task drops the last reference, the vCPU is effectively blocked
> > by the host for the entire duration of the zap.  If the root being zapped
> > happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
> > being active, the zap can take several hundred seconds.  Unsurprisingly,
> > most guests are unhappy if a vCPU disappears for hundreds of seconds.
> > 
> > E.g. running a synthetic selftest that triggers a vCPU root zap with
> > ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
> > Offloading the zap to a worker drops the block time to <100ms.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> 
> Do we even need kvm_tdp_mmu_zap_invalidated_roots() now?  That is,
> something like the following:

Nice!  I initially did something similar (moving invalidated roots to a separate
list), but never circled back to idea after implementing the worker stuff.

> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index bd3625a875ef..5fd8bc858c6f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5698,6 +5698,16 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>  {
>  	lockdep_assert_held(&kvm->slots_lock);
> +	/*
> +	 * kvm_tdp_mmu_invalidate_all_roots() needs a nonzero reference
> +	 * count.  If we're dying, zap everything as it's going to happen
> +	 * soon anyway.
> +	 */
> +	if (!refcount_read(&kvm->users_count)) {
> +		kvm_mmu_zap_all(kvm);
> +		return;
> +	}

I'd prefer we make this an assertion and shove this logic to set_nx_huge_pages(),
because in that case there's no need to zap anything, the guest can never run
again.  E.g. (I'm trying to remember why I didn't do this before...)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b2c1c4eb6007..d4d25ab88ae7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6132,7 +6132,8 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 
                list_for_each_entry(kvm, &vm_list, vm_list) {
                        mutex_lock(&kvm->slots_lock);
-                       kvm_mmu_zap_all_fast(kvm);
+                       if (refcount_read(&kvm->users_count))
+                               kvm_mmu_zap_all_fast(kvm);
                        mutex_unlock(&kvm->slots_lock);
 
                        wake_up_process(kvm->arch.nx_lpage_recovery_thread);


> +
>  	write_lock(&kvm->mmu_lock);
>  	trace_kvm_mmu_zap_all_fast(kvm);
> @@ -5732,20 +5742,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>  	kvm_zap_obsolete_pages(kvm);
>  	write_unlock(&kvm->mmu_lock);
> -
> -	/*
> -	 * Zap the invalidated TDP MMU roots, all SPTEs must be dropped before
> -	 * returning to the caller, e.g. if the zap is in response to a memslot
> -	 * deletion, mmu_notifier callbacks will be unable to reach the SPTEs
> -	 * associated with the deleted memslot once the update completes, and
> -	 * Deferring the zap until the final reference to the root is put would
> -	 * lead to use-after-free.
> -	 */
> -	if (is_tdp_mmu_enabled(kvm)) {
> -		read_lock(&kvm->mmu_lock);
> -		kvm_tdp_mmu_zap_invalidated_roots(kvm);
> -		read_unlock(&kvm->mmu_lock);
> -	}
>  }
>  static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index cd1bf68e7511..af9db5b8f713 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -142,10 +142,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  	WARN_ON(!root->tdp_mmu_page);
>  	/*
> -	 * The root now has refcount=0 and is valid.  Readers cannot acquire
> -	 * a reference to it (they all visit valid roots only, except for
> -	 * kvm_tdp_mmu_zap_invalidated_roots() which however does not acquire
> -	 * any reference itself.
> +	 * The root now has refcount=0.  It is valid, but readers already
> +	 * cannot acquire a reference to it because kvm_tdp_mmu_get_root()
> +	 * rejects it.  This remains true for the rest of the execution
> +	 * of this function, because readers visit valid roots only

One thing that keeps tripping me up is the "readers" verbiage.  I get confused
because taking mmu_lock for read vs. write doesn't really have anything to do with
reading or writing state, e.g. "readers" still write SPTEs, and so I keep thinking
"readers" means anything iterating over the set of roots.  Not sure if there's a
shorthand that won't be confusing.

> +	 * (except for tdp_mmu_zap_root_work(), which however operates only
> +	 * on one specific root and does not acquire any reference itself).
> 
>  	 *
>  	 * Even though there are flows that need to visit all roots for
>  	 * correctness, they all take mmu_lock for write, so they cannot yet

...

> It passes a smoke test, and also resolves the debate on the fate of patch 1.

+1000, I love this approach.  Do you want me to work on a v3, or shall I let you
have the honors?
Paolo Bonzini March 2, 2022, 6:20 p.m. UTC | #5
On 3/2/22 19:01, Sean Christopherson wrote:
>> +	 */
>> +	if (!refcount_read(&kvm->users_count)) {
>> +		kvm_mmu_zap_all(kvm);
>> +		return;
>> +	}
> 
> I'd prefer we make this an assertion and shove this logic to set_nx_huge_pages(),
> because in that case there's no need to zap anything, the guest can never run
> again.  E.g. (I'm trying to remember why I didn't do this before...)

I did it this way because it seemed like a reasonable fallback for any 
present or future caller.

> One thing that keeps tripping me up is the "readers" verbiage.  I get confused
> because taking mmu_lock for read vs. write doesn't really have anything to do with
> reading or writing state, e.g. "readers" still write SPTEs, and so I keep thinking
> "readers" means anything iterating over the set of roots.  Not sure if there's a
> shorthand that won't be confusing.

Not that I know of.  You really need to know that the rwlock is been 
used for its shared/exclusive locking behavior.  But even on ther OSes 
use shared/exclusive instead of read/write, there are no analogous nouns 
and people end up using readers/writers anyway.

>> It passes a smoke test, and also resolves the debate on the fate of patch 1.
> +1000, I love this approach.  Do you want me to work on a v3, or shall I let you
> have the honors?

I'm already running the usual battery of tests, so I should be able to 
post it either tomorrow (early in my evening) or Friday morning.

Paolo
David Matlack March 2, 2022, 6:33 p.m. UTC | #6
On Wed, Mar 2, 2022 at 9:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> > However, I think we now need a module_get/module_put when creating/destroying
> > a VM; the workers can outlive kvm_vm_release and therefore any reference
> > automatically taken by VFS's fops_get/fops_put.
>
> Haven't read the rest of the patch, but this caught my eye.  We _already_ need
> to handle this scenario.  As you noted, any worker, i.e. anything that takes a
> reference via kvm_get_kvm() without any additional guarantee that the module can't
> be unloaded is suspect. x86 is mostly fine, though kvm_setup_async_pf() is likely
> affected, and other architectures seem to have bugs.
>
> Google has an internal patch that addresses this.  I believe David is going to post
> the fix... David?

This was towards the back of my queue but I can bump it to the front.
I'll have the patches out this week.
Paolo Bonzini March 2, 2022, 6:36 p.m. UTC | #7
On 3/2/22 19:33, David Matlack wrote:
> On Wed, Mar 2, 2022 at 9:35 AM Sean Christopherson <seanjc@google.com> wrote:
>>
>> On Wed, Mar 02, 2022, Paolo Bonzini wrote:
>>> However, I think we now need a module_get/module_put when creating/destroying
>>> a VM; the workers can outlive kvm_vm_release and therefore any reference
>>> automatically taken by VFS's fops_get/fops_put.
>>
>> Haven't read the rest of the patch, but this caught my eye.  We _already_ need
>> to handle this scenario.  As you noted, any worker, i.e. anything that takes a
>> reference via kvm_get_kvm() without any additional guarantee that the module can't
>> be unloaded is suspect. x86 is mostly fine, though kvm_setup_async_pf() is likely
>> affected, and other architectures seem to have bugs.
>>
>> Google has an internal patch that addresses this.  I believe David is going to post
>> the fix... David?
> 
> This was towards the back of my queue but I can bump it to the front.
> I'll have the patches out this week.

Thanks!

Paolo
Sean Christopherson March 2, 2022, 7:33 p.m. UTC | #8
On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> On 3/2/22 19:01, Sean Christopherson wrote:
> > > It passes a smoke test, and also resolves the debate on the fate of patch 1.
> > +1000, I love this approach.  Do you want me to work on a v3, or shall I let you
> > have the honors?
> 
> I'm already running the usual battery of tests, so I should be able to post
> it either tomorrow (early in my evening) or Friday morning.

Gah, now I remember why I didn't use an async worker.  kvm_mmu_zap_all_fast()
must ensure all SPTEs are zapped and their dirty/accessed data written back to
the primary MMU prior to returning.  Once the memslot update completes, the old
deleted/moved memslot is no longer reachable by the mmu_notifier.  If an mmu_notifier
zaps pfns reachable via the root, KVM will do nothing because there's no relevant
memslot.

So we can use workers, but kvm_mmu_zap_all_fast() would need to flush all workers
before returning, which ends up being no different than putting the invalid roots
on a different list.

What about that idea?  Put roots invalidated by "fast zap" on _another_ list?
My very original idea of moving the roots to a separate list didn't work because
the roots needed to be reachable by the mmu_notifier.  But we could just add
another list_head (inside the unsync_child_bitmap union) and add the roots to
_that_ list.

Let me go resurrect that patch from v1 and tweak it to keep the roots on the old
list, but add them to a new list as well.  That would get rid of the invalid
root iterator stuff.
Paolo Bonzini March 2, 2022, 8:14 p.m. UTC | #9
On 3/2/22 20:33, Sean Christopherson wrote:
> What about that idea?  Put roots invalidated by "fast zap" on_another_  list?
> My very original idea of moving the roots to a separate list didn't work because
> the roots needed to be reachable by the mmu_notifier.  But we could just add
> another list_head (inside the unsync_child_bitmap union) and add the roots to
> _that_  list.

Perhaps the "separate list" idea could be extended to have a single 
worker for all kvm_tdp_mmu_put_root() work, and then indeed replace 
kvm_tdp_mmu_zap_invalidated_roots() with a flush of _that_ worker.  The 
disadvantage is a little less parallelism in zapping invalidated roots; 
but what is good for kvm_tdp_mmu_zap_invalidated_roots() is just as good 
for kvm_tdp_mmu_put_root(), I suppose.  If one wants separate work 
items, KVM could have its own workqueue, and then you flush that workqueue.

For now let's do it the simple but ugly way.  Keeping 
next_invalidated_root() does not make things worse than the status quo, 
and further work will be easier to review if it's kept separate from 
this already-complex work.

Paolo
Sean Christopherson March 2, 2022, 8:47 p.m. UTC | #10
On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> On 3/2/22 20:33, Sean Christopherson wrote:
> > What about that idea?  Put roots invalidated by "fast zap" on_another_  list?
> > My very original idea of moving the roots to a separate list didn't work because
> > the roots needed to be reachable by the mmu_notifier.  But we could just add
> > another list_head (inside the unsync_child_bitmap union) and add the roots to
> > _that_  list.
> 
> Perhaps the "separate list" idea could be extended to have a single worker
> for all kvm_tdp_mmu_put_root() work, and then indeed replace
> kvm_tdp_mmu_zap_invalidated_roots() with a flush of _that_ worker.  The
> disadvantage is a little less parallelism in zapping invalidated roots; but
> what is good for kvm_tdp_mmu_zap_invalidated_roots() is just as good for
> kvm_tdp_mmu_put_root(), I suppose.  If one wants separate work items, KVM
> could have its own workqueue, and then you flush that workqueue.
> 
> For now let's do it the simple but ugly way.  Keeping
> next_invalidated_root() does not make things worse than the status quo, and
> further work will be easier to review if it's kept separate from this
> already-complex work.

Oof, that's not gonna work.  My approach here in v3 doesn't work either.  I finally
remembered why I had the dedicated tdp_mmu_defunct_root flag and thus the smp_mb_*()
dance.

kvm_tdp_mmu_zap_invalidated_roots() assumes that it was gifted a reference to
_all_ invalid roots by kvm_tdp_mmu_invalidate_all_roots().  This works in the
current code base only because kvm->slots_lock is held for the entire duration,
i.e. roots can't become invalid between the end of kvm_tdp_mmu_invalidate_all_roots()
and the end of kvm_tdp_mmu_zap_invalidated_roots().

Marking a root invalid in kvm_tdp_mmu_put_root() breaks that assumption, e.g. if a
new root is created and then dropped, it will be marked invalid but the "fast zap"
will not have a reference.  The "defunct" flag prevents this scenario by allowing
the "fast zap" path to identify invalid roots for which it did not take a reference.
By virtue of holding a reference, "fast zap" also guarantees that the roots it needs
to invalidate and put can't become defunct.

My preference would be to either go back to a variant of v2, or to implement my
"second list" idea.  

I also need to figure out why I didn't encounter errors in v3, because I distinctly
remember underflowing the refcount before adding the defunct flag...
Paolo Bonzini March 2, 2022, 9:22 p.m. UTC | #11
On 3/2/22 21:47, Sean Christopherson wrote:
> On Wed, Mar 02, 2022, Paolo Bonzini wrote:
>> For now let's do it the simple but ugly way.  Keeping
>> next_invalidated_root() does not make things worse than the status quo, and
>> further work will be easier to review if it's kept separate from this
>> already-complex work.
> 
> Oof, that's not gonna work.  My approach here in v3 doesn't work either.  I finally
> remembered why I had the dedicated tdp_mmu_defunct_root flag and thus the smp_mb_*()
> dance.
> 
> kvm_tdp_mmu_zap_invalidated_roots() assumes that it was gifted a reference to
> _all_ invalid roots by kvm_tdp_mmu_invalidate_all_roots().  This works in the
> current code base only because kvm->slots_lock is held for the entire duration,
> i.e. roots can't become invalid between the end of kvm_tdp_mmu_invalidate_all_roots()
> and the end of kvm_tdp_mmu_zap_invalidated_roots().

Yeah, of course that doesn't work if kvm_tdp_mmu_zap_invalidated_roots() 
calls kvm_tdp_mmu_put_root() and the worker also does the same 
kvm_tdp_mmu_put_root().

But, it seems so me that we were so close to something that works and is 
elegant with the worker idea.  It does avoid the possibility of two 
"puts", because the work item is created on the valid->invalid 
transition.  What do you think of having a separate workqueue for each 
struct kvm, so that kvm_tdp_mmu_zap_invalidated_roots() can be replaced 
with a flush?  I can probably do it next Friday.

Paolo

> 
> Marking a root invalid in kvm_tdp_mmu_put_root() breaks that assumption, e.g. if a
> new root is created and then dropped, it will be marked invalid but the "fast zap"
> will not have a reference.  The "defunct" flag prevents this scenario by allowing
> the "fast zap" path to identify invalid roots for which it did not take a reference.
> By virtue of holding a reference, "fast zap" also guarantees that the roots it needs
> to invalidate and put can't become defunct.
> 
> My preference would be to either go back to a variant of v2, or to implement my
> "second list" idea.
> 
> I also need to figure out why I didn't encounter errors in v3, because I distinctly
> remember underflowing the refcount before adding the defunct flag...
Sean Christopherson March 2, 2022, 10:25 p.m. UTC | #12
On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> On 3/2/22 21:47, Sean Christopherson wrote:
> > On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> > > For now let's do it the simple but ugly way.  Keeping
> > > next_invalidated_root() does not make things worse than the status quo, and
> > > further work will be easier to review if it's kept separate from this
> > > already-complex work.
> > 
> > Oof, that's not gonna work.  My approach here in v3 doesn't work either.  I finally
> > remembered why I had the dedicated tdp_mmu_defunct_root flag and thus the smp_mb_*()
> > dance.
> > 
> > kvm_tdp_mmu_zap_invalidated_roots() assumes that it was gifted a reference to
> > _all_ invalid roots by kvm_tdp_mmu_invalidate_all_roots().  This works in the
> > current code base only because kvm->slots_lock is held for the entire duration,
> > i.e. roots can't become invalid between the end of kvm_tdp_mmu_invalidate_all_roots()
> > and the end of kvm_tdp_mmu_zap_invalidated_roots().
> 
> Yeah, of course that doesn't work if kvm_tdp_mmu_zap_invalidated_roots()
> calls kvm_tdp_mmu_put_root() and the worker also does the same
> kvm_tdp_mmu_put_root().
> 
> But, it seems so me that we were so close to something that works and is
> elegant with the worker idea.  It does avoid the possibility of two "puts",
> because the work item is created on the valid->invalid transition.  What do
> you think of having a separate workqueue for each struct kvm, so that
> kvm_tdp_mmu_zap_invalidated_roots() can be replaced with a flush?

I definitely like the idea, but I'm getting another feeling of deja vu.  Ah, I
think the mess I created was zapping via async worker without a dedicated workqueue,
and so the flush became very annoying/painful.

I have the "dedicated list" idea coded up.  If testing looks good, I'll post it as
a v3.5 (without your xchg() magic or other kvm_tdp_mmu_put_root() changes).  That
way we have a less-awful backup (and/or an intermediate step) if the workqueue
idea is delayed or doesn't work.  Assuming it works, it's much prettier than having
a defunct flag.

> I can probably do it next Friday.

Early-ish warning, I'll be offline March 11th - March 23rd inclusive.  

FWIW, other than saving me from another painful rebase, there's no urgent need to
get this series into 5.18.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index be063b6c91b7..1bff453f7cbe 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -65,7 +65,13 @@  struct kvm_mmu_page {
 		struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
 		tdp_ptep_t ptep;
 	};
-	DECLARE_BITMAP(unsync_child_bitmap, 512);
+	union {
+		DECLARE_BITMAP(unsync_child_bitmap, 512);
+		struct {
+			struct work_struct tdp_mmu_async_work;
+			void *tdp_mmu_async_data;
+		};
+	};
 
 	struct list_head lpage_disallowed_link;
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ec28a88c6376..4151e61245a7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -81,6 +81,38 @@  static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
 static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			     bool shared);
 
+static void tdp_mmu_zap_root_async(struct work_struct *work)
+{
+	struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page,
+						 tdp_mmu_async_work);
+	struct kvm *kvm = root->tdp_mmu_async_data;
+
+	read_lock(&kvm->mmu_lock);
+
+	/*
+	 * A TLB flush is not necessary as KVM performs a local TLB flush when
+	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
+	 * to a different pCPU.  Note, the local TLB flush on reuse also
+	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
+	 * intermediate paging structures, that may be zapped, as such entries
+	 * are associated with the ASID on both VMX and SVM.
+	 */
+	tdp_mmu_zap_root(kvm, root, true);
+
+	/*
+	 * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for
+	 * avoiding an infinite loop.  By design, the root is reachable while
+	 * it's being asynchronously zapped, thus a different task can put its
+	 * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an
+	 * asynchronously zapped root is unavoidable.
+	 */
+	kvm_tdp_mmu_put_root(kvm, root, true);
+
+	read_unlock(&kvm->mmu_lock);
+
+	kvm_put_kvm(kvm);
+}
+
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			  bool shared)
 {
@@ -142,15 +174,26 @@  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	refcount_set(&root->tdp_mmu_root_count, 1);
 
 	/*
-	 * Zap the root, then put the refcount "acquired" above.   Recursively
-	 * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an
-	 * infinite loop by freeing invalid roots.  By design, the root is
-	 * reachable while it's being zapped, thus a different task can put its
-	 * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a
-	 * defunct root is unavoidable.
+	 * Attempt to acquire a reference to KVM itself.  If KVM is alive, then
+	 * zap the root asynchronously in a worker, otherwise it must be zapped
+	 * directly here.  Wait to do this check until after the refcount is
+	 * reset so that tdp_mmu_zap_root() can safely yield.
+	 *
+	 * In both flows, zap the root, then put the refcount "acquired" above.
+	 * When putting the reference, use kvm_tdp_mmu_put_root() to test the
+	 * above logic for avoiding an infinite loop by freeing invalid roots.
+	 * By design, the root is reachable while it's being zapped, thus a
+	 * different task can put its last reference, i.e. flowing through
+	 * kvm_tdp_mmu_put_root() for a defunct root is unavoidable.
 	 */
-	tdp_mmu_zap_root(kvm, root, shared);
-	kvm_tdp_mmu_put_root(kvm, root, shared);
+	if (kvm_get_kvm_safe(kvm)) {
+		root->tdp_mmu_async_data = kvm;
+		INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_async);
+		schedule_work(&root->tdp_mmu_async_work);
+	} else {
+		tdp_mmu_zap_root(kvm, root, shared);
+		kvm_tdp_mmu_put_root(kvm, root, shared);
+	}
 }
 
 enum tdp_mmu_roots_iter_type {
@@ -954,7 +997,11 @@  void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 
 	/*
 	 * Zap all roots, including invalid roots, as all SPTEs must be dropped
-	 * before returning to the caller.
+	 * before returning to the caller.  Zap directly even if the root is
+	 * also being zapped by a worker.  Walking zapped top-level SPTEs isn't
+	 * all that expensive and mmu_lock is already held, which means the
+	 * worker has yielded, i.e. flushing the work instead of zapping here
+	 * isn't guaranteed to be any faster.
 	 *
 	 * A TLB flush is unnecessary, KVM zaps everything if and only the VM
 	 * is being destroyed or the userspace VMM has exited.  In both cases,