diff mbox series

[v4,21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker

Message ID 20220303193842.370645-22-pbonzini@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Overhaul TDP MMU zapping and flushing | expand

Commit Message

Paolo Bonzini March 3, 2022, 7:38 p.m. UTC
Use the system work queue also for roots invalidated by the TDP MMU's
"fast zap" mechanism implemented by kvm_tdp_mmu_invalidate_all_roots().
Currently this is done by kvm_tdp_mmu_zap_invalidated_roots(), but
there is no need to duplicate the code between the "normal"
kvm_tdp_mmu_put_root() path and the invalidation case.  The
only issue is that kvm_tdp_mmu_invalidate_all_roots() now
assumes that there is at least one reference in kvm->users_count;
so if the VM is dying just go through the slow path, as there is
nothing to gain by using the fast zapping.

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/mmu/mmu.c          |   6 +-
 arch/x86/kvm/mmu/mmu_internal.h |   8 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 158 +++++++++++++++-----------------
 4 files changed, 86 insertions(+), 88 deletions(-)

Comments

Sean Christopherson March 3, 2022, 8:54 p.m. UTC | #1
On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0b88592495f8..9287ee078c49 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5730,7 +5730,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>  	kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
>  
>  	kvm_zap_obsolete_pages(kvm);
> -

Spurious whitespace deletion.

>  	write_unlock(&kvm->mmu_lock);
>  
>  	/*
> @@ -5741,11 +5740,8 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>  	 * Deferring the zap until the final reference to the root is put would
>  	 * lead to use-after-free.
>  	 */
> -	if (is_tdp_mmu_enabled(kvm)) {
> -		read_lock(&kvm->mmu_lock);
> +	if (is_tdp_mmu_enabled(kvm))
>  		kvm_tdp_mmu_zap_invalidated_roots(kvm);
> -		read_unlock(&kvm->mmu_lock);
> -	}
>  }
>  
>  static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)

...

> +static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root)
> +{

Definitely worth doing (I'll provide more info in the "Zap defunct roots" patch):

	WARN_ON_ONCE(!root->role.invalid || root->tdp_mmu_async_data);

The assertion on role.invalid is a little overkill, but might help document when
and how this is used.

> +	root->tdp_mmu_async_data = kvm;
> +	INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work);
> +	queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
> +}
> +
> +static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
> +{
> +	union kvm_mmu_page_role role = page->role;
> +	role.invalid = true;
> +
> +	/* No need to use cmpxchg, only the invalid bit can change.  */
> +	role.word = xchg(&page->role.word, role.word);
> +	return role.invalid;

This helper is unused.  It _could_ be used here, but I think it belongs in the
next patch.  Critically, until zapping defunct roots creates the invariant that
invalid roots are _always_ zapped via worker, kvm_tdp_mmu_invalidate_all_roots()
must not assume that an invalid root is queued for zapping.  I.e. doing this
before the "Zap defunct roots" would be wrong:

	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
		if (kvm_tdp_root_mark_invalid(root))
			continue;

		if (WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)));
			continue;

		tdp_mmu_schedule_zap_root(kvm, root);
	}
Sean Christopherson March 3, 2022, 9:06 p.m. UTC | #2
On Thu, Mar 03, 2022, Sean Christopherson wrote:
> On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> > +	root->tdp_mmu_async_data = kvm;
> > +	INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work);
> > +	queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
> > +}
> > +
> > +static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
> > +{
> > +	union kvm_mmu_page_role role = page->role;
> > +	role.invalid = true;
> > +
> > +	/* No need to use cmpxchg, only the invalid bit can change.  */
> > +	role.word = xchg(&page->role.word, role.word);
> > +	return role.invalid;
> 
> This helper is unused.  It _could_ be used here, but I think it belongs in the
> next patch.  Critically, until zapping defunct roots creates the invariant that
> invalid roots are _always_ zapped via worker, kvm_tdp_mmu_invalidate_all_roots()
> must not assume that an invalid root is queued for zapping.  I.e. doing this
> before the "Zap defunct roots" would be wrong:
> 
> 	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> 		if (kvm_tdp_root_mark_invalid(root))
> 			continue;
> 
> 		if (WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)));
> 			continue;
> 
> 		tdp_mmu_schedule_zap_root(kvm, root);
> 	}

Gah, lost my train of thought and forgot that this _can_ re-queue a root even in
this patch, it just can't it just can't re-queue a root that is _currently_ queued.

The re-queue scenario happens if a root is queued and zapped, but is kept alive
by a vCPU that hasn't yet put its reference.  If another memslot comes along before
the (sleeping) vCPU drops its reference, this will re-queue the root.

It's not a major problem in this patch as it's a small amount of wasted effort,
but it will be an issue when the "put" path starts using the queue, as that will
create a scenario where a memslot update (or NX toggle) can come along while a
defunct root is in the zap queue.

Checking for role.invalid is wrong (as above), so for this patch I think the
easiest thing is to use tdp_mmu_async_data as a sentinel that the root was zapped
in the past and doesn't need to be re-zapped.

/*
 * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
 * is about to be zapped, e.g. in response to a memslots update.  The actual
 * zapping is performed asynchronously, so a reference is taken on all roots.
 * Using a separate workqueue makes it easy to ensure that the destruction is
 * performed before the "fast zap" completes, without keeping a separate list
 * of invalidated roots; the list is effectively the list of work items in
 * the workqueue.
 *
 * Skip roots that were already queued for zapping, the "fast zap" path is the
 * only user of the zap queue and always flushes the queue under slots_lock,
 * i.e. the queued zap is guaranteed to have completed already.
 *
 * Because mmu_lock is held for write, it should be impossible to observe a
 * root with zero refcount,* i.e. the list of roots cannot be stale.
 *
 * This has essentially the same effect for the TDP MMU
 * as updating mmu_valid_gen does for the shadow MMU.
 */
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
{
	struct kvm_mmu_page *root;

	lockdep_assert_held_write(&kvm->mmu_lock);
	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
		if (root->tdp_mmu_async_data)
			continue;

		if (WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
			continue;

		root->role.invalid = true;
		tdp_mmu_schedule_zap_root(kvm, root);
	}
}
Sean Christopherson March 3, 2022, 9:20 p.m. UTC | #3
On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> The only issue is that kvm_tdp_mmu_invalidate_all_roots() now assumes that
> there is at least one reference in kvm->users_count; so if the VM is dying
> just go through the slow path, as there is nothing to gain by using the fast
> zapping.

This isn't actually implemented. :-)
Sean Christopherson March 3, 2022, 9:32 p.m. UTC | #4
On Thu, Mar 03, 2022, Sean Christopherson wrote:
> On Thu, Mar 03, 2022, Paolo Bonzini wrote:
> > The only issue is that kvm_tdp_mmu_invalidate_all_roots() now assumes that
> > there is at least one reference in kvm->users_count; so if the VM is dying
> > just go through the slow path, as there is nothing to gain by using the fast
> > zapping.
> 
> This isn't actually implemented. :-)

Oh, and when you implement it (or copy paste), can you also add a lockdep and a
comment about the check being racy, but that the race is benign?  It took me a
second to realize why it's safe to use a work queue without holding a reference
to @kvm.

static void kvm_mmu_zap_all_fast(struct kvm *kvm)
{
	lockdep_assert_held(&kvm->slots_lock);

	/*
	 * Zap using the "slow" path if the VM is being destroyed.  The "slow"
	 * path isn't actually slower, it just just doesn't block vCPUs for an
	 * extended duration, which is irrelevant if the VM is dying.
	 *
	 * Note, this doesn't guarantee users_count won't go to '0' immediately
	 * after this check, but that race is benign as callers that don't hold
	 * a reference to @kvm must hold kvm_lock to prevent use-after-free.
	 */
	if (unlikely(refcount_read(&kvm->users_count)) {
		lockdep_assert_held(&kvm_lock);
		kvm_mmu_zap_all(kvm);
		return;
	}

	write_lock(&kvm->mmu_lock);
	trace_kvm_mmu_zap_all_fast(kvm);
Paolo Bonzini March 4, 2022, 6:48 a.m. UTC | #5
On 3/3/22 22:32, Sean Christopherson wrote:

> The re-queue scenario happens if a root is queued and zapped, but is kept alive
> by a vCPU that hasn't yet put its reference.  If another memslot comes along before
> the (sleeping) vCPU drops its reference, this will re-queue the root.
> 
> It's not a major problem in this patch as it's a small amount of wasted effort,
> but it will be an issue when the "put" path starts using the queue, as that will
> create a scenario where a memslot update (or NX toggle) can come along while a
> defunct root is in the zap queue.

As of this patch it's not a problem because 
kvm_tdp_mmu_invalidate_all_roots()'s caller holds kvm->slots_lock, so 
kvm_tdp_mmu_invalidate_all_roots() is guarantee to queue its work items 
on an empty workqueue.  In effect the workqueue is just a fancy list. 
But as you point out in the review to patch 24, it becomes a problem 
when there's no kvm->slots_lock to guarantee that.  Then it needs to 
check that the root isn't already invalid.

>>> The only issue is that kvm_tdp_mmu_invalidate_all_roots() now assumes that
>>> there is at least one reference in kvm->users_count; so if the VM is dying
>>> just go through the slow path, as there is nothing to gain by using the fast
>>> zapping.
>> This isn't actually implemented.:-)
> Oh, and when you implement it (or copy paste), can you also add a lockdep and a
> comment about the check being racy, but that the race is benign?  It took me a
> second to realize why it's safe to use a work queue without holding a reference
> to @kvm.

I didn't remove the paragraph from the commit message, but I think it's 
unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and 
kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to 
take a reference to the VM.

I think I don't even need to check kvm->users_count in the defunct root 
case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the 
workqueue before it checks that the lists are empty.

I'll wait to hear from you later today before sending out v5.

Paolo
Sean Christopherson March 4, 2022, 4:02 p.m. UTC | #6
On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> On 3/3/22 22:32, Sean Christopherson wrote:
> I didn't remove the paragraph from the commit message, but I think it's
> unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and
> kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
> a reference to the VM.
> 
> I think I don't even need to check kvm->users_count in the defunct root
> case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
> before it checks that the lists are empty.

Yes, that should work.  IIRC, the WARN_ONs will tell us/you quite quickly if
we're wrong :-)  mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
the WARN_ON that there are no roots on the list.
Paolo Bonzini March 4, 2022, 6:11 p.m. UTC | #7
On 3/4/22 17:02, Sean Christopherson wrote:
> On Fri, Mar 04, 2022, Paolo Bonzini wrote:
>> On 3/3/22 22:32, Sean Christopherson wrote:
>> I didn't remove the paragraph from the commit message, but I think it's
>> unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and
>> kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
>> a reference to the VM.
>>
>> I think I don't even need to check kvm->users_count in the defunct root
>> case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
>> before it checks that the lists are empty.
> 
> Yes, that should work.  IIRC, the WARN_ONs will tell us/you quite quickly if
> we're wrong :-)  mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
> and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
> the WARN_ON that there are no roots on the list.

Good, for the record these are the commit messages I have:

     KVM: x86/mmu: Zap invalidated roots via asynchronous worker
     
     Use the system worker threads to zap the roots invalidated
     by the TDP MMU's "fast zap" mechanism, implemented by
     kvm_tdp_mmu_invalidate_all_roots().
     
     At this point, apart from allowing some parallelism in the zapping of
     roots, the workqueue is a glorified linked list: work items are added and
     flushed entirely within a single kvm->slots_lock critical section.  However,
     the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
     assumes that it owns a reference to all invalid roots; therefore, no
     one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
     invalidated roots on a linked list... erm, on a workqueue ensures that
     tdp_mmu_zap_root_work() only puts back those extra references that
     kvm_mmu_zap_all_invalidated_roots() had gifted to it.

and

     KVM: x86/mmu: Zap defunct roots via asynchronous worker
     
     Zap defunct roots, a.k.a. roots that have been invalidated after their
     last reference was initially dropped, asynchronously via the existing work
     queue instead of forcing the work upon the unfortunate task that happened
     to drop the last reference.
     
     If a vCPU task drops the last reference, the vCPU is effectively blocked
     by the host for the entire duration of the zap.  If the root being zapped
     happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
     being active, the zap can take several hundred seconds.  Unsurprisingly,
     most guests are unhappy if a vCPU disappears for hundreds of seconds.
     
     E.g. running a synthetic selftest that triggers a vCPU root zap with
     ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
     Offloading the zap to a worker drops the block time to <100ms.
     
     There is an important nuance to this change.  If the same work item
     was queued twice before the work function has run, it would only
     execute once and one reference would be leaked.  Therefore, now that
     queueing items is not anymore protected by write_lock(&kvm->mmu_lock),
     kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
     skip already invalid roots.  On the other hand, kvm_mmu_zap_all_fast()
     must return only after those skipped roots have been zapped as well.
     These two requirements can be satisfied only if _all_ places that
     change invalid to true now schedule the worker before releasing the
     mmu_lock.  There are just two, kvm_tdp_mmu_put_root() and
     kvm_tdp_mmu_invalidate_all_roots().

Paolo
Sean Christopherson March 5, 2022, 12:34 a.m. UTC | #8
On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> On 3/4/22 17:02, Sean Christopherson wrote:
> > On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> > > On 3/3/22 22:32, Sean Christopherson wrote:
> > > I didn't remove the paragraph from the commit message, but I think it's
> > > unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and
> > > kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
> > > a reference to the VM.
> > > 
> > > I think I don't even need to check kvm->users_count in the defunct root
> > > case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
> > > before it checks that the lists are empty.
> > 
> > Yes, that should work.  IIRC, the WARN_ONs will tell us/you quite quickly if
> > we're wrong :-)  mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
> > and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
> > the WARN_ON that there are no roots on the list.
> 
> Good, for the record these are the commit messages I have:
> 
>     KVM: x86/mmu: Zap invalidated roots via asynchronous worker
>     Use the system worker threads to zap the roots invalidated
>     by the TDP MMU's "fast zap" mechanism, implemented by
>     kvm_tdp_mmu_invalidate_all_roots().
>     At this point, apart from allowing some parallelism in the zapping of
>     roots, the workqueue is a glorified linked list: work items are added and
>     flushed entirely within a single kvm->slots_lock critical section.  However,
>     the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
>     assumes that it owns a reference to all invalid roots; therefore, no
>     one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
>     invalidated roots on a linked list... erm, on a workqueue ensures that
>     tdp_mmu_zap_root_work() only puts back those extra references that
>     kvm_mmu_zap_all_invalidated_roots() had gifted to it.
> 
> and
> 
>     KVM: x86/mmu: Zap defunct roots via asynchronous worker
>     Zap defunct roots, a.k.a. roots that have been invalidated after their
>     last reference was initially dropped, asynchronously via the existing work
>     queue instead of forcing the work upon the unfortunate task that happened
>     to drop the last reference.
>     If a vCPU task drops the last reference, the vCPU is effectively blocked
>     by the host for the entire duration of the zap.  If the root being zapped
>     happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
>     being active, the zap can take several hundred seconds.  Unsurprisingly,
>     most guests are unhappy if a vCPU disappears for hundreds of seconds.
>     E.g. running a synthetic selftest that triggers a vCPU root zap with
>     ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
>     Offloading the zap to a worker drops the block time to <100ms.
>     There is an important nuance to this change.  If the same work item
>     was queued twice before the work function has run, it would only
>     execute once and one reference would be leaked.  Therefore, now that
>     queueing items is not anymore protected by write_lock(&kvm->mmu_lock),
>     kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
>     skip already invalid roots.  On the other hand, kvm_mmu_zap_all_fast()
>     must return only after those skipped roots have been zapped as well.
>     These two requirements can be satisfied only if _all_ places that
>     change invalid to true now schedule the worker before releasing the
>     mmu_lock.  There are just two, kvm_tdp_mmu_put_root() and
>     kvm_tdp_mmu_invalidate_all_roots().

Very nice!
Paolo Bonzini March 5, 2022, 7:53 p.m. UTC | #9
On 3/5/22 01:34, Sean Christopherson wrote:
> On Fri, Mar 04, 2022, Paolo Bonzini wrote:
>> On 3/4/22 17:02, Sean Christopherson wrote:
>>> On Fri, Mar 04, 2022, Paolo Bonzini wrote:
>>>> On 3/3/22 22:32, Sean Christopherson wrote:
>>>> I didn't remove the paragraph from the commit message, but I think it's
>>>> unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and
>>>> kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
>>>> a reference to the VM.
>>>>
>>>> I think I don't even need to check kvm->users_count in the defunct root
>>>> case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
>>>> before it checks that the lists are empty.
>>>
>>> Yes, that should work.  IIRC, the WARN_ONs will tell us/you quite quickly if
>>> we're wrong :-)  mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
>>> and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
>>> the WARN_ON that there are no roots on the list.
>>
>> Good, for the record these are the commit messages I have:

I'm seeing some hangs in ~50% of installation jobs, both Windows and 
Linux.  I have not yet tried to reproduce outside the automated tests, 
or to bisect, but I'll try to push at least the first part of the series 
for 5.18.

Paolo

>>      KVM: x86/mmu: Zap invalidated roots via asynchronous worker
>>      Use the system worker threads to zap the roots invalidated
>>      by the TDP MMU's "fast zap" mechanism, implemented by
>>      kvm_tdp_mmu_invalidate_all_roots().
>>      At this point, apart from allowing some parallelism in the zapping of
>>      roots, the workqueue is a glorified linked list: work items are added and
>>      flushed entirely within a single kvm->slots_lock critical section.  However,
>>      the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
>>      assumes that it owns a reference to all invalid roots; therefore, no
>>      one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
>>      invalidated roots on a linked list... erm, on a workqueue ensures that
>>      tdp_mmu_zap_root_work() only puts back those extra references that
>>      kvm_mmu_zap_all_invalidated_roots() had gifted to it.
>>
>> and
>>
>>      KVM: x86/mmu: Zap defunct roots via asynchronous worker
>>      Zap defunct roots, a.k.a. roots that have been invalidated after their
>>      last reference was initially dropped, asynchronously via the existing work
>>      queue instead of forcing the work upon the unfortunate task that happened
>>      to drop the last reference.
>>      If a vCPU task drops the last reference, the vCPU is effectively blocked
>>      by the host for the entire duration of the zap.  If the root being zapped
>>      happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
>>      being active, the zap can take several hundred seconds.  Unsurprisingly,
>>      most guests are unhappy if a vCPU disappears for hundreds of seconds.
>>      E.g. running a synthetic selftest that triggers a vCPU root zap with
>>      ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
>>      Offloading the zap to a worker drops the block time to <100ms.
>>      There is an important nuance to this change.  If the same work item
>>      was queued twice before the work function has run, it would only
>>      execute once and one reference would be leaked.  Therefore, now that
>>      queueing items is not anymore protected by write_lock(&kvm->mmu_lock),
>>      kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
>>      skip already invalid roots.  On the other hand, kvm_mmu_zap_all_fast()
>>      must return only after those skipped roots have been zapped as well.
>>      These two requirements can be satisfied only if _all_ places that
>>      change invalid to true now schedule the worker before releasing the
>>      mmu_lock.  There are just two, kvm_tdp_mmu_put_root() and
>>      kvm_tdp_mmu_invalidate_all_roots().
> 
> Very nice!
>
Sean Christopherson March 8, 2022, 9:29 p.m. UTC | #10
On Sat, Mar 05, 2022, Paolo Bonzini wrote:
> On 3/5/22 01:34, Sean Christopherson wrote:
> > On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> > > On 3/4/22 17:02, Sean Christopherson wrote:
> > > > On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> > > > > On 3/3/22 22:32, Sean Christopherson wrote:
> > > > > I didn't remove the paragraph from the commit message, but I think it's
> > > > > unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and
> > > > > kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
> > > > > a reference to the VM.
> > > > > 
> > > > > I think I don't even need to check kvm->users_count in the defunct root
> > > > > case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
> > > > > before it checks that the lists are empty.
> > > > 
> > > > Yes, that should work.  IIRC, the WARN_ONs will tell us/you quite quickly if
> > > > we're wrong :-)  mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
> > > > and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
> > > > the WARN_ON that there are no roots on the list.
> > > 
> > > Good, for the record these are the commit messages I have:
> 
> I'm seeing some hangs in ~50% of installation jobs, both Windows and Linux.
> I have not yet tried to reproduce outside the automated tests, or to bisect,
> but I'll try to push at least the first part of the series for 5.18.

Out of curiosity, what was the bug?  I see this got pushed to kvm/next.
Paolo Bonzini March 11, 2022, 5:50 p.m. UTC | #11
On 3/8/22 22:29, Sean Christopherson wrote:
>>>> Good, for the record these are the commit messages I have:
>> I'm seeing some hangs in ~50% of installation jobs, both Windows and Linux.
>> I have not yet tried to reproduce outside the automated tests, or to bisect,
>> but I'll try to push at least the first part of the series for 5.18.
> Out of curiosity, what was the bug?  I see this got pushed to kvm/next.
> 

Of course it was in another, "harmless" patch that was in front of it. :)

Paolo
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c45ab8b5c37f..fd05ad52b65c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -15,6 +15,7 @@ 
 #include <linux/cpumask.h>
 #include <linux/irq_work.h>
 #include <linux/irq.h>
+#include <linux/workqueue.h>
 
 #include <linux/kvm.h>
 #include <linux/kvm_para.h>
@@ -1218,6 +1219,7 @@  struct kvm_arch {
 	 * the thread holds the MMU lock in write mode.
 	 */
 	spinlock_t tdp_mmu_pages_lock;
+	struct workqueue_struct *tdp_mmu_zap_wq;
 #endif /* CONFIG_X86_64 */
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0b88592495f8..9287ee078c49 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5730,7 +5730,6 @@  static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS);
 
 	kvm_zap_obsolete_pages(kvm);
-
 	write_unlock(&kvm->mmu_lock);
 
 	/*
@@ -5741,11 +5740,8 @@  static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 	 * Deferring the zap until the final reference to the root is put would
 	 * lead to use-after-free.
 	 */
-	if (is_tdp_mmu_enabled(kvm)) {
-		read_lock(&kvm->mmu_lock);
+	if (is_tdp_mmu_enabled(kvm))
 		kvm_tdp_mmu_zap_invalidated_roots(kvm);
-		read_unlock(&kvm->mmu_lock);
-	}
 }
 
 static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index be063b6c91b7..1bff453f7cbe 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -65,7 +65,13 @@  struct kvm_mmu_page {
 		struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
 		tdp_ptep_t ptep;
 	};
-	DECLARE_BITMAP(unsync_child_bitmap, 512);
+	union {
+		DECLARE_BITMAP(unsync_child_bitmap, 512);
+		struct {
+			struct work_struct tdp_mmu_async_work;
+			void *tdp_mmu_async_data;
+		};
+	};
 
 	struct list_head lpage_disallowed_link;
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5038de0c872d..ed1bb63b342d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -25,6 +25,8 @@  bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
 	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
+	kvm->arch.tdp_mmu_zap_wq =
+		alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
 
 	return true;
 }
@@ -49,11 +51,15 @@  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages));
 	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
 
+	flush_workqueue(kvm->arch.tdp_mmu_zap_wq);
+
 	/*
 	 * Ensure that all the outstanding RCU callbacks to free shadow pages
-	 * can run before the VM is torn down.
+	 * can run before the VM is torn down.  Work items on tdp_mmu_zap_wq
+	 * can call kvm_tdp_mmu_put_root and create new callbacks.
 	 */
 	rcu_barrier();
+	destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
 }
 
 static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
@@ -81,6 +87,53 @@  static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
 static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			     bool shared);
 
+static void tdp_mmu_zap_root_work(struct work_struct *work)
+{
+	struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page,
+						 tdp_mmu_async_work);
+	struct kvm *kvm = root->tdp_mmu_async_data;
+
+	read_lock(&kvm->mmu_lock);
+
+	/*
+	 * A TLB flush is not necessary as KVM performs a local TLB flush when
+	 * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
+	 * to a different pCPU.  Note, the local TLB flush on reuse also
+	 * invalidates any paging-structure-cache entries, i.e. TLB entries for
+	 * intermediate paging structures, that may be zapped, as such entries
+	 * are associated with the ASID on both VMX and SVM.
+	 */
+	tdp_mmu_zap_root(kvm, root, true);
+
+	/*
+	 * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for
+	 * avoiding an infinite loop.  By design, the root is reachable while
+	 * it's being asynchronously zapped, thus a different task can put its
+	 * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an
+	 * asynchronously zapped root is unavoidable.
+	 */
+	kvm_tdp_mmu_put_root(kvm, root, true);
+
+	read_unlock(&kvm->mmu_lock);
+}
+
+static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+	root->tdp_mmu_async_data = kvm;
+	INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work);
+	queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
+}
+
+static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
+{
+	union kvm_mmu_page_role role = page->role;
+	role.invalid = true;
+
+	/* No need to use cmpxchg, only the invalid bit can change.  */
+	role.word = xchg(&page->role.word, role.word);
+	return role.invalid;
+}
+
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			  bool shared)
 {
@@ -892,6 +945,13 @@  void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 	int i;
 
 	/*
+	 * Zap all roots, including invalid roots, as all SPTEs must be dropped
+	 * before returning to the caller.  Zap directly even if the root is
+	 * also being zapped by a worker.  Walking zapped top-level SPTEs isn't
+	 * all that expensive and mmu_lock is already held, which means the
+	 * worker has yielded, i.e. flushing the work instead of zapping here
+	 * isn't guaranteed to be any faster.
+	 *
 	 * A TLB flush is unnecessary, KVM zaps everything if and only the VM
 	 * is being destroyed or the userspace VMM has exited.  In both cases,
 	 * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
@@ -902,96 +962,28 @@  void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 	}
 }
 
-static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
-						  struct kvm_mmu_page *prev_root)
-{
-	struct kvm_mmu_page *next_root;
-
-	if (prev_root)
-		next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
-						  &prev_root->link,
-						  typeof(*prev_root), link);
-	else
-		next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
-						   typeof(*next_root), link);
-
-	while (next_root && !(next_root->role.invalid &&
-			      refcount_read(&next_root->tdp_mmu_root_count)))
-		next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
-						  &next_root->link,
-						  typeof(*next_root), link);
-
-	return next_root;
-}
-
 /*
  * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
- * zap" completes.  Since kvm_tdp_mmu_invalidate_all_roots() has acquired a
- * reference to each invalidated root, roots will not be freed until after this
- * function drops the gifted reference, e.g. so that vCPUs don't get stuck with
- * tearing down paging structures.
+ * zap" completes.
  */
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 {
-	struct kvm_mmu_page *next_root;
-	struct kvm_mmu_page *root;
-
-	lockdep_assert_held_read(&kvm->mmu_lock);
-
-	rcu_read_lock();
-
-	root = next_invalidated_root(kvm, NULL);
-
-	while (root) {
-		next_root = next_invalidated_root(kvm, root);
-
-		rcu_read_unlock();
-
-		/*
-		 * A TLB flush is unnecessary, invalidated roots are guaranteed
-		 * to be unreachable by the guest (see kvm_tdp_mmu_put_root()
-		 * for more details), and unlike the legacy MMU, no vCPU kick
-		 * is needed to play nice with lockless shadow walks as the TDP
-		 * MMU protects its paging structures via RCU.  Note, zapping
-		 * will still flush on yield, but that's a minor performance
-		 * blip and not a functional issue.
-		 */
-		tdp_mmu_zap_root(kvm, root, true);
-
-		/*
-		 * Put the reference acquired in
-		 * kvm_tdp_mmu_invalidate_roots
-		 */
-		kvm_tdp_mmu_put_root(kvm, root, true);
-
-		root = next_root;
-
-		rcu_read_lock();
-	}
-
-	rcu_read_unlock();
+	flush_workqueue(kvm->arch.tdp_mmu_zap_wq);
 }
 
 /*
  * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
- * is about to be zapped, e.g. in response to a memslots update.  The caller is
- * responsible for invoking kvm_tdp_mmu_zap_invalidated_roots() to do the actual
- * zapping.
- *
- * Take a reference on all roots to prevent the root from being freed before it
- * is zapped by this thread.  Freeing a root is not a correctness issue, but if
- * a vCPU drops the last reference to a root prior to the root being zapped, it
- * will get stuck with tearing down the entire paging structure.
+ * is about to be zapped, e.g. in response to a memslots update.  The actual
+ * zapping is performed asynchronously, so a reference is taken on all roots.
+ * Using a separate workqueue makes it easy to ensure that the destruction is
+ * performed before the "fast zap" completes, without keeping a separate list
+ * of invalidated roots; the list is effectively the list of work items in
+ * the workqueue.
  *
- * Get a reference even if the root is already invalid,
- * kvm_tdp_mmu_zap_invalidated_roots() assumes it was gifted a reference to all
- * invalid roots, e.g. there's no epoch to identify roots that were invalidated
- * by a previous call.  Roots stay on the list until the last reference is
- * dropped, so even though all invalid roots are zapped, a root may not go away
- * for quite some time, e.g. if a vCPU blocks across multiple memslot updates.
- *
- * Because mmu_lock is held for write, it should be impossible to observe a
- * root with zero refcount, i.e. the list of roots cannot be stale.
+ * Get a reference even if the root is already invalid, the asynchronous worker
+ * assumes it was gifted a reference to the root it processes.  Because mmu_lock
+ * is held for write, it should be impossible to observe a root with zero refcount,
+ * i.e. the list of roots cannot be stale.
  *
  * This has essentially the same effect for the TDP MMU
  * as updating mmu_valid_gen does for the shadow MMU.
@@ -1002,8 +994,10 @@  void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
-		if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
+		if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) {
 			root->role.invalid = true;
+			tdp_mmu_schedule_zap_root(kvm, root);
+		}
 	}
 }