diff mbox series

[1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed

Message ID 20240110011533.503302-2-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: Async #PF fixes and cleanups | expand

Commit Message

Sean Christopherson Jan. 10, 2024, 1:15 a.m. UTC
Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
completion queue, e.g. when a VM and all its vCPUs is being destroyed.
KVM must ensure that none of its workqueue callbacks is running when the
last reference to the KVM _module_ is put.  Gifting a reference to the
associated VM prevents the workqueue callback from dereferencing freed
vCPU/VM memory, but does not prevent the KVM module from being unloaded
before the callback completes.

Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:

 WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
 Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
 CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 Workqueue: events async_pf_execute [kvm]
 RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
 Call Trace:
  <TASK>
  async_pf_execute+0x198/0x260 [kvm]
  process_one_work+0x145/0x2d0
  worker_thread+0x27e/0x3a0
  kthread+0xba/0xe0
  ret_from_fork+0x2d/0x50
  ret_from_fork_asm+0x11/0x20
  </TASK>
 ---[ end trace 0000000000000000 ]---
 INFO: task kworker/8:1:251 blocked for more than 120 seconds.
       Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
 Workqueue: events async_pf_execute [kvm]
 Call Trace:
  <TASK>
  __schedule+0x33f/0xa40
  schedule+0x53/0xc0
  schedule_timeout+0x12a/0x140
  __wait_for_common+0x8d/0x1d0
  __flush_work.isra.0+0x19f/0x2c0
  kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
  kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
  kvm_put_kvm+0x1c1/0x320 [kvm]
  async_pf_execute+0x198/0x260 [kvm]
  process_one_work+0x145/0x2d0
  worker_thread+0x27e/0x3a0
  kthread+0xba/0xe0
  ret_from_fork+0x2d/0x50
  ret_from_fork_asm+0x11/0x20
  </TASK>

If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
then there's no need to gift async_pf_execute() a reference because all
invocations of async_pf_execute() will be forced to complete before the
vCPU and its VM are destroyed/freed.  And that in turn fixes the module
unloading bug as __fput() won't do module_put() on the last vCPU reference
until the vCPU has been freed, e.g. if closing the vCPU file also puts the
last reference to the KVM module.

Note that kvm_check_async_pf_completion() may also take the work item off
the completion queue and so also needs to flush the work queue, as the
work will not be seen by kvm_clear_async_pf_completion_queue().  Waiting
on the workqueue could theoretically delay a vCPU due to waiting for the
work to complete, but that's a very, very small chance, and likely a very
small delay.  kvm_arch_async_page_present_queued() unconditionally makes a
new request, i.e. will effectively delay entering the guest, so the
remaining work is really just:

        trace_kvm_async_pf_completed(addr, cr2_or_gpa);

        __kvm_vcpu_wake_up(vcpu);

        mmput(mm);

and mmput() can't drop the last reference to the page tables if the vCPU is
still alive, i.e. the vCPU won't get stuck tearing down page tables.

Add a helper to do the flushing, specifically to deal with "wakeup all"
work items, as they aren't actually work items, i.e. are never placed in a
workqueue.  Trying to flush a bogus workqueue entry rightly makes
__flush_work() complain (kudos to whoever added that sanity check).

Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
freed") *tried* to fix the module refcounting issue by having VMs grab a
reference to the module, but that only made the bug slightly harder to hit
as it gave async_pf_execute() a bit more time to complete before the KVM
module could be unloaded.

Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
Cc: stable@vger.kernel.org
Cc: David Matlack <dmatlack@google.com>
Cc: Xu Yilun <yilun.xu@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/async_pf.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

Comments

Xu Yilun Jan. 20, 2024, 12:40 p.m. UTC | #1
On Tue, Jan 09, 2024 at 05:15:30PM -0800, Sean Christopherson wrote:
> Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> completion queue, e.g. when a VM and all its vCPUs is being destroyed.
> KVM must ensure that none of its workqueue callbacks is running when the
> last reference to the KVM _module_ is put.  Gifting a reference to the
> associated VM prevents the workqueue callback from dereferencing freed
> vCPU/VM memory, but does not prevent the KVM module from being unloaded
> before the callback completes.
> 
> Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
> finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
> 
>  WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
>  Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
>  CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>  Workqueue: events async_pf_execute [kvm]
>  RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
>  Call Trace:
>   <TASK>
>   async_pf_execute+0x198/0x260 [kvm]
>   process_one_work+0x145/0x2d0
>   worker_thread+0x27e/0x3a0
>   kthread+0xba/0xe0
>   ret_from_fork+0x2d/0x50
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
>  ---[ end trace 0000000000000000 ]---
>  INFO: task kworker/8:1:251 blocked for more than 120 seconds.
>        Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
>  Workqueue: events async_pf_execute [kvm]
>  Call Trace:
>   <TASK>
>   __schedule+0x33f/0xa40
>   schedule+0x53/0xc0
>   schedule_timeout+0x12a/0x140
>   __wait_for_common+0x8d/0x1d0
>   __flush_work.isra.0+0x19f/0x2c0
>   kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
>   kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
>   kvm_put_kvm+0x1c1/0x320 [kvm]
>   async_pf_execute+0x198/0x260 [kvm]
>   process_one_work+0x145/0x2d0
>   worker_thread+0x27e/0x3a0
>   kthread+0xba/0xe0
>   ret_from_fork+0x2d/0x50
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
> 
> If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> then there's no need to gift async_pf_execute() a reference because all
> invocations of async_pf_execute() will be forced to complete before the
> vCPU and its VM are destroyed/freed.  And that in turn fixes the module
> unloading bug as __fput() won't do module_put() on the last vCPU reference
> until the vCPU has been freed, e.g. if closing the vCPU file also puts the

I'm not sure why __fput() of vCPU fd should be mentioned here. I assume
we just need to say that vCPUs are freed before module_put(KVM the module)
in kvm_destroy_vm(), then the whole logic for module unloading fix is:

  1. All workqueue callbacks complete when kvm_clear_async_pf_completion_queue(vcpu)
  2. kvm_clear_async_pf_completion_queue(vcpu) must be executed before vCPU free.
  3. vCPUs must be freed before module_put(KVM the module).

  So all workqueue callbacks complete before module_put(KVM the module).


__fput() of vCPU fd is not the only trigger of kvm_destroy_vm(), that
makes me distracted from reason of the fix.

> last reference to the KVM module.
> 
> Note that kvm_check_async_pf_completion() may also take the work item off
> the completion queue and so also needs to flush the work queue, as the
> work will not be seen by kvm_clear_async_pf_completion_queue().  Waiting
> on the workqueue could theoretically delay a vCPU due to waiting for the
> work to complete, but that's a very, very small chance, and likely a very
> small delay.  kvm_arch_async_page_present_queued() unconditionally makes a
> new request, i.e. will effectively delay entering the guest, so the
> remaining work is really just:
> 
>         trace_kvm_async_pf_completed(addr, cr2_or_gpa);
> 
>         __kvm_vcpu_wake_up(vcpu);
> 
>         mmput(mm);
> 
> and mmput() can't drop the last reference to the page tables if the vCPU is
> still alive, i.e. the vCPU won't get stuck tearing down page tables.
> 
> Add a helper to do the flushing, specifically to deal with "wakeup all"
> work items, as they aren't actually work items, i.e. are never placed in a
> workqueue.  Trying to flush a bogus workqueue entry rightly makes
> __flush_work() complain (kudos to whoever added that sanity check).
> 
> Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
> freed") *tried* to fix the module refcounting issue by having VMs grab a
> reference to the module, but that only made the bug slightly harder to hit
> as it gave async_pf_execute() a bit more time to complete before the KVM
> module could be unloaded.
> 
> Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
> Cc: stable@vger.kernel.org
> Cc: David Matlack <dmatlack@google.com>
> Cc: Xu Yilun <yilun.xu@linux.intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/async_pf.c | 37 ++++++++++++++++++++++++++++++++-----
>  1 file changed, 32 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index e033c79d528e..876927a558ad 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -87,7 +87,25 @@ static void async_pf_execute(struct work_struct *work)
>  	__kvm_vcpu_wake_up(vcpu);
>  
>  	mmput(mm);
> -	kvm_put_kvm(vcpu->kvm);
> +}
> +
> +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> +{
> +	/*
> +	 * The async #PF is "done", but KVM must wait for the work item itself,
> +	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
> +	 * KVM must ensure *no* code owned by the KVM (the module) can be run
> +	 * after the last call to module_put(), i.e. after the last reference
> +	 * to the last vCPU's file is put.

Maybe drop the i.e? It is not exactly true, other components like VFIO
may also be the last one to put KVM reference?

> +	 *
> +	 * Wake all events skip the queue and go straight done, i.e. don't
> +	 * need to be flushed (but sanity check that the work wasn't queued).
> +	 */
> +	if (work->wakeup_all)
> +		WARN_ON_ONCE(work->work.func);
> +	else
> +		flush_work(&work->work);
> +	kmem_cache_free(async_pf_cache, work);
>  }
>  
>  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  #else
>  		if (cancel_work_sync(&work->work)) {
>  			mmput(work->mm);
> -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
>  			kmem_cache_free(async_pf_cache, work);
>  		}
>  #endif
> @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  			list_first_entry(&vcpu->async_pf.done,
>  					 typeof(*work), link);
>  		list_del(&work->link);
> -		kmem_cache_free(async_pf_cache, work);
> +
> +		spin_unlock(&vcpu->async_pf.lock);
> +
> +		/*
> +		 * The async #PF is "done", but KVM must wait for the work item
> +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> +		 * (the module) can be run after the last call to module_put(),
> +		 * i.e. after the last reference to the last vCPU's file is put.
> +		 */
> +		kvm_flush_and_free_async_pf_work(work);
> +		spin_lock(&vcpu->async_pf.lock);
>  	}
>  	spin_unlock(&vcpu->async_pf.lock);
>  
> @@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>  
>  		list_del(&work->queue);
>  		vcpu->async_pf.queued--;
> -		kmem_cache_free(async_pf_cache, work);
> +		kvm_flush_and_free_async_pf_work(work);
>  	}
>  }
>  
> @@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	work->arch = *arch;
>  	work->mm = current->mm;
>  	mmget(work->mm);
> -	kvm_get_kvm(work->vcpu->kvm);
>  
>  	INIT_WORK(&work->work, async_pf_execute);

Overall LGTM, Reviewed-by: Xu Yilun <yilun.xu@intel.com>

>  
> -- 
> 2.43.0.472.g3155946c3a-goog
>
Sean Christopherson Jan. 24, 2024, 7:04 p.m. UTC | #2
On Sat, Jan 20, 2024, Xu Yilun wrote:
> On Tue, Jan 09, 2024 at 05:15:30PM -0800, Sean Christopherson wrote:
> > Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> > completion queue, e.g. when a VM and all its vCPUs is being destroyed.
> > KVM must ensure that none of its workqueue callbacks is running when the
> > last reference to the KVM _module_ is put.  Gifting a reference to the
> > associated VM prevents the workqueue callback from dereferencing freed
> > vCPU/VM memory, but does not prevent the KVM module from being unloaded
> > before the callback completes.
> > 
> > Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> > async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> > result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
> > finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
> > 
> >  WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
> >  Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
> >  CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> >  Workqueue: events async_pf_execute [kvm]
> >  RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
> >  Call Trace:
> >   <TASK>
> >   async_pf_execute+0x198/0x260 [kvm]
> >   process_one_work+0x145/0x2d0
> >   worker_thread+0x27e/0x3a0
> >   kthread+0xba/0xe0
> >   ret_from_fork+0x2d/0x50
> >   ret_from_fork_asm+0x11/0x20
> >   </TASK>
> >  ---[ end trace 0000000000000000 ]---
> >  INFO: task kworker/8:1:251 blocked for more than 120 seconds.
> >        Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >  task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
> >  Workqueue: events async_pf_execute [kvm]
> >  Call Trace:
> >   <TASK>
> >   __schedule+0x33f/0xa40
> >   schedule+0x53/0xc0
> >   schedule_timeout+0x12a/0x140
> >   __wait_for_common+0x8d/0x1d0
> >   __flush_work.isra.0+0x19f/0x2c0
> >   kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
> >   kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
> >   kvm_put_kvm+0x1c1/0x320 [kvm]
> >   async_pf_execute+0x198/0x260 [kvm]
> >   process_one_work+0x145/0x2d0
> >   worker_thread+0x27e/0x3a0
> >   kthread+0xba/0xe0
> >   ret_from_fork+0x2d/0x50
> >   ret_from_fork_asm+0x11/0x20
> >   </TASK>
> > 
> > If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> > then there's no need to gift async_pf_execute() a reference because all
> > invocations of async_pf_execute() will be forced to complete before the
> > vCPU and its VM are destroyed/freed.  And that in turn fixes the module
> > unloading bug as __fput() won't do module_put() on the last vCPU reference
> > until the vCPU has been freed, e.g. if closing the vCPU file also puts the
> 
> I'm not sure why __fput() of vCPU fd should be mentioned here. I assume
> we just need to say that vCPUs are freed before module_put(KVM the module)
> in kvm_destroy_vm(), then the whole logic for module unloading fix is:
> 
>   1. All workqueue callbacks complete when kvm_clear_async_pf_completion_queue(vcpu)
>   2. kvm_clear_async_pf_completion_queue(vcpu) must be executed before vCPU free.
>   3. vCPUs must be freed before module_put(KVM the module).
> 
>   So all workqueue callbacks complete before module_put(KVM the module).
> 
> 
> __fput() of vCPU fd is not the only trigger of kvm_destroy_vm(), that
> makes me distracted from reason of the fix.

My goal was to call out that (a) the vCPU file descriptor is what ensures kvm.ko
is alive at this point and (b) that __fput() very deliberately ensures module_put()
is called after all module function callbacks/hooks complete, as there was quite
a bit of confusion around who/what can safely do module_put().

> > last reference to the KVM module.
> > 
> > Note that kvm_check_async_pf_completion() may also take the work item off
> > the completion queue and so also needs to flush the work queue, as the
> > work will not be seen by kvm_clear_async_pf_completion_queue().  Waiting
> > on the workqueue could theoretically delay a vCPU due to waiting for the
> > work to complete, but that's a very, very small chance, and likely a very
> > small delay.  kvm_arch_async_page_present_queued() unconditionally makes a
> > new request, i.e. will effectively delay entering the guest, so the
> > remaining work is really just:
> > 
> >         trace_kvm_async_pf_completed(addr, cr2_or_gpa);
> > 
> >         __kvm_vcpu_wake_up(vcpu);
> > 
> >         mmput(mm);
> > 
> > and mmput() can't drop the last reference to the page tables if the vCPU is
> > still alive, i.e. the vCPU won't get stuck tearing down page tables.
> > 
> > Add a helper to do the flushing, specifically to deal with "wakeup all"
> > work items, as they aren't actually work items, i.e. are never placed in a
> > workqueue.  Trying to flush a bogus workqueue entry rightly makes
> > __flush_work() complain (kudos to whoever added that sanity check).
> > 
> > Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
> > freed") *tried* to fix the module refcounting issue by having VMs grab a
> > reference to the module, but that only made the bug slightly harder to hit
> > as it gave async_pf_execute() a bit more time to complete before the KVM
> > module could be unloaded.
> > 
> > Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
> > Cc: stable@vger.kernel.org
> > Cc: David Matlack <dmatlack@google.com>
> > Cc: Xu Yilun <yilun.xu@linux.intel.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  virt/kvm/async_pf.c | 37 ++++++++++++++++++++++++++++++++-----
> >  1 file changed, 32 insertions(+), 5 deletions(-)
> > 
> > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> > index e033c79d528e..876927a558ad 100644
> > --- a/virt/kvm/async_pf.c
> > +++ b/virt/kvm/async_pf.c
> > @@ -87,7 +87,25 @@ static void async_pf_execute(struct work_struct *work)
> >  	__kvm_vcpu_wake_up(vcpu);
> >  
> >  	mmput(mm);
> > -	kvm_put_kvm(vcpu->kvm);
> > +}
> > +
> > +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> > +{
> > +	/*
> > +	 * The async #PF is "done", but KVM must wait for the work item itself,
> > +	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
> > +	 * KVM must ensure *no* code owned by the KVM (the module) can be run
> > +	 * after the last call to module_put(), i.e. after the last reference
> > +	 * to the last vCPU's file is put.
> 
> Maybe drop the i.e? It is not exactly true, other components like VFIO
> may also be the last one to put KVM reference?

Ah, yeah, agreed.  I'll drop that last snippet, it doesn't and much value.
Xu Yilun Jan. 26, 2024, 7:36 a.m. UTC | #3
On Wed, Jan 24, 2024 at 11:04:55AM -0800, Sean Christopherson wrote:
> On Sat, Jan 20, 2024, Xu Yilun wrote:
> > On Tue, Jan 09, 2024 at 05:15:30PM -0800, Sean Christopherson wrote:
> > > Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> > > completion queue, e.g. when a VM and all its vCPUs is being destroyed.
> > > KVM must ensure that none of its workqueue callbacks is running when the
> > > last reference to the KVM _module_ is put.  Gifting a reference to the
> > > associated VM prevents the workqueue callback from dereferencing freed
> > > vCPU/VM memory, but does not prevent the KVM module from being unloaded
> > > before the callback completes.
> > > 
> > > Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> > > async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> > > result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
> > > finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
> > > 
> > >  WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
> > >  Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
> > >  CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> > >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > >  Workqueue: events async_pf_execute [kvm]
> > >  RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
> > >  Call Trace:
> > >   <TASK>
> > >   async_pf_execute+0x198/0x260 [kvm]
> > >   process_one_work+0x145/0x2d0
> > >   worker_thread+0x27e/0x3a0
> > >   kthread+0xba/0xe0
> > >   ret_from_fork+0x2d/0x50
> > >   ret_from_fork_asm+0x11/0x20
> > >   </TASK>
> > >  ---[ end trace 0000000000000000 ]---
> > >  INFO: task kworker/8:1:251 blocked for more than 120 seconds.
> > >        Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> > >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > >  task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
> > >  Workqueue: events async_pf_execute [kvm]
> > >  Call Trace:
> > >   <TASK>
> > >   __schedule+0x33f/0xa40
> > >   schedule+0x53/0xc0
> > >   schedule_timeout+0x12a/0x140
> > >   __wait_for_common+0x8d/0x1d0
> > >   __flush_work.isra.0+0x19f/0x2c0
> > >   kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
> > >   kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
> > >   kvm_put_kvm+0x1c1/0x320 [kvm]
> > >   async_pf_execute+0x198/0x260 [kvm]
> > >   process_one_work+0x145/0x2d0
> > >   worker_thread+0x27e/0x3a0
> > >   kthread+0xba/0xe0
> > >   ret_from_fork+0x2d/0x50
> > >   ret_from_fork_asm+0x11/0x20
> > >   </TASK>
> > > 
> > > If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> > > then there's no need to gift async_pf_execute() a reference because all
> > > invocations of async_pf_execute() will be forced to complete before the
> > > vCPU and its VM are destroyed/freed.  And that in turn fixes the module
> > > unloading bug as __fput() won't do module_put() on the last vCPU reference
> > > until the vCPU has been freed, e.g. if closing the vCPU file also puts the
> > 
> > I'm not sure why __fput() of vCPU fd should be mentioned here. I assume
> > we just need to say that vCPUs are freed before module_put(KVM the module)
> > in kvm_destroy_vm(), then the whole logic for module unloading fix is:
> > 
> >   1. All workqueue callbacks complete when kvm_clear_async_pf_completion_queue(vcpu)
> >   2. kvm_clear_async_pf_completion_queue(vcpu) must be executed before vCPU free.
> >   3. vCPUs must be freed before module_put(KVM the module).
> > 
> >   So all workqueue callbacks complete before module_put(KVM the module).
> > 
> > 
> > __fput() of vCPU fd is not the only trigger of kvm_destroy_vm(), that
> > makes me distracted from reason of the fix.
> 
> My goal was to call out that (a) the vCPU file descriptor is what ensures kvm.ko
> is alive at this point and (b) that __fput() very deliberately ensures module_put()
> is called after all module function callbacks/hooks complete, as there was quite

Ah, I understood. These are ensured by your previous fix which grants
kvm_vcpu_fops the module owner. LGTM now.

Thanks,
Yilun
Vitaly Kuznetsov Jan. 26, 2024, 4:51 p.m. UTC | #4
Sean Christopherson <seanjc@google.com> writes:

> Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> completion queue, e.g. when a VM and all its vCPUs is being destroyed.
> KVM must ensure that none of its workqueue callbacks is running when the
> last reference to the KVM _module_ is put.  Gifting a reference to the
> associated VM prevents the workqueue callback from dereferencing freed
> vCPU/VM memory, but does not prevent the KVM module from being unloaded
> before the callback completes.
>
> Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
> finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
>
>  WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
>  Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
>  CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>  Workqueue: events async_pf_execute [kvm]
>  RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
>  Call Trace:
>   <TASK>
>   async_pf_execute+0x198/0x260 [kvm]
>   process_one_work+0x145/0x2d0
>   worker_thread+0x27e/0x3a0
>   kthread+0xba/0xe0
>   ret_from_fork+0x2d/0x50
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
>  ---[ end trace 0000000000000000 ]---
>  INFO: task kworker/8:1:251 blocked for more than 120 seconds.
>        Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
>  Workqueue: events async_pf_execute [kvm]
>  Call Trace:
>   <TASK>
>   __schedule+0x33f/0xa40
>   schedule+0x53/0xc0
>   schedule_timeout+0x12a/0x140
>   __wait_for_common+0x8d/0x1d0
>   __flush_work.isra.0+0x19f/0x2c0
>   kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
>   kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
>   kvm_put_kvm+0x1c1/0x320 [kvm]
>   async_pf_execute+0x198/0x260 [kvm]
>   process_one_work+0x145/0x2d0
>   worker_thread+0x27e/0x3a0
>   kthread+0xba/0xe0
>   ret_from_fork+0x2d/0x50
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
>
> If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> then there's no need to gift async_pf_execute() a reference because all
> invocations of async_pf_execute() will be forced to complete before the
> vCPU and its VM are destroyed/freed.  And that in turn fixes the module
> unloading bug as __fput() won't do module_put() on the last vCPU reference
> until the vCPU has been freed, e.g. if closing the vCPU file also puts the
> last reference to the KVM module.
>
> Note that kvm_check_async_pf_completion() may also take the work item off
> the completion queue and so also needs to flush the work queue, as the
> work will not be seen by kvm_clear_async_pf_completion_queue().  Waiting
> on the workqueue could theoretically delay a vCPU due to waiting for the
> work to complete, but that's a very, very small chance, and likely a very
> small delay.  kvm_arch_async_page_present_queued() unconditionally makes a
> new request, i.e. will effectively delay entering the guest, so the
> remaining work is really just:
>
>         trace_kvm_async_pf_completed(addr, cr2_or_gpa);
>
>         __kvm_vcpu_wake_up(vcpu);
>
>         mmput(mm);
>
> and mmput() can't drop the last reference to the page tables if the vCPU is
> still alive, i.e. the vCPU won't get stuck tearing down page tables.
>
> Add a helper to do the flushing, specifically to deal with "wakeup all"
> work items, as they aren't actually work items, i.e. are never placed in a
> workqueue.  Trying to flush a bogus workqueue entry rightly makes
> __flush_work() complain (kudos to whoever added that sanity check).
>
> Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
> freed") *tried* to fix the module refcounting issue by having VMs grab a
> reference to the module, but that only made the bug slightly harder to hit
> as it gave async_pf_execute() a bit more time to complete before the KVM
> module could be unloaded.
>
> Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
> Cc: stable@vger.kernel.org
> Cc: David Matlack <dmatlack@google.com>
> Cc: Xu Yilun <yilun.xu@linux.intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/async_pf.c | 37 ++++++++++++++++++++++++++++++++-----
>  1 file changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index e033c79d528e..876927a558ad 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -87,7 +87,25 @@ static void async_pf_execute(struct work_struct *work)
>  	__kvm_vcpu_wake_up(vcpu);
>  
>  	mmput(mm);
> -	kvm_put_kvm(vcpu->kvm);
> +}
> +
> +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> +{
> +	/*
> +	 * The async #PF is "done", but KVM must wait for the work item itself,
> +	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
> +	 * KVM must ensure *no* code owned by the KVM (the module) can be run
> +	 * after the last call to module_put(), i.e. after the last reference
> +	 * to the last vCPU's file is put.
> +	 *

Do I understand correctly that the problem is also present on the
"normal" path, i.e.:

  KVM_REQ_APF_READY
     kvm_check_async_pf_completion()
         kmem_cache_free(,work)

on one CPU can actually finish _before_ work is fully flushed on the
other (async_pf_execute() has already added an item to 'done' list but
hasn't completed)? Is it just the fact that the window of opportunity
to get the freed item re-purposed is so short that no real issue was
ever noticed? In that case I'd suggest we emphasize that in the comment
as currently it sounds like kvm_arch_destroy_vm() is the only
probemmatic path.

> +	 * Wake all events skip the queue and go straight done, i.e. don't
> +	 * need to be flushed (but sanity check that the work wasn't queued).
> +	 */
> +	if (work->wakeup_all)
> +		WARN_ON_ONCE(work->work.func);
> +	else
> +		flush_work(&work->work);
> +	kmem_cache_free(async_pf_cache, work);
>  }
>  
>  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  #else
>  		if (cancel_work_sync(&work->work)) {
>  			mmput(work->mm);
> -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
>  			kmem_cache_free(async_pf_cache, work);
>  		}
>  #endif
> @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  			list_first_entry(&vcpu->async_pf.done,
>  					 typeof(*work), link);
>  		list_del(&work->link);
> -		kmem_cache_free(async_pf_cache, work);
> +
> +		spin_unlock(&vcpu->async_pf.lock);
> +
> +		/*
> +		 * The async #PF is "done", but KVM must wait for the work item
> +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> +		 * (the module) can be run after the last call to module_put(),
> +		 * i.e. after the last reference to the last vCPU's file is put.
> +		 */
> +		kvm_flush_and_free_async_pf_work(work);
> +		spin_lock(&vcpu->async_pf.lock);
>  	}
>  	spin_unlock(&vcpu->async_pf.lock);
>  
> @@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>  
>  		list_del(&work->queue);
>  		vcpu->async_pf.queued--;
> -		kmem_cache_free(async_pf_cache, work);
> +		kvm_flush_and_free_async_pf_work(work);
>  	}
>  }
>  
> @@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	work->arch = *arch;
>  	work->mm = current->mm;
>  	mmget(work->mm);
> -	kvm_get_kvm(work->vcpu->kvm);
>  
>  	INIT_WORK(&work->work, async_pf_execute);

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Sean Christopherson Jan. 26, 2024, 5:19 p.m. UTC | #5
On Fri, Jan 26, 2024, Vitaly Kuznetsov wrote:
> > +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> > +{
> > +	/*
> > +	 * The async #PF is "done", but KVM must wait for the work item itself,
> > +	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
> > +	 * KVM must ensure *no* code owned by the KVM (the module) can be run
> > +	 * after the last call to module_put(), i.e. after the last reference
> > +	 * to the last vCPU's file is put.
> > +	 *
> 
> Do I understand correctly that the problem is also present on the
> "normal" path, i.e.:
> 
>   KVM_REQ_APF_READY
>      kvm_check_async_pf_completion()
>          kmem_cache_free(,work)
> 
> on one CPU can actually finish _before_ work is fully flushed on the
> other (async_pf_execute() has already added an item to 'done' list but
> hasn't completed)? Is it just the fact that the window of opportunity
> to get the freed item re-purposed is so short that no real issue was
> ever noticed?

Sort of?  It's not a problem with accessing a freed obect, the issue is that
async_pf_execute() can still be executing while KVM-the-module is unloaded.

The reason the "normal" path is problematic is because it removes the
work entry from async_pf.done and from async_pf.queue, i.e. makes it invisible
to kvm_arch_destroy_vm().  So to hit the bug where the "normal" path processes
the completed work, userspace would need to terminate the VM (i.e. exit() or
close all fds), and KVM would need to completely tear down the VM, all before
async_pf_execute() finished it's last few instructions.

Which is possible, just comically unlikely :-)

> In that case I'd suggest we emphasize that in the comment as currently it
> sounds like kvm_arch_destroy_vm() is the only probemmatic path.

How's this?

	/*
	 * The async #PF is "done", but KVM must wait for the work item itself,
	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
	 * KVM must ensure *no* code owned by the KVM (the module) can be run
	 * after the last call to module_put().  Note, flushing the work item
	 * is always required when the item is taken off the completion queue.
	 * E.g. even if the vCPU handles the item in the "normal" path, the VM
	 * could be terminated before async_pf_execute() completes.
	 *
	 * Wake all events skip the queue and go straight done, i.e. don't
	 * need to be flushed (but sanity check that the work wasn't queued).
	 */

> > +	 * Wake all events skip the queue and go straight done, i.e. don't
> > +	 * need to be flushed (but sanity check that the work wasn't queued).
> > +	 */
> > +	if (work->wakeup_all)
> > +		WARN_ON_ONCE(work->work.func);
> > +	else
> > +		flush_work(&work->work);
> > +	kmem_cache_free(async_pf_cache, work);
> >  }
> >  
> >  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >  #else
> >  		if (cancel_work_sync(&work->work)) {
> >  			mmput(work->mm);
> > -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
> >  			kmem_cache_free(async_pf_cache, work);
> >  		}
> >  #endif
> > @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >  			list_first_entry(&vcpu->async_pf.done,
> >  					 typeof(*work), link);
> >  		list_del(&work->link);
> > -		kmem_cache_free(async_pf_cache, work);
> > +
> > +		spin_unlock(&vcpu->async_pf.lock);
> > +
> > +		/*
> > +		 * The async #PF is "done", but KVM must wait for the work item
> > +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> > +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> > +		 * (the module) can be run after the last call to module_put(),
> > +		 * i.e. after the last reference to the last vCPU's file is put.
> > +		 */

Doh, this is a duplicate comment, I'll delete it.
 
> > +		kvm_flush_and_free_async_pf_work(work);
> > +		spin_lock(&vcpu->async_pf.lock);
> >  	}
> >  	spin_unlock(&vcpu->async_pf.lock);
> >  
> > @@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
> >  
> >  		list_del(&work->queue);
> >  		vcpu->async_pf.queued--;
> > -		kmem_cache_free(async_pf_cache, work);
> > +		kvm_flush_and_free_async_pf_work(work);
> >  	}
> >  }
> >  
> > @@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  	work->arch = *arch;
> >  	work->mm = current->mm;
> >  	mmget(work->mm);
> > -	kvm_get_kvm(work->vcpu->kvm);
> >  
> >  	INIT_WORK(&work->work, async_pf_execute);
> 
> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> 
> -- 
> Vitaly
>
Vitaly Kuznetsov Jan. 29, 2024, 9:02 a.m. UTC | #6
Sean Christopherson <seanjc@google.com> writes:

> On Fri, Jan 26, 2024, Vitaly Kuznetsov wrote:
>> > +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
>> > +{
>> > +	/*
>> > +	 * The async #PF is "done", but KVM must wait for the work item itself,
>> > +	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
>> > +	 * KVM must ensure *no* code owned by the KVM (the module) can be run
>> > +	 * after the last call to module_put(), i.e. after the last reference
>> > +	 * to the last vCPU's file is put.
>> > +	 *
>> 
>> Do I understand correctly that the problem is also present on the
>> "normal" path, i.e.:
>> 
>>   KVM_REQ_APF_READY
>>      kvm_check_async_pf_completion()
>>          kmem_cache_free(,work)
>> 
>> on one CPU can actually finish _before_ work is fully flushed on the
>> other (async_pf_execute() has already added an item to 'done' list but
>> hasn't completed)? Is it just the fact that the window of opportunity
>> to get the freed item re-purposed is so short that no real issue was
>> ever noticed?
>
> Sort of?  It's not a problem with accessing a freed obect, the issue is that
> async_pf_execute() can still be executing while KVM-the-module is unloaded.
>
> The reason the "normal" path is problematic is because it removes the
> work entry from async_pf.done and from async_pf.queue, i.e. makes it invisible
> to kvm_arch_destroy_vm().  So to hit the bug where the "normal" path processes
> the completed work, userspace would need to terminate the VM (i.e. exit() or
> close all fds), and KVM would need to completely tear down the VM, all before
> async_pf_execute() finished it's last few instructions.
>
> Which is possible, just comically unlikely :-)
>
>> In that case I'd suggest we emphasize that in the comment as currently it
>> sounds like kvm_arch_destroy_vm() is the only probemmatic path.
>
> How's this?
>
> 	/*
> 	 * The async #PF is "done", but KVM must wait for the work item itself,
> 	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
> 	 * KVM must ensure *no* code owned by the KVM (the module) can be run
> 	 * after the last call to module_put().  Note, flushing the work item
> 	 * is always required when the item is taken off the completion queue.
> 	 * E.g. even if the vCPU handles the item in the "normal" path, the VM
> 	 * could be terminated before async_pf_execute() completes.
> 	 *
> 	 * Wake all events skip the queue and go straight done, i.e. don't
> 	 * need to be flushed (but sanity check that the work wasn't queued).
> 	 */
>

Sounds good to me, thanks!

>> > +	 * Wake all events skip the queue and go straight done, i.e. don't
>> > +	 * need to be flushed (but sanity check that the work wasn't queued).
>> > +	 */
>> > +	if (work->wakeup_all)
>> > +		WARN_ON_ONCE(work->work.func);
>> > +	else
>> > +		flush_work(&work->work);
>> > +	kmem_cache_free(async_pf_cache, work);
>> >  }
>> >  
>> >  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>> > @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>> >  #else
>> >  		if (cancel_work_sync(&work->work)) {
>> >  			mmput(work->mm);
>> > -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
>> >  			kmem_cache_free(async_pf_cache, work);
>> >  		}
>> >  #endif
>> > @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>> >  			list_first_entry(&vcpu->async_pf.done,
>> >  					 typeof(*work), link);
>> >  		list_del(&work->link);
>> > -		kmem_cache_free(async_pf_cache, work);
>> > +
>> > +		spin_unlock(&vcpu->async_pf.lock);
>> > +
>> > +		/*
>> > +		 * The async #PF is "done", but KVM must wait for the work item
>> > +		 * itself, i.e. async_pf_execute(), to run to completion.  If
>> > +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
>> > +		 * (the module) can be run after the last call to module_put(),
>> > +		 * i.e. after the last reference to the last vCPU's file is put.
>> > +		 */
>
> Doh, this is a duplicate comment, I'll delete it.
>  
>> > +		kvm_flush_and_free_async_pf_work(work);
>> > +		spin_lock(&vcpu->async_pf.lock);
>> >  	}
>> >  	spin_unlock(&vcpu->async_pf.lock);
>> >  
>> > @@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>> >  
>> >  		list_del(&work->queue);
>> >  		vcpu->async_pf.queued--;
>> > -		kmem_cache_free(async_pf_cache, work);
>> > +		kvm_flush_and_free_async_pf_work(work);
>> >  	}
>> >  }
>> >  
>> > @@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>> >  	work->arch = *arch;
>> >  	work->mm = current->mm;
>> >  	mmget(work->mm);
>> > -	kvm_get_kvm(work->vcpu->kvm);
>> >  
>> >  	INIT_WORK(&work->work, async_pf_execute);
>> 
>> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> 
>> -- 
>> Vitaly
>> 
>
Sean Christopherson Feb. 6, 2024, 7:06 p.m. UTC | #7
On Sat, Jan 20, 2024, Xu Yilun wrote:
> Overall LGTM, Reviewed-by: Xu Yilun <yilun.xu@intel.com>

A very random thing.  AFAICT, b4 doesn't appear to pickup tags that are "buried"
partway through a line.  You didn't do anything wrong (IMO, at least :-D), and
it was easy enough to apply manually, just an FYI for the future.
Xu Yilun Feb. 19, 2024, 1:59 p.m. UTC | #8
>  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  #else
>  		if (cancel_work_sync(&work->work)) {
>  			mmput(work->mm);
> -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
>  			kmem_cache_free(async_pf_cache, work);
>  		}
>  #endif
> @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  			list_first_entry(&vcpu->async_pf.done,
>  					 typeof(*work), link);
>  		list_del(&work->link);
> -		kmem_cache_free(async_pf_cache, work);
> +
> +		spin_unlock(&vcpu->async_pf.lock);
> +
> +		/*
> +		 * The async #PF is "done", but KVM must wait for the work item
> +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> +		 * (the module) can be run after the last call to module_put(),
> +		 * i.e. after the last reference to the last vCPU's file is put.
> +		 */
> +		kvm_flush_and_free_async_pf_work(work);

I have a new concern when I re-visit this patchset.

Form kvm_check_async_pf_completion(), I see async_pf.queue is always a
superset of async_pf.done (except wake-all work, which is not within
concern).  And done work would be skipped from sync (cancel_work_sync()) by:

                if (!work->vcpu)
                        continue;

But now with this patch we also sync done works, how about we just sync all
queued work instead.

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index e033c79d528e..2268f16a36c2 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -71,7 +71,6 @@ static void async_pf_execute(struct work_struct *work)
        spin_lock(&vcpu->async_pf.lock);
        first = list_empty(&vcpu->async_pf.done);
        list_add_tail(&apf->link, &vcpu->async_pf.done);
-       apf->vcpu = NULL;
        spin_unlock(&vcpu->async_pf.lock);

        if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
@@ -101,13 +100,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
                                         typeof(*work), queue);
                list_del(&work->queue);

-               /*
-                * We know it's present in vcpu->async_pf.done, do
-                * nothing here.
-                */
-               if (!work->vcpu)
-                       continue;
-
                spin_unlock(&vcpu->async_pf.lock);
 #ifdef CONFIG_KVM_ASYNC_PF_SYNC
                flush_work(&work->work);


This way we don't have to sync twice for the same purpose, also we could
avoid reusing work->vcpu as a "done" flag which confused me a bit.

Thanks,
Yilun

> +		spin_lock(&vcpu->async_pf.lock);
>  	}
>  	spin_unlock(&vcpu->async_pf.lock);
>
Sean Christopherson Feb. 19, 2024, 3:51 p.m. UTC | #9
On Mon, Feb 19, 2024, Xu Yilun wrote:
> >  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >  #else
> >  		if (cancel_work_sync(&work->work)) {
> >  			mmput(work->mm);
> > -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
> >  			kmem_cache_free(async_pf_cache, work);
> >  		}
> >  #endif
> > @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >  			list_first_entry(&vcpu->async_pf.done,
> >  					 typeof(*work), link);
> >  		list_del(&work->link);
> > -		kmem_cache_free(async_pf_cache, work);
> > +
> > +		spin_unlock(&vcpu->async_pf.lock);
> > +
> > +		/*
> > +		 * The async #PF is "done", but KVM must wait for the work item
> > +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> > +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> > +		 * (the module) can be run after the last call to module_put(),
> > +		 * i.e. after the last reference to the last vCPU's file is put.
> > +		 */
> > +		kvm_flush_and_free_async_pf_work(work);
> 
> I have a new concern when I re-visit this patchset.
> 
> Form kvm_check_async_pf_completion(), I see async_pf.queue is always a
> superset of async_pf.done (except wake-all work, which is not within
> concern).  And done work would be skipped from sync (cancel_work_sync()) by:
> 
>                 if (!work->vcpu)
>                         continue;
> 
> But now with this patch we also sync done works, how about we just sync all
> queued work instead.

Hmm, IIUC, I think we can simply revert commit 22583f0d9c85 ("KVM: async_pf: avoid
recursive flushing of work items").
Xu Yilun Feb. 20, 2024, 3:02 a.m. UTC | #10
On Mon, Feb 19, 2024 at 07:51:24AM -0800, Sean Christopherson wrote:
> On Mon, Feb 19, 2024, Xu Yilun wrote:
> > >  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > > @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > >  #else
> > >  		if (cancel_work_sync(&work->work)) {
> > >  			mmput(work->mm);
> > > -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
> > >  			kmem_cache_free(async_pf_cache, work);
> > >  		}
> > >  #endif
> > > @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > >  			list_first_entry(&vcpu->async_pf.done,
> > >  					 typeof(*work), link);
> > >  		list_del(&work->link);
> > > -		kmem_cache_free(async_pf_cache, work);
> > > +
> > > +		spin_unlock(&vcpu->async_pf.lock);
> > > +
> > > +		/*
> > > +		 * The async #PF is "done", but KVM must wait for the work item
> > > +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> > > +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> > > +		 * (the module) can be run after the last call to module_put(),
> > > +		 * i.e. after the last reference to the last vCPU's file is put.
> > > +		 */
> > > +		kvm_flush_and_free_async_pf_work(work);
> > 
> > I have a new concern when I re-visit this patchset.
> > 
> > Form kvm_check_async_pf_completion(), I see async_pf.queue is always a
> > superset of async_pf.done (except wake-all work, which is not within
> > concern).  And done work would be skipped from sync (cancel_work_sync()) by:
> > 
> >                 if (!work->vcpu)
> >                         continue;
> > 
> > But now with this patch we also sync done works, how about we just sync all
> > queued work instead.
> 
> Hmm, IIUC, I think we can simply revert commit 22583f0d9c85 ("KVM: async_pf: avoid
> recursive flushing of work items").

Ah, yes. This also make me clear about the history of the confusing spin_lock.
Reverting is good to me.

Thanks,
Yilun
diff mbox series

Patch

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index e033c79d528e..876927a558ad 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -87,7 +87,25 @@  static void async_pf_execute(struct work_struct *work)
 	__kvm_vcpu_wake_up(vcpu);
 
 	mmput(mm);
-	kvm_put_kvm(vcpu->kvm);
+}
+
+static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
+{
+	/*
+	 * The async #PF is "done", but KVM must wait for the work item itself,
+	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
+	 * KVM must ensure *no* code owned by the KVM (the module) can be run
+	 * after the last call to module_put(), i.e. after the last reference
+	 * to the last vCPU's file is put.
+	 *
+	 * Wake all events skip the queue and go straight done, i.e. don't
+	 * need to be flushed (but sanity check that the work wasn't queued).
+	 */
+	if (work->wakeup_all)
+		WARN_ON_ONCE(work->work.func);
+	else
+		flush_work(&work->work);
+	kmem_cache_free(async_pf_cache, work);
 }
 
 void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
@@ -114,7 +132,6 @@  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
 #else
 		if (cancel_work_sync(&work->work)) {
 			mmput(work->mm);
-			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
 			kmem_cache_free(async_pf_cache, work);
 		}
 #endif
@@ -126,7 +143,18 @@  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
 			list_first_entry(&vcpu->async_pf.done,
 					 typeof(*work), link);
 		list_del(&work->link);
-		kmem_cache_free(async_pf_cache, work);
+
+		spin_unlock(&vcpu->async_pf.lock);
+
+		/*
+		 * The async #PF is "done", but KVM must wait for the work item
+		 * itself, i.e. async_pf_execute(), to run to completion.  If
+		 * KVM is a module, KVM must ensure *no* code owned by the KVM
+		 * (the module) can be run after the last call to module_put(),
+		 * i.e. after the last reference to the last vCPU's file is put.
+		 */
+		kvm_flush_and_free_async_pf_work(work);
+		spin_lock(&vcpu->async_pf.lock);
 	}
 	spin_unlock(&vcpu->async_pf.lock);
 
@@ -151,7 +179,7 @@  void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 
 		list_del(&work->queue);
 		vcpu->async_pf.queued--;
-		kmem_cache_free(async_pf_cache, work);
+		kvm_flush_and_free_async_pf_work(work);
 	}
 }
 
@@ -186,7 +214,6 @@  bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	work->arch = *arch;
 	work->mm = current->mm;
 	mmget(work->mm);
-	kvm_get_kvm(work->vcpu->kvm);
 
 	INIT_WORK(&work->work, async_pf_execute);