diff mbox series

virtio-blk: dataplane: release AioContext before blk_set_aio_context

Message ID 20190227165212.34901-1-slp@redhat.com (mailing list archive)
State New, archived
Headers show
Series virtio-blk: dataplane: release AioContext before blk_set_aio_context | expand

Commit Message

Sergio Lopez Feb. 27, 2019, 4:52 p.m. UTC
Stopping the dataplane requires calling to blk_set_aio_context, which
may need to wait for a running job to be completed or paused.

As stopping the dataplane is something that can be triggered from a vcpu
thread (due to the Guest requesting to stop the device), while the job
itself may be managed by an iothread, holding the AioContext will lead
to a deadlock, where the first one is waiting for the job to pause or
finish, while the second can't make any progress as it's waiting for the
AioContext to be released:

 Thread 6 (LWP 90472)
 #0  0x000055a6f8497aee in blk_root_drained_end
 #1  0x000055a6f84a7926 in bdrv_parent_drained_end
 #2  0x000055a6f84a799f in bdrv_do_drained_end
 #3  0x000055a6f84a82ab in bdrv_drained_end
 #4  0x000055a6f8498be8 in blk_drain
 #5  0x000055a6f84a22cd in mirror_drain
 #6  0x000055a6f8457708 in block_job_detach_aio_context
 #7  0x000055a6f84538f1 in bdrv_detach_aio_context
 #8  0x000055a6f8453a96 in bdrv_set_aio_context
 #9  0x000055a6f84999f8 in blk_set_aio_context
 #10 0x000055a6f81802c8 in virtio_blk_data_plane_stop
 #11 0x000055a6f83cfc95 in virtio_bus_stop_ioeventfd
 #12 0x000055a6f83d1f81 in virtio_pci_common_write
 #13 0x000055a6f83d1f81 in virtio_pci_common_write
 #14 0x000055a6f8148d62 in memory_region_write_accessor
 #15 0x000055a6f81459c9 in access_with_adjusted_size
 #16 0x000055a6f814a608 in memory_region_dispatch_write
 #17 0x000055a6f80f1f98 in flatview_write_continue
 #18 0x000055a6f80f214f in flatview_write
 #19 0x000055a6f80f6a7b in address_space_write
 #20 0x000055a6f80f6b15 in address_space_rw
 #21 0x000055a6f815da08 in kvm_cpu_exec
 #22 0x000055a6f8132fee in qemu_kvm_cpu_thread_fn
 #23 0x000055a6f8551306 in qemu_thread_start
 #24 0x00007f9bdf5b9dd5 in start_thread
 #25 0x00007f9bdf2e2ead in clone

 Thread 8 (LWP 90467)
 #0  0x00007f9bdf5c04ed in __lll_lock_wait
 #1  0x00007f9bdf5bbde6 in _L_lock_941
 #2  0x00007f9bdf5bbcdf in __GI___pthread_mutex_lock
 #3  0x000055a6f8551447 in qemu_mutex_lock_impl
 #4  0x000055a6f854be77 in co_schedule_bh_cb
 #5  0x000055a6f854b781 in aio_bh_poll
 #6  0x000055a6f854b781 in aio_bh_poll
 #7  0x000055a6f854f01b in aio_poll
 #8  0x000055a6f825a488 in iothread_run
 #9  0x000055a6f8551306 in qemu_thread_start
 #10 0x00007f9bdf5b9dd5 in start_thread
 #11 0x00007f9bdf2e2ead in clone

 (gdb) thread 8
 [Switching to thread 8 (Thread 0x7f9bd6dae700 (LWP 90467))]
 #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
     file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
 66	    err = pthread_mutex_lock(&mutex->lock);
 (gdb) up 3
 #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
     file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
 66	    err = pthread_mutex_lock(&mutex->lock);
 (gdb) p mutex.lock.__data.__owner
 $6 = 90472

Signed-off-by: Sergio Lopez <slp@redhat.com>
---
 hw/block/dataplane/virtio-blk.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Comments

Kevin Wolf Feb. 27, 2019, 5:37 p.m. UTC | #1
Am 27.02.2019 um 17:52 hat Sergio Lopez geschrieben:
> Stopping the dataplane requires calling to blk_set_aio_context, which
> may need to wait for a running job to be completed or paused.
> 
> As stopping the dataplane is something that can be triggered from a vcpu
> thread (due to the Guest requesting to stop the device), while the job
> itself may be managed by an iothread, holding the AioContext will lead
> to a deadlock, where the first one is waiting for the job to pause or
> finish, while the second can't make any progress as it's waiting for the
> AioContext to be released:
> 
>  Thread 6 (LWP 90472)
>  #0  0x000055a6f8497aee in blk_root_drained_end
>  #1  0x000055a6f84a7926 in bdrv_parent_drained_end
>  #2  0x000055a6f84a799f in bdrv_do_drained_end
>  #3  0x000055a6f84a82ab in bdrv_drained_end
>  #4  0x000055a6f8498be8 in blk_drain
>  #5  0x000055a6f84a22cd in mirror_drain
>  #6  0x000055a6f8457708 in block_job_detach_aio_context
>  #7  0x000055a6f84538f1 in bdrv_detach_aio_context
>  #8  0x000055a6f8453a96 in bdrv_set_aio_context
>  #9  0x000055a6f84999f8 in blk_set_aio_context
>  #10 0x000055a6f81802c8 in virtio_blk_data_plane_stop
>  #11 0x000055a6f83cfc95 in virtio_bus_stop_ioeventfd
>  #12 0x000055a6f83d1f81 in virtio_pci_common_write
>  #13 0x000055a6f83d1f81 in virtio_pci_common_write
>  #14 0x000055a6f8148d62 in memory_region_write_accessor
>  #15 0x000055a6f81459c9 in access_with_adjusted_size
>  #16 0x000055a6f814a608 in memory_region_dispatch_write
>  #17 0x000055a6f80f1f98 in flatview_write_continue
>  #18 0x000055a6f80f214f in flatview_write
>  #19 0x000055a6f80f6a7b in address_space_write
>  #20 0x000055a6f80f6b15 in address_space_rw
>  #21 0x000055a6f815da08 in kvm_cpu_exec
>  #22 0x000055a6f8132fee in qemu_kvm_cpu_thread_fn
>  #23 0x000055a6f8551306 in qemu_thread_start
>  #24 0x00007f9bdf5b9dd5 in start_thread
>  #25 0x00007f9bdf2e2ead in clone
> 
>  Thread 8 (LWP 90467)
>  #0  0x00007f9bdf5c04ed in __lll_lock_wait
>  #1  0x00007f9bdf5bbde6 in _L_lock_941
>  #2  0x00007f9bdf5bbcdf in __GI___pthread_mutex_lock
>  #3  0x000055a6f8551447 in qemu_mutex_lock_impl
>  #4  0x000055a6f854be77 in co_schedule_bh_cb
>  #5  0x000055a6f854b781 in aio_bh_poll
>  #6  0x000055a6f854b781 in aio_bh_poll
>  #7  0x000055a6f854f01b in aio_poll
>  #8  0x000055a6f825a488 in iothread_run
>  #9  0x000055a6f8551306 in qemu_thread_start
>  #10 0x00007f9bdf5b9dd5 in start_thread
>  #11 0x00007f9bdf2e2ead in clone
> 
>  (gdb) thread 8
>  [Switching to thread 8 (Thread 0x7f9bd6dae700 (LWP 90467))]
>  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
>      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
>  66	    err = pthread_mutex_lock(&mutex->lock);
>  (gdb) up 3
>  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
>      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
>  66	    err = pthread_mutex_lock(&mutex->lock);
>  (gdb) p mutex.lock.__data.__owner
>  $6 = 90472
> 
> Signed-off-by: Sergio Lopez <slp@redhat.com>
> ---
>  hw/block/dataplane/virtio-blk.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> index 8c37bd314a..358e6ae89b 100644
> --- a/hw/block/dataplane/virtio-blk.c
> +++ b/hw/block/dataplane/virtio-blk.c
> @@ -280,12 +280,11 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
>  
>      aio_context_acquire(s->ctx);
>      aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
> +    aio_context_release(s->ctx);
>  
>      /* Drain and switch bs back to the QEMU main loop */
>      blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
>  
> -    aio_context_release(s->ctx);
> -

blk_set_aio_context() requires that the caller hold the lock for the
source context, so I'm afraid this is wrong.

However, I think the problem might already be fixed with my "block: Use
normal drain for bdrv_set_aio_context()" (contained in a pull request,
which isn't merged yet), which makes bdrv_set_aio_context() use a real
drain operation. This way, the requests of the mirror jobs should be
already drained before we even call into block_job_detach_aio_context().

Can you give this a try and see whether it fixes your problem?

Kevin
Sergio Lopez Feb. 28, 2019, 3:01 p.m. UTC | #2
On Wed, Feb 27, 2019 at 06:37:14PM +0100, Kevin Wolf wrote:
> Am 27.02.2019 um 17:52 hat Sergio Lopez geschrieben:
> > Stopping the dataplane requires calling to blk_set_aio_context, which
> > may need to wait for a running job to be completed or paused.
> > 
> > As stopping the dataplane is something that can be triggered from a vcpu
> > thread (due to the Guest requesting to stop the device), while the job
> > itself may be managed by an iothread, holding the AioContext will lead
> > to a deadlock, where the first one is waiting for the job to pause or
> > finish, while the second can't make any progress as it's waiting for the
> > AioContext to be released:
> > 
> >  Thread 6 (LWP 90472)
> >  #0  0x000055a6f8497aee in blk_root_drained_end
> >  #1  0x000055a6f84a7926 in bdrv_parent_drained_end
> >  #2  0x000055a6f84a799f in bdrv_do_drained_end
> >  #3  0x000055a6f84a82ab in bdrv_drained_end
> >  #4  0x000055a6f8498be8 in blk_drain
> >  #5  0x000055a6f84a22cd in mirror_drain
> >  #6  0x000055a6f8457708 in block_job_detach_aio_context
> >  #7  0x000055a6f84538f1 in bdrv_detach_aio_context
> >  #8  0x000055a6f8453a96 in bdrv_set_aio_context
> >  #9  0x000055a6f84999f8 in blk_set_aio_context
> >  #10 0x000055a6f81802c8 in virtio_blk_data_plane_stop
> >  #11 0x000055a6f83cfc95 in virtio_bus_stop_ioeventfd
> >  #12 0x000055a6f83d1f81 in virtio_pci_common_write
> >  #13 0x000055a6f83d1f81 in virtio_pci_common_write
> >  #14 0x000055a6f8148d62 in memory_region_write_accessor
> >  #15 0x000055a6f81459c9 in access_with_adjusted_size
> >  #16 0x000055a6f814a608 in memory_region_dispatch_write
> >  #17 0x000055a6f80f1f98 in flatview_write_continue
> >  #18 0x000055a6f80f214f in flatview_write
> >  #19 0x000055a6f80f6a7b in address_space_write
> >  #20 0x000055a6f80f6b15 in address_space_rw
> >  #21 0x000055a6f815da08 in kvm_cpu_exec
> >  #22 0x000055a6f8132fee in qemu_kvm_cpu_thread_fn
> >  #23 0x000055a6f8551306 in qemu_thread_start
> >  #24 0x00007f9bdf5b9dd5 in start_thread
> >  #25 0x00007f9bdf2e2ead in clone
> > 
> >  Thread 8 (LWP 90467)
> >  #0  0x00007f9bdf5c04ed in __lll_lock_wait
> >  #1  0x00007f9bdf5bbde6 in _L_lock_941
> >  #2  0x00007f9bdf5bbcdf in __GI___pthread_mutex_lock
> >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl
> >  #4  0x000055a6f854be77 in co_schedule_bh_cb
> >  #5  0x000055a6f854b781 in aio_bh_poll
> >  #6  0x000055a6f854b781 in aio_bh_poll
> >  #7  0x000055a6f854f01b in aio_poll
> >  #8  0x000055a6f825a488 in iothread_run
> >  #9  0x000055a6f8551306 in qemu_thread_start
> >  #10 0x00007f9bdf5b9dd5 in start_thread
> >  #11 0x00007f9bdf2e2ead in clone
> > 
> >  (gdb) thread 8
> >  [Switching to thread 8 (Thread 0x7f9bd6dae700 (LWP 90467))]
> >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
> >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
> >  66	    err = pthread_mutex_lock(&mutex->lock);
> >  (gdb) up 3
> >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
> >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
> >  66	    err = pthread_mutex_lock(&mutex->lock);
> >  (gdb) p mutex.lock.__data.__owner
> >  $6 = 90472
> > 
> > Signed-off-by: Sergio Lopez <slp@redhat.com>
> > ---
> >  hw/block/dataplane/virtio-blk.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> > index 8c37bd314a..358e6ae89b 100644
> > --- a/hw/block/dataplane/virtio-blk.c
> > +++ b/hw/block/dataplane/virtio-blk.c
> > @@ -280,12 +280,11 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
> >  
> >      aio_context_acquire(s->ctx);
> >      aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
> > +    aio_context_release(s->ctx);
> >  
> >      /* Drain and switch bs back to the QEMU main loop */
> >      blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
> >  
> > -    aio_context_release(s->ctx);
> > -
> 
> blk_set_aio_context() requires that the caller hold the lock for the
> source context, so I'm afraid this is wrong.

TBH, I was quite sure this patch was wrong myself, but I thought it was
still a good way to illustrate the issue.

> However, I think the problem might already be fixed with my "block: Use
> normal drain for bdrv_set_aio_context()" (contained in a pull request,
> which isn't merged yet), which makes bdrv_set_aio_context() use a real
> drain operation. This way, the requests of the mirror jobs should be
> already drained before we even call into block_job_detach_aio_context().
> 
> Can you give this a try and see whether it fixes your problem?

I've applied your patchset to my local copy, but it doesn't fix the
issue.

The problem is the coroutine is already scheduled to be run in the
iothread context, which means job->busy == true, so we can't switch to
it from any other place.

I think we should give it a chance to run. What'd do you think of
something like this:

diff --git a/blockjob.c b/blockjob.c
index 58de8cb..1695bef 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -135,9 +135,8 @@ static void block_job_detach_aio_context(void *opaque)

     job_pause(&job->job);

-    while (!job->job.paused && !job_is_completed(&job->job)) {
-        job_drain(&job->job);
-    }
+    AIO_WAIT_WHILE(job->job.aio_context,
+        (job_drain(&job->job), !job->job.paused && !job_is_completed(&job->job)));

     job->job.aio_context = NULL;
     job_unref(&job->job);

Thanks,
Sergio (slp).
Kevin Wolf Feb. 28, 2019, 3:50 p.m. UTC | #3
Am 28.02.2019 um 16:01 hat Sergio Lopez geschrieben:
> On Wed, Feb 27, 2019 at 06:37:14PM +0100, Kevin Wolf wrote:
> > Am 27.02.2019 um 17:52 hat Sergio Lopez geschrieben:
> > > Stopping the dataplane requires calling to blk_set_aio_context, which
> > > may need to wait for a running job to be completed or paused.
> > > 
> > > As stopping the dataplane is something that can be triggered from a vcpu
> > > thread (due to the Guest requesting to stop the device), while the job
> > > itself may be managed by an iothread, holding the AioContext will lead
> > > to a deadlock, where the first one is waiting for the job to pause or
> > > finish, while the second can't make any progress as it's waiting for the
> > > AioContext to be released:
> > > 
> > >  Thread 6 (LWP 90472)
> > >  #0  0x000055a6f8497aee in blk_root_drained_end
> > >  #1  0x000055a6f84a7926 in bdrv_parent_drained_end
> > >  #2  0x000055a6f84a799f in bdrv_do_drained_end
> > >  #3  0x000055a6f84a82ab in bdrv_drained_end
> > >  #4  0x000055a6f8498be8 in blk_drain
> > >  #5  0x000055a6f84a22cd in mirror_drain
> > >  #6  0x000055a6f8457708 in block_job_detach_aio_context
> > >  #7  0x000055a6f84538f1 in bdrv_detach_aio_context
> > >  #8  0x000055a6f8453a96 in bdrv_set_aio_context
> > >  #9  0x000055a6f84999f8 in blk_set_aio_context
> > >  #10 0x000055a6f81802c8 in virtio_blk_data_plane_stop
> > >  #11 0x000055a6f83cfc95 in virtio_bus_stop_ioeventfd
> > >  #12 0x000055a6f83d1f81 in virtio_pci_common_write
> > >  #13 0x000055a6f83d1f81 in virtio_pci_common_write
> > >  #14 0x000055a6f8148d62 in memory_region_write_accessor
> > >  #15 0x000055a6f81459c9 in access_with_adjusted_size
> > >  #16 0x000055a6f814a608 in memory_region_dispatch_write
> > >  #17 0x000055a6f80f1f98 in flatview_write_continue
> > >  #18 0x000055a6f80f214f in flatview_write
> > >  #19 0x000055a6f80f6a7b in address_space_write
> > >  #20 0x000055a6f80f6b15 in address_space_rw
> > >  #21 0x000055a6f815da08 in kvm_cpu_exec
> > >  #22 0x000055a6f8132fee in qemu_kvm_cpu_thread_fn
> > >  #23 0x000055a6f8551306 in qemu_thread_start
> > >  #24 0x00007f9bdf5b9dd5 in start_thread
> > >  #25 0x00007f9bdf2e2ead in clone
> > > 
> > >  Thread 8 (LWP 90467)
> > >  #0  0x00007f9bdf5c04ed in __lll_lock_wait
> > >  #1  0x00007f9bdf5bbde6 in _L_lock_941
> > >  #2  0x00007f9bdf5bbcdf in __GI___pthread_mutex_lock
> > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl
> > >  #4  0x000055a6f854be77 in co_schedule_bh_cb
> > >  #5  0x000055a6f854b781 in aio_bh_poll
> > >  #6  0x000055a6f854b781 in aio_bh_poll
> > >  #7  0x000055a6f854f01b in aio_poll
> > >  #8  0x000055a6f825a488 in iothread_run
> > >  #9  0x000055a6f8551306 in qemu_thread_start
> > >  #10 0x00007f9bdf5b9dd5 in start_thread
> > >  #11 0x00007f9bdf2e2ead in clone
> > > 
> > >  (gdb) thread 8
> > >  [Switching to thread 8 (Thread 0x7f9bd6dae700 (LWP 90467))]
> > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
> > >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
> > >  66	    err = pthread_mutex_lock(&mutex->lock);
> > >  (gdb) up 3
> > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
> > >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
> > >  66	    err = pthread_mutex_lock(&mutex->lock);
> > >  (gdb) p mutex.lock.__data.__owner
> > >  $6 = 90472
> > > 
> > > Signed-off-by: Sergio Lopez <slp@redhat.com>
> > > ---
> > >  hw/block/dataplane/virtio-blk.c | 3 +--
> > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > 
> > > diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> > > index 8c37bd314a..358e6ae89b 100644
> > > --- a/hw/block/dataplane/virtio-blk.c
> > > +++ b/hw/block/dataplane/virtio-blk.c
> > > @@ -280,12 +280,11 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
> > >  
> > >      aio_context_acquire(s->ctx);
> > >      aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
> > > +    aio_context_release(s->ctx);
> > >  
> > >      /* Drain and switch bs back to the QEMU main loop */
> > >      blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
> > >  
> > > -    aio_context_release(s->ctx);
> > > -
> > 
> > blk_set_aio_context() requires that the caller hold the lock for the
> > source context, so I'm afraid this is wrong.
> 
> TBH, I was quite sure this patch was wrong myself, but I thought it was
> still a good way to illustrate the issue.
> 
> > However, I think the problem might already be fixed with my "block: Use
> > normal drain for bdrv_set_aio_context()" (contained in a pull request,
> > which isn't merged yet), which makes bdrv_set_aio_context() use a real
> > drain operation. This way, the requests of the mirror jobs should be
> > already drained before we even call into block_job_detach_aio_context().
> > 
> > Can you give this a try and see whether it fixes your problem?
> 
> I've applied your patchset to my local copy, but it doesn't fix the
> issue.
> 
> The problem is the coroutine is already scheduled to be run in the
> iothread context, which means job->busy == true, so we can't switch to
> it from any other place.

I still don't understand this because with job->paused == true the
hanging loop wouldn't even be started. And after bdrv_drained_begin()
returns, all jobs that use the node in question should be paused (see
child_job_drained_begin/poll).

So somehow that drain (called in bdrv_set_aio_context()) doesn't seem to
fully do what it is supposed to do?

> I think we should give it a chance to run. What'd do you think of
> something like this:
> 
> diff --git a/blockjob.c b/blockjob.c
> index 58de8cb..1695bef 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -135,9 +135,8 @@ static void block_job_detach_aio_context(void *opaque)
> 
>      job_pause(&job->job);
> 
> -    while (!job->job.paused && !job_is_completed(&job->job)) {
> -        job_drain(&job->job);
> -    }
> +    AIO_WAIT_WHILE(job->job.aio_context,
> +        (job_drain(&job->job), !job->job.paused && !job_is_completed(&job->job)));
> 
>      job->job.aio_context = NULL;
>      job_unref(&job->job);

If that's enough, it looks okay to me (and very similar to what we
already do in job_finish_sync()).

But I'd still like to understand why the drain didn't work.

Kevin
Sergio Lopez Feb. 28, 2019, 5:04 p.m. UTC | #4
On Thu, Feb 28, 2019 at 04:50:53PM +0100, Kevin Wolf wrote:
> Am 28.02.2019 um 16:01 hat Sergio Lopez geschrieben:
> > On Wed, Feb 27, 2019 at 06:37:14PM +0100, Kevin Wolf wrote:
> > > Am 27.02.2019 um 17:52 hat Sergio Lopez geschrieben:
> > > > Stopping the dataplane requires calling to blk_set_aio_context, which
> > > > may need to wait for a running job to be completed or paused.
> > > > 
> > > > As stopping the dataplane is something that can be triggered from a vcpu
> > > > thread (due to the Guest requesting to stop the device), while the job
> > > > itself may be managed by an iothread, holding the AioContext will lead
> > > > to a deadlock, where the first one is waiting for the job to pause or
> > > > finish, while the second can't make any progress as it's waiting for the
> > > > AioContext to be released:
> > > > 
> > > >  Thread 6 (LWP 90472)
> > > >  #0  0x000055a6f8497aee in blk_root_drained_end
> > > >  #1  0x000055a6f84a7926 in bdrv_parent_drained_end
> > > >  #2  0x000055a6f84a799f in bdrv_do_drained_end
> > > >  #3  0x000055a6f84a82ab in bdrv_drained_end
> > > >  #4  0x000055a6f8498be8 in blk_drain
> > > >  #5  0x000055a6f84a22cd in mirror_drain
> > > >  #6  0x000055a6f8457708 in block_job_detach_aio_context
> > > >  #7  0x000055a6f84538f1 in bdrv_detach_aio_context
> > > >  #8  0x000055a6f8453a96 in bdrv_set_aio_context
> > > >  #9  0x000055a6f84999f8 in blk_set_aio_context
> > > >  #10 0x000055a6f81802c8 in virtio_blk_data_plane_stop
> > > >  #11 0x000055a6f83cfc95 in virtio_bus_stop_ioeventfd
> > > >  #12 0x000055a6f83d1f81 in virtio_pci_common_write
> > > >  #13 0x000055a6f83d1f81 in virtio_pci_common_write
> > > >  #14 0x000055a6f8148d62 in memory_region_write_accessor
> > > >  #15 0x000055a6f81459c9 in access_with_adjusted_size
> > > >  #16 0x000055a6f814a608 in memory_region_dispatch_write
> > > >  #17 0x000055a6f80f1f98 in flatview_write_continue
> > > >  #18 0x000055a6f80f214f in flatview_write
> > > >  #19 0x000055a6f80f6a7b in address_space_write
> > > >  #20 0x000055a6f80f6b15 in address_space_rw
> > > >  #21 0x000055a6f815da08 in kvm_cpu_exec
> > > >  #22 0x000055a6f8132fee in qemu_kvm_cpu_thread_fn
> > > >  #23 0x000055a6f8551306 in qemu_thread_start
> > > >  #24 0x00007f9bdf5b9dd5 in start_thread
> > > >  #25 0x00007f9bdf2e2ead in clone
> > > > 
> > > >  Thread 8 (LWP 90467)
> > > >  #0  0x00007f9bdf5c04ed in __lll_lock_wait
> > > >  #1  0x00007f9bdf5bbde6 in _L_lock_941
> > > >  #2  0x00007f9bdf5bbcdf in __GI___pthread_mutex_lock
> > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl
> > > >  #4  0x000055a6f854be77 in co_schedule_bh_cb
> > > >  #5  0x000055a6f854b781 in aio_bh_poll
> > > >  #6  0x000055a6f854b781 in aio_bh_poll
> > > >  #7  0x000055a6f854f01b in aio_poll
> > > >  #8  0x000055a6f825a488 in iothread_run
> > > >  #9  0x000055a6f8551306 in qemu_thread_start
> > > >  #10 0x00007f9bdf5b9dd5 in start_thread
> > > >  #11 0x00007f9bdf2e2ead in clone
> > > > 
> > > >  (gdb) thread 8
> > > >  [Switching to thread 8 (Thread 0x7f9bd6dae700 (LWP 90467))]
> > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
> > > >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
> > > >  66	    err = pthread_mutex_lock(&mutex->lock);
> > > >  (gdb) up 3
> > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
> > > >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
> > > >  66	    err = pthread_mutex_lock(&mutex->lock);
> > > >  (gdb) p mutex.lock.__data.__owner
> > > >  $6 = 90472
> > > > 
> > > > Signed-off-by: Sergio Lopez <slp@redhat.com>
> > > > ---
> > > >  hw/block/dataplane/virtio-blk.c | 3 +--
> > > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > > 
> > > > diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> > > > index 8c37bd314a..358e6ae89b 100644
> > > > --- a/hw/block/dataplane/virtio-blk.c
> > > > +++ b/hw/block/dataplane/virtio-blk.c
> > > > @@ -280,12 +280,11 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
> > > >  
> > > >      aio_context_acquire(s->ctx);
> > > >      aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
> > > > +    aio_context_release(s->ctx);
> > > >  
> > > >      /* Drain and switch bs back to the QEMU main loop */
> > > >      blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
> > > >  
> > > > -    aio_context_release(s->ctx);
> > > > -
> > > 
> > > blk_set_aio_context() requires that the caller hold the lock for the
> > > source context, so I'm afraid this is wrong.
> > 
> > TBH, I was quite sure this patch was wrong myself, but I thought it was
> > still a good way to illustrate the issue.
> > 
> > > However, I think the problem might already be fixed with my "block: Use
> > > normal drain for bdrv_set_aio_context()" (contained in a pull request,
> > > which isn't merged yet), which makes bdrv_set_aio_context() use a real
> > > drain operation. This way, the requests of the mirror jobs should be
> > > already drained before we even call into block_job_detach_aio_context().
> > > 
> > > Can you give this a try and see whether it fixes your problem?
> > 
> > I've applied your patchset to my local copy, but it doesn't fix the
> > issue.
> > 
> > The problem is the coroutine is already scheduled to be run in the
> > iothread context, which means job->busy == true, so we can't switch to
> > it from any other place.
> 
> I still don't understand this because with job->paused == true the
> hanging loop wouldn't even be started. And after bdrv_drained_begin()
> returns, all jobs that use the node in question should be paused (see
> child_job_drained_begin/poll).
> 
> So somehow that drain (called in bdrv_set_aio_context()) doesn't seem to
> fully do what it is supposed to do?

IIUC, child_job_drained_begin() requests the job to be paused (something
that block_job_detach_aio_context() also does), but we don't get to
child_job_drained_poll() as bdrv_parent_drained_begin_single() is called
with "poll == false" by bdrv_parent_drained_begin().

But even if we did, that probably won't help in some scenarios, as
mirror's drained_poll implementation just checks if there are in_flight
requests, so it might happen that BDRV_POLL_WHILE returns without having
called aio_wait() a single time.

In other words, I think the drain code makes sure there aren't any
in_flight requests in the chain, but doesn't provide by itself a
guarantee that the jobs have been paused.

Thanks,
Sergio (slp).
Kevin Wolf Feb. 28, 2019, 5:22 p.m. UTC | #5
Am 28.02.2019 um 18:04 hat Sergio Lopez geschrieben:
> On Thu, Feb 28, 2019 at 04:50:53PM +0100, Kevin Wolf wrote:
> > Am 28.02.2019 um 16:01 hat Sergio Lopez geschrieben:
> > > On Wed, Feb 27, 2019 at 06:37:14PM +0100, Kevin Wolf wrote:
> > > > Am 27.02.2019 um 17:52 hat Sergio Lopez geschrieben:
> > > > > Stopping the dataplane requires calling to blk_set_aio_context, which
> > > > > may need to wait for a running job to be completed or paused.
> > > > > 
> > > > > As stopping the dataplane is something that can be triggered from a vcpu
> > > > > thread (due to the Guest requesting to stop the device), while the job
> > > > > itself may be managed by an iothread, holding the AioContext will lead
> > > > > to a deadlock, where the first one is waiting for the job to pause or
> > > > > finish, while the second can't make any progress as it's waiting for the
> > > > > AioContext to be released:
> > > > > 
> > > > >  Thread 6 (LWP 90472)
> > > > >  #0  0x000055a6f8497aee in blk_root_drained_end
> > > > >  #1  0x000055a6f84a7926 in bdrv_parent_drained_end
> > > > >  #2  0x000055a6f84a799f in bdrv_do_drained_end
> > > > >  #3  0x000055a6f84a82ab in bdrv_drained_end
> > > > >  #4  0x000055a6f8498be8 in blk_drain
> > > > >  #5  0x000055a6f84a22cd in mirror_drain
> > > > >  #6  0x000055a6f8457708 in block_job_detach_aio_context
> > > > >  #7  0x000055a6f84538f1 in bdrv_detach_aio_context
> > > > >  #8  0x000055a6f8453a96 in bdrv_set_aio_context
> > > > >  #9  0x000055a6f84999f8 in blk_set_aio_context
> > > > >  #10 0x000055a6f81802c8 in virtio_blk_data_plane_stop
> > > > >  #11 0x000055a6f83cfc95 in virtio_bus_stop_ioeventfd
> > > > >  #12 0x000055a6f83d1f81 in virtio_pci_common_write
> > > > >  #13 0x000055a6f83d1f81 in virtio_pci_common_write
> > > > >  #14 0x000055a6f8148d62 in memory_region_write_accessor
> > > > >  #15 0x000055a6f81459c9 in access_with_adjusted_size
> > > > >  #16 0x000055a6f814a608 in memory_region_dispatch_write
> > > > >  #17 0x000055a6f80f1f98 in flatview_write_continue
> > > > >  #18 0x000055a6f80f214f in flatview_write
> > > > >  #19 0x000055a6f80f6a7b in address_space_write
> > > > >  #20 0x000055a6f80f6b15 in address_space_rw
> > > > >  #21 0x000055a6f815da08 in kvm_cpu_exec
> > > > >  #22 0x000055a6f8132fee in qemu_kvm_cpu_thread_fn
> > > > >  #23 0x000055a6f8551306 in qemu_thread_start
> > > > >  #24 0x00007f9bdf5b9dd5 in start_thread
> > > > >  #25 0x00007f9bdf2e2ead in clone
> > > > > 
> > > > >  Thread 8 (LWP 90467)
> > > > >  #0  0x00007f9bdf5c04ed in __lll_lock_wait
> > > > >  #1  0x00007f9bdf5bbde6 in _L_lock_941
> > > > >  #2  0x00007f9bdf5bbcdf in __GI___pthread_mutex_lock
> > > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl
> > > > >  #4  0x000055a6f854be77 in co_schedule_bh_cb
> > > > >  #5  0x000055a6f854b781 in aio_bh_poll
> > > > >  #6  0x000055a6f854b781 in aio_bh_poll
> > > > >  #7  0x000055a6f854f01b in aio_poll
> > > > >  #8  0x000055a6f825a488 in iothread_run
> > > > >  #9  0x000055a6f8551306 in qemu_thread_start
> > > > >  #10 0x00007f9bdf5b9dd5 in start_thread
> > > > >  #11 0x00007f9bdf2e2ead in clone
> > > > > 
> > > > >  (gdb) thread 8
> > > > >  [Switching to thread 8 (Thread 0x7f9bd6dae700 (LWP 90467))]
> > > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
> > > > >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
> > > > >  66	    err = pthread_mutex_lock(&mutex->lock);
> > > > >  (gdb) up 3
> > > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
> > > > >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
> > > > >  66	    err = pthread_mutex_lock(&mutex->lock);
> > > > >  (gdb) p mutex.lock.__data.__owner
> > > > >  $6 = 90472
> > > > > 
> > > > > Signed-off-by: Sergio Lopez <slp@redhat.com>
> > > > > ---
> > > > >  hw/block/dataplane/virtio-blk.c | 3 +--
> > > > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> > > > > index 8c37bd314a..358e6ae89b 100644
> > > > > --- a/hw/block/dataplane/virtio-blk.c
> > > > > +++ b/hw/block/dataplane/virtio-blk.c
> > > > > @@ -280,12 +280,11 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
> > > > >  
> > > > >      aio_context_acquire(s->ctx);
> > > > >      aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
> > > > > +    aio_context_release(s->ctx);
> > > > >  
> > > > >      /* Drain and switch bs back to the QEMU main loop */
> > > > >      blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
> > > > >  
> > > > > -    aio_context_release(s->ctx);
> > > > > -
> > > > 
> > > > blk_set_aio_context() requires that the caller hold the lock for the
> > > > source context, so I'm afraid this is wrong.
> > > 
> > > TBH, I was quite sure this patch was wrong myself, but I thought it was
> > > still a good way to illustrate the issue.
> > > 
> > > > However, I think the problem might already be fixed with my "block: Use
> > > > normal drain for bdrv_set_aio_context()" (contained in a pull request,
> > > > which isn't merged yet), which makes bdrv_set_aio_context() use a real
> > > > drain operation. This way, the requests of the mirror jobs should be
> > > > already drained before we even call into block_job_detach_aio_context().
> > > > 
> > > > Can you give this a try and see whether it fixes your problem?
> > > 
> > > I've applied your patchset to my local copy, but it doesn't fix the
> > > issue.
> > > 
> > > The problem is the coroutine is already scheduled to be run in the
> > > iothread context, which means job->busy == true, so we can't switch to
> > > it from any other place.
> > 
> > I still don't understand this because with job->paused == true the
> > hanging loop wouldn't even be started. And after bdrv_drained_begin()
> > returns, all jobs that use the node in question should be paused (see
> > child_job_drained_begin/poll).
> > 
> > So somehow that drain (called in bdrv_set_aio_context()) doesn't seem to
> > fully do what it is supposed to do?
> 
> IIUC, child_job_drained_begin() requests the job to be paused (something
> that block_job_detach_aio_context() also does), but we don't get to
> child_job_drained_poll() as bdrv_parent_drained_begin_single() is called
> with "poll == false" by bdrv_parent_drained_begin().
> 
> But even if we did, that probably won't help in some scenarios, as
> mirror's drained_poll implementation just checks if there are in_flight
> requests, so it might happen that BDRV_POLL_WHILE returns without having
> called aio_wait() a single time.
> 
> In other words, I think the drain code makes sure there aren't any
> in_flight requests in the chain, but doesn't provide by itself a
> guarantee that the jobs have been paused.

Yes, seems this is what it does. But I think that's not enough because
if the job isn't paused, it can still issue new requests.

I'm not sure whether mirror_drained_poll() or child_job_drained_poll()
is to blame, but the logic is wrong there: We should only return that
the job is quiescent when it has reached a pause point _and_
s->in_flight == 0.

IIUC, the reason why mirror even needs this .drained_poll implementation
is that it spawns additional coroutines, so waiting just for the main
coroutine to reach a pause point is not enough because the other
coroutines could still be active. But this is only an additional
condition, not one that could replace the condition that the main
coroutine must be inactive, too.

Kevin
Sergio Lopez Feb. 28, 2019, 6:36 p.m. UTC | #6
On Thu, Feb 28, 2019 at 06:22:02PM +0100, Kevin Wolf wrote:
> Am 28.02.2019 um 18:04 hat Sergio Lopez geschrieben:
> > On Thu, Feb 28, 2019 at 04:50:53PM +0100, Kevin Wolf wrote:
> > > Am 28.02.2019 um 16:01 hat Sergio Lopez geschrieben:
> > > > On Wed, Feb 27, 2019 at 06:37:14PM +0100, Kevin Wolf wrote:
> > > > > Am 27.02.2019 um 17:52 hat Sergio Lopez geschrieben:
> > > > > > Stopping the dataplane requires calling to blk_set_aio_context, which
> > > > > > may need to wait for a running job to be completed or paused.
> > > > > > 
> > > > > > As stopping the dataplane is something that can be triggered from a vcpu
> > > > > > thread (due to the Guest requesting to stop the device), while the job
> > > > > > itself may be managed by an iothread, holding the AioContext will lead
> > > > > > to a deadlock, where the first one is waiting for the job to pause or
> > > > > > finish, while the second can't make any progress as it's waiting for the
> > > > > > AioContext to be released:
> > > > > > 
> > > > > >  Thread 6 (LWP 90472)
> > > > > >  #0  0x000055a6f8497aee in blk_root_drained_end
> > > > > >  #1  0x000055a6f84a7926 in bdrv_parent_drained_end
> > > > > >  #2  0x000055a6f84a799f in bdrv_do_drained_end
> > > > > >  #3  0x000055a6f84a82ab in bdrv_drained_end
> > > > > >  #4  0x000055a6f8498be8 in blk_drain
> > > > > >  #5  0x000055a6f84a22cd in mirror_drain
> > > > > >  #6  0x000055a6f8457708 in block_job_detach_aio_context
> > > > > >  #7  0x000055a6f84538f1 in bdrv_detach_aio_context
> > > > > >  #8  0x000055a6f8453a96 in bdrv_set_aio_context
> > > > > >  #9  0x000055a6f84999f8 in blk_set_aio_context
> > > > > >  #10 0x000055a6f81802c8 in virtio_blk_data_plane_stop
> > > > > >  #11 0x000055a6f83cfc95 in virtio_bus_stop_ioeventfd
> > > > > >  #12 0x000055a6f83d1f81 in virtio_pci_common_write
> > > > > >  #13 0x000055a6f83d1f81 in virtio_pci_common_write
> > > > > >  #14 0x000055a6f8148d62 in memory_region_write_accessor
> > > > > >  #15 0x000055a6f81459c9 in access_with_adjusted_size
> > > > > >  #16 0x000055a6f814a608 in memory_region_dispatch_write
> > > > > >  #17 0x000055a6f80f1f98 in flatview_write_continue
> > > > > >  #18 0x000055a6f80f214f in flatview_write
> > > > > >  #19 0x000055a6f80f6a7b in address_space_write
> > > > > >  #20 0x000055a6f80f6b15 in address_space_rw
> > > > > >  #21 0x000055a6f815da08 in kvm_cpu_exec
> > > > > >  #22 0x000055a6f8132fee in qemu_kvm_cpu_thread_fn
> > > > > >  #23 0x000055a6f8551306 in qemu_thread_start
> > > > > >  #24 0x00007f9bdf5b9dd5 in start_thread
> > > > > >  #25 0x00007f9bdf2e2ead in clone
> > > > > > 
> > > > > >  Thread 8 (LWP 90467)
> > > > > >  #0  0x00007f9bdf5c04ed in __lll_lock_wait
> > > > > >  #1  0x00007f9bdf5bbde6 in _L_lock_941
> > > > > >  #2  0x00007f9bdf5bbcdf in __GI___pthread_mutex_lock
> > > > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl
> > > > > >  #4  0x000055a6f854be77 in co_schedule_bh_cb
> > > > > >  #5  0x000055a6f854b781 in aio_bh_poll
> > > > > >  #6  0x000055a6f854b781 in aio_bh_poll
> > > > > >  #7  0x000055a6f854f01b in aio_poll
> > > > > >  #8  0x000055a6f825a488 in iothread_run
> > > > > >  #9  0x000055a6f8551306 in qemu_thread_start
> > > > > >  #10 0x00007f9bdf5b9dd5 in start_thread
> > > > > >  #11 0x00007f9bdf2e2ead in clone
> > > > > > 
> > > > > >  (gdb) thread 8
> > > > > >  [Switching to thread 8 (Thread 0x7f9bd6dae700 (LWP 90467))]
> > > > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
> > > > > >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
> > > > > >  66	    err = pthread_mutex_lock(&mutex->lock);
> > > > > >  (gdb) up 3
> > > > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
> > > > > >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
> > > > > >  66	    err = pthread_mutex_lock(&mutex->lock);
> > > > > >  (gdb) p mutex.lock.__data.__owner
> > > > > >  $6 = 90472
> > > > > > 
> > > > > > Signed-off-by: Sergio Lopez <slp@redhat.com>
> > > > > > ---
> > > > > >  hw/block/dataplane/virtio-blk.c | 3 +--
> > > > > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> > > > > > index 8c37bd314a..358e6ae89b 100644
> > > > > > --- a/hw/block/dataplane/virtio-blk.c
> > > > > > +++ b/hw/block/dataplane/virtio-blk.c
> > > > > > @@ -280,12 +280,11 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
> > > > > >  
> > > > > >      aio_context_acquire(s->ctx);
> > > > > >      aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
> > > > > > +    aio_context_release(s->ctx);
> > > > > >  
> > > > > >      /* Drain and switch bs back to the QEMU main loop */
> > > > > >      blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
> > > > > >  
> > > > > > -    aio_context_release(s->ctx);
> > > > > > -
> > > > > 
> > > > > blk_set_aio_context() requires that the caller hold the lock for the
> > > > > source context, so I'm afraid this is wrong.
> > > > 
> > > > TBH, I was quite sure this patch was wrong myself, but I thought it was
> > > > still a good way to illustrate the issue.
> > > > 
> > > > > However, I think the problem might already be fixed with my "block: Use
> > > > > normal drain for bdrv_set_aio_context()" (contained in a pull request,
> > > > > which isn't merged yet), which makes bdrv_set_aio_context() use a real
> > > > > drain operation. This way, the requests of the mirror jobs should be
> > > > > already drained before we even call into block_job_detach_aio_context().
> > > > > 
> > > > > Can you give this a try and see whether it fixes your problem?
> > > > 
> > > > I've applied your patchset to my local copy, but it doesn't fix the
> > > > issue.
> > > > 
> > > > The problem is the coroutine is already scheduled to be run in the
> > > > iothread context, which means job->busy == true, so we can't switch to
> > > > it from any other place.
> > > 
> > > I still don't understand this because with job->paused == true the
> > > hanging loop wouldn't even be started. And after bdrv_drained_begin()
> > > returns, all jobs that use the node in question should be paused (see
> > > child_job_drained_begin/poll).
> > > 
> > > So somehow that drain (called in bdrv_set_aio_context()) doesn't seem to
> > > fully do what it is supposed to do?
> > 
> > IIUC, child_job_drained_begin() requests the job to be paused (something
> > that block_job_detach_aio_context() also does), but we don't get to
> > child_job_drained_poll() as bdrv_parent_drained_begin_single() is called
> > with "poll == false" by bdrv_parent_drained_begin().
> > 
> > But even if we did, that probably won't help in some scenarios, as
> > mirror's drained_poll implementation just checks if there are in_flight
> > requests, so it might happen that BDRV_POLL_WHILE returns without having
> > called aio_wait() a single time.
> > 
> > In other words, I think the drain code makes sure there aren't any
> > in_flight requests in the chain, but doesn't provide by itself a
> > guarantee that the jobs have been paused.
> 
> Yes, seems this is what it does. But I think that's not enough because
> if the job isn't paused, it can still issue new requests.
> 
> I'm not sure whether mirror_drained_poll() or child_job_drained_poll()
> is to blame, but the logic is wrong there: We should only return that
> the job is quiescent when it has reached a pause point _and_
> s->in_flight == 0.

If we expect this to be the case even when
bdrv_parent_drained_begin_single() is called from
bdrv_parent_drained_begin() with poll == false, then we need to make the
job_pause() at child_job_drained_begin() to take effect immeditaly,
probaly putting an AIO_WAIT_WHILE afterwards.

Otherwise, we can simply add an extra condition at
child_job_drained_poll(), before the drv->drained_poll(), to return true
if the job isn't yet paused.

Thanks,
Sergio (slp).
Kevin Wolf March 1, 2019, 11:44 a.m. UTC | #7
Am 28.02.2019 um 19:36 hat Sergio Lopez geschrieben:
> On Thu, Feb 28, 2019 at 06:22:02PM +0100, Kevin Wolf wrote:
> > Am 28.02.2019 um 18:04 hat Sergio Lopez geschrieben:
> > > On Thu, Feb 28, 2019 at 04:50:53PM +0100, Kevin Wolf wrote:
> > > > Am 28.02.2019 um 16:01 hat Sergio Lopez geschrieben:
> > > > > On Wed, Feb 27, 2019 at 06:37:14PM +0100, Kevin Wolf wrote:
> > > > > > Am 27.02.2019 um 17:52 hat Sergio Lopez geschrieben:
> > > > > > > Stopping the dataplane requires calling to blk_set_aio_context, which
> > > > > > > may need to wait for a running job to be completed or paused.
> > > > > > > 
> > > > > > > As stopping the dataplane is something that can be triggered from a vcpu
> > > > > > > thread (due to the Guest requesting to stop the device), while the job
> > > > > > > itself may be managed by an iothread, holding the AioContext will lead
> > > > > > > to a deadlock, where the first one is waiting for the job to pause or
> > > > > > > finish, while the second can't make any progress as it's waiting for the
> > > > > > > AioContext to be released:
> > > > > > > 
> > > > > > >  Thread 6 (LWP 90472)
> > > > > > >  #0  0x000055a6f8497aee in blk_root_drained_end
> > > > > > >  #1  0x000055a6f84a7926 in bdrv_parent_drained_end
> > > > > > >  #2  0x000055a6f84a799f in bdrv_do_drained_end
> > > > > > >  #3  0x000055a6f84a82ab in bdrv_drained_end
> > > > > > >  #4  0x000055a6f8498be8 in blk_drain
> > > > > > >  #5  0x000055a6f84a22cd in mirror_drain
> > > > > > >  #6  0x000055a6f8457708 in block_job_detach_aio_context
> > > > > > >  #7  0x000055a6f84538f1 in bdrv_detach_aio_context
> > > > > > >  #8  0x000055a6f8453a96 in bdrv_set_aio_context
> > > > > > >  #9  0x000055a6f84999f8 in blk_set_aio_context
> > > > > > >  #10 0x000055a6f81802c8 in virtio_blk_data_plane_stop
> > > > > > >  #11 0x000055a6f83cfc95 in virtio_bus_stop_ioeventfd
> > > > > > >  #12 0x000055a6f83d1f81 in virtio_pci_common_write
> > > > > > >  #13 0x000055a6f83d1f81 in virtio_pci_common_write
> > > > > > >  #14 0x000055a6f8148d62 in memory_region_write_accessor
> > > > > > >  #15 0x000055a6f81459c9 in access_with_adjusted_size
> > > > > > >  #16 0x000055a6f814a608 in memory_region_dispatch_write
> > > > > > >  #17 0x000055a6f80f1f98 in flatview_write_continue
> > > > > > >  #18 0x000055a6f80f214f in flatview_write
> > > > > > >  #19 0x000055a6f80f6a7b in address_space_write
> > > > > > >  #20 0x000055a6f80f6b15 in address_space_rw
> > > > > > >  #21 0x000055a6f815da08 in kvm_cpu_exec
> > > > > > >  #22 0x000055a6f8132fee in qemu_kvm_cpu_thread_fn
> > > > > > >  #23 0x000055a6f8551306 in qemu_thread_start
> > > > > > >  #24 0x00007f9bdf5b9dd5 in start_thread
> > > > > > >  #25 0x00007f9bdf2e2ead in clone
> > > > > > > 
> > > > > > >  Thread 8 (LWP 90467)
> > > > > > >  #0  0x00007f9bdf5c04ed in __lll_lock_wait
> > > > > > >  #1  0x00007f9bdf5bbde6 in _L_lock_941
> > > > > > >  #2  0x00007f9bdf5bbcdf in __GI___pthread_mutex_lock
> > > > > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl
> > > > > > >  #4  0x000055a6f854be77 in co_schedule_bh_cb
> > > > > > >  #5  0x000055a6f854b781 in aio_bh_poll
> > > > > > >  #6  0x000055a6f854b781 in aio_bh_poll
> > > > > > >  #7  0x000055a6f854f01b in aio_poll
> > > > > > >  #8  0x000055a6f825a488 in iothread_run
> > > > > > >  #9  0x000055a6f8551306 in qemu_thread_start
> > > > > > >  #10 0x00007f9bdf5b9dd5 in start_thread
> > > > > > >  #11 0x00007f9bdf2e2ead in clone
> > > > > > > 
> > > > > > >  (gdb) thread 8
> > > > > > >  [Switching to thread 8 (Thread 0x7f9bd6dae700 (LWP 90467))]
> > > > > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
> > > > > > >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
> > > > > > >  66	    err = pthread_mutex_lock(&mutex->lock);
> > > > > > >  (gdb) up 3
> > > > > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
> > > > > > >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
> > > > > > >  66	    err = pthread_mutex_lock(&mutex->lock);
> > > > > > >  (gdb) p mutex.lock.__data.__owner
> > > > > > >  $6 = 90472
> > > > > > > 
> > > > > > > Signed-off-by: Sergio Lopez <slp@redhat.com>
> > > > > > > ---
> > > > > > >  hw/block/dataplane/virtio-blk.c | 3 +--
> > > > > > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> > > > > > > index 8c37bd314a..358e6ae89b 100644
> > > > > > > --- a/hw/block/dataplane/virtio-blk.c
> > > > > > > +++ b/hw/block/dataplane/virtio-blk.c
> > > > > > > @@ -280,12 +280,11 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
> > > > > > >  
> > > > > > >      aio_context_acquire(s->ctx);
> > > > > > >      aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
> > > > > > > +    aio_context_release(s->ctx);
> > > > > > >  
> > > > > > >      /* Drain and switch bs back to the QEMU main loop */
> > > > > > >      blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
> > > > > > >  
> > > > > > > -    aio_context_release(s->ctx);
> > > > > > > -
> > > > > > 
> > > > > > blk_set_aio_context() requires that the caller hold the lock for the
> > > > > > source context, so I'm afraid this is wrong.
> > > > > 
> > > > > TBH, I was quite sure this patch was wrong myself, but I thought it was
> > > > > still a good way to illustrate the issue.
> > > > > 
> > > > > > However, I think the problem might already be fixed with my "block: Use
> > > > > > normal drain for bdrv_set_aio_context()" (contained in a pull request,
> > > > > > which isn't merged yet), which makes bdrv_set_aio_context() use a real
> > > > > > drain operation. This way, the requests of the mirror jobs should be
> > > > > > already drained before we even call into block_job_detach_aio_context().
> > > > > > 
> > > > > > Can you give this a try and see whether it fixes your problem?
> > > > > 
> > > > > I've applied your patchset to my local copy, but it doesn't fix the
> > > > > issue.
> > > > > 
> > > > > The problem is the coroutine is already scheduled to be run in the
> > > > > iothread context, which means job->busy == true, so we can't switch to
> > > > > it from any other place.
> > > > 
> > > > I still don't understand this because with job->paused == true the
> > > > hanging loop wouldn't even be started. And after bdrv_drained_begin()
> > > > returns, all jobs that use the node in question should be paused (see
> > > > child_job_drained_begin/poll).
> > > > 
> > > > So somehow that drain (called in bdrv_set_aio_context()) doesn't seem to
> > > > fully do what it is supposed to do?
> > > 
> > > IIUC, child_job_drained_begin() requests the job to be paused (something
> > > that block_job_detach_aio_context() also does), but we don't get to
> > > child_job_drained_poll() as bdrv_parent_drained_begin_single() is called
> > > with "poll == false" by bdrv_parent_drained_begin().
> > > 
> > > But even if we did, that probably won't help in some scenarios, as
> > > mirror's drained_poll implementation just checks if there are in_flight
> > > requests, so it might happen that BDRV_POLL_WHILE returns without having
> > > called aio_wait() a single time.
> > > 
> > > In other words, I think the drain code makes sure there aren't any
> > > in_flight requests in the chain, but doesn't provide by itself a
> > > guarantee that the jobs have been paused.
> > 
> > Yes, seems this is what it does. But I think that's not enough because
> > if the job isn't paused, it can still issue new requests.
> > 
> > I'm not sure whether mirror_drained_poll() or child_job_drained_poll()
> > is to blame, but the logic is wrong there: We should only return that
> > the job is quiescent when it has reached a pause point _and_
> > s->in_flight == 0.
> 
> If we expect this to be the case even when
> bdrv_parent_drained_begin_single() is called from
> bdrv_parent_drained_begin() with poll == false, then we need to make the
> job_pause() at child_job_drained_begin() to take effect immeditaly,
> probaly putting an AIO_WAIT_WHILE afterwards.

.drained_begin callbacks must not call AIO_WAIT_WHILE (or any kind of
aio_poll()) because that could run callbacks that change the graph,
which would mess with the drained_begin recursion.

The design of drain after my recent fixes is that .drained_begin does
whatever is needed to initiate tbat things can become quiescent, and
then there is only a single AIO_WAIT_WHILE() at the top level which
recursively calls .drained_poll callbacks to check whether we need to
wait longer (and which may not make further aio_poll() calls either).

This way, graph changes can never occur during a recursion.

bdrv_parent_drained_begin_single) is only called with poll == false if
we know that the caller will already poll.

> Otherwise, we can simply add an extra condition at
> child_job_drained_poll(), before the drv->drained_poll(), to return
> true if the job isn't yet paused.

Yes, I think something like this is this right fix.

Kevin
Sergio Lopez March 1, 2019, 1:47 p.m. UTC | #8
Kevin Wolf writes:

> Am 28.02.2019 um 19:36 hat Sergio Lopez geschrieben:
>> On Thu, Feb 28, 2019 at 06:22:02PM +0100, Kevin Wolf wrote:
>> > Am 28.02.2019 um 18:04 hat Sergio Lopez geschrieben:
>> > > On Thu, Feb 28, 2019 at 04:50:53PM +0100, Kevin Wolf wrote:
>> > > > Am 28.02.2019 um 16:01 hat Sergio Lopez geschrieben:
>> > > > > On Wed, Feb 27, 2019 at 06:37:14PM +0100, Kevin Wolf wrote:
>> > > > > > Am 27.02.2019 um 17:52 hat Sergio Lopez geschrieben:
>> > > > > > > Stopping the dataplane requires calling to blk_set_aio_context, which
>> > > > > > > may need to wait for a running job to be completed or paused.
>> > > > > > > 
>> > > > > > > As stopping the dataplane is something that can be triggered from a vcpu
>> > > > > > > thread (due to the Guest requesting to stop the device), while the job
>> > > > > > > itself may be managed by an iothread, holding the AioContext will lead
>> > > > > > > to a deadlock, where the first one is waiting for the job to pause or
>> > > > > > > finish, while the second can't make any progress as it's waiting for the
>> > > > > > > AioContext to be released:
>> > > > > > > 
>> > > > > > >  Thread 6 (LWP 90472)
>> > > > > > >  #0  0x000055a6f8497aee in blk_root_drained_end
>> > > > > > >  #1  0x000055a6f84a7926 in bdrv_parent_drained_end
>> > > > > > >  #2  0x000055a6f84a799f in bdrv_do_drained_end
>> > > > > > >  #3  0x000055a6f84a82ab in bdrv_drained_end
>> > > > > > >  #4  0x000055a6f8498be8 in blk_drain
>> > > > > > >  #5  0x000055a6f84a22cd in mirror_drain
>> > > > > > >  #6  0x000055a6f8457708 in block_job_detach_aio_context
>> > > > > > >  #7  0x000055a6f84538f1 in bdrv_detach_aio_context
>> > > > > > >  #8  0x000055a6f8453a96 in bdrv_set_aio_context
>> > > > > > >  #9  0x000055a6f84999f8 in blk_set_aio_context
>> > > > > > >  #10 0x000055a6f81802c8 in virtio_blk_data_plane_stop
>> > > > > > >  #11 0x000055a6f83cfc95 in virtio_bus_stop_ioeventfd
>> > > > > > >  #12 0x000055a6f83d1f81 in virtio_pci_common_write
>> > > > > > >  #13 0x000055a6f83d1f81 in virtio_pci_common_write
>> > > > > > >  #14 0x000055a6f8148d62 in memory_region_write_accessor
>> > > > > > >  #15 0x000055a6f81459c9 in access_with_adjusted_size
>> > > > > > >  #16 0x000055a6f814a608 in memory_region_dispatch_write
>> > > > > > >  #17 0x000055a6f80f1f98 in flatview_write_continue
>> > > > > > >  #18 0x000055a6f80f214f in flatview_write
>> > > > > > >  #19 0x000055a6f80f6a7b in address_space_write
>> > > > > > >  #20 0x000055a6f80f6b15 in address_space_rw
>> > > > > > >  #21 0x000055a6f815da08 in kvm_cpu_exec
>> > > > > > >  #22 0x000055a6f8132fee in qemu_kvm_cpu_thread_fn
>> > > > > > >  #23 0x000055a6f8551306 in qemu_thread_start
>> > > > > > >  #24 0x00007f9bdf5b9dd5 in start_thread
>> > > > > > >  #25 0x00007f9bdf2e2ead in clone
>> > > > > > > 
>> > > > > > >  Thread 8 (LWP 90467)
>> > > > > > >  #0  0x00007f9bdf5c04ed in __lll_lock_wait
>> > > > > > >  #1  0x00007f9bdf5bbde6 in _L_lock_941
>> > > > > > >  #2  0x00007f9bdf5bbcdf in __GI___pthread_mutex_lock
>> > > > > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl
>> > > > > > >  #4  0x000055a6f854be77 in co_schedule_bh_cb
>> > > > > > >  #5  0x000055a6f854b781 in aio_bh_poll
>> > > > > > >  #6  0x000055a6f854b781 in aio_bh_poll
>> > > > > > >  #7  0x000055a6f854f01b in aio_poll
>> > > > > > >  #8  0x000055a6f825a488 in iothread_run
>> > > > > > >  #9  0x000055a6f8551306 in qemu_thread_start
>> > > > > > >  #10 0x00007f9bdf5b9dd5 in start_thread
>> > > > > > >  #11 0x00007f9bdf2e2ead in clone
>> > > > > > > 
>> > > > > > >  (gdb) thread 8
>> > > > > > >  [Switching to thread 8 (Thread 0x7f9bd6dae700 (LWP 90467))]
>> > > > > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
>> > > > > > >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
>> > > > > > >  66	    err = pthread_mutex_lock(&mutex->lock);
>> > > > > > >  (gdb) up 3
>> > > > > > >  #3  0x000055a6f8551447 in qemu_mutex_lock_impl (mutex=0x55a6fac8fea0,
>> > > > > > >      file=0x55a6f8730c3f "util/async.c", line=511) at util/qemu-thread-posix.c:66
>> > > > > > >  66	    err = pthread_mutex_lock(&mutex->lock);
>> > > > > > >  (gdb) p mutex.lock.__data.__owner
>> > > > > > >  $6 = 90472
>> > > > > > > 
>> > > > > > > Signed-off-by: Sergio Lopez <slp@redhat.com>
>> > > > > > > ---
>> > > > > > >  hw/block/dataplane/virtio-blk.c | 3 +--
>> > > > > > >  1 file changed, 1 insertion(+), 2 deletions(-)
>> > > > > > > 
>> > > > > > > diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
>> > > > > > > index 8c37bd314a..358e6ae89b 100644
>> > > > > > > --- a/hw/block/dataplane/virtio-blk.c
>> > > > > > > +++ b/hw/block/dataplane/virtio-blk.c
>> > > > > > > @@ -280,12 +280,11 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
>> > > > > > >  
>> > > > > > >      aio_context_acquire(s->ctx);
>> > > > > > >      aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
>> > > > > > > +    aio_context_release(s->ctx);
>> > > > > > >  
>> > > > > > >      /* Drain and switch bs back to the QEMU main loop */
>> > > > > > >      blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
>> > > > > > >  
>> > > > > > > -    aio_context_release(s->ctx);
>> > > > > > > -
>> > > > > > 
>> > > > > > blk_set_aio_context() requires that the caller hold the lock for the
>> > > > > > source context, so I'm afraid this is wrong.
>> > > > > 
>> > > > > TBH, I was quite sure this patch was wrong myself, but I thought it was
>> > > > > still a good way to illustrate the issue.
>> > > > > 
>> > > > > > However, I think the problem might already be fixed with my "block: Use
>> > > > > > normal drain for bdrv_set_aio_context()" (contained in a pull request,
>> > > > > > which isn't merged yet), which makes bdrv_set_aio_context() use a real
>> > > > > > drain operation. This way, the requests of the mirror jobs should be
>> > > > > > already drained before we even call into block_job_detach_aio_context().
>> > > > > > 
>> > > > > > Can you give this a try and see whether it fixes your problem?
>> > > > > 
>> > > > > I've applied your patchset to my local copy, but it doesn't fix the
>> > > > > issue.
>> > > > > 
>> > > > > The problem is the coroutine is already scheduled to be run in the
>> > > > > iothread context, which means job->busy == true, so we can't switch to
>> > > > > it from any other place.
>> > > > 
>> > > > I still don't understand this because with job->paused == true the
>> > > > hanging loop wouldn't even be started. And after bdrv_drained_begin()
>> > > > returns, all jobs that use the node in question should be paused (see
>> > > > child_job_drained_begin/poll).
>> > > > 
>> > > > So somehow that drain (called in bdrv_set_aio_context()) doesn't seem to
>> > > > fully do what it is supposed to do?
>> > > 
>> > > IIUC, child_job_drained_begin() requests the job to be paused (something
>> > > that block_job_detach_aio_context() also does), but we don't get to
>> > > child_job_drained_poll() as bdrv_parent_drained_begin_single() is called
>> > > with "poll == false" by bdrv_parent_drained_begin().
>> > > 
>> > > But even if we did, that probably won't help in some scenarios, as
>> > > mirror's drained_poll implementation just checks if there are in_flight
>> > > requests, so it might happen that BDRV_POLL_WHILE returns without having
>> > > called aio_wait() a single time.
>> > > 
>> > > In other words, I think the drain code makes sure there aren't any
>> > > in_flight requests in the chain, but doesn't provide by itself a
>> > > guarantee that the jobs have been paused.
>> > 
>> > Yes, seems this is what it does. But I think that's not enough because
>> > if the job isn't paused, it can still issue new requests.
>> > 
>> > I'm not sure whether mirror_drained_poll() or child_job_drained_poll()
>> > is to blame, but the logic is wrong there: We should only return that
>> > the job is quiescent when it has reached a pause point _and_
>> > s->in_flight == 0.
>> 
>> If we expect this to be the case even when
>> bdrv_parent_drained_begin_single() is called from
>> bdrv_parent_drained_begin() with poll == false, then we need to make the
>> job_pause() at child_job_drained_begin() to take effect immeditaly,
>> probaly putting an AIO_WAIT_WHILE afterwards.
>
> .drained_begin callbacks must not call AIO_WAIT_WHILE (or any kind of
> aio_poll()) because that could run callbacks that change the graph,
> which would mess with the drained_begin recursion.
>
> The design of drain after my recent fixes is that .drained_begin does
> whatever is needed to initiate tbat things can become quiescent, and
> then there is only a single AIO_WAIT_WHILE() at the top level which
> recursively calls .drained_poll callbacks to check whether we need to
> wait longer (and which may not make further aio_poll() calls either).
>
> This way, graph changes can never occur during a recursion.
>
> bdrv_parent_drained_begin_single) is only called with poll == false if
> we know that the caller will already poll.

Makes sense, thanks for the explanation.

>> Otherwise, we can simply add an extra condition at
>> child_job_drained_poll(), before the drv->drained_poll(), to return
>> true if the job isn't yet paused.
>
> Yes, I think something like this is this right fix.

Fixing this has uncovered another issue also triggered by issuing
'block_commit' and 'device_del' consecutively. At the end, mirror_run()
calls to bdrv_drained_begin(), which is scheduled of later (via
bdrv_co_yield_to_drain()) as the mirror job is running in a coroutine.

At the same time, the Guest requests the device to be unplugged, which
leads to blk_unref()->blk_drain()->bdrv_do_drained_begin(). When the
latter reaches BDRV_POLL_WHILE, the bdrv_drained_begin scheduled above
is run, which also runs BDRV_POLL_WHILE, leading to the thread getting
stuck in aio_poll().

Is it really safe scheduling a bdrv_drained_begin() with poll == true?

Thanks,
Sergio (slp).
Kevin Wolf March 1, 2019, 3:50 p.m. UTC | #9
Am 01.03.2019 um 14:47 hat Sergio Lopez geschrieben:
> >> Otherwise, we can simply add an extra condition at
> >> child_job_drained_poll(), before the drv->drained_poll(), to return
> >> true if the job isn't yet paused.
> >
> > Yes, I think something like this is this right fix.
> 
> Fixing this has uncovered another issue also triggered by issuing
> 'block_commit' and 'device_del' consecutively. At the end, mirror_run()
> calls to bdrv_drained_begin(), which is scheduled of later (via
> bdrv_co_yield_to_drain()) as the mirror job is running in a coroutine.
> 
> At the same time, the Guest requests the device to be unplugged, which
> leads to blk_unref()->blk_drain()->bdrv_do_drained_begin(). When the
> latter reaches BDRV_POLL_WHILE, the bdrv_drained_begin scheduled above
> is run, which also runs BDRV_POLL_WHILE, leading to the thread getting
> stuck in aio_poll().
> 
> Is it really safe scheduling a bdrv_drained_begin() with poll == true?

I don't see what the problem would be with it in theory. Once the node
becomes idle, both the inner and the outer BDRV_POLL_WHILE() should
return.

The question with such hangs is usually, what is the condition that made
bdrv_drain_poll() return true, and why aren't we making progress so that
is would become false. With iothreads, it could also be that the
condition has actually already changed, but aio_wait_kick() wasn't
called, so aio_poll() isn't woken up.

Kevin
Sergio Lopez March 6, 2019, 12:47 p.m. UTC | #10
Kevin Wolf writes:

> Am 01.03.2019 um 14:47 hat Sergio Lopez geschrieben:
>> >> Otherwise, we can simply add an extra condition at
>> >> child_job_drained_poll(), before the drv->drained_poll(), to return
>> >> true if the job isn't yet paused.
>> >
>> > Yes, I think something like this is this right fix.
>> 
>> Fixing this has uncovered another issue also triggered by issuing
>> 'block_commit' and 'device_del' consecutively. At the end, mirror_run()
>> calls to bdrv_drained_begin(), which is scheduled of later (via
>> bdrv_co_yield_to_drain()) as the mirror job is running in a coroutine.
>> 
>> At the same time, the Guest requests the device to be unplugged, which
>> leads to blk_unref()->blk_drain()->bdrv_do_drained_begin(). When the
>> latter reaches BDRV_POLL_WHILE, the bdrv_drained_begin scheduled above
>> is run, which also runs BDRV_POLL_WHILE, leading to the thread getting
>> stuck in aio_poll().
>> 
>> Is it really safe scheduling a bdrv_drained_begin() with poll == true?
>
> I don't see what the problem would be with it in theory. Once the node
> becomes idle, both the inner and the outer BDRV_POLL_WHILE() should
> return.
>
> The question with such hangs is usually, what is the condition that made
> bdrv_drain_poll() return true, and why aren't we making progress so that
> is would become false. With iothreads, it could also be that the
> condition has actually already changed, but aio_wait_kick() wasn't
> called, so aio_poll() isn't woken up.

Turns out we can't restrict child_job_drained_poll() to signal
completion only if the job has already been effectively paused or
cancelled, as we may reach this point from job_finish_sync().

Do you think it's worth to keep trying that bdrv_drained_begin() only
returns when the related jobs are completely paused, or should we just
use AIO_WAIT_WHILE at block_job_detach_aio_context() as previously
suggested?

Thanks,
Sergio (slp).
diff mbox series

Patch

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 8c37bd314a..358e6ae89b 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -280,12 +280,11 @@  void virtio_blk_data_plane_stop(VirtIODevice *vdev)
 
     aio_context_acquire(s->ctx);
     aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
+    aio_context_release(s->ctx);
 
     /* Drain and switch bs back to the QEMU main loop */
     blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
 
-    aio_context_release(s->ctx);
-
     for (i = 0; i < nvqs; i++) {
         virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
         virtio_bus_cleanup_host_notifier(VIRTIO_BUS(qbus), i);