[05/21] block: Introduce bdrv_schedule_unref()

Message ID	20230817125020.208339-6-kwolf@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Kevin Wolf <kwolf@redhat.com> To: qemu-block@nongnu.org Cc: kwolf@redhat.com, stefanha@redhat.com, eblake@redhat.com, eesposit@redhat.com, pbonzini@redhat.com, vsementsov@yandex-team.ru, qemu-devel@nongnu.org Subject: [PATCH 05/21] block: Introduce bdrv_schedule_unref() Date: Thu, 17 Aug 2023 14:50:04 +0200 Message-ID: <20230817125020.208339-6-kwolf@redhat.com> In-Reply-To: <20230817125020.208339-1-kwolf@redhat.com> References: <20230817125020.208339-1-kwolf@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.133.124; envelope-from=kwolf@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.021, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=unavailable autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	Graph locking part 4 (node management) \| expand [00/21] Graph locking part 4 (node management) [01/21] block: Remove unused BlockReopenQueueEntry.perms_checked [02/21] preallocate: Factor out preallocate_truncate_to_real_size() [03/21] preallocate: Don't poll during permission updates [04/21] block: Take AioContext lock for bdrv_append() more consistently [05/21] block: Introduce bdrv_schedule_unref() [06/21] block-coroutine-wrapper: Add no_co_wrapper_bdrv_wrlock functions [07/21] block-coroutine-wrapper: Allow arbitrary parameter names [08/21] block: Mark bdrv_replace_child_noperm() GRAPH_WRLOCK [09/21] block: Mark bdrv_replace_child_tran() GRAPH_WRLOCK [10/21] block: Mark bdrv_attach_child_common() GRAPH_WRLOCK [11/21] block: Call transaction callbacks with lock held [12/21] block: Mark bdrv_attach_child() GRAPH_WRLOCK [13/21] block: Mark bdrv_parent_perms_conflict() and callers GRAPH_RDLOCK [14/21] block: Mark bdrv_get_cumulative_perm() and callers GRAPH_RDLOCK [15/21] block: Mark bdrv_child_perm() GRAPH_RDLOCK [16/21] block: Mark bdrv_parent_cb_change_media() GRAPH_RDLOCK [17/21] block: Take graph rdlock in bdrv_drop_intermediate() [18/21] block: Take graph rdlock in bdrv_change_aio_context() [19/21] block: Mark bdrv_root_unref_child() GRAPH_WRLOCK [20/21] block: Mark bdrv_unref_child() GRAPH_WRLOCK [21/21] block: Mark bdrv_add/del_child() and caller GRAPH_WRLOCK

Kevin Wolf Aug. 17, 2023, 12:50 p.m. UTC

bdrv_unref() is called by a lot of places that need to hold the graph
lock (it naturally happens in the context of operations that change the
graph). However, bdrv_unref() takes the graph writer lock internally, so
it can't actually be called while already holding a graph lock without
causing a deadlock.

bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the
node before closing it, and draining requires that the graph is
unlocked.

The solution is to defer deleting the node until we don't hold the lock
any more and draining is possible again.

Note that keeping images open for longer than necessary can create
problems, too: You can't open an image again before it is really closed
(if image locking didn't prevent it, it would cause corruption).
Reopening an image immediately happens at least during bdrv_open() and
bdrv_co_create().

In order to solve this problem, make sure to run the deferred unref in
bdrv_graph_wrunlock(), i.e. the first possible place where we can drain
again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK.

The output of iotest 051 is updated because the additional polling
changes the order of HMP output, resulting in a new "(qemu)" prompt in
the test output that was previously on a separate line and filtered out.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block-global-state.h |  1 +
 block.c                            |  9 +++++++++
 block/graph-lock.c                 | 23 ++++++++++++++++-------
 tests/qemu-iotests/051.pc.out      |  6 +++---
 4 files changed, 29 insertions(+), 10 deletions(-)

Eric Blake Aug. 18, 2023, 4:23 p.m. UTC | #1

On Thu, Aug 17, 2023 at 02:50:04PM +0200, Kevin Wolf wrote:
> bdrv_unref() is called by a lot of places that need to hold the graph
> lock (it naturally happens in the context of operations that change the
> graph). However, bdrv_unref() takes the graph writer lock internally, so
> it can't actually be called while already holding a graph lock without
> causing a deadlock.
> 
> bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the
> node before closing it, and draining requires that the graph is
> unlocked.
> 
> The solution is to defer deleting the node until we don't hold the lock
> any more and draining is possible again.
> 
> Note that keeping images open for longer than necessary can create
> problems, too: You can't open an image again before it is really closed
> (if image locking didn't prevent it, it would cause corruption).
> Reopening an image immediately happens at least during bdrv_open() and
> bdrv_co_create().
> 
> In order to solve this problem, make sure to run the deferred unref in
> bdrv_graph_wrunlock(), i.e. the first possible place where we can drain
> again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK.
> 
> The output of iotest 051 is updated because the additional polling
> changes the order of HMP output, resulting in a new "(qemu)" prompt in
> the test output that was previously on a separate line and filtered out.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---

> +++ b/block/graph-lock.c
> @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
>  void bdrv_graph_wrunlock(void)
>  {
>      GLOBAL_STATE_CODE();
> -    QEMU_LOCK_GUARD(&aio_context_list_lock);
>      assert(qatomic_read(&has_writer));
>  
> +    WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
> +        /*
> +         * No need for memory barriers, this works in pair with
> +         * the slow path of rdlock() and both take the lock.
> +         */
> +        qatomic_store_release(&has_writer, 0);
> +
> +        /* Wake up all coroutine that are waiting to read the graph */
> +        qemu_co_enter_all(&reader_queue, &aio_context_list_lock);

So if I understand coroutines correctly, this says all pending
coroutines are now scheduled to run (or maybe they do try to run here,
but then immediately return control back to this coroutine to await
the right lock conditions since we are still in the block guarded by
list_lock)...

> +    }
> +
>      /*
> -     * No need for memory barriers, this works in pair with
> -     * the slow path of rdlock() and both take the lock.
> +     * Run any BHs that were scheduled during the wrlock section and that
> +     * callers might expect to have finished (e.g. bdrv_unref() calls). Do this
> +     * only after restarting coroutines so that nested event loops in BHs don't
> +     * deadlock if their condition relies on the coroutine making progress.
>       */
> -    qatomic_store_release(&has_writer, 0);
> -
> -    /* Wake up all coroutine that are waiting to read the graph */
> -    qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> +    aio_bh_poll(qemu_get_aio_context());

...and as long as the other coroutines sharing this thread don't
actually get to make progress until the next point at which the
current coroutine yields, and as long as our aio_bh_poll() doesn't
also include a yield point, then we are ensured that the BH has
completed before the next yield point in our caller.

There are times (like today) where I'm still trying to wrap my mind
about the subtle differences between true multi-threading
vs. cooperative coroutines sharing a single thread via the use of
yield points.  coroutines are cool when they can get rid of some of
the races that you have to worry about in true multi-threading.

Reviewed-by: Eric Blake <eblake@redhat.com>

Eric Blake Aug. 18, 2023, 4:26 p.m. UTC | #2

On Fri, Aug 18, 2023 at 11:24:00AM -0500, Eric Blake wrote:
> > +++ b/block/graph-lock.c
> > @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
> >  void bdrv_graph_wrunlock(void)
> >  {
> >      GLOBAL_STATE_CODE();
> > -    QEMU_LOCK_GUARD(&aio_context_list_lock);
> >      assert(qatomic_read(&has_writer));
> >  
> > +    WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
> > +        /*
> > +         * No need for memory barriers, this works in pair with
> > +         * the slow path of rdlock() and both take the lock.
> > +         */
> > +        qatomic_store_release(&has_writer, 0);
> > +
> > +        /* Wake up all coroutine that are waiting to read the graph */
> > +        qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> 
> So if I understand coroutines correctly, this says all pending
> coroutines are now scheduled to run (or maybe they do try to run here,
> but then immediately return control back to this coroutine to await
> the right lock conditions since we are still in the block guarded by
> list_lock)...
> 
> > +    }
> > +
> >      /*
> > -     * No need for memory barriers, this works in pair with
> > -     * the slow path of rdlock() and both take the lock.
> > +     * Run any BHs that were scheduled during the wrlock section and that
> > +     * callers might expect to have finished (e.g. bdrv_unref() calls). Do this
> > +     * only after restarting coroutines so that nested event loops in BHs don't
> > +     * deadlock if their condition relies on the coroutine making progress.
> >       */
> > -    qatomic_store_release(&has_writer, 0);
> > -
> > -    /* Wake up all coroutine that are waiting to read the graph */
> > -    qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > +    aio_bh_poll(qemu_get_aio_context());
> 
> ...and as long as the other coroutines sharing this thread don't
> actually get to make progress until the next point at which the
> current coroutine yields, and as long as our aio_bh_poll() doesn't
> also include a yield point, then we are ensured that the BH has
> completed before the next yield point in our caller.
> 
> There are times (like today) where I'm still trying to wrap my mind
> about the subtle differences between true multi-threading
> vs. cooperative coroutines sharing a single thread via the use of
> yield points.  coroutines are cool when they can get rid of some of
> the races that you have to worry about in true multi-threading.

That said, once we introduce multi-queue, can we ever have a scenario
where a different iothread might be trying to access the graph and
perform a reopen in the time while this thread has not completed the
BH close?  Or is that protected by some other mutual exclusion (where
the only one we have to worry about is reopen by a coroutine in the
same thread, because all other threads are locked out of graph
modifications)?

Kevin Wolf Aug. 21, 2023, 4:59 p.m. UTC | #3

Am 18.08.2023 um 18:26 hat Eric Blake geschrieben:
> On Fri, Aug 18, 2023 at 11:24:00AM -0500, Eric Blake wrote:
> > > +++ b/block/graph-lock.c
> > > @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
> > >  void bdrv_graph_wrunlock(void)
> > >  {
> > >      GLOBAL_STATE_CODE();
> > > -    QEMU_LOCK_GUARD(&aio_context_list_lock);
> > >      assert(qatomic_read(&has_writer));
> > >  
> > > +    WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
> > > +        /*
> > > +         * No need for memory barriers, this works in pair with
> > > +         * the slow path of rdlock() and both take the lock.
> > > +         */
> > > +        qatomic_store_release(&has_writer, 0);
> > > +
> > > +        /* Wake up all coroutine that are waiting to read the graph */
> > > +        qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > 
> > So if I understand coroutines correctly, this says all pending
> > coroutines are now scheduled to run (or maybe they do try to run here,
> > but then immediately return control back to this coroutine to await
> > the right lock conditions since we are still in the block guarded by
> > list_lock)...
> > 
> > > +    }
> > > +
> > >      /*
> > > -     * No need for memory barriers, this works in pair with
> > > -     * the slow path of rdlock() and both take the lock.
> > > +     * Run any BHs that were scheduled during the wrlock section and that
> > > +     * callers might expect to have finished (e.g. bdrv_unref() calls). Do this
> > > +     * only after restarting coroutines so that nested event loops in BHs don't
> > > +     * deadlock if their condition relies on the coroutine making progress.
> > >       */
> > > -    qatomic_store_release(&has_writer, 0);
> > > -
> > > -    /* Wake up all coroutine that are waiting to read the graph */
> > > -    qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > > +    aio_bh_poll(qemu_get_aio_context());
> > 
> > ...and as long as the other coroutines sharing this thread don't
> > actually get to make progress until the next point at which the
> > current coroutine yields, and as long as our aio_bh_poll() doesn't
> > also include a yield point, then we are ensured that the BH has
> > completed before the next yield point in our caller.
> > 
> > There are times (like today) where I'm still trying to wrap my mind
> > about the subtle differences between true multi-threading
> > vs. cooperative coroutines sharing a single thread via the use of
> > yield points.  coroutines are cool when they can get rid of some of
> > the races that you have to worry about in true multi-threading.
> 
> That said, once we introduce multi-queue, can we ever have a scenario
> where a different iothread might be trying to access the graph and
> perform a reopen in the time while this thread has not completed the
> BH close?  Or is that protected by some other mutual exclusion (where
> the only one we have to worry about is reopen by a coroutine in the
> same thread, because all other threads are locked out of graph
> modifications)?

We don't have to worry about that one because reopen (and taking the
writer lock in general) only happen in the main thread, which is exactly
the thread that this code runs in.

The thing that we need to take into consideration is that aio_bh_poll()
could call something that wants to take the writer lock and modify the
graph again. It's not really a problem, though, because we're already
done at that point. Any readers that we resumed above will just
synchronise with the writer in the usual way and one of them will have
to wait. But there is nothing that is still waiting to be resumed and
could deadlock.

Kevin

Stefan Hajnoczi Aug. 22, 2023, 7:01 p.m. UTC | #4

On Thu, Aug 17, 2023 at 02:50:04PM +0200, Kevin Wolf wrote:
> bdrv_unref() is called by a lot of places that need to hold the graph
> lock (it naturally happens in the context of operations that change the
> graph). However, bdrv_unref() takes the graph writer lock internally, so
> it can't actually be called while already holding a graph lock without
> causing a deadlock.
> 
> bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the
> node before closing it, and draining requires that the graph is
> unlocked.
> 
> The solution is to defer deleting the node until we don't hold the lock
> any more and draining is possible again.
> 
> Note that keeping images open for longer than necessary can create
> problems, too: You can't open an image again before it is really closed
> (if image locking didn't prevent it, it would cause corruption).
> Reopening an image immediately happens at least during bdrv_open() and
> bdrv_co_create().
> 
> In order to solve this problem, make sure to run the deferred unref in
> bdrv_graph_wrunlock(), i.e. the first possible place where we can drain
> again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK.
> 
> The output of iotest 051 is updated because the additional polling
> changes the order of HMP output, resulting in a new "(qemu)" prompt in
> the test output that was previously on a separate line and filtered out.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  include/block/block-global-state.h |  1 +
>  block.c                            |  9 +++++++++
>  block/graph-lock.c                 | 23 ++++++++++++++++-------
>  tests/qemu-iotests/051.pc.out      |  6 +++---
>  4 files changed, 29 insertions(+), 10 deletions(-)
> 
> diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
> index f347199bff..e570799f85 100644
> --- a/include/block/block-global-state.h
> +++ b/include/block/block-global-state.h
> @@ -224,6 +224,7 @@ void bdrv_img_create(const char *filename, const char *fmt,
>  void bdrv_ref(BlockDriverState *bs);
>  void no_coroutine_fn bdrv_unref(BlockDriverState *bs);
>  void coroutine_fn no_co_wrapper bdrv_co_unref(BlockDriverState *bs);
> +void GRAPH_WRLOCK bdrv_schedule_unref(BlockDriverState *bs);
>  void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
>  BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
>                               BlockDriverState *child_bs,
> diff --git a/block.c b/block.c
> index 6376452768..9c4f24f4b9 100644
> --- a/block.c
> +++ b/block.c
> @@ -7033,6 +7033,15 @@ void bdrv_unref(BlockDriverState *bs)
>      }
>  }
>  
> +void bdrv_schedule_unref(BlockDriverState *bs)

Please add a doc comment explaining when and why this should be used.

> +{
> +    if (!bs) {
> +        return;
> +    }
> +    aio_bh_schedule_oneshot(qemu_get_aio_context(),
> +                            (QEMUBHFunc *) bdrv_unref, bs);
> +}
> +
>  struct BdrvOpBlocker {
>      Error *reason;
>      QLIST_ENTRY(BdrvOpBlocker) list;
> diff --git a/block/graph-lock.c b/block/graph-lock.c
> index 5e66f01ae8..395d387651 100644
> --- a/block/graph-lock.c
> +++ b/block/graph-lock.c
> @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
>  void bdrv_graph_wrunlock(void)
>  {
>      GLOBAL_STATE_CODE();
> -    QEMU_LOCK_GUARD(&aio_context_list_lock);
>      assert(qatomic_read(&has_writer));
>  
> +    WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
> +        /*
> +         * No need for memory barriers, this works in pair with
> +         * the slow path of rdlock() and both take the lock.
> +         */
> +        qatomic_store_release(&has_writer, 0);
> +
> +        /* Wake up all coroutine that are waiting to read the graph */

s/coroutine/coroutines/

> +        qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> +    }
> +
>      /*
> -     * No need for memory barriers, this works in pair with
> -     * the slow path of rdlock() and both take the lock.
> +     * Run any BHs that were scheduled during the wrlock section and that
> +     * callers might expect to have finished (e.g. bdrv_unref() calls). Do this

Referring directly to bdrv_schedule_unref() would help make it clearer
what you mean.

> +     * only after restarting coroutines so that nested event loops in BHs don't
> +     * deadlock if their condition relies on the coroutine making progress.
>       */
> -    qatomic_store_release(&has_writer, 0);
> -
> -    /* Wake up all coroutine that are waiting to read the graph */
> -    qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> +    aio_bh_poll(qemu_get_aio_context());

Keeping a dedicated list of BDSes pending unref seems cleaner than using
the aio_bh_poll(). There's a lot of stuff happening in the event loop
and I wonder if those other BHs might cause issues if they run at the
end of bdrv_graph_wrunlock(). The change to qemu-iotests 051's output is
an example of this, but other things could happen too (e.g. monitor
command processing).

>  }
>  
>  void coroutine_fn bdrv_graph_co_rdlock(void)
> diff --git a/tests/qemu-iotests/051.pc.out b/tests/qemu-iotests/051.pc.out
> index 4d4af5a486..7e10c5fa1b 100644
> --- a/tests/qemu-iotests/051.pc.out
> +++ b/tests/qemu-iotests/051.pc.out
> @@ -169,11 +169,11 @@ QEMU_PROG: -device scsi-hd,drive=disk: Device needs media, but drive is empty
>  
>  Testing: -drive file=TEST_DIR/t.qcow2,if=none,node-name=disk -object iothread,id=thread0 -device virtio-scsi,iothread=thread0,id=virtio-scsi0 -device scsi-hd,bus=virtio-scsi0.0,drive=disk,share-rw=on -device ide-hd,drive=disk,share-rw=on
>  QEMU X.Y.Z monitor - type 'help' for more information
> -QEMU_PROG: -device ide-hd,drive=disk,share-rw=on: Cannot change iothread of active block backend
> +(qemu) QEMU_PROG: -device ide-hd,drive=disk,share-rw=on: Cannot change iothread of active block backend
>  
>  Testing: -drive file=TEST_DIR/t.qcow2,if=none,node-name=disk -object iothread,id=thread0 -device virtio-scsi,iothread=thread0,id=virtio-scsi0 -device scsi-hd,bus=virtio-scsi0.0,drive=disk,share-rw=on -device virtio-blk-pci,drive=disk,share-rw=on
>  QEMU X.Y.Z monitor - type 'help' for more information
> -QEMU_PROG: -device virtio-blk-pci,drive=disk,share-rw=on: Cannot change iothread of active block backend
> +(qemu) QEMU_PROG: -device virtio-blk-pci,drive=disk,share-rw=on: Cannot change iothread of active block backend
>  
>  Testing: -drive file=TEST_DIR/t.qcow2,if=none,node-name=disk -object iothread,id=thread0 -device virtio-scsi,iothread=thread0,id=virtio-scsi0 -device scsi-hd,bus=virtio-scsi0.0,drive=disk,share-rw=on -device lsi53c895a,id=lsi0 -device scsi-hd,bus=lsi0.0,drive=disk,share-rw=on
>  QEMU X.Y.Z monitor - type 'help' for more information
> @@ -185,7 +185,7 @@ QEMU X.Y.Z monitor - type 'help' for more information
>  
>  Testing: -drive file=TEST_DIR/t.qcow2,if=none,node-name=disk -object iothread,id=thread0 -device virtio-scsi,iothread=thread0,id=virtio-scsi0 -device scsi-hd,bus=virtio-scsi0.0,drive=disk,share-rw=on -device virtio-blk-pci,drive=disk,iothread=thread0,share-rw=on
>  QEMU X.Y.Z monitor - type 'help' for more information
> -QEMU_PROG: -device virtio-blk-pci,drive=disk,iothread=thread0,share-rw=on: Cannot change iothread of active block backend
> +(qemu) QEMU_PROG: -device virtio-blk-pci,drive=disk,iothread=thread0,share-rw=on: Cannot change iothread of active block backend
>  
>  Testing: -drive file=TEST_DIR/t.qcow2,if=none,node-name=disk -object iothread,id=thread0 -device virtio-scsi,iothread=thread0,id=virtio-scsi0 -device scsi-hd,bus=virtio-scsi0.0,drive=disk,share-rw=on -device virtio-scsi,id=virtio-scsi1,iothread=thread0 -device scsi-hd,bus=virtio-scsi1.0,drive=disk,share-rw=on
>  QEMU X.Y.Z monitor - type 'help' for more information
> -- 
> 2.41.0
>

Kevin Wolf Sept. 5, 2023, 4:39 p.m. UTC | #5

Am 22.08.2023 um 21:01 hat Stefan Hajnoczi geschrieben:
> On Thu, Aug 17, 2023 at 02:50:04PM +0200, Kevin Wolf wrote:
> > bdrv_unref() is called by a lot of places that need to hold the graph
> > lock (it naturally happens in the context of operations that change the
> > graph). However, bdrv_unref() takes the graph writer lock internally, so
> > it can't actually be called while already holding a graph lock without
> > causing a deadlock.
> > 
> > bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the
> > node before closing it, and draining requires that the graph is
> > unlocked.
> > 
> > The solution is to defer deleting the node until we don't hold the lock
> > any more and draining is possible again.
> > 
> > Note that keeping images open for longer than necessary can create
> > problems, too: You can't open an image again before it is really closed
> > (if image locking didn't prevent it, it would cause corruption).
> > Reopening an image immediately happens at least during bdrv_open() and
> > bdrv_co_create().
> > 
> > In order to solve this problem, make sure to run the deferred unref in
> > bdrv_graph_wrunlock(), i.e. the first possible place where we can drain
> > again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK.
> > 
> > The output of iotest 051 is updated because the additional polling
> > changes the order of HMP output, resulting in a new "(qemu)" prompt in
> > the test output that was previously on a separate line and filtered out.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >  include/block/block-global-state.h |  1 +
> >  block.c                            |  9 +++++++++
> >  block/graph-lock.c                 | 23 ++++++++++++++++-------
> >  tests/qemu-iotests/051.pc.out      |  6 +++---
> >  4 files changed, 29 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
> > index f347199bff..e570799f85 100644
> > --- a/include/block/block-global-state.h
> > +++ b/include/block/block-global-state.h
> > @@ -224,6 +224,7 @@ void bdrv_img_create(const char *filename, const char *fmt,
> >  void bdrv_ref(BlockDriverState *bs);
> >  void no_coroutine_fn bdrv_unref(BlockDriverState *bs);
> >  void coroutine_fn no_co_wrapper bdrv_co_unref(BlockDriverState *bs);
> > +void GRAPH_WRLOCK bdrv_schedule_unref(BlockDriverState *bs);
> >  void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
> >  BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
> >                               BlockDriverState *child_bs,
> > diff --git a/block.c b/block.c
> > index 6376452768..9c4f24f4b9 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -7033,6 +7033,15 @@ void bdrv_unref(BlockDriverState *bs)
> >      }
> >  }
> >  
> > +void bdrv_schedule_unref(BlockDriverState *bs)
> 
> Please add a doc comment explaining when and why this should be used.

Ok.

> > +{
> > +    if (!bs) {
> > +        return;
> > +    }
> > +    aio_bh_schedule_oneshot(qemu_get_aio_context(),
> > +                            (QEMUBHFunc *) bdrv_unref, bs);
> > +}
> > +
> >  struct BdrvOpBlocker {
> >      Error *reason;
> >      QLIST_ENTRY(BdrvOpBlocker) list;
> > diff --git a/block/graph-lock.c b/block/graph-lock.c
> > index 5e66f01ae8..395d387651 100644
> > --- a/block/graph-lock.c
> > +++ b/block/graph-lock.c
> > @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
> >  void bdrv_graph_wrunlock(void)
> >  {
> >      GLOBAL_STATE_CODE();
> > -    QEMU_LOCK_GUARD(&aio_context_list_lock);
> >      assert(qatomic_read(&has_writer));
> >  
> > +    WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
> > +        /*
> > +         * No need for memory barriers, this works in pair with
> > +         * the slow path of rdlock() and both take the lock.
> > +         */
> > +        qatomic_store_release(&has_writer, 0);
> > +
> > +        /* Wake up all coroutine that are waiting to read the graph */
> 
> s/coroutine/coroutines/

I only changed the indentation, but I guess I can just fix it while I
touch it.

> > +        qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > +    }
> > +
> >      /*
> > -     * No need for memory barriers, this works in pair with
> > -     * the slow path of rdlock() and both take the lock.
> > +     * Run any BHs that were scheduled during the wrlock section and that
> > +     * callers might expect to have finished (e.g. bdrv_unref() calls). Do this
> 
> Referring directly to bdrv_schedule_unref() would help make it clearer
> what you mean.
> 
> > +     * only after restarting coroutines so that nested event loops in BHs don't
> > +     * deadlock if their condition relies on the coroutine making progress.
> >       */
> > -    qatomic_store_release(&has_writer, 0);
> > -
> > -    /* Wake up all coroutine that are waiting to read the graph */
> > -    qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > +    aio_bh_poll(qemu_get_aio_context());
> 
> Keeping a dedicated list of BDSes pending unref seems cleaner than using
> the aio_bh_poll(). There's a lot of stuff happening in the event loop
> and I wonder if those other BHs might cause issues if they run at the
> end of bdrv_graph_wrunlock(). The change to qemu-iotests 051's output is
> an example of this, but other things could happen too (e.g. monitor
> command processing).

I would argue that it's no worse than the old code: The bdrv_unref()
that we're replacing would already run a nested event loop if it was the
last reference. We only moved this a bit later, making the part of the
code that doesn't run an event loop a bit larger. The difference is that
we're running it unconditionally now, not only in a special case, but I
think that's actually an improvement (in terms of test coverage etc.)

We run nested event loops in all kinds of different places without
thinking much about it. If we're worried about it here, what do we do
about all these other places?

Anyway, if you think that we should process only bdrv_schedule_unref()
here, how would you approach this? Restrict it to bdrv_schedule_unref()
in the hope that this is the only operation we'll need to delay, or
implement another BH mechanism from scratch in graph-lock.c? In theory
we could also add another AioContext where actual BHs can be queued and
that is only polled here, but the iohandler context is already painful
enough that I don't necessarily want to create another thing like it.

Kevin

Kevin Wolf Sept. 6, 2023, 9:17 a.m. UTC | #6

Am 05.09.2023 um 18:39 hat Kevin Wolf geschrieben:
> Am 22.08.2023 um 21:01 hat Stefan Hajnoczi geschrieben:
> > On Thu, Aug 17, 2023 at 02:50:04PM +0200, Kevin Wolf wrote:
> > > bdrv_unref() is called by a lot of places that need to hold the graph
> > > lock (it naturally happens in the context of operations that change the
> > > graph). However, bdrv_unref() takes the graph writer lock internally, so
> > > it can't actually be called while already holding a graph lock without
> > > causing a deadlock.
> > > 
> > > bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the
> > > node before closing it, and draining requires that the graph is
> > > unlocked.
> > > 
> > > The solution is to defer deleting the node until we don't hold the lock
> > > any more and draining is possible again.
> > > 
> > > Note that keeping images open for longer than necessary can create
> > > problems, too: You can't open an image again before it is really closed
> > > (if image locking didn't prevent it, it would cause corruption).
> > > Reopening an image immediately happens at least during bdrv_open() and
> > > bdrv_co_create().
> > > 
> > > In order to solve this problem, make sure to run the deferred unref in
> > > bdrv_graph_wrunlock(), i.e. the first possible place where we can drain
> > > again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK.
> > > 
> > > The output of iotest 051 is updated because the additional polling
> > > changes the order of HMP output, resulting in a new "(qemu)" prompt in
> > > the test output that was previously on a separate line and filtered out.
> > > 
> > > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > > ---
> > >  include/block/block-global-state.h |  1 +
> > >  block.c                            |  9 +++++++++
> > >  block/graph-lock.c                 | 23 ++++++++++++++++-------
> > >  tests/qemu-iotests/051.pc.out      |  6 +++---
> > >  4 files changed, 29 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
> > > index f347199bff..e570799f85 100644
> > > --- a/include/block/block-global-state.h
> > > +++ b/include/block/block-global-state.h
> > > @@ -224,6 +224,7 @@ void bdrv_img_create(const char *filename, const char *fmt,
> > >  void bdrv_ref(BlockDriverState *bs);
> > >  void no_coroutine_fn bdrv_unref(BlockDriverState *bs);
> > >  void coroutine_fn no_co_wrapper bdrv_co_unref(BlockDriverState *bs);
> > > +void GRAPH_WRLOCK bdrv_schedule_unref(BlockDriverState *bs);
> > >  void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
> > >  BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
> > >                               BlockDriverState *child_bs,
> > > diff --git a/block.c b/block.c
> > > index 6376452768..9c4f24f4b9 100644
> > > --- a/block.c
> > > +++ b/block.c
> > > @@ -7033,6 +7033,15 @@ void bdrv_unref(BlockDriverState *bs)
> > >      }
> > >  }
> > >  
> > > +void bdrv_schedule_unref(BlockDriverState *bs)
> > 
> > Please add a doc comment explaining when and why this should be used.
> 
> Ok.
> 
> > > +{
> > > +    if (!bs) {
> > > +        return;
> > > +    }
> > > +    aio_bh_schedule_oneshot(qemu_get_aio_context(),
> > > +                            (QEMUBHFunc *) bdrv_unref, bs);
> > > +}
> > > +
> > >  struct BdrvOpBlocker {
> > >      Error *reason;
> > >      QLIST_ENTRY(BdrvOpBlocker) list;
> > > diff --git a/block/graph-lock.c b/block/graph-lock.c
> > > index 5e66f01ae8..395d387651 100644
> > > --- a/block/graph-lock.c
> > > +++ b/block/graph-lock.c
> > > @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
> > >  void bdrv_graph_wrunlock(void)
> > >  {
> > >      GLOBAL_STATE_CODE();
> > > -    QEMU_LOCK_GUARD(&aio_context_list_lock);
> > >      assert(qatomic_read(&has_writer));
> > >  
> > > +    WITH_QEMU_LOCK_GUARD(&aio_context_list_lock) {
> > > +        /*
> > > +         * No need for memory barriers, this works in pair with
> > > +         * the slow path of rdlock() and both take the lock.
> > > +         */
> > > +        qatomic_store_release(&has_writer, 0);
> > > +
> > > +        /* Wake up all coroutine that are waiting to read the graph */
> > 
> > s/coroutine/coroutines/
> 
> I only changed the indentation, but I guess I can just fix it while I
> touch it.
> 
> > > +        qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > > +    }
> > > +
> > >      /*
> > > -     * No need for memory barriers, this works in pair with
> > > -     * the slow path of rdlock() and both take the lock.
> > > +     * Run any BHs that were scheduled during the wrlock section and that
> > > +     * callers might expect to have finished (e.g. bdrv_unref() calls). Do this
> > 
> > Referring directly to bdrv_schedule_unref() would help make it clearer
> > what you mean.
> > 
> > > +     * only after restarting coroutines so that nested event loops in BHs don't
> > > +     * deadlock if their condition relies on the coroutine making progress.
> > >       */
> > > -    qatomic_store_release(&has_writer, 0);
> > > -
> > > -    /* Wake up all coroutine that are waiting to read the graph */
> > > -    qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
> > > +    aio_bh_poll(qemu_get_aio_context());
> > 
> > Keeping a dedicated list of BDSes pending unref seems cleaner than using
> > the aio_bh_poll(). There's a lot of stuff happening in the event loop
> > and I wonder if those other BHs might cause issues if they run at the
> > end of bdrv_graph_wrunlock(). The change to qemu-iotests 051's output is
> > an example of this, but other things could happen too (e.g. monitor
> > command processing).
> 
> I would argue that it's no worse than the old code: The bdrv_unref()
> that we're replacing would already run a nested event loop if it was the
> last reference. We only moved this a bit later, making the part of the
> code that doesn't run an event loop a bit larger. The difference is that
> we're running it unconditionally now, not only in a special case, but I
> think that's actually an improvement (in terms of test coverage etc.)
> 
> We run nested event loops in all kinds of different places without
> thinking much about it. If we're worried about it here, what do we do
> about all these other places?
> 
> Anyway, if you think that we should process only bdrv_schedule_unref()
> here, how would you approach this? Restrict it to bdrv_schedule_unref()
> in the hope that this is the only operation we'll need to delay, or
> implement another BH mechanism from scratch in graph-lock.c? In theory
> we could also add another AioContext where actual BHs can be queued and
> that is only polled here, but the iohandler context is already painful
> enough that I don't necessarily want to create another thing like it.

Actually, come to think of it, even processing only scheduled unrefs is
effectively a nested event loop (via bdrv_close()) that could run for
any caller. The only thing we would achieve is making it conditional
again so that it triggers only sometimes. But callers still have to make
sure that polling is safe because it could be the last reference.

So even the little use I saw yesterday, I'm not sure about it any more
now.

Kevin

[05/21] block: Introduce bdrv_schedule_unref()

Commit Message

Comments

Patch