diff mbox series

[10/12] block.c: add subtree_drains where needed

Message ID 20220118162738.1366281-11-eesposit@redhat.com (mailing list archive)
State New, archived
Headers show
Series Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm. | expand

Commit Message

Emanuele Giuseppe Esposito Jan. 18, 2022, 4:27 p.m. UTC
Protect bdrv_replace_child_noperm, as it modifies the
graph by adding/removing elements to .children and .parents
list of a bs. Use the newly introduced
bdrv_subtree_drained_{begin/end}_unlocked drains to achieve
that and be free from the aiocontext lock.

One important criteria to keep in mind is that if the caller of
bdrv_replace_child_noperm creates a transaction, we need to make sure that the
whole transaction is under the same drain block. This is imperative, as having
multiple drains also in the .abort() class of functions causes discrepancies
in the drained counters (as nodes are put back into the original positions),
making it really hard to retourn all to zero and leaving the code very buggy.
See https://patchew.org/QEMU/20211213104014.69858-1-eesposit@redhat.com/
for more explanations.

Unfortunately we still need to have bdrv_subtree_drained_begin/end
in bdrv_detach_child() releasing and then holding the AioContext
lock, since it later invokes bdrv_try_set_aio_context() that is
not safe yet. Once all is cleaned up, we can also remove the
acquire/release locks in job_unref, artificially added because of this.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 block.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 44 insertions(+), 6 deletions(-)

Comments

Paolo Bonzini Jan. 19, 2022, 9:47 a.m. UTC | #1
Subject doesn't say what needs them: "block: drain block devices across 
graph modifications"

On 1/18/22 17:27, Emanuele Giuseppe Esposito wrote:
> Protect bdrv_replace_child_noperm, as it modifies the
> graph by adding/removing elements to .children and .parents
> list of a bs. Use the newly introduced
> bdrv_subtree_drained_{begin/end}_unlocked drains to achieve
> that and be free from the aiocontext lock.
> 
> One important criteria to keep in mind is that if the caller of
> bdrv_replace_child_noperm creates a transaction, we need to make sure that the
> whole transaction is under the same drain block.

Okay, this is the important part and it should be mentioned in patch 8 
as well.  It should also be in a comment above bdrv_replace_child_noperm().

> This is imperative, as having
> multiple drains also in the .abort() class of functions causes discrepancies
> in the drained counters (as nodes are put back into the original positions),
> making it really hard to retourn all to zero and leaving the code very buggy.
> See https://patchew.org/QEMU/20211213104014.69858-1-eesposit@redhat.com/
> for more explanations.
> 
> Unfortunately we still need to have bdrv_subtree_drained_begin/end
> in bdrv_detach_child() releasing and then holding the AioContext
> lock, since it later invokes bdrv_try_set_aio_context() that is
> not safe yet. Once all is cleaned up, we can also remove the
> acquire/release locks in job_unref, artificially added because of this.

About this:

> +         * TODO: this is called by job_unref with lock held, because
> +         * afterwards it calls bdrv_try_set_aio_context.
> +         * Once all of this is fixed, take care of removing
> +         * the aiocontext lock and make this function _unlocked.

It may be clear to you, but it's quite cryptic:

- which lock is held by job_unref()?  Also, would it make more sense to 
talk about block_job_free() rather than job_unref()?  I can't quite 
follow where the AioContext lock is taken.

- what is "all of this", and what do you mean by "not safe yet"?  Do 
both refer to bdrv_try_set_aio_context() needing the AioContext lock?

- what is "this function" (that should become _unlocked)?

I think you could also split the patch in multiple parts for different 
call chains.  In particular bdrv_set_backing_hd can be merged with the 
patch to bdrv_reopen_parse_file_or_backing, since both of them deal with 
bdrv_set_file_or_backing_noperm.

Paolo

> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   block.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 44 insertions(+), 6 deletions(-)
> 
> diff --git a/block.c b/block.c
> index fcc44a49a0..6196c95aae 100644
> --- a/block.c
> +++ b/block.c
> @@ -3114,8 +3114,22 @@ static void bdrv_detach_child(BdrvChild **childp)
>       BlockDriverState *old_bs = (*childp)->bs;
>   
>       assert(qemu_in_main_thread());
> +    if (old_bs) {
> +        /*
> +         * TODO: this is called by job_unref with lock held, because
> +         * afterwards it calls bdrv_try_set_aio_context.
> +         * Once all of this is fixed, take care of removing
> +         * the aiocontext lock and make this function _unlocked.
> +         */
> +        bdrv_subtree_drained_begin(old_bs);
> +    }
> +
>       bdrv_replace_child_noperm(childp, NULL, true);
>   
> +    if (old_bs) {
> +        bdrv_subtree_drained_end(old_bs);
> +    }
> +
>       if (old_bs) {
>           /*
>            * Update permissions for old node. We're just taking a parent away, so
> @@ -3154,6 +3168,7 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
>       Transaction *tran = tran_new();
>   
>       assert(qemu_in_main_thread());
> +    bdrv_subtree_drained_begin_unlocked(child_bs);
>   
>       ret = bdrv_attach_child_common(child_bs, child_name, child_class,
>                                      child_role, perm, shared_perm, opaque,
> @@ -3168,6 +3183,7 @@ out:
>       tran_finalize(tran, ret);
>       /* child is unset on failure by bdrv_attach_child_common_abort() */
>       assert((ret < 0) == !child);
> +    bdrv_subtree_drained_end_unlocked(child_bs);
>   
>       bdrv_unref(child_bs);
>       return child;
> @@ -3197,6 +3213,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
>   
>       assert(qemu_in_main_thread());
>   
> +    bdrv_subtree_drained_begin_unlocked(parent_bs);
> +    bdrv_subtree_drained_begin_unlocked(child_bs);
> +
>       ret = bdrv_attach_child_noperm(parent_bs, child_bs, child_name, child_class,
>                                      child_role, &child, tran, errp);
>       if (ret < 0) {
> @@ -3211,6 +3230,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
>   out:
>       tran_finalize(tran, ret);
>       /* child is unset on failure by bdrv_attach_child_common_abort() */
> +    bdrv_subtree_drained_end_unlocked(child_bs);
> +    bdrv_subtree_drained_end_unlocked(parent_bs);
> +
>       assert((ret < 0) == !child);
>   
>       bdrv_unref(child_bs);
> @@ -3456,6 +3478,11 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
>   
>       assert(qemu_in_main_thread());
>   
> +    bdrv_subtree_drained_begin_unlocked(bs);
> +    if (backing_hd) {
> +        bdrv_subtree_drained_begin_unlocked(backing_hd);
> +    }
> +
>       ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
>       if (ret < 0) {
>           goto out;
> @@ -3464,6 +3491,10 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
>       ret = bdrv_refresh_perms(bs, errp);
>   out:
>       tran_finalize(tran, ret);
> +    if (backing_hd) {
> +        bdrv_subtree_drained_end_unlocked(backing_hd);
> +    }
> +    bdrv_subtree_drained_end_unlocked(bs);
>   
>       return ret;
>   }
> @@ -5266,7 +5297,8 @@ static int bdrv_replace_node_common(BlockDriverState *from,
>   
>       assert(qemu_get_current_aio_context() == qemu_get_aio_context());
>       assert(bdrv_get_aio_context(from) == bdrv_get_aio_context(to));
> -    bdrv_drained_begin(from);
> +    bdrv_subtree_drained_begin_unlocked(from);
> +    bdrv_subtree_drained_begin_unlocked(to);
>   
>       /*
>        * Do the replacement without permission update.
> @@ -5298,7 +5330,8 @@ static int bdrv_replace_node_common(BlockDriverState *from,
>   out:
>       tran_finalize(tran, ret);
>   
> -    bdrv_drained_end(from);
> +    bdrv_subtree_drained_end_unlocked(to);
> +    bdrv_subtree_drained_end_unlocked(from);
>       bdrv_unref(from);
>   
>       return ret;
> @@ -5345,6 +5378,9 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
>   
>       assert(!bs_new->backing);
>   
> +    bdrv_subtree_drained_begin_unlocked(bs_new);
> +    bdrv_subtree_drained_begin_unlocked(bs_top);
> +
>       ret = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
>                                      &child_of_bds, bdrv_backing_role(bs_new),
>                                      &bs_new->backing, tran, errp);
> @@ -5360,6 +5396,8 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
>       ret = bdrv_refresh_perms(bs_new, errp);
>   out:
>       tran_finalize(tran, ret);
> +    bdrv_subtree_drained_end_unlocked(bs_top);
> +    bdrv_subtree_drained_end_unlocked(bs_new);
>   
>       bdrv_refresh_limits(bs_top, NULL, NULL);
>   
> @@ -5379,8 +5417,8 @@ int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
>       assert(qemu_in_main_thread());
>   
>       bdrv_ref(old_bs);
> -    bdrv_drained_begin(old_bs);
> -    bdrv_drained_begin(new_bs);
> +    bdrv_subtree_drained_begin_unlocked(old_bs);
> +    bdrv_subtree_drained_begin_unlocked(new_bs);
>   
>       bdrv_replace_child_tran(&child, new_bs, tran, true);
>       /* @new_bs must have been non-NULL, so @child must not have been freed */
> @@ -5394,8 +5432,8 @@ int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
>   
>       tran_finalize(tran, ret);
>   
> -    bdrv_drained_end(old_bs);
> -    bdrv_drained_end(new_bs);
> +    bdrv_subtree_drained_end_unlocked(new_bs);
> +    bdrv_subtree_drained_end_unlocked(old_bs);
>       bdrv_unref(old_bs);
>   
>       return ret;
Vladimir Sementsov-Ogievskiy Feb. 1, 2022, 2:47 p.m. UTC | #2
18.01.2022 19:27, Emanuele Giuseppe Esposito wrote:
> Protect bdrv_replace_child_noperm, as it modifies the
> graph by adding/removing elements to .children and .parents
> list of a bs. Use the newly introduced
> bdrv_subtree_drained_{begin/end}_unlocked drains to achieve
> that and be free from the aiocontext lock.
> 
> One important criteria to keep in mind is that if the caller of
> bdrv_replace_child_noperm creates a transaction, we need to make sure that the
> whole transaction is under the same drain block. This is imperative, as having
> multiple drains also in the .abort() class of functions causes discrepancies
> in the drained counters (as nodes are put back into the original positions),
> making it really hard to retourn all to zero and leaving the code very buggy.
> See https://patchew.org/QEMU/20211213104014.69858-1-eesposit@redhat.com/
> for more explanations.
> 
> Unfortunately we still need to have bdrv_subtree_drained_begin/end
> in bdrv_detach_child() releasing and then holding the AioContext
> lock, since it later invokes bdrv_try_set_aio_context() that is
> not safe yet. Once all is cleaned up, we can also remove the
> acquire/release locks in job_unref, artificially added because of this.
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   block.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 44 insertions(+), 6 deletions(-)
> 
> diff --git a/block.c b/block.c
> index fcc44a49a0..6196c95aae 100644
> --- a/block.c
> +++ b/block.c
> @@ -3114,8 +3114,22 @@ static void bdrv_detach_child(BdrvChild **childp)
>       BlockDriverState *old_bs = (*childp)->bs;
>   
>       assert(qemu_in_main_thread());
> +    if (old_bs) {
> +        /*
> +         * TODO: this is called by job_unref with lock held, because
> +         * afterwards it calls bdrv_try_set_aio_context.
> +         * Once all of this is fixed, take care of removing
> +         * the aiocontext lock and make this function _unlocked.
> +         */
> +        bdrv_subtree_drained_begin(old_bs);
> +    }
> +
>       bdrv_replace_child_noperm(childp, NULL, true);
>   
> +    if (old_bs) {
> +        bdrv_subtree_drained_end(old_bs);
> +    }
> +
>       if (old_bs) {
>           /*
>            * Update permissions for old node. We're just taking a parent away, so
> @@ -3154,6 +3168,7 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
>       Transaction *tran = tran_new();
>   
>       assert(qemu_in_main_thread());
> +    bdrv_subtree_drained_begin_unlocked(child_bs);
>   
>       ret = bdrv_attach_child_common(child_bs, child_name, child_class,
>                                      child_role, perm, shared_perm, opaque,
> @@ -3168,6 +3183,7 @@ out:
>       tran_finalize(tran, ret);
>       /* child is unset on failure by bdrv_attach_child_common_abort() */
>       assert((ret < 0) == !child);
> +    bdrv_subtree_drained_end_unlocked(child_bs);
>   
>       bdrv_unref(child_bs);
>       return child;
> @@ -3197,6 +3213,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
>   
>       assert(qemu_in_main_thread());
>   
> +    bdrv_subtree_drained_begin_unlocked(parent_bs);
> +    bdrv_subtree_drained_begin_unlocked(child_bs);
> +
>       ret = bdrv_attach_child_noperm(parent_bs, child_bs, child_name, child_class,
>                                      child_role, &child, tran, errp);
>       if (ret < 0) {
> @@ -3211,6 +3230,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
>   out:
>       tran_finalize(tran, ret);
>       /* child is unset on failure by bdrv_attach_child_common_abort() */
> +    bdrv_subtree_drained_end_unlocked(child_bs);
> +    bdrv_subtree_drained_end_unlocked(parent_bs);
> +
>       assert((ret < 0) == !child);
>   
>       bdrv_unref(child_bs);
> @@ -3456,6 +3478,11 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
>   
>       assert(qemu_in_main_thread());
>   
> +    bdrv_subtree_drained_begin_unlocked(bs);
> +    if (backing_hd) {
> +        bdrv_subtree_drained_begin_unlocked(backing_hd);
> +    }
> +
>       ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
>       if (ret < 0) {
>           goto out;
> @@ -3464,6 +3491,10 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
>       ret = bdrv_refresh_perms(bs, errp);
>   out:
>       tran_finalize(tran, ret);
> +    if (backing_hd) {
> +        bdrv_subtree_drained_end_unlocked(backing_hd);
> +    }
> +    bdrv_subtree_drained_end_unlocked(bs);
>   
>       return ret;
>   }
> @@ -5266,7 +5297,8 @@ static int bdrv_replace_node_common(BlockDriverState *from,
>   
>       assert(qemu_get_current_aio_context() == qemu_get_aio_context());
>       assert(bdrv_get_aio_context(from) == bdrv_get_aio_context(to));
> -    bdrv_drained_begin(from);
> +    bdrv_subtree_drained_begin_unlocked(from);
> +    bdrv_subtree_drained_begin_unlocked(to);
>   
>       /*
>        * Do the replacement without permission update.
> @@ -5298,7 +5330,8 @@ static int bdrv_replace_node_common(BlockDriverState *from,
>   out:
>       tran_finalize(tran, ret);
>   
> -    bdrv_drained_end(from);
> +    bdrv_subtree_drained_end_unlocked(to);
> +    bdrv_subtree_drained_end_unlocked(from);
>       bdrv_unref(from);
>   
>       return ret;
> @@ -5345,6 +5378,9 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
>   
>       assert(!bs_new->backing);
>   
> +    bdrv_subtree_drained_begin_unlocked(bs_new);
> +    bdrv_subtree_drained_begin_unlocked(bs_top);
> +
>       ret = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
>                                      &child_of_bds, bdrv_backing_role(bs_new),
>                                      &bs_new->backing, tran, errp);
> @@ -5360,6 +5396,8 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
>       ret = bdrv_refresh_perms(bs_new, errp);
>   out:
>       tran_finalize(tran, ret);
> +    bdrv_subtree_drained_end_unlocked(bs_top);
> +    bdrv_subtree_drained_end_unlocked(bs_new);
>   
>       bdrv_refresh_limits(bs_top, NULL, NULL);
>   
> @@ -5379,8 +5417,8 @@ int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
>       assert(qemu_in_main_thread());
>   
>       bdrv_ref(old_bs);
> -    bdrv_drained_begin(old_bs);
> -    bdrv_drained_begin(new_bs);
> +    bdrv_subtree_drained_begin_unlocked(old_bs);
> +    bdrv_subtree_drained_begin_unlocked(new_bs);
>   
>       bdrv_replace_child_tran(&child, new_bs, tran, true);
>       /* @new_bs must have been non-NULL, so @child must not have been freed */
> @@ -5394,8 +5432,8 @@ int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
>   
>       tran_finalize(tran, ret);
>   
> -    bdrv_drained_end(old_bs);
> -    bdrv_drained_end(new_bs);
> +    bdrv_subtree_drained_end_unlocked(new_bs);
> +    bdrv_subtree_drained_end_unlocked(old_bs);
>       bdrv_unref(old_bs);
>   
>       return ret;
> 

As we started to discuss in a thread started with my "[PATCH] block: bdrv_set_backing_hd(): use drained section", I think draining the whole subtree when we modify some parent-child relation is too much. In-flight requests in subtree don't touch these relations, so why to wait/stop them? Imagine two disks A and B with same backing file C. If we want to detach A from C, what is the reason to drain in-fligth read request of B in C?

Modifying children/parents lists should be protected somehow. Currently it's protected by aio-context lock.. If we drop it, we probably need some new mutex for it. Also, graph modification should be protected from each other, so that one graph modifiction is not started until another is in-flight, probably we need some global mutex for graph modification. But that's all not about drains.

Drains are about in-flight requests. And when we switch a child of node X, we should do it in drained section for X, as in-flight requests in X can touch its children. But another in-flight requests in subtree are safe and should not be awaited.
Emanuele Giuseppe Esposito Feb. 2, 2022, 3:37 p.m. UTC | #3
On 01/02/2022 15:47, Vladimir Sementsov-Ogievskiy wrote:
> 18.01.2022 19:27, Emanuele Giuseppe Esposito wrote:
>> Protect bdrv_replace_child_noperm, as it modifies the
>> graph by adding/removing elements to .children and .parents
>> list of a bs. Use the newly introduced
>> bdrv_subtree_drained_{begin/end}_unlocked drains to achieve
>> that and be free from the aiocontext lock.
>>
>> One important criteria to keep in mind is that if the caller of
>> bdrv_replace_child_noperm creates a transaction, we need to make sure
>> that the
>> whole transaction is under the same drain block. This is imperative,
>> as having
>> multiple drains also in the .abort() class of functions causes
>> discrepancies
>> in the drained counters (as nodes are put back into the original
>> positions),
>> making it really hard to retourn all to zero and leaving the code very
>> buggy.
>> See https://patchew.org/QEMU/20211213104014.69858-1-eesposit@redhat.com/
>> for more explanations.
>>
>> Unfortunately we still need to have bdrv_subtree_drained_begin/end
>> in bdrv_detach_child() releasing and then holding the AioContext
>> lock, since it later invokes bdrv_try_set_aio_context() that is
>> not safe yet. Once all is cleaned up, we can also remove the
>> acquire/release locks in job_unref, artificially added because of this.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> ---
>>   block.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
>>   1 file changed, 44 insertions(+), 6 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index fcc44a49a0..6196c95aae 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3114,8 +3114,22 @@ static void bdrv_detach_child(BdrvChild **childp)
>>       BlockDriverState *old_bs = (*childp)->bs;
>>         assert(qemu_in_main_thread());
>> +    if (old_bs) {
>> +        /*
>> +         * TODO: this is called by job_unref with lock held, because
>> +         * afterwards it calls bdrv_try_set_aio_context.
>> +         * Once all of this is fixed, take care of removing
>> +         * the aiocontext lock and make this function _unlocked.
>> +         */
>> +        bdrv_subtree_drained_begin(old_bs);
>> +    }
>> +
>>       bdrv_replace_child_noperm(childp, NULL, true);
>>   +    if (old_bs) {
>> +        bdrv_subtree_drained_end(old_bs);
>> +    }
>> +
>>       if (old_bs) {
>>           /*
>>            * Update permissions for old node. We're just taking a
>> parent away, so
>> @@ -3154,6 +3168,7 @@ BdrvChild
>> *bdrv_root_attach_child(BlockDriverState *child_bs,
>>       Transaction *tran = tran_new();
>>         assert(qemu_in_main_thread());
>> +    bdrv_subtree_drained_begin_unlocked(child_bs);
>>         ret = bdrv_attach_child_common(child_bs, child_name, child_class,
>>                                      child_role, perm, shared_perm,
>> opaque,
>> @@ -3168,6 +3183,7 @@ out:
>>       tran_finalize(tran, ret);
>>       /* child is unset on failure by bdrv_attach_child_common_abort() */
>>       assert((ret < 0) == !child);
>> +    bdrv_subtree_drained_end_unlocked(child_bs);
>>         bdrv_unref(child_bs);
>>       return child;
>> @@ -3197,6 +3213,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState
>> *parent_bs,
>>         assert(qemu_in_main_thread());
>>   +    bdrv_subtree_drained_begin_unlocked(parent_bs);
>> +    bdrv_subtree_drained_begin_unlocked(child_bs);
>> +
>>       ret = bdrv_attach_child_noperm(parent_bs, child_bs, child_name,
>> child_class,
>>                                      child_role, &child, tran, errp);
>>       if (ret < 0) {
>> @@ -3211,6 +3230,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState
>> *parent_bs,
>>   out:
>>       tran_finalize(tran, ret);
>>       /* child is unset on failure by bdrv_attach_child_common_abort() */
>> +    bdrv_subtree_drained_end_unlocked(child_bs);
>> +    bdrv_subtree_drained_end_unlocked(parent_bs);
>> +
>>       assert((ret < 0) == !child);
>>         bdrv_unref(child_bs);
>> @@ -3456,6 +3478,11 @@ int bdrv_set_backing_hd(BlockDriverState *bs,
>> BlockDriverState *backing_hd,
>>         assert(qemu_in_main_thread());
>>   +    bdrv_subtree_drained_begin_unlocked(bs);
>> +    if (backing_hd) {
>> +        bdrv_subtree_drained_begin_unlocked(backing_hd);
>> +    }
>> +
>>       ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
>>       if (ret < 0) {
>>           goto out;
>> @@ -3464,6 +3491,10 @@ int bdrv_set_backing_hd(BlockDriverState *bs,
>> BlockDriverState *backing_hd,
>>       ret = bdrv_refresh_perms(bs, errp);
>>   out:
>>       tran_finalize(tran, ret);
>> +    if (backing_hd) {
>> +        bdrv_subtree_drained_end_unlocked(backing_hd);
>> +    }
>> +    bdrv_subtree_drained_end_unlocked(bs);
>>         return ret;
>>   }
>> @@ -5266,7 +5297,8 @@ static int
>> bdrv_replace_node_common(BlockDriverState *from,
>>         assert(qemu_get_current_aio_context() == qemu_get_aio_context());
>>       assert(bdrv_get_aio_context(from) == bdrv_get_aio_context(to));
>> -    bdrv_drained_begin(from);
>> +    bdrv_subtree_drained_begin_unlocked(from);
>> +    bdrv_subtree_drained_begin_unlocked(to);
>>         /*
>>        * Do the replacement without permission update.
>> @@ -5298,7 +5330,8 @@ static int
>> bdrv_replace_node_common(BlockDriverState *from,
>>   out:
>>       tran_finalize(tran, ret);
>>   -    bdrv_drained_end(from);
>> +    bdrv_subtree_drained_end_unlocked(to);
>> +    bdrv_subtree_drained_end_unlocked(from);
>>       bdrv_unref(from);
>>         return ret;
>> @@ -5345,6 +5378,9 @@ int bdrv_append(BlockDriverState *bs_new,
>> BlockDriverState *bs_top,
>>         assert(!bs_new->backing);
>>   +    bdrv_subtree_drained_begin_unlocked(bs_new);
>> +    bdrv_subtree_drained_begin_unlocked(bs_top);
>> +
>>       ret = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
>>                                      &child_of_bds,
>> bdrv_backing_role(bs_new),
>>                                      &bs_new->backing, tran, errp);
>> @@ -5360,6 +5396,8 @@ int bdrv_append(BlockDriverState *bs_new,
>> BlockDriverState *bs_top,
>>       ret = bdrv_refresh_perms(bs_new, errp);
>>   out:
>>       tran_finalize(tran, ret);
>> +    bdrv_subtree_drained_end_unlocked(bs_top);
>> +    bdrv_subtree_drained_end_unlocked(bs_new);
>>         bdrv_refresh_limits(bs_top, NULL, NULL);
>>   @@ -5379,8 +5417,8 @@ int bdrv_replace_child_bs(BdrvChild *child,
>> BlockDriverState *new_bs,
>>       assert(qemu_in_main_thread());
>>         bdrv_ref(old_bs);
>> -    bdrv_drained_begin(old_bs);
>> -    bdrv_drained_begin(new_bs);
>> +    bdrv_subtree_drained_begin_unlocked(old_bs);
>> +    bdrv_subtree_drained_begin_unlocked(new_bs);
>>         bdrv_replace_child_tran(&child, new_bs, tran, true);
>>       /* @new_bs must have been non-NULL, so @child must not have been
>> freed */
>> @@ -5394,8 +5432,8 @@ int bdrv_replace_child_bs(BdrvChild *child,
>> BlockDriverState *new_bs,
>>         tran_finalize(tran, ret);
>>   -    bdrv_drained_end(old_bs);
>> -    bdrv_drained_end(new_bs);
>> +    bdrv_subtree_drained_end_unlocked(new_bs);
>> +    bdrv_subtree_drained_end_unlocked(old_bs);
>>       bdrv_unref(old_bs);
>>         return ret;
>>

From the other thread:
> So you mean that backing_hd is modified as its parent is changed, do I understand correctly?

Yes

> 
> As we started to discuss in a thread started with my "[PATCH] block:
> bdrv_set_backing_hd(): use drained section", I think draining the whole
> subtree when we modify some parent-child relation is too much. In-flight
> requests in subtree don't touch these relations, so why to wait/stop
> them? Imagine two disks A and B with same backing file C. If we want to
> detach A from C, what is the reason to drain in-fligth read request of B
> in C?

If I am not mistaken, bdrv_set_backing_hd adds a new node (yes removes
the old backing_hd, but let's not think about this for a moment).
So we have disk B with backing file C, and new disk A that wants to have
backing file C.

I think I understand what you mean, so in theory the operation would be
- create new child
- add child to A->children list
- add child to C->parents list

So in theory we need to
* drain A (without subtree), because it can't happen that child nodes of
  A have in-flight requests that look at A status (children list), right?
  In other words, if A has another node X, can a request in X inspect
  A->children
* drain C, as parents can inspect C status (like B). Same assumption
  here, C->children[x]->bs cannot have requests inspecting C->parents
  list?

> Modifying children/parents lists should be protected somehow. Currently
> it's protected by aio-context lock.. If we drop it, we probably need
> some new mutex for it. Also, graph modification should be protected from
> each other, so that one graph modifiction is not started until another
> is in-flight, probably we need some global mutex for graph modification.
> But that's all not about drains.

The idea was to use BQL + drain, to obtain the same effect of aiocontext
lock. BQL should prevent concurrent graph modifications, drain
concurrent I/O reading the state while being modificated.

Emanuele
> 
> Drains are about in-flight requests. And when we switch a child of node
> X, we should do it in drained section for X, as in-flight requests in X
> can touch its children. But another in-flight requests in subtree are
> safe and should not be awaited.
> 
>
Paolo Bonzini Feb. 2, 2022, 5:38 p.m. UTC | #4
On 2/2/22 16:37, Emanuele Giuseppe Esposito wrote:
> So we have disk B with backing file C, and new disk A that wants to have
> backing file C.
> 
> I think I understand what you mean, so in theory the operation would be
> - create new child
> - add child to A->children list
> - add child to C->parents list
> 
> So in theory we need to
> * drain A (without subtree), because it can't happen that child nodes of
>    A have in-flight requests that look at A status (children list), right?
>    In other words, if A has another node X, can a request in X inspect
>    A->children
> * drain C, as parents can inspect C status (like B). Same assumption
>    here, C->children[x]->bs cannot have requests inspecting C->parents
>    list?

In that case (i.e. if parents have to be drained, but children need not) 
bdrv_drained_begin_unlocked would be enough, right?

That would mean that ->children is I/O state but ->parents is global 
state.  I think it's quite a bit more complicated to analyze and to 
understand.

Paolo
Emanuele Giuseppe Esposito Feb. 3, 2022, 10:09 a.m. UTC | #5
On 02/02/2022 18:38, Paolo Bonzini wrote:
> On 2/2/22 16:37, Emanuele Giuseppe Esposito wrote:
>> So we have disk B with backing file C, and new disk A that wants to have
>> backing file C.
>>
>> I think I understand what you mean, so in theory the operation would be
>> - create new child
>> - add child to A->children list
>> - add child to C->parents list
>>
>> So in theory we need to
>> * drain A (without subtree), because it can't happen that child nodes of
>>    A have in-flight requests that look at A status (children list),
>> right?
>>    In other words, if A has another node X, can a request in X inspect
>>    A->children

I am not sure if this can happen. It doesn't seem so, though. All
functions inspecting ->parents are either GS or don't call recursive
function that go down on children list again.

>> * drain C, as parents can inspect C status (like B). Same assumption
>>    here, C->children[x]->bs cannot have requests inspecting C->parents
>>    list?

What if C->children[x]->bs drains? we would have a child inspecting
C->parents.

> 
> In that case (i.e. if parents have to be drained, but children need not)
> bdrv_drained_begin_unlocked would be enough, right?

Should be, if that is the case.

> 
> That would mean that ->children is I/O state but ->parents is global
> state.  I think it's quite a bit more complicated to analyze and to
> understand.

Not sure I follow this one, why is ->parents GS? it is also used by the
drain API...

Emanuele
Emanuele Giuseppe Esposito Feb. 3, 2022, 1:13 p.m. UTC | #6
On 19/01/2022 10:47, Paolo Bonzini wrote:
> 
> About this:
> 
>> +         * TODO: this is called by job_unref with lock held, because
>> +         * afterwards it calls bdrv_try_set_aio_context.
>> +         * Once all of this is fixed, take care of removing
>> +         * the aiocontext lock and make this function _unlocked.
> 
> It may be clear to you, but it's quite cryptic:
> 

I think you figured it by looking at the job patches, but:

> - which lock is held by job_unref()?  Also, would it make more sense to
> talk about block_job_free() rather than job_unref()?  I can't quite
> follow where the AioContext lock is taken.

AioContext lock. I think this is a change I made in the job patches, so
comparing it with the master would make this piece harder to understand.
In the job series, I reduce the granularity of the AioContext lock,
ending up having it only around few callbacks of JobDriver, namely
->free(). This is why I talk about job_unref, because it calls ->free.

The actual lock is taken in job_unref, but the caller (->free) is
block_job_free. Yes it makes more sense mentioning block_job_free.

> - what is "all of this", and what do you mean by "not safe yet"?  Do
> both refer to bdrv_try_set_aio_context() needing the AioContext lock?

Yes

> - what is "this function" (that should become _unlocked)?

bdrv_subtree_drained_begin

This is the new comment I intend to put:
/*
* TODO: this function is called by BlockJobDriver's ->free()
* callback, block_job_free.
* We unfortunately must still take the AioContext lock around
* ->free() in job_unref, because of the bdrv_try_set_aio_context
* call below that still releases/acquires it.
* Once bdrv_try_set_aio_context does not require the AioContext lock,
* take care of removing it around ->free() and replace
* the following bdrv_subtree_drained_begin with its
* _unlocked version.
*/
> 
> I think you could also split the patch in multiple parts for different
> call chains.  In particular bdrv_set_backing_hd can be merged with the
> patch to bdrv_reopen_parse_file_or_backing, since both of them deal with
> bdrv_set_file_or_backing_noperm.
> Ok, I will try to do that.

Emanuele
Vladimir Sementsov-Ogievskiy Feb. 4, 2022, 9:49 a.m. UTC | #7
02.02.2022 18:37, Emanuele Giuseppe Esposito wrote:
> 
> 
> On 01/02/2022 15:47, Vladimir Sementsov-Ogievskiy wrote:
>> 18.01.2022 19:27, Emanuele Giuseppe Esposito wrote:
>>> Protect bdrv_replace_child_noperm, as it modifies the
>>> graph by adding/removing elements to .children and .parents
>>> list of a bs. Use the newly introduced
>>> bdrv_subtree_drained_{begin/end}_unlocked drains to achieve
>>> that and be free from the aiocontext lock.
>>>
>>> One important criteria to keep in mind is that if the caller of
>>> bdrv_replace_child_noperm creates a transaction, we need to make sure
>>> that the
>>> whole transaction is under the same drain block. This is imperative,
>>> as having
>>> multiple drains also in the .abort() class of functions causes
>>> discrepancies
>>> in the drained counters (as nodes are put back into the original
>>> positions),
>>> making it really hard to retourn all to zero and leaving the code very
>>> buggy.
>>> See https://patchew.org/QEMU/20211213104014.69858-1-eesposit@redhat.com/
>>> for more explanations.
>>>
>>> Unfortunately we still need to have bdrv_subtree_drained_begin/end
>>> in bdrv_detach_child() releasing and then holding the AioContext
>>> lock, since it later invokes bdrv_try_set_aio_context() that is
>>> not safe yet. Once all is cleaned up, we can also remove the
>>> acquire/release locks in job_unref, artificially added because of this.
>>>
>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>> ---
>>>    block.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
>>>    1 file changed, 44 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index fcc44a49a0..6196c95aae 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -3114,8 +3114,22 @@ static void bdrv_detach_child(BdrvChild **childp)
>>>        BlockDriverState *old_bs = (*childp)->bs;
>>>          assert(qemu_in_main_thread());
>>> +    if (old_bs) {
>>> +        /*
>>> +         * TODO: this is called by job_unref with lock held, because
>>> +         * afterwards it calls bdrv_try_set_aio_context.
>>> +         * Once all of this is fixed, take care of removing
>>> +         * the aiocontext lock and make this function _unlocked.
>>> +         */
>>> +        bdrv_subtree_drained_begin(old_bs);
>>> +    }
>>> +
>>>        bdrv_replace_child_noperm(childp, NULL, true);
>>>    +    if (old_bs) {
>>> +        bdrv_subtree_drained_end(old_bs);
>>> +    }
>>> +
>>>        if (old_bs) {
>>>            /*
>>>             * Update permissions for old node. We're just taking a
>>> parent away, so
>>> @@ -3154,6 +3168,7 @@ BdrvChild
>>> *bdrv_root_attach_child(BlockDriverState *child_bs,
>>>        Transaction *tran = tran_new();
>>>          assert(qemu_in_main_thread());
>>> +    bdrv_subtree_drained_begin_unlocked(child_bs);
>>>          ret = bdrv_attach_child_common(child_bs, child_name, child_class,
>>>                                       child_role, perm, shared_perm,
>>> opaque,
>>> @@ -3168,6 +3183,7 @@ out:
>>>        tran_finalize(tran, ret);
>>>        /* child is unset on failure by bdrv_attach_child_common_abort() */
>>>        assert((ret < 0) == !child);
>>> +    bdrv_subtree_drained_end_unlocked(child_bs);
>>>          bdrv_unref(child_bs);
>>>        return child;
>>> @@ -3197,6 +3213,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState
>>> *parent_bs,
>>>          assert(qemu_in_main_thread());
>>>    +    bdrv_subtree_drained_begin_unlocked(parent_bs);
>>> +    bdrv_subtree_drained_begin_unlocked(child_bs);
>>> +
>>>        ret = bdrv_attach_child_noperm(parent_bs, child_bs, child_name,
>>> child_class,
>>>                                       child_role, &child, tran, errp);
>>>        if (ret < 0) {
>>> @@ -3211,6 +3230,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState
>>> *parent_bs,
>>>    out:
>>>        tran_finalize(tran, ret);
>>>        /* child is unset on failure by bdrv_attach_child_common_abort() */
>>> +    bdrv_subtree_drained_end_unlocked(child_bs);
>>> +    bdrv_subtree_drained_end_unlocked(parent_bs);
>>> +
>>>        assert((ret < 0) == !child);
>>>          bdrv_unref(child_bs);
>>> @@ -3456,6 +3478,11 @@ int bdrv_set_backing_hd(BlockDriverState *bs,
>>> BlockDriverState *backing_hd,
>>>          assert(qemu_in_main_thread());
>>>    +    bdrv_subtree_drained_begin_unlocked(bs);
>>> +    if (backing_hd) {
>>> +        bdrv_subtree_drained_begin_unlocked(backing_hd);
>>> +    }
>>> +
>>>        ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
>>>        if (ret < 0) {
>>>            goto out;
>>> @@ -3464,6 +3491,10 @@ int bdrv_set_backing_hd(BlockDriverState *bs,
>>> BlockDriverState *backing_hd,
>>>        ret = bdrv_refresh_perms(bs, errp);
>>>    out:
>>>        tran_finalize(tran, ret);
>>> +    if (backing_hd) {
>>> +        bdrv_subtree_drained_end_unlocked(backing_hd);
>>> +    }
>>> +    bdrv_subtree_drained_end_unlocked(bs);
>>>          return ret;
>>>    }
>>> @@ -5266,7 +5297,8 @@ static int
>>> bdrv_replace_node_common(BlockDriverState *from,
>>>          assert(qemu_get_current_aio_context() == qemu_get_aio_context());
>>>        assert(bdrv_get_aio_context(from) == bdrv_get_aio_context(to));
>>> -    bdrv_drained_begin(from);
>>> +    bdrv_subtree_drained_begin_unlocked(from);
>>> +    bdrv_subtree_drained_begin_unlocked(to);
>>>          /*
>>>         * Do the replacement without permission update.
>>> @@ -5298,7 +5330,8 @@ static int
>>> bdrv_replace_node_common(BlockDriverState *from,
>>>    out:
>>>        tran_finalize(tran, ret);
>>>    -    bdrv_drained_end(from);
>>> +    bdrv_subtree_drained_end_unlocked(to);
>>> +    bdrv_subtree_drained_end_unlocked(from);
>>>        bdrv_unref(from);
>>>          return ret;
>>> @@ -5345,6 +5378,9 @@ int bdrv_append(BlockDriverState *bs_new,
>>> BlockDriverState *bs_top,
>>>          assert(!bs_new->backing);
>>>    +    bdrv_subtree_drained_begin_unlocked(bs_new);
>>> +    bdrv_subtree_drained_begin_unlocked(bs_top);
>>> +
>>>        ret = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
>>>                                       &child_of_bds,
>>> bdrv_backing_role(bs_new),
>>>                                       &bs_new->backing, tran, errp);
>>> @@ -5360,6 +5396,8 @@ int bdrv_append(BlockDriverState *bs_new,
>>> BlockDriverState *bs_top,
>>>        ret = bdrv_refresh_perms(bs_new, errp);
>>>    out:
>>>        tran_finalize(tran, ret);
>>> +    bdrv_subtree_drained_end_unlocked(bs_top);
>>> +    bdrv_subtree_drained_end_unlocked(bs_new);
>>>          bdrv_refresh_limits(bs_top, NULL, NULL);
>>>    @@ -5379,8 +5417,8 @@ int bdrv_replace_child_bs(BdrvChild *child,
>>> BlockDriverState *new_bs,
>>>        assert(qemu_in_main_thread());
>>>          bdrv_ref(old_bs);
>>> -    bdrv_drained_begin(old_bs);
>>> -    bdrv_drained_begin(new_bs);
>>> +    bdrv_subtree_drained_begin_unlocked(old_bs);
>>> +    bdrv_subtree_drained_begin_unlocked(new_bs);
>>>          bdrv_replace_child_tran(&child, new_bs, tran, true);
>>>        /* @new_bs must have been non-NULL, so @child must not have been
>>> freed */
>>> @@ -5394,8 +5432,8 @@ int bdrv_replace_child_bs(BdrvChild *child,
>>> BlockDriverState *new_bs,
>>>          tran_finalize(tran, ret);
>>>    -    bdrv_drained_end(old_bs);
>>> -    bdrv_drained_end(new_bs);
>>> +    bdrv_subtree_drained_end_unlocked(new_bs);
>>> +    bdrv_subtree_drained_end_unlocked(old_bs);
>>>        bdrv_unref(old_bs);
>>>          return ret;
>>>
> 
>  From the other thread:
>> So you mean that backing_hd is modified as its parent is changed, do I understand correctly?
> 
> Yes
> 
>>
>> As we started to discuss in a thread started with my "[PATCH] block:
>> bdrv_set_backing_hd(): use drained section", I think draining the whole
>> subtree when we modify some parent-child relation is too much. In-flight
>> requests in subtree don't touch these relations, so why to wait/stop
>> them? Imagine two disks A and B with same backing file C. If we want to
>> detach A from C, what is the reason to drain in-fligth read request of B
>> in C?
> 
> If I am not mistaken, bdrv_set_backing_hd adds a new node (yes removes
> the old backing_hd, but let's not think about this for a moment).
> So we have disk B with backing file C, and new disk A that wants to have
> backing file C.
> 
> I think I understand what you mean, so in theory the operation would be
> - create new child
> - add child to A->children list
> - add child to C->parents list
> 
> So in theory we need to
> * drain A (without subtree), because it can't happen that child nodes of
>    A have in-flight requests that look at A status (children list), right?
>    In other words, if A has another node X, can a request in X inspect
>    A->children

It should not happen. In my understanding, IO request never access parents of the node.

> * drain C, as parents can inspect C status (like B). Same assumption
>    here, C->children[x]->bs cannot have requests inspecting C->parents
>    list?

No, I think we should not drain C. IO requests don't inspect parents list. And if some _other_ operations do inspect it, drain will not help, as drain is only about IO requests.

> 
>> Modifying children/parents lists should be protected somehow. Currently
>> it's protected by aio-context lock.. If we drop it, we probably need
>> some new mutex for it. Also, graph modification should be protected from
>> each other, so that one graph modifiction is not started until another
>> is in-flight, probably we need some global mutex for graph modification.
>> But that's all not about drains.
> 
> The idea was to use BQL + drain, to obtain the same effect of aiocontext
> lock. BQL should prevent concurrent graph modifications, drain
> concurrent I/O reading the state while being modificated.
> 

Concurrent I/O should not touch bs->parents list, so we don't need to drain C.

>>
>> Drains are about in-flight requests. And when we switch a child of node
>> X, we should do it in drained section for X, as in-flight requests in X
>> can touch its children. But another in-flight requests in subtree are
>> safe and should not be awaited.
>>
>>
>
Emanuele Giuseppe Esposito Feb. 4, 2022, 1:30 p.m. UTC | #8
>>
>>  From the other thread:
>>> So you mean that backing_hd is modified as its parent is changed, do
>>> I understand correctly?
>>
>> Yes
>>
>>>
>>> As we started to discuss in a thread started with my "[PATCH] block:
>>> bdrv_set_backing_hd(): use drained section", I think draining the whole
>>> subtree when we modify some parent-child relation is too much. In-flight
>>> requests in subtree don't touch these relations, so why to wait/stop
>>> them? Imagine two disks A and B with same backing file C. If we want to
>>> detach A from C, what is the reason to drain in-fligth read request of B
>>> in C?
>>
>> If I am not mistaken, bdrv_set_backing_hd adds a new node (yes removes
>> the old backing_hd, but let's not think about this for a moment).
>> So we have disk B with backing file C, and new disk A that wants to have
>> backing file C.
>>
>> I think I understand what you mean, so in theory the operation would be
>> - create new child
>> - add child to A->children list
>> - add child to C->parents list
>>
>> So in theory we need to
>> * drain A (without subtree), because it can't happen that child nodes of
>>    A have in-flight requests that look at A status (children list),
>> right?
>>    In other words, if A has another node X, can a request in X inspect
>>    A->children
> 
> It should not happen. In my understanding, IO request never access
> parents of the node.
> 
>> * drain C, as parents can inspect C status (like B). Same assumption
>>    here, C->children[x]->bs cannot have requests inspecting C->parents
>>    list?
> 
> No, I think we should not drain C. IO requests don't inspect parents
> list. And if some _other_ operations do inspect it, drain will not help,
> as drain is only about IO requests.

I am still not convinced about this, because of the draining.

While drain can only be called by either main loop or the "home
iothread" (the one running the AioContext), it still means there are 2
threads involved. So while the iothread runs set_backing_hd, main loop
could easily drain one of the children of the nodes. Or the other way
around.
I am not saying that this happens, but it *might* happen.

I am a little bit confused about this, it would be nice to hear opinions
from others as well.

Once we figure this, I will know where exactly to put the assertions,
and consequently what to drain and with which function.

Emanuele
Vladimir Sementsov-Ogievskiy Feb. 4, 2022, 2:03 p.m. UTC | #9
04.02.2022 16:30, Emanuele Giuseppe Esposito wrote:
>>>
>>>   From the other thread:
>>>> So you mean that backing_hd is modified as its parent is changed, do
>>>> I understand correctly?
>>>
>>> Yes
>>>
>>>>
>>>> As we started to discuss in a thread started with my "[PATCH] block:
>>>> bdrv_set_backing_hd(): use drained section", I think draining the whole
>>>> subtree when we modify some parent-child relation is too much. In-flight
>>>> requests in subtree don't touch these relations, so why to wait/stop
>>>> them? Imagine two disks A and B with same backing file C. If we want to
>>>> detach A from C, what is the reason to drain in-fligth read request of B
>>>> in C?
>>>
>>> If I am not mistaken, bdrv_set_backing_hd adds a new node (yes removes
>>> the old backing_hd, but let's not think about this for a moment).
>>> So we have disk B with backing file C, and new disk A that wants to have
>>> backing file C.
>>>
>>> I think I understand what you mean, so in theory the operation would be
>>> - create new child
>>> - add child to A->children list
>>> - add child to C->parents list
>>>
>>> So in theory we need to
>>> * drain A (without subtree), because it can't happen that child nodes of
>>>     A have in-flight requests that look at A status (children list),
>>> right?
>>>     In other words, if A has another node X, can a request in X inspect
>>>     A->children
>>
>> It should not happen. In my understanding, IO request never access
>> parents of the node.
>>
>>> * drain C, as parents can inspect C status (like B). Same assumption
>>>     here, C->children[x]->bs cannot have requests inspecting C->parents
>>>     list?
>>
>> No, I think we should not drain C. IO requests don't inspect parents
>> list. And if some _other_ operations do inspect it, drain will not help,
>> as drain is only about IO requests.
> 
> I am still not convinced about this, because of the draining.
> 
> While drain can only be called by either main loop or the "home
> iothread" (the one running the AioContext), it still means there are 2
> threads involved. So while the iothread runs set_backing_hd, main loop
> could easily drain one of the children of the nodes. Or the other way
> around.
> I am not saying that this happens, but it *might* happen.

I agree that it might happen. So, parallel drains are a problem. But drain is always a part of graph modification, and any graph modifications running in parallel are already a problem, we should forbid it somehow.

When you drain node, you say: please no in-flight requests now at this node. But IO requests doesn't do drain themselves, so why to restrict them?

And anyway, I don't see how this help. Ok, assume you drain additional node C or even the whole subtree. What protect us from parallel drain from another thread?

> 
> I am a little bit confused about this, it would be nice to hear opinions
> from others as well.
> 
> Once we figure this, I will know where exactly to put the assertions,
> and consequently what to drain and with which function.
> 
> Emanuele
>
diff mbox series

Patch

diff --git a/block.c b/block.c
index fcc44a49a0..6196c95aae 100644
--- a/block.c
+++ b/block.c
@@ -3114,8 +3114,22 @@  static void bdrv_detach_child(BdrvChild **childp)
     BlockDriverState *old_bs = (*childp)->bs;
 
     assert(qemu_in_main_thread());
+    if (old_bs) {
+        /*
+         * TODO: this is called by job_unref with lock held, because
+         * afterwards it calls bdrv_try_set_aio_context.
+         * Once all of this is fixed, take care of removing
+         * the aiocontext lock and make this function _unlocked.
+         */
+        bdrv_subtree_drained_begin(old_bs);
+    }
+
     bdrv_replace_child_noperm(childp, NULL, true);
 
+    if (old_bs) {
+        bdrv_subtree_drained_end(old_bs);
+    }
+
     if (old_bs) {
         /*
          * Update permissions for old node. We're just taking a parent away, so
@@ -3154,6 +3168,7 @@  BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
     Transaction *tran = tran_new();
 
     assert(qemu_in_main_thread());
+    bdrv_subtree_drained_begin_unlocked(child_bs);
 
     ret = bdrv_attach_child_common(child_bs, child_name, child_class,
                                    child_role, perm, shared_perm, opaque,
@@ -3168,6 +3183,7 @@  out:
     tran_finalize(tran, ret);
     /* child is unset on failure by bdrv_attach_child_common_abort() */
     assert((ret < 0) == !child);
+    bdrv_subtree_drained_end_unlocked(child_bs);
 
     bdrv_unref(child_bs);
     return child;
@@ -3197,6 +3213,9 @@  BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 
     assert(qemu_in_main_thread());
 
+    bdrv_subtree_drained_begin_unlocked(parent_bs);
+    bdrv_subtree_drained_begin_unlocked(child_bs);
+
     ret = bdrv_attach_child_noperm(parent_bs, child_bs, child_name, child_class,
                                    child_role, &child, tran, errp);
     if (ret < 0) {
@@ -3211,6 +3230,9 @@  BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 out:
     tran_finalize(tran, ret);
     /* child is unset on failure by bdrv_attach_child_common_abort() */
+    bdrv_subtree_drained_end_unlocked(child_bs);
+    bdrv_subtree_drained_end_unlocked(parent_bs);
+
     assert((ret < 0) == !child);
 
     bdrv_unref(child_bs);
@@ -3456,6 +3478,11 @@  int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
 
     assert(qemu_in_main_thread());
 
+    bdrv_subtree_drained_begin_unlocked(bs);
+    if (backing_hd) {
+        bdrv_subtree_drained_begin_unlocked(backing_hd);
+    }
+
     ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
     if (ret < 0) {
         goto out;
@@ -3464,6 +3491,10 @@  int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
     ret = bdrv_refresh_perms(bs, errp);
 out:
     tran_finalize(tran, ret);
+    if (backing_hd) {
+        bdrv_subtree_drained_end_unlocked(backing_hd);
+    }
+    bdrv_subtree_drained_end_unlocked(bs);
 
     return ret;
 }
@@ -5266,7 +5297,8 @@  static int bdrv_replace_node_common(BlockDriverState *from,
 
     assert(qemu_get_current_aio_context() == qemu_get_aio_context());
     assert(bdrv_get_aio_context(from) == bdrv_get_aio_context(to));
-    bdrv_drained_begin(from);
+    bdrv_subtree_drained_begin_unlocked(from);
+    bdrv_subtree_drained_begin_unlocked(to);
 
     /*
      * Do the replacement without permission update.
@@ -5298,7 +5330,8 @@  static int bdrv_replace_node_common(BlockDriverState *from,
 out:
     tran_finalize(tran, ret);
 
-    bdrv_drained_end(from);
+    bdrv_subtree_drained_end_unlocked(to);
+    bdrv_subtree_drained_end_unlocked(from);
     bdrv_unref(from);
 
     return ret;
@@ -5345,6 +5378,9 @@  int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
 
     assert(!bs_new->backing);
 
+    bdrv_subtree_drained_begin_unlocked(bs_new);
+    bdrv_subtree_drained_begin_unlocked(bs_top);
+
     ret = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
                                    &child_of_bds, bdrv_backing_role(bs_new),
                                    &bs_new->backing, tran, errp);
@@ -5360,6 +5396,8 @@  int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
     ret = bdrv_refresh_perms(bs_new, errp);
 out:
     tran_finalize(tran, ret);
+    bdrv_subtree_drained_end_unlocked(bs_top);
+    bdrv_subtree_drained_end_unlocked(bs_new);
 
     bdrv_refresh_limits(bs_top, NULL, NULL);
 
@@ -5379,8 +5417,8 @@  int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
     assert(qemu_in_main_thread());
 
     bdrv_ref(old_bs);
-    bdrv_drained_begin(old_bs);
-    bdrv_drained_begin(new_bs);
+    bdrv_subtree_drained_begin_unlocked(old_bs);
+    bdrv_subtree_drained_begin_unlocked(new_bs);
 
     bdrv_replace_child_tran(&child, new_bs, tran, true);
     /* @new_bs must have been non-NULL, so @child must not have been freed */
@@ -5394,8 +5432,8 @@  int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
 
     tran_finalize(tran, ret);
 
-    bdrv_drained_end(old_bs);
-    bdrv_drained_end(new_bs);
+    bdrv_subtree_drained_end_unlocked(new_bs);
+    bdrv_subtree_drained_end_unlocked(old_bs);
     bdrv_unref(old_bs);
 
     return ret;