mbox series

[RFC,v2,0/8] Removal of AioContext lock, bs->parents and ->children: new rwlock

Message ID 20220426085114.199647-1-eesposit@redhat.com (mailing list archive)
Headers show
Series Removal of AioContext lock, bs->parents and ->children: new rwlock | expand

Message

Emanuele Giuseppe Esposito April 26, 2022, 8:51 a.m. UTC
This is a new attempt to replace the need to take the AioContext lock to
protect graph modifications. In particular, we aim to remove
(or better, substitute) the AioContext around bdrv_replace_child_noperm,
since this function changes BlockDriverState's ->parents and ->children
lists.

In the previous version, we decided to discard using subtree_drains to
protect the nodes, for different reasons: for those unfamiliar with it,
please see https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/

In this new attempt, we introduce a custom rwlock implementation.
This is inspired from cpus-common.c implementation, and aims to
reduce cacheline bouncing by having per-aiocontext counter of readers.
All details and implementation of the lock are in patch 3.

We distinguish between writer (main loop, under BQL) that modifies the
graph, and readers (all other coroutines running in various AioContext),
that go through the graph edges, reading ->parents and->children.
The writer (main loop)  has an "exclusive" access, so it first waits for
current read to finish, and then prevents incoming ones from
entering while it has the exclusive access.
The readers (coroutines in multiple AioContext) are free to
access the graph as long the writer is not modifying the graph.
In case it is, they go in a CoQueue and sleep until the writer
is done.

Patch 1 and 2 are in preparation for the lock, and fix bugs or missing
cases that popped out while implementing this lock.
Patch 3-7 implement and use the graph and assertions.
Note that unfortunately assert_graph_readable is not very efficient,
because it iterates through the list of all AioContexts to get the total number of readers.
For now we will use it only for debugging purposes and to be sure all
places are properly protected, but eventually we might want to disable it.

Patch 8 is an example of usage: as you can see, it is not essential that
the right coroutine takes the rdlock, but just that it is taken
somewhere. Otherwise the writer is not aware of any readers, and can write
while others are reading.

Next step is the most complex one: figure where to put the rdlocks.
While wrlock is easy, rdlock should ideally be:
- not recursive, otherwise it is not very intuitive
- not everywhere, prefereably only on the top caller of recursion trees
- still manage to protect ->parents and ->children reads.

Luckly, most of the cases where we recursively go through a graph are
the BlockDriverState callback functions in block_int-common.h
In order to understand what to protect, I categorized the callbacks in
block_int-common.h depending on the type of function that calls them:

1) If the caller is a generated_co_wrapper, this function must be
   protected by rdlock. The reason is that generated_co_wrapper create
   coroutines that run in the given bs AioContext, so it doesn't matter
   if we are running in the main loop or not, the coroutine might run
   in an iothread.
2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro,
   then the function is safe. The main loop is the writer and thus won't
   read and write at the same time.
3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE()
   macro, then we need to check the callers and see case-by-case if the
   caller is in the main loop, if it needs to take the lock, or delegate
   this duty to its caller (to reduce the places where to take it).

I used the vrc script (https://github.com/bonzini/vrc) to get help finding
all the callers of a callback. Using its filter function, I can
omit all functions protected by the added lock to avoid having duplicates
when querying for new callbacks.


Emanuele Giuseppe Esposito (8):
  aio_wait_kick: add missing memory barrier
  coroutine-lock: release lock when restarting all coroutines
  block: introduce a lock to protect graph operations
  async: register/unregister aiocontext in graph lock list
  block.c: wrlock in bdrv_replace_child_noperm
  block: assert that graph read and writes are performed correctly
  graph-lock: implement WITH_GRAPH_RDLOCK_GUARD and GRAPH_RDLOCK_GUARD
    macros
  mirror: protect drains in coroutine with rdlock

 block.c                                |  39 ++++-
 block/block-backend.c                  |   6 +-
 block/graph-lock.c                     | 227 +++++++++++++++++++++++++
 block/meson.build                      |   1 +
 block/mirror.c                         |   6 +
 include/block/aio.h                    |   9 +
 include/block/block_int-common.h       |   8 +-
 include/block/block_int-global-state.h |  17 --
 include/block/block_int.h              |   1 +
 include/block/graph-lock.h             | 101 +++++++++++
 include/qemu/coroutine.h               |  10 ++
 util/aio-wait.c                        |   3 +-
 util/async.c                           |   4 +
 util/meson.build                       |   1 +
 util/qemu-coroutine-lock.c             |  26 ++-
 15 files changed, 413 insertions(+), 46 deletions(-)
 create mode 100644 block/graph-lock.c
 create mode 100644 include/block/graph-lock.h

Comments

Emanuele Giuseppe Esposito April 27, 2022, 6:55 a.m. UTC | #1
Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito:
> Luckly, most of the cases where we recursively go through a graph are
> the BlockDriverState callback functions in block_int-common.h
> In order to understand what to protect, I categorized the callbacks in
> block_int-common.h depending on the type of function that calls them:
> 
> 1) If the caller is a generated_co_wrapper, this function must be
>    protected by rdlock. The reason is that generated_co_wrapper create
>    coroutines that run in the given bs AioContext, so it doesn't matter
>    if we are running in the main loop or not, the coroutine might run
>    in an iothread.
> 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro,
>    then the function is safe. The main loop is the writer and thus won't
>    read and write at the same time.
> 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE()
>    macro, then we need to check the callers and see case-by-case if the
>    caller is in the main loop, if it needs to take the lock, or delegate
>    this duty to its caller (to reduce the places where to take it).
> 
> I used the vrc script (https://github.com/bonzini/vrc) to get help finding
> all the callers of a callback. Using its filter function, I can
> omit all functions protected by the added lock to avoid having duplicates
> when querying for new callbacks.

I was wondering, if a function is in category (3) and runs in an
Iothread but the function itself is not (currently) recursive, meaning
it doesn't really traverse the graph or calls someone that traverses it,
should I add the rdlock anyways or not?

Example: bdrv_co_drain_end

Pros:
   + Covers if in future a new recursive callback for a new/existing
     BlockDriver is implemented.
   + Covers also the case where I or someone missed the recursive part.

Cons:
   - Potentially introducing an unnecessary critical section.

What do you think?

Emanuele
Stefan Hajnoczi April 28, 2022, 10:34 a.m. UTC | #2
On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote:
> Next step is the most complex one: figure where to put the rdlocks.

I got a little lost at this step but will hopefully understand more when
reading the patches. Earlier it said there is a counter for readers, so
it seems trivial to make it a recursive lock. I don't understand why you
want to avoid a recursive locking - the complications you described stem
from the fact that you want to avoid recursive locks?

Secondly, can you describe the role of the lock? The text mentions the
old AioContext lock that is being replaced but doesn't define this new
lock in its own right. Is the idea to make ->parents and ->children list
accesses thread-safe and all loops "safe" so they don't break when a BDS
is removed?

Stefan
Stefan Hajnoczi April 28, 2022, 10:45 a.m. UTC | #3
On Wed, Apr 27, 2022 at 08:55:35AM +0200, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito:
> > Luckly, most of the cases where we recursively go through a graph are
> > the BlockDriverState callback functions in block_int-common.h
> > In order to understand what to protect, I categorized the callbacks in
> > block_int-common.h depending on the type of function that calls them:
> > 
> > 1) If the caller is a generated_co_wrapper, this function must be
> >    protected by rdlock. The reason is that generated_co_wrapper create
> >    coroutines that run in the given bs AioContext, so it doesn't matter
> >    if we are running in the main loop or not, the coroutine might run
> >    in an iothread.
> > 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro,
> >    then the function is safe. The main loop is the writer and thus won't
> >    read and write at the same time.
> > 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE()
> >    macro, then we need to check the callers and see case-by-case if the
> >    caller is in the main loop, if it needs to take the lock, or delegate
> >    this duty to its caller (to reduce the places where to take it).
> > 
> > I used the vrc script (https://github.com/bonzini/vrc) to get help finding
> > all the callers of a callback. Using its filter function, I can
> > omit all functions protected by the added lock to avoid having duplicates
> > when querying for new callbacks.
> 
> I was wondering, if a function is in category (3) and runs in an
> Iothread but the function itself is not (currently) recursive, meaning
> it doesn't really traverse the graph or calls someone that traverses it,
> should I add the rdlock anyways or not?
> 
> Example: bdrv_co_drain_end
> 
> Pros:
>    + Covers if in future a new recursive callback for a new/existing
>      BlockDriver is implemented.
>    + Covers also the case where I or someone missed the recursive part.
> 
> Cons:
>    - Potentially introducing an unnecessary critical section.
> 
> What do you think?

->bdrv_co_drain_end() is a callback function. Do you mean whether its
caller, bdrv_drain_invoke_entry(), should take the rdlock around
->bdrv_co_drain_end()?

Going up further in the call chain (and maybe switching threads),
bdrv_do_drained_end() has QLIST_FOREACH(child, &bs->children, next) so
it needs protection. If the caller of bdrv_do_drained_end() holds then
rdlock then I think none of the child functions (including
->bdrv_co_drain_end()) need to take it explicitly.

Stefan
Emanuele Giuseppe Esposito April 28, 2022, 9:56 p.m. UTC | #4
Am 28/04/2022 um 12:45 schrieb Stefan Hajnoczi:
> On Wed, Apr 27, 2022 at 08:55:35AM +0200, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito:
>>> Luckly, most of the cases where we recursively go through a graph are
>>> the BlockDriverState callback functions in block_int-common.h
>>> In order to understand what to protect, I categorized the callbacks in
>>> block_int-common.h depending on the type of function that calls them:
>>>
>>> 1) If the caller is a generated_co_wrapper, this function must be
>>>    protected by rdlock. The reason is that generated_co_wrapper create
>>>    coroutines that run in the given bs AioContext, so it doesn't matter
>>>    if we are running in the main loop or not, the coroutine might run
>>>    in an iothread.
>>> 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro,
>>>    then the function is safe. The main loop is the writer and thus won't
>>>    read and write at the same time.
>>> 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE()
>>>    macro, then we need to check the callers and see case-by-case if the
>>>    caller is in the main loop, if it needs to take the lock, or delegate
>>>    this duty to its caller (to reduce the places where to take it).
>>>
>>> I used the vrc script (https://github.com/bonzini/vrc) to get help finding
>>> all the callers of a callback. Using its filter function, I can
>>> omit all functions protected by the added lock to avoid having duplicates
>>> when querying for new callbacks.
>>
>> I was wondering, if a function is in category (3) and runs in an
>> Iothread but the function itself is not (currently) recursive, meaning
>> it doesn't really traverse the graph or calls someone that traverses it,
>> should I add the rdlock anyways or not?
>>
>> Example: bdrv_co_drain_end
>>
>> Pros:
>>    + Covers if in future a new recursive callback for a new/existing
>>      BlockDriver is implemented.
>>    + Covers also the case where I or someone missed the recursive part.
>>
>> Cons:
>>    - Potentially introducing an unnecessary critical section.
>>
>> What do you think?
> 
> ->bdrv_co_drain_end() is a callback function. Do you mean whether its
> caller, bdrv_drain_invoke_entry(), should take the rdlock around
> ->bdrv_co_drain_end()?

Yes. The problem is that the coroutine is created in bs AioContext, so
it might be in an iothread.

> 
> Going up further in the call chain (and maybe switching threads),
> bdrv_do_drained_end() has QLIST_FOREACH(child, &bs->children, next) so
> it needs protection. If the caller of bdrv_do_drained_end() holds then
> rdlock then I think none of the child functions (including
> ->bdrv_co_drain_end()) need to take it explicitly.

Regarding bdrv_do_drained_end and similar, they are either running in
the main loop (or they will be, if coming from a coroutine) or in the
iothread running the AioContext of the bs involved.

I think that most of the drains except for mirror.c are coming from main
loop. I protected mirror.c in patch 8, even though right now I am not
really sure that what I did is necessary, since the bh will be scheduled
in the main loop.

Therefore we don't really need locks around drains.

Emanuele
> 
> Stefan
>
Emanuele Giuseppe Esposito April 29, 2022, 8:06 a.m. UTC | #5
Am 28/04/2022 um 12:34 schrieb Stefan Hajnoczi:
> On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote:
>> Next step is the most complex one: figure where to put the rdlocks.
> 
> I got a little lost at this step but will hopefully understand more when
> reading the patches. Earlier it said there is a counter for readers, so
> it seems trivial to make it a recursive lock. I don't understand why you
> want to avoid a recursive locking - the complications you described stem
> from the fact that you want to avoid recursive locks?

I'll let Paolo to further elaborate on this. I think it would be cleaner
if we try at least not to have recursive locks.

> 
> Secondly, can you describe the role of the lock? The text mentions the
> old AioContext lock that is being replaced but doesn't define this new
> lock in its own right. Is the idea to make ->parents and ->children list
> accesses thread-safe and all loops "safe" so they don't break when a BDS
> is removed?

Yes. Replace what AioContext lock is doing right now.

Emanuele
> 
> Stefan
>
Stefan Hajnoczi April 30, 2022, 5:17 a.m. UTC | #6
On Thu, Apr 28, 2022 at 11:56:09PM +0200, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 28/04/2022 um 12:45 schrieb Stefan Hajnoczi:
> > On Wed, Apr 27, 2022 at 08:55:35AM +0200, Emanuele Giuseppe Esposito wrote:
> >>
> >>
> >> Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito:
> >>> Luckly, most of the cases where we recursively go through a graph are
> >>> the BlockDriverState callback functions in block_int-common.h
> >>> In order to understand what to protect, I categorized the callbacks in
> >>> block_int-common.h depending on the type of function that calls them:
> >>>
> >>> 1) If the caller is a generated_co_wrapper, this function must be
> >>>    protected by rdlock. The reason is that generated_co_wrapper create
> >>>    coroutines that run in the given bs AioContext, so it doesn't matter
> >>>    if we are running in the main loop or not, the coroutine might run
> >>>    in an iothread.
> >>> 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro,
> >>>    then the function is safe. The main loop is the writer and thus won't
> >>>    read and write at the same time.
> >>> 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE()
> >>>    macro, then we need to check the callers and see case-by-case if the
> >>>    caller is in the main loop, if it needs to take the lock, or delegate
> >>>    this duty to its caller (to reduce the places where to take it).
> >>>
> >>> I used the vrc script (https://github.com/bonzini/vrc) to get help finding
> >>> all the callers of a callback. Using its filter function, I can
> >>> omit all functions protected by the added lock to avoid having duplicates
> >>> when querying for new callbacks.
> >>
> >> I was wondering, if a function is in category (3) and runs in an
> >> Iothread but the function itself is not (currently) recursive, meaning
> >> it doesn't really traverse the graph or calls someone that traverses it,
> >> should I add the rdlock anyways or not?
> >>
> >> Example: bdrv_co_drain_end
> >>
> >> Pros:
> >>    + Covers if in future a new recursive callback for a new/existing
> >>      BlockDriver is implemented.
> >>    + Covers also the case where I or someone missed the recursive part.
> >>
> >> Cons:
> >>    - Potentially introducing an unnecessary critical section.
> >>
> >> What do you think?
> > 
> > ->bdrv_co_drain_end() is a callback function. Do you mean whether its
> > caller, bdrv_drain_invoke_entry(), should take the rdlock around
> > ->bdrv_co_drain_end()?
> 
> Yes. The problem is that the coroutine is created in bs AioContext, so
> it might be in an iothread.
> 
> > 
> > Going up further in the call chain (and maybe switching threads),
> > bdrv_do_drained_end() has QLIST_FOREACH(child, &bs->children, next) so
> > it needs protection. If the caller of bdrv_do_drained_end() holds then
> > rdlock then I think none of the child functions (including
> > ->bdrv_co_drain_end()) need to take it explicitly.
> 
> Regarding bdrv_do_drained_end and similar, they are either running in
> the main loop (or they will be, if coming from a coroutine) or in the
> iothread running the AioContext of the bs involved.
> 
> I think that most of the drains except for mirror.c are coming from main
> loop. I protected mirror.c in patch 8, even though right now I am not
> really sure that what I did is necessary, since the bh will be scheduled
> in the main loop.
> 
> Therefore we don't really need locks around drains.

Are you saying rdlock isn't necessary in the main loop because nothing
can take the wrlock while our code is executing in the main loop?

Stefan
Emanuele Giuseppe Esposito May 2, 2022, 8:02 a.m. UTC | #7
Am 30/04/2022 um 07:17 schrieb Stefan Hajnoczi:
> On Thu, Apr 28, 2022 at 11:56:09PM +0200, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 28/04/2022 um 12:45 schrieb Stefan Hajnoczi:
>>> On Wed, Apr 27, 2022 at 08:55:35AM +0200, Emanuele Giuseppe Esposito wrote:
>>>>
>>>>
>>>> Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito:
>>>>> Luckly, most of the cases where we recursively go through a graph are
>>>>> the BlockDriverState callback functions in block_int-common.h
>>>>> In order to understand what to protect, I categorized the callbacks in
>>>>> block_int-common.h depending on the type of function that calls them:
>>>>>
>>>>> 1) If the caller is a generated_co_wrapper, this function must be
>>>>>    protected by rdlock. The reason is that generated_co_wrapper create
>>>>>    coroutines that run in the given bs AioContext, so it doesn't matter
>>>>>    if we are running in the main loop or not, the coroutine might run
>>>>>    in an iothread.
>>>>> 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro,
>>>>>    then the function is safe. The main loop is the writer and thus won't
>>>>>    read and write at the same time.
>>>>> 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE()
>>>>>    macro, then we need to check the callers and see case-by-case if the
>>>>>    caller is in the main loop, if it needs to take the lock, or delegate
>>>>>    this duty to its caller (to reduce the places where to take it).
>>>>>
>>>>> I used the vrc script (https://github.com/bonzini/vrc) to get help finding
>>>>> all the callers of a callback. Using its filter function, I can
>>>>> omit all functions protected by the added lock to avoid having duplicates
>>>>> when querying for new callbacks.
>>>>
>>>> I was wondering, if a function is in category (3) and runs in an
>>>> Iothread but the function itself is not (currently) recursive, meaning
>>>> it doesn't really traverse the graph or calls someone that traverses it,
>>>> should I add the rdlock anyways or not?
>>>>
>>>> Example: bdrv_co_drain_end
>>>>
>>>> Pros:
>>>>    + Covers if in future a new recursive callback for a new/existing
>>>>      BlockDriver is implemented.
>>>>    + Covers also the case where I or someone missed the recursive part.
>>>>
>>>> Cons:
>>>>    - Potentially introducing an unnecessary critical section.
>>>>
>>>> What do you think?
>>>
>>> ->bdrv_co_drain_end() is a callback function. Do you mean whether its
>>> caller, bdrv_drain_invoke_entry(), should take the rdlock around
>>> ->bdrv_co_drain_end()?
>>
>> Yes. The problem is that the coroutine is created in bs AioContext, so
>> it might be in an iothread.
>>
>>>
>>> Going up further in the call chain (and maybe switching threads),
>>> bdrv_do_drained_end() has QLIST_FOREACH(child, &bs->children, next) so
>>> it needs protection. If the caller of bdrv_do_drained_end() holds then
>>> rdlock then I think none of the child functions (including
>>> ->bdrv_co_drain_end()) need to take it explicitly.
>>
>> Regarding bdrv_do_drained_end and similar, they are either running in
>> the main loop (or they will be, if coming from a coroutine) or in the
>> iothread running the AioContext of the bs involved.
>>
>> I think that most of the drains except for mirror.c are coming from main
>> loop. I protected mirror.c in patch 8, even though right now I am not
>> really sure that what I did is necessary, since the bh will be scheduled
>> in the main loop.
>>
>> Therefore we don't really need locks around drains.
> 
> Are you saying rdlock isn't necessary in the main loop because nothing
> can take the wrlock while our code is executing in the main loop?

Yes, that's the idea.
If I am not mistaken (and I hope I am not), only the main loop currently
modifies/is allowed to modify the graph.

The only case where currently we need to take the rdlock in main loop is
when we have the case

simplified_flush_callback(bs) {
	for (child in bs)
		bdrv_flush(child->bs);
}

some_function() {
	GLOBAL_STATE_CODE();
	/* assume bdrv_get_aio_context(bs) != qemu_in_main_thread() */

	bdrv_flush(bs);
		co = coroutine_create(bdrv_get_aio_context(bs))
		qemu_coroutine_enter(co, simplified_flush_callback)
}
> 
> Stefan
>
Paolo Bonzini May 2, 2022, 1:15 p.m. UTC | #8
On 5/2/22 10:02, Emanuele Giuseppe Esposito wrote:
>> Are you saying rdlock isn't necessary in the main loop because nothing
>> can take the wrlock while our code is executing in the main loop?
> Yes, that's the idea.
> If I am not mistaken (and I hope I am not), only the main loop currently
> modifies/is allowed to modify the graph.
> 
> The only case where currently we need to take the rdlock in main loop is
> when we have the case
> 
> simplified_flush_callback(bs) {
> 	for (child in bs)
> 		bdrv_flush(child->bs);
> }
> 
> some_function() {
> 	GLOBAL_STATE_CODE();
> 	/* assume bdrv_get_aio_context(bs) != qemu_in_main_thread() */
> 
> 	bdrv_flush(bs);
> 		co = coroutine_create(bdrv_get_aio_context(bs))
> 		qemu_coroutine_enter(co, simplified_flush_callback)
> }

This is correct, but it is very unsafe as long as bdrv_flush(bs) is 
allowed to run both in coroutine context and outside.  So we go circling 
back to the same issue that was there in the stackless coroutine 
experiment, i.e. functions that can run both in coroutine context and 
outside.

This issue is fundamentally one of unclear invariants, and reminds me a 
lot of the problems with recursive mutexes.

Paolo
Kevin Wolf May 3, 2022, 8:24 a.m. UTC | #9
Am 02.05.2022 um 10:02 hat Emanuele Giuseppe Esposito geschrieben:
> 
> 
> Am 30/04/2022 um 07:17 schrieb Stefan Hajnoczi:
> > On Thu, Apr 28, 2022 at 11:56:09PM +0200, Emanuele Giuseppe Esposito wrote:
> >>
> >>
> >> Am 28/04/2022 um 12:45 schrieb Stefan Hajnoczi:
> >>> On Wed, Apr 27, 2022 at 08:55:35AM +0200, Emanuele Giuseppe Esposito wrote:
> >>>>
> >>>>
> >>>> Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito:
> >>>>> Luckly, most of the cases where we recursively go through a graph are
> >>>>> the BlockDriverState callback functions in block_int-common.h
> >>>>> In order to understand what to protect, I categorized the callbacks in
> >>>>> block_int-common.h depending on the type of function that calls them:
> >>>>>
> >>>>> 1) If the caller is a generated_co_wrapper, this function must be
> >>>>>    protected by rdlock. The reason is that generated_co_wrapper create
> >>>>>    coroutines that run in the given bs AioContext, so it doesn't matter
> >>>>>    if we are running in the main loop or not, the coroutine might run
> >>>>>    in an iothread.
> >>>>> 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro,
> >>>>>    then the function is safe. The main loop is the writer and thus won't
> >>>>>    read and write at the same time.
> >>>>> 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE()
> >>>>>    macro, then we need to check the callers and see case-by-case if the
> >>>>>    caller is in the main loop, if it needs to take the lock, or delegate
> >>>>>    this duty to its caller (to reduce the places where to take it).
> >>>>>
> >>>>> I used the vrc script (https://github.com/bonzini/vrc) to get help finding
> >>>>> all the callers of a callback. Using its filter function, I can
> >>>>> omit all functions protected by the added lock to avoid having duplicates
> >>>>> when querying for new callbacks.
> >>>>
> >>>> I was wondering, if a function is in category (3) and runs in an
> >>>> Iothread but the function itself is not (currently) recursive, meaning
> >>>> it doesn't really traverse the graph or calls someone that traverses it,
> >>>> should I add the rdlock anyways or not?
> >>>>
> >>>> Example: bdrv_co_drain_end
> >>>>
> >>>> Pros:
> >>>>    + Covers if in future a new recursive callback for a new/existing
> >>>>      BlockDriver is implemented.
> >>>>    + Covers also the case where I or someone missed the recursive part.
> >>>>
> >>>> Cons:
> >>>>    - Potentially introducing an unnecessary critical section.
> >>>>
> >>>> What do you think?
> >>>
> >>> ->bdrv_co_drain_end() is a callback function. Do you mean whether its
> >>> caller, bdrv_drain_invoke_entry(), should take the rdlock around
> >>> ->bdrv_co_drain_end()?
> >>
> >> Yes. The problem is that the coroutine is created in bs AioContext, so
> >> it might be in an iothread.
> >>
> >>>
> >>> Going up further in the call chain (and maybe switching threads),
> >>> bdrv_do_drained_end() has QLIST_FOREACH(child, &bs->children, next) so
> >>> it needs protection. If the caller of bdrv_do_drained_end() holds then
> >>> rdlock then I think none of the child functions (including
> >>> ->bdrv_co_drain_end()) need to take it explicitly.
> >>
> >> Regarding bdrv_do_drained_end and similar, they are either running in
> >> the main loop (or they will be, if coming from a coroutine) or in the
> >> iothread running the AioContext of the bs involved.
> >>
> >> I think that most of the drains except for mirror.c are coming from main
> >> loop. I protected mirror.c in patch 8, even though right now I am not
> >> really sure that what I did is necessary, since the bh will be scheduled
> >> in the main loop.
> >>
> >> Therefore we don't really need locks around drains.
> > 
> > Are you saying rdlock isn't necessary in the main loop because nothing
> > can take the wrlock while our code is executing in the main loop?
> 
> Yes, that's the idea.
> If I am not mistaken (and I hope I am not), only the main loop currently
> modifies/is allowed to modify the graph.

Aren't you completely ignoring coroutines in this? What would a
coroutine do that requires the graph not to change across a yield?

(It's not easily possible to protect this today, and I think this was
the source of some bugs in the past. But if we introduce some proper
locking, I would expect it to solve this.)

Kevin
Stefan Hajnoczi May 3, 2022, 11:04 a.m. UTC | #10
On Mon, May 02, 2022 at 10:02:14AM +0200, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 30/04/2022 um 07:17 schrieb Stefan Hajnoczi:
> > On Thu, Apr 28, 2022 at 11:56:09PM +0200, Emanuele Giuseppe Esposito wrote:
> >>
> >>
> >> Am 28/04/2022 um 12:45 schrieb Stefan Hajnoczi:
> >>> On Wed, Apr 27, 2022 at 08:55:35AM +0200, Emanuele Giuseppe Esposito wrote:
> >>>>
> >>>>
> >>>> Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito:
> >>>>> Luckly, most of the cases where we recursively go through a graph are
> >>>>> the BlockDriverState callback functions in block_int-common.h
> >>>>> In order to understand what to protect, I categorized the callbacks in
> >>>>> block_int-common.h depending on the type of function that calls them:
> >>>>>
> >>>>> 1) If the caller is a generated_co_wrapper, this function must be
> >>>>>    protected by rdlock. The reason is that generated_co_wrapper create
> >>>>>    coroutines that run in the given bs AioContext, so it doesn't matter
> >>>>>    if we are running in the main loop or not, the coroutine might run
> >>>>>    in an iothread.
> >>>>> 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro,
> >>>>>    then the function is safe. The main loop is the writer and thus won't
> >>>>>    read and write at the same time.
> >>>>> 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE()
> >>>>>    macro, then we need to check the callers and see case-by-case if the
> >>>>>    caller is in the main loop, if it needs to take the lock, or delegate
> >>>>>    this duty to its caller (to reduce the places where to take it).
> >>>>>
> >>>>> I used the vrc script (https://github.com/bonzini/vrc) to get help finding
> >>>>> all the callers of a callback. Using its filter function, I can
> >>>>> omit all functions protected by the added lock to avoid having duplicates
> >>>>> when querying for new callbacks.
> >>>>
> >>>> I was wondering, if a function is in category (3) and runs in an
> >>>> Iothread but the function itself is not (currently) recursive, meaning
> >>>> it doesn't really traverse the graph or calls someone that traverses it,
> >>>> should I add the rdlock anyways or not?
> >>>>
> >>>> Example: bdrv_co_drain_end
> >>>>
> >>>> Pros:
> >>>>    + Covers if in future a new recursive callback for a new/existing
> >>>>      BlockDriver is implemented.
> >>>>    + Covers also the case where I or someone missed the recursive part.
> >>>>
> >>>> Cons:
> >>>>    - Potentially introducing an unnecessary critical section.
> >>>>
> >>>> What do you think?
> >>>
> >>> ->bdrv_co_drain_end() is a callback function. Do you mean whether its
> >>> caller, bdrv_drain_invoke_entry(), should take the rdlock around
> >>> ->bdrv_co_drain_end()?
> >>
> >> Yes. The problem is that the coroutine is created in bs AioContext, so
> >> it might be in an iothread.
> >>
> >>>
> >>> Going up further in the call chain (and maybe switching threads),
> >>> bdrv_do_drained_end() has QLIST_FOREACH(child, &bs->children, next) so
> >>> it needs protection. If the caller of bdrv_do_drained_end() holds then
> >>> rdlock then I think none of the child functions (including
> >>> ->bdrv_co_drain_end()) need to take it explicitly.
> >>
> >> Regarding bdrv_do_drained_end and similar, they are either running in
> >> the main loop (or they will be, if coming from a coroutine) or in the
> >> iothread running the AioContext of the bs involved.
> >>
> >> I think that most of the drains except for mirror.c are coming from main
> >> loop. I protected mirror.c in patch 8, even though right now I am not
> >> really sure that what I did is necessary, since the bh will be scheduled
> >> in the main loop.
> >>
> >> Therefore we don't really need locks around drains.
> > 
> > Are you saying rdlock isn't necessary in the main loop because nothing
> > can take the wrlock while our code is executing in the main loop?
> 
> Yes, that's the idea.
> If I am not mistaken (and I hope I am not), only the main loop currently
> modifies/is allowed to modify the graph.
> 
> The only case where currently we need to take the rdlock in main loop is
> when we have the case
> 
> simplified_flush_callback(bs) {
> 	for (child in bs)
> 		bdrv_flush(child->bs);
> }
> 
> some_function() {
> 	GLOBAL_STATE_CODE();
> 	/* assume bdrv_get_aio_context(bs) != qemu_in_main_thread() */
> 
> 	bdrv_flush(bs);
> 		co = coroutine_create(bdrv_get_aio_context(bs))
> 		qemu_coroutine_enter(co, simplified_flush_callback)
> }

Why does the main loop need to take rdlock in this case? Only the
coroutine that runs in the non-main loop AioContext iterates over
children.

Also, there are a bunch of other functions in build/block/block-gen.c
that seem to follow the same pattern, so I'm not sure why bdrv_flush()
would be the only case?

In general I'm having a hard time following this patch series and the
discussion because you aren't giving a chain of reasoning. I can make
guesses myself but some of them will be incorrect. It's better to have a
full explanation so reviewers understand your perspective and the logic
behind it.

Stefan
Stefan Hajnoczi May 4, 2022, 1:39 p.m. UTC | #11
On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote:
> This is a new attempt to replace the need to take the AioContext lock to
> protect graph modifications. In particular, we aim to remove
> (or better, substitute) the AioContext around bdrv_replace_child_noperm,
> since this function changes BlockDriverState's ->parents and ->children
> lists.
> 
> In the previous version, we decided to discard using subtree_drains to
> protect the nodes, for different reasons: for those unfamiliar with it,
> please see https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/

I reread the thread and it's unclear to me why drain is the wrong
mechanism for protecting graph modifications. We theorized a lot but
ultimately is this new mechanism sufficiently different from
bdrv_drained_begin()/end() to make it worth implementing?

Instead of invoking .drained_begin() callbacks to stop further I/O,
we're now queuing coroutines (without backpressure information that
whoever is spawning I/O needs so they can stop). The writer still waits
for in-flight I/O to finish, including I/O not associated with the bdrv
graph we wish to modify (because rdlock is per-AioContext and unrelated
to a specific graph). Is this really more lightweight than drain?

If I understand correctly, the original goal was to avoid the need to
hold the AioContext lock across bdrv_replace_child_noperm(). I would
focus on that and use drain for now.

Maybe I've missed an important point about why the new mechanism is
needed?

Stefan
Stefan Hajnoczi May 17, 2022, 10:59 a.m. UTC | #12
On Wed, May 04, 2022 at 02:39:05PM +0100, Stefan Hajnoczi wrote:
> On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote:
> > This is a new attempt to replace the need to take the AioContext lock to
> > protect graph modifications. In particular, we aim to remove
> > (or better, substitute) the AioContext around bdrv_replace_child_noperm,
> > since this function changes BlockDriverState's ->parents and ->children
> > lists.
> > 
> > In the previous version, we decided to discard using subtree_drains to
> > protect the nodes, for different reasons: for those unfamiliar with it,
> > please see https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/
> 
> I reread the thread and it's unclear to me why drain is the wrong
> mechanism for protecting graph modifications. We theorized a lot but
> ultimately is this new mechanism sufficiently different from
> bdrv_drained_begin()/end() to make it worth implementing?
> 
> Instead of invoking .drained_begin() callbacks to stop further I/O,
> we're now queuing coroutines (without backpressure information that
> whoever is spawning I/O needs so they can stop). The writer still waits
> for in-flight I/O to finish, including I/O not associated with the bdrv
> graph we wish to modify (because rdlock is per-AioContext and unrelated
> to a specific graph). Is this really more lightweight than drain?
> 
> If I understand correctly, the original goal was to avoid the need to
> hold the AioContext lock across bdrv_replace_child_noperm(). I would
> focus on that and use drain for now.
> 
> Maybe I've missed an important point about why the new mechanism is
> needed?

Ping?

Stefan
Emanuele Giuseppe Esposito May 18, 2022, 12:28 p.m. UTC | #13
Am 17/05/2022 um 12:59 schrieb Stefan Hajnoczi:
> On Wed, May 04, 2022 at 02:39:05PM +0100, Stefan Hajnoczi wrote:
>> On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote:
>>> This is a new attempt to replace the need to take the AioContext lock to
>>> protect graph modifications. In particular, we aim to remove
>>> (or better, substitute) the AioContext around bdrv_replace_child_noperm,
>>> since this function changes BlockDriverState's ->parents and ->children
>>> lists.
>>>
>>> In the previous version, we decided to discard using subtree_drains to
>>> protect the nodes, for different reasons: for those unfamiliar with it,
>>> please see https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/
>>
>> I reread the thread and it's unclear to me why drain is the wrong
>> mechanism for protecting graph modifications. We theorized a lot but
>> ultimately is this new mechanism sufficiently different from
>> bdrv_drained_begin()/end() to make it worth implementing?
>>
>> Instead of invoking .drained_begin() callbacks to stop further I/O,
>> we're now queuing coroutines (without backpressure information that
>> whoever is spawning I/O needs so they can stop). The writer still waits
>> for in-flight I/O to finish, including I/O not associated with the bdrv
>> graph we wish to modify (because rdlock is per-AioContext and unrelated
>> to a specific graph). Is this really more lightweight than drain?
>>
>> If I understand correctly, the original goal was to avoid the need to
>> hold the AioContext lock across bdrv_replace_child_noperm(). I would
>> focus on that and use drain for now.
>>
>> Maybe I've missed an important point about why the new mechanism is
>> needed?
> 
> Ping?

label: // read till the end to see why I wrote this here

I was hoping someone from the "No" party would answer to your question,
because as you know we reached this same conclusion together.

We thought about dropping the drain for various reasons, the main one
(at least as far as I understood) is that we are not sure if something
can still happen in between drain_begin/end, and it is a little bit
confusing to use the same mechanism to block I/O and protect the graph.

We then thought about implementing a rwlock. A rdlock would clarify what
we are protecting and who is using the lock. I had a rwlock draft
implementation sent in this thread, but this also lead to additional
problems.
Main problem was that this new lock would introduce nested event loops,
that together with such locking would just create deadlocks.
If readers are in coroutines and writers are not (because graph
operations are not running in coroutines), we have a lot of deadlocks.
If a writer has to take the lock, it must wait all other readers to
finish. And it does it by internally calling AIO_WAIT_WHILE, creating
nested event loop. We don't know what could execute when polling for
events, and for example another writer could be resumed.
Ideally, we want writers in coroutines too.

Additionally, many readers are running in what we call "mixed"
functions: usually implemented automatically with generated_co_wrapper
tag, they let a function (usually bdrv callback) run always in a
coroutine, creating one if necessary. For example, bdrv_flush() makes
sure hat bs->bdrv_co_flush() is always run in a coroutine.
Such mixed functions are used in other callbacks too, making it really
difficult to understand if we are in a coroutine or not, and mostly
important make rwlock usage very difficult.

Which lead us to stepping back once more and try to convert all
BlockDriverState callbacks in coroutines. This would greatly simplify
rwlock usage, because we could make the rwlock coroutine-frendly
(without any AIO_WAIT_WHILE, allowing a writer to wait for readers by
just yielding and queuing itself in coroutine queues).

First step was then to convert all callbacks in coroutines, using
generated_coroutine_wrapper (g_c_w).
A typical g_c_w is implemented in this way:
	if (qemu_in_coroutine()) {
		callback();
	} else { // much simplified
		co = qemu_coroutine_create(callback);
		bdrv_coroutine_enter(bs, co);
		BDRV_POLL_WHILE(bs, coroutine_in_progress);
	}
Once all callbacks are implemented using g_c_w, we can start splitting
the two sides of the if function to only create a coroutine when we are
outside from a bdrv callback.

However, we immediately found a problem while starting to convert the
first callbacks: the AioContext lock is taken around some non coroutine
callbacks! For example, bs->bdrv_open() is always called with the
AioContext lock taken. In addition, callbacks like bdrv_open are
graph-modifying functions, which is probably why we are taking the
Aiocontext lock, and they do not like to run in coroutines.
Anyways, the real problem comes when we create a coroutine in such
places where the AioContext lock is taken and we have a graph-modifying
function.

bdrv_coroutine_enter() calls aio_co_enter(), which in turns first checks
 if the coroutine is entering another context from the current (which is
not the case for open) and if we are already in coroutine (for sure
not). Therefore it resorts to the following calls;
	aio_context_acquire(ctx);
        qemu_aio_coroutine_enter(ctx, co);
        aio_context_release(ctx);
Which is clearly a problem, because we are taking the lock twice: once
from the original caller of the callback, and once here due to the
coroutine. This creates a lot of deadlock situations.

For example, all callers of bdrv_open() always take the AioContext lock.
Often it is taken very high in the call stack, but it's always taken.

Getting rid of the lock around qemu_aio_coroutine_enter() is difficult
too, because coroutines expect to have the lock taken. For example, if
we want to drain from a coroutine, bdrv_co_yield_to_drain releases the
lock for us.

The solution is to get rid of the aio_context lock that was originally
taken in the caller, but the problem is: can we get rid of the
AioContext lock? What is that lock protecting, being so high in the call
stack?
What is the alternative for its usage? We are getting rid of it and
replacing it with nothing.
We are going to drop the AioContext lock to allow coroutines callback,
but leave graph operations unprotected.

We might suppose that many callbacks are called under drain and in
GLOBAL_STATE, which should be enough, but from our experimentation in
the previous series we saw that currently not everything is under drain,
leaving some operations unprotected (remember assert_graph_writable
temporarily disabled, since drain coverage for bdrv_replace_child_noperm
was not 100%?).
Therefore we need to add more drains. But isn't drain what we decided to
drop at the beginning? Why isn't drain good?
goto label;

Thank you,
Emanuele
> 
> Stefan
>
Paolo Bonzini May 18, 2022, 12:43 p.m. UTC | #14
On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote:
> For example, all callers of bdrv_open() always take the AioContext lock.
> Often it is taken very high in the call stack, but it's always taken.

I think it's actually not a problem of who takes the AioContext lock or 
where; the requirements are contradictory:

* IO_OR_GS_CODE() functions, when called from coroutine context, expect 
to be called with the AioContext lock taken (example: 
bdrv_co_yield_to_drain)

* to call these functions with the lock taken, the code has to run in 
the BDS's home iothread.  Attempts to do otherwise results in deadlocks 
(the main loop's AIO_WAIT_WHILEs expect progress from the iothread, that 
cannot happen without releasing the aiocontext lock)

* running the code in the BDS's home iothread is not possible for 
GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main 
thread, but that cannot be guaranteed in general)

> We might suppose that many callbacks are called under drain and in
> GLOBAL_STATE, which should be enough, but from our experimentation in
> the previous series we saw that currently not everything is under drain,
> leaving some operations unprotected (remember assert_graph_writable
> temporarily disabled, since drain coverage for bdrv_replace_child_noperm
> was not 100%?).
> Therefore we need to add more drains. But isn't drain what we decided to
> drop at the beginning? Why isn't drain good?

To sum up the patch ordering deadlock that we have right now:

* in some cases, graph manipulations are protected by the AioContext lock

* eliminating the AioContext lock is needed to move callbacks to 
coroutine contexts (see above for the deadlock scenario)

* moving callbacks to coroutine context is needed by the graph rwlock 
implementation

On one hand, we cannot protect the graph across manipulations with a 
graph rwlock without removing the AioContext lock; on the other hand, 
the AioContext lock is what _right now_ protects the graph.

So I'd rather go back to Emanuele's draining approach.  It may not be 
beautiful, but it allows progress.  Once that is in place, we can remove 
the AioContext lock (which mostly protects virtio-blk/virtio-scsi code 
right now) and reevaluate our next steps.

Paolo
Stefan Hajnoczi May 18, 2022, 2:27 p.m. UTC | #15
On Wed, May 18, 2022 at 02:28:41PM +0200, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 17/05/2022 um 12:59 schrieb Stefan Hajnoczi:
> > On Wed, May 04, 2022 at 02:39:05PM +0100, Stefan Hajnoczi wrote:
> >> On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote:
> >>> This is a new attempt to replace the need to take the AioContext lock to
> >>> protect graph modifications. In particular, we aim to remove
> >>> (or better, substitute) the AioContext around bdrv_replace_child_noperm,
> >>> since this function changes BlockDriverState's ->parents and ->children
> >>> lists.
> >>>
> >>> In the previous version, we decided to discard using subtree_drains to
> >>> protect the nodes, for different reasons: for those unfamiliar with it,
> >>> please see https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/
> >>
> >> I reread the thread and it's unclear to me why drain is the wrong
> >> mechanism for protecting graph modifications. We theorized a lot but
> >> ultimately is this new mechanism sufficiently different from
> >> bdrv_drained_begin()/end() to make it worth implementing?
> >>
> >> Instead of invoking .drained_begin() callbacks to stop further I/O,
> >> we're now queuing coroutines (without backpressure information that
> >> whoever is spawning I/O needs so they can stop). The writer still waits
> >> for in-flight I/O to finish, including I/O not associated with the bdrv
> >> graph we wish to modify (because rdlock is per-AioContext and unrelated
> >> to a specific graph). Is this really more lightweight than drain?
> >>
> >> If I understand correctly, the original goal was to avoid the need to
> >> hold the AioContext lock across bdrv_replace_child_noperm(). I would
> >> focus on that and use drain for now.
> >>
> >> Maybe I've missed an important point about why the new mechanism is
> >> needed?
> > 
> > Ping?
> 
> label: // read till the end to see why I wrote this here
> 
> I was hoping someone from the "No" party would answer to your question,
> because as you know we reached this same conclusion together.
> 
> We thought about dropping the drain for various reasons, the main one
> (at least as far as I understood) is that we are not sure if something
> can still happen in between drain_begin/end, and it is a little bit
> confusing to use the same mechanism to block I/O and protect the graph.

We had discussions about what could go wrong and there was a feeling
that maybe a separate mechanism is appropriate for graph modifications,
but there was no concrete reason why draining around graph modification
won't work.

If no one has a concrete reason then drain still seems like the most
promising approach to protecting graph modifications. The rwlock patch
wasn't sufficiently different from drain to have significant advantages
in my opinion.

> We then thought about implementing a rwlock. A rdlock would clarify what
> we are protecting and who is using the lock. I had a rwlock draft
> implementation sent in this thread, but this also lead to additional
> problems.
> Main problem was that this new lock would introduce nested event loops,
> that together with such locking would just create deadlocks.
> If readers are in coroutines and writers are not (because graph
> operations are not running in coroutines), we have a lot of deadlocks.
> If a writer has to take the lock, it must wait all other readers to
> finish. And it does it by internally calling AIO_WAIT_WHILE, creating
> nested event loop. We don't know what could execute when polling for
> events, and for example another writer could be resumed.
> Ideally, we want writers in coroutines too.

What is the deadlock? Do the readers depend on the writer somehow?

> Additionally, many readers are running in what we call "mixed"
> functions: usually implemented automatically with generated_co_wrapper
> tag, they let a function (usually bdrv callback) run always in a
> coroutine, creating one if necessary. For example, bdrv_flush() makes
> sure hat bs->bdrv_co_flush() is always run in a coroutine.
> Such mixed functions are used in other callbacks too, making it really
> difficult to understand if we are in a coroutine or not, and mostly
> important make rwlock usage very difficult.
> 
> Which lead us to stepping back once more and try to convert all
> BlockDriverState callbacks in coroutines. This would greatly simplify
> rwlock usage, because we could make the rwlock coroutine-frendly
> (without any AIO_WAIT_WHILE, allowing a writer to wait for readers by
> just yielding and queuing itself in coroutine queues).
> 
> First step was then to convert all callbacks in coroutines, using
> generated_coroutine_wrapper (g_c_w).
> A typical g_c_w is implemented in this way:
> 	if (qemu_in_coroutine()) {
> 		callback();
> 	} else { // much simplified
> 		co = qemu_coroutine_create(callback);
> 		bdrv_coroutine_enter(bs, co);
> 		BDRV_POLL_WHILE(bs, coroutine_in_progress);
> 	}
> Once all callbacks are implemented using g_c_w, we can start splitting
> the two sides of the if function to only create a coroutine when we are
> outside from a bdrv callback.
> 
> However, we immediately found a problem while starting to convert the
> first callbacks: the AioContext lock is taken around some non coroutine
> callbacks! For example, bs->bdrv_open() is always called with the
> AioContext lock taken. In addition, callbacks like bdrv_open are
> graph-modifying functions, which is probably why we are taking the
> Aiocontext lock, and they do not like to run in coroutines.
> Anyways, the real problem comes when we create a coroutine in such
> places where the AioContext lock is taken and we have a graph-modifying
> function.
> 
> bdrv_coroutine_enter() calls aio_co_enter(), which in turns first checks
>  if the coroutine is entering another context from the current (which is
> not the case for open) and if we are already in coroutine (for sure
> not). Therefore it resorts to the following calls;
> 	aio_context_acquire(ctx);
>         qemu_aio_coroutine_enter(ctx, co);
>         aio_context_release(ctx);
> Which is clearly a problem, because we are taking the lock twice: once
> from the original caller of the callback, and once here due to the
> coroutine. This creates a lot of deadlock situations.

aio_context_acquire() is a recursive lock. Where is the deadlock?

Stefan
Stefan Hajnoczi May 18, 2022, 2:57 p.m. UTC | #16
On Wed, May 18, 2022 at 02:43:50PM +0200, Paolo Bonzini wrote:
> On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote:
> > For example, all callers of bdrv_open() always take the AioContext lock.
> > Often it is taken very high in the call stack, but it's always taken.
> 
> I think it's actually not a problem of who takes the AioContext lock or
> where; the requirements are contradictory:
> 
> * IO_OR_GS_CODE() functions, when called from coroutine context, expect to
> be called with the AioContext lock taken (example: bdrv_co_yield_to_drain)
> 
> * to call these functions with the lock taken, the code has to run in the
> BDS's home iothread.  Attempts to do otherwise results in deadlocks (the
> main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot
> happen without releasing the aiocontext lock)
>
> 
> * running the code in the BDS's home iothread is not possible for
> GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main
> thread, but that cannot be guaranteed in general)
> 
> > We might suppose that many callbacks are called under drain and in
> > GLOBAL_STATE, which should be enough, but from our experimentation in
> > the previous series we saw that currently not everything is under drain,
> > leaving some operations unprotected (remember assert_graph_writable
> > temporarily disabled, since drain coverage for bdrv_replace_child_noperm
> > was not 100%?).
> > Therefore we need to add more drains. But isn't drain what we decided to
> > drop at the beginning? Why isn't drain good?
> 
> To sum up the patch ordering deadlock that we have right now:
> 
> * in some cases, graph manipulations are protected by the AioContext lock
> 
> * eliminating the AioContext lock is needed to move callbacks to coroutine
> contexts (see above for the deadlock scenario)
> 
> * moving callbacks to coroutine context is needed by the graph rwlock
> implementation
> 
> On one hand, we cannot protect the graph across manipulations with a graph
> rwlock without removing the AioContext lock; on the other hand, the
> AioContext lock is what _right now_ protects the graph.
> 
> So I'd rather go back to Emanuele's draining approach.  It may not be
> beautiful, but it allows progress.  Once that is in place, we can remove the
> AioContext lock (which mostly protects virtio-blk/virtio-scsi code right
> now) and reevaluate our next steps.

Me too, I don't think the rwlock was particularly nice either.

Stefan
Kevin Wolf May 18, 2022, 4:14 p.m. UTC | #17
Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben:
> On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote:
> > For example, all callers of bdrv_open() always take the AioContext lock.
> > Often it is taken very high in the call stack, but it's always taken.
> 
> I think it's actually not a problem of who takes the AioContext lock or
> where; the requirements are contradictory:
> 
> * IO_OR_GS_CODE() functions, when called from coroutine context, expect to
> be called with the AioContext lock taken (example: bdrv_co_yield_to_drain)
> 
> * to call these functions with the lock taken, the code has to run in the
> BDS's home iothread.  Attempts to do otherwise results in deadlocks (the
> main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot
> happen without releasing the aiocontext lock)
> 
> * running the code in the BDS's home iothread is not possible for
> GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main
> thread, but that cannot be guaranteed in general)
> 
> > We might suppose that many callbacks are called under drain and in
> > GLOBAL_STATE, which should be enough, but from our experimentation in
> > the previous series we saw that currently not everything is under drain,
> > leaving some operations unprotected (remember assert_graph_writable
> > temporarily disabled, since drain coverage for bdrv_replace_child_noperm
> > was not 100%?).
> > Therefore we need to add more drains. But isn't drain what we decided to
> > drop at the beginning? Why isn't drain good?
> 
> To sum up the patch ordering deadlock that we have right now:
> 
> * in some cases, graph manipulations are protected by the AioContext lock
> 
> * eliminating the AioContext lock is needed to move callbacks to coroutine
> contexts (see above for the deadlock scenario)
> 
> * moving callbacks to coroutine context is needed by the graph rwlock
> implementation
> 
> On one hand, we cannot protect the graph across manipulations with a graph
> rwlock without removing the AioContext lock; on the other hand, the
> AioContext lock is what _right now_ protects the graph.
> 
> So I'd rather go back to Emanuele's draining approach.  It may not be
> beautiful, but it allows progress.  Once that is in place, we can remove the
> AioContext lock (which mostly protects virtio-blk/virtio-scsi code right
> now) and reevaluate our next steps.

If we want to use drain for locking, we need to make sure that drain
actually does the job correctly. I see two major problems with it:

The first one is that drain only covers I/O paths, but we need to
protect against _anything_ touching block nodes. This might mean a
massive audit and making sure that everything in QEMU that could
possibly touch a block node is integrated with drain.

I think Emanuele has argued before that because writes to the graph only
happen in the main thread and we believe that currently only I/O
requests are processed in iothreads, this is safe and we don't actually
need to audit everything.

This is true as long as the assumption holds true (how do we ensure that
nobody ever introduces non-I/O code touching a block node in an
iothread?) and as long as the graph writer never yields or polls. I
think the latter condition is violated today, a random example is that
adjusting drain counts in bdrv_replace_child_noperm() does poll. Without
cooperation from all relevant places, the effectively locked code
section ends right there, even if the drained section continues. Even if
we can fix this, verifying that the conditions are met everywhere seems
not trivial.

And that's exactly my second major concern: Even if we manage to
correctly implement things with drain, I don't see a way to meaningfully
review it. I just don't know how to verify with some confidence that
it's actually correct and covering everything that needs to be covered.

Drain is not really a lock. But if you use it as one, the best it can
provide is advisory locking (callers, inside and outside the block
layer, need to explicitly support drain instead of having the lock
applied somewhere in the block layer functions). And even if all
relevant pieces actually make use of it, it still has an awkward
interface for locking:

/* Similar to rdlock(), but doesn't wait for writers to finish. It is
 * the callers responsibility to make sure that there are no writers. */
bdrv_inc_in_flight()

/* Similar to wrlock(). Waits for readers to finish. New readers are not
 * prevented from starting after it returns. Third parties are politely
 * asked not to touch the block node while it is drained. */
bdrv_drained_begin()

(I think the unlock counterparts actually behave as expected from a real
lock.)

Having an actual rdlock() (that waits for writers), and using static
analysis to verify that all relevant places use it (so that wrlock()
doesn't rely on politely asking third parties to leave the node alone)
gave me some confidence that we could verify the result.

I'm not sure at all how to achieve the same with the drain interface. In
theory, it's possible. But it complicates the conditions so much that
I'm pretty much sure that the review wouldn't only be very time
consuming, but I would make mistakes during the review, rendering it
useless.

Maybe throwing some more static analysis on the code can help, not sure.
It's going to be a bit more complex than with the other approach,
though.

Kevin
Stefan Hajnoczi May 19, 2022, 11:27 a.m. UTC | #18
On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote:
> If we want to use drain for locking, we need to make sure that drain
> actually does the job correctly. I see two major problems with it:
> 
> The first one is that drain only covers I/O paths, but we need to
> protect against _anything_ touching block nodes. This might mean a
> massive audit and making sure that everything in QEMU that could
> possibly touch a block node is integrated with drain.

> I think Emanuele has argued before that because writes to the graph only
> happen in the main thread and we believe that currently only I/O
> requests are processed in iothreads, this is safe and we don't actually
> need to audit everything.

I'm interested in the non-I/O code path cases you're thinking about:

Block jobs receive callbacks during drain. They are safe.

Exports:
- The nbd export has code to deal with drain and looks safe.
- vhost-user-blk uses aio_set_fd_handler(is_external=true) for virtqueue
  kick fds but not for the vhost-user UNIX domain socket (not sure if
  that's a problem).
- FUSE uses aio_set_fd_handler(is_external=true) and looks safe.

The monitor runs with the BQL in the main loop and doesn't use
coroutines. It should be safe.

Anything else?

Stefan
Kevin Wolf May 19, 2022, 12:52 p.m. UTC | #19
Am 19.05.2022 um 13:27 hat Stefan Hajnoczi geschrieben:
> On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote:
> > If we want to use drain for locking, we need to make sure that drain
> > actually does the job correctly. I see two major problems with it:
> > 
> > The first one is that drain only covers I/O paths, but we need to
> > protect against _anything_ touching block nodes. This might mean a
> > massive audit and making sure that everything in QEMU that could
> > possibly touch a block node is integrated with drain.
> 
> > I think Emanuele has argued before that because writes to the graph only
> > happen in the main thread and we believe that currently only I/O
> > requests are processed in iothreads, this is safe and we don't actually
> > need to audit everything.
> 
> I'm interested in the non-I/O code path cases you're thinking about:
> 
> Block jobs receive callbacks during drain. They are safe.

We've had bugs in the callbacks before, so while they are probably as
safe as they can get when each user has to actively support drain, I'm
a bit less than 100% confident.

> Exports:
> - The nbd export has code to deal with drain and looks safe.
> - vhost-user-blk uses aio_set_fd_handler(is_external=true) for virtqueue
>   kick fds but not for the vhost-user UNIX domain socket (not sure if
>   that's a problem).
> - FUSE uses aio_set_fd_handler(is_external=true) and looks safe.
> 
> The monitor runs with the BQL in the main loop and doesn't use
> coroutines. It should be safe.

The monitor does use coroutines (though I think this doesn't make a
difference; the more important point is that this coroutine runs in
qemu_aio_context while executing commands) and is not safe with respect
to drain and we're seeing bugs coming from it:

https://lists.gnu.org/archive/html/qemu-block/2022-03/msg00582.html

The broader point here is that even running with the BQL in the main
loop doesn't help you at all if the graph writer it could interfere with
polls or is a coroutine that yields. You're only safe if _both_ sides
run with the BQL in the main thread and neither poll nor yield. This is
the condition I explained under which Emanuele's reasoning holds true.

So you can choose between verifying that the monitor actively respects
drain (it doesn't currently) or verifying that no graph writer can poll
or yield during its drain section so that holding the BQL is enough (not
true today and I'm not sure if it could be made true even in theory).

> Anything else?

How to figure this out is precisely my question to you. :-) Maybe
migration completion? Some random BHs? A guest enabling a virtio-blk
device so that the dataplane gets started and its backend is moved to a
different AioContext? I'm not sure if these cases are bad. Maybe they
aren't. But how do I tell without reading every code path?

Well, and the follow-up question: How do we make sure that the list of
"anything else" doesn't change over time when we rely on auditing every
item on it for the correctness of drain?

Kevin
Stefan Hajnoczi May 22, 2022, 3:06 p.m. UTC | #20
On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote:
> Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben:
> > On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote:
> > > For example, all callers of bdrv_open() always take the AioContext lock.
> > > Often it is taken very high in the call stack, but it's always taken.
> > 
> > I think it's actually not a problem of who takes the AioContext lock or
> > where; the requirements are contradictory:
> > 
> > * IO_OR_GS_CODE() functions, when called from coroutine context, expect to
> > be called with the AioContext lock taken (example: bdrv_co_yield_to_drain)
> > 
> > * to call these functions with the lock taken, the code has to run in the
> > BDS's home iothread.  Attempts to do otherwise results in deadlocks (the
> > main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot
> > happen without releasing the aiocontext lock)
> > 
> > * running the code in the BDS's home iothread is not possible for
> > GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main
> > thread, but that cannot be guaranteed in general)
> > 
> > > We might suppose that many callbacks are called under drain and in
> > > GLOBAL_STATE, which should be enough, but from our experimentation in
> > > the previous series we saw that currently not everything is under drain,
> > > leaving some operations unprotected (remember assert_graph_writable
> > > temporarily disabled, since drain coverage for bdrv_replace_child_noperm
> > > was not 100%?).
> > > Therefore we need to add more drains. But isn't drain what we decided to
> > > drop at the beginning? Why isn't drain good?
> > 
> > To sum up the patch ordering deadlock that we have right now:
> > 
> > * in some cases, graph manipulations are protected by the AioContext lock
> > 
> > * eliminating the AioContext lock is needed to move callbacks to coroutine
> > contexts (see above for the deadlock scenario)
> > 
> > * moving callbacks to coroutine context is needed by the graph rwlock
> > implementation
> > 
> > On one hand, we cannot protect the graph across manipulations with a graph
> > rwlock without removing the AioContext lock; on the other hand, the
> > AioContext lock is what _right now_ protects the graph.
> > 
> > So I'd rather go back to Emanuele's draining approach.  It may not be
> > beautiful, but it allows progress.  Once that is in place, we can remove the
> > AioContext lock (which mostly protects virtio-blk/virtio-scsi code right
> > now) and reevaluate our next steps.
> 
> If we want to use drain for locking, we need to make sure that drain
> actually does the job correctly. I see two major problems with it:
> 
> The first one is that drain only covers I/O paths, but we need to
> protect against _anything_ touching block nodes. This might mean a
> massive audit and making sure that everything in QEMU that could
> possibly touch a block node is integrated with drain.
> 
> I think Emanuele has argued before that because writes to the graph only
> happen in the main thread and we believe that currently only I/O
> requests are processed in iothreads, this is safe and we don't actually
> need to audit everything.
> 
> This is true as long as the assumption holds true (how do we ensure that
> nobody ever introduces non-I/O code touching a block node in an
> iothread?) and as long as the graph writer never yields or polls. I
> think the latter condition is violated today, a random example is that
> adjusting drain counts in bdrv_replace_child_noperm() does poll. Without
> cooperation from all relevant places, the effectively locked code
> section ends right there, even if the drained section continues. Even if
> we can fix this, verifying that the conditions are met everywhere seems
> not trivial.
> 
> And that's exactly my second major concern: Even if we manage to
> correctly implement things with drain, I don't see a way to meaningfully
> review it. I just don't know how to verify with some confidence that
> it's actually correct and covering everything that needs to be covered.
> 
> Drain is not really a lock. But if you use it as one, the best it can
> provide is advisory locking (callers, inside and outside the block
> layer, need to explicitly support drain instead of having the lock
> applied somewhere in the block layer functions). And even if all
> relevant pieces actually make use of it, it still has an awkward
> interface for locking:
> 
> /* Similar to rdlock(), but doesn't wait for writers to finish. It is
>  * the callers responsibility to make sure that there are no writers. */
> bdrv_inc_in_flight()
> 
> /* Similar to wrlock(). Waits for readers to finish. New readers are not
>  * prevented from starting after it returns. Third parties are politely
>  * asked not to touch the block node while it is drained. */
> bdrv_drained_begin()
> 
> (I think the unlock counterparts actually behave as expected from a real
> lock.)
> 
> Having an actual rdlock() (that waits for writers), and using static
> analysis to verify that all relevant places use it (so that wrlock()
> doesn't rely on politely asking third parties to leave the node alone)
> gave me some confidence that we could verify the result.
> 
> I'm not sure at all how to achieve the same with the drain interface. In
> theory, it's possible. But it complicates the conditions so much that
> I'm pretty much sure that the review wouldn't only be very time
> consuming, but I would make mistakes during the review, rendering it
> useless.
> 
> Maybe throwing some more static analysis on the code can help, not sure.
> It's going to be a bit more complex than with the other approach,
> though.

Hi,
Sorry for the long email. I've included three topics that may help us discuss
draining and AioContext removal further.

The shortcomings of drain
-------------------------
I wanted to explore the logical argument that making graph modifications within
a drained section is correct:
- Graph modifications and BB/BDS lookup are Global State (GS).
- Graph traversal from a BB/BDS pointer is IO.
- Only one thread executes GS code at a time.
- IO is quiesced within a drained section.
- Therefore a drained section in GS code suspends graph traversal, other graph
  modifications, and BB/BDS lookup.
- Therefore it is safe to modify the graph from a GS drained section.

However, I hit on a problem that I think Emanuele and Paolo have already
pointed out: draining is GS & IO. This might have worked under the 1 IOThread
model but it does not make sense for multi-queue. It is possible to submit I/O
requests in drained sections. How can multiple threads be in drained sections
simultaneously and possibly submit further I/O requests in their drained
sections? Those sections wouldn't be "drained" in any useful sense of the word.

It is necessary to adjust how recursive drained sections work:
bdrv_drained_begin() must not return while there are deeper nested drained
sections.

This is allowed:

     Monitor command            Block job
     ---------------            ---------
  > bdrv_drained_begin()
           .                 > bdrv_drained_begin()
           .                 < bdrv_drained_begin()
           .                          .
           .                          .
           .                 > bdrv_drained_end()
           .                 < bdrv_drained_end()
  < bdrv_drained_begin()
           .
           .
  > bdrv_drained_end()
  < bdrv_drained_end()

This is not allowed:

     Monitor command            Block job
     ---------------            ---------
  > bdrv_drained_begin()
           .                 > bdrv_drained_begin()
           .                 < bdrv_drained_begin()
           .                          .
           .                          .
  < bdrv_drained_begin()              .
           .                          .
           .                 > bdrv_drained_end()
           .                 < bdrv_drained_end()
  > bdrv_drained_end()
  < bdrv_drained_end()

This restriction creates an ordering between bdrv_drained_begin() callers. In
this example the block job must not depend on the monitor command completing
first. Otherwise there would be a deadlock just like with two threads wait for
each other while holding a mutex.

For sanity I think draining should be GS-only. IO code should not invoke
bdrv_drained_begin() to avoid ordering problems when multiple threads drain
simultaneously and have dependencies on each other.

block/mirror.c needs to be modified because it currently drains from IO when
mirroring is about to end.

With this change to draining I think the logical argument for correctness with
graph modifications holds.

Enforcing GS/IO separation at compile time
------------------------------------------
Right now GS/IO asserts check assumptions at runtime. The next step may be
using the type system to separate GS and IO APIs so it's impossible for IO code
to accidentally call GS code, for example.

  typedef struct {
      BlockDriverState *bs;
  } BlockDriverStateIOPtr;

  typedef struct {
      BlockDriverState *bs;
  } BlockDriverStateGSPtr;

Then APIs can be protected as follows:

  void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...);

A function that only has a BlockDriverStateIOPtr cannot call
bdrv_set_aio_context_ignore() by mistake since the compiler will complain that
the first argument must be a BlockDriverStateGSPtr.

Possible steps for AioContext removal
-------------------------------------
I also wanted to share my assumptions about multi-queue and AioContext removal.
Please let me know if anything seems wrong or questionable:

- IO code can execute in any thread that has an AioContext.
- Multiple threads may execute a IO code at the same time.
- GS code only execute under the BQL.

For AioContext removal this means:

- bdrv_get_aio_context() becomes mostly meaningless since there is no need for
  a special "home" AioContext.
- bdrv_coroutine_enter() becomes mostly meaningless because there is no need to
  run a coroutine in the BDS's AioContext.
- aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many
  threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin()
  may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue
  virtio-blk device).
- AIO_WAIT_WHILE() simplifies to

    while ((cond)) {
        aio_poll(qemu_get_current_aio_context(), true);
        ...
    }

  and the distinction between home AioContext and non-home context is
  eliminated. AioContext unlocking is dropped.

Does this make sense? I haven't seen these things in recent patch series.

Stefan
Emanuele Giuseppe Esposito May 23, 2022, 8:48 a.m. UTC | #21
Am 22/05/2022 um 17:06 schrieb Stefan Hajnoczi:
> On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote:
>> Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben:
>>> On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote:
>>>> For example, all callers of bdrv_open() always take the AioContext lock.
>>>> Often it is taken very high in the call stack, but it's always taken.
>>>
>>> I think it's actually not a problem of who takes the AioContext lock or
>>> where; the requirements are contradictory:
>>>
>>> * IO_OR_GS_CODE() functions, when called from coroutine context, expect to
>>> be called with the AioContext lock taken (example: bdrv_co_yield_to_drain)
>>>
>>> * to call these functions with the lock taken, the code has to run in the
>>> BDS's home iothread.  Attempts to do otherwise results in deadlocks (the
>>> main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot
>>> happen without releasing the aiocontext lock)
>>>
>>> * running the code in the BDS's home iothread is not possible for
>>> GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main
>>> thread, but that cannot be guaranteed in general)
>>>
>>>> We might suppose that many callbacks are called under drain and in
>>>> GLOBAL_STATE, which should be enough, but from our experimentation in
>>>> the previous series we saw that currently not everything is under drain,
>>>> leaving some operations unprotected (remember assert_graph_writable
>>>> temporarily disabled, since drain coverage for bdrv_replace_child_noperm
>>>> was not 100%?).
>>>> Therefore we need to add more drains. But isn't drain what we decided to
>>>> drop at the beginning? Why isn't drain good?
>>>
>>> To sum up the patch ordering deadlock that we have right now:
>>>
>>> * in some cases, graph manipulations are protected by the AioContext lock
>>>
>>> * eliminating the AioContext lock is needed to move callbacks to coroutine
>>> contexts (see above for the deadlock scenario)
>>>
>>> * moving callbacks to coroutine context is needed by the graph rwlock
>>> implementation
>>>
>>> On one hand, we cannot protect the graph across manipulations with a graph
>>> rwlock without removing the AioContext lock; on the other hand, the
>>> AioContext lock is what _right now_ protects the graph.
>>>
>>> So I'd rather go back to Emanuele's draining approach.  It may not be
>>> beautiful, but it allows progress.  Once that is in place, we can remove the
>>> AioContext lock (which mostly protects virtio-blk/virtio-scsi code right
>>> now) and reevaluate our next steps.
>>
>> If we want to use drain for locking, we need to make sure that drain
>> actually does the job correctly. I see two major problems with it:
>>
>> The first one is that drain only covers I/O paths, but we need to
>> protect against _anything_ touching block nodes. This might mean a
>> massive audit and making sure that everything in QEMU that could
>> possibly touch a block node is integrated with drain.
>>
>> I think Emanuele has argued before that because writes to the graph only
>> happen in the main thread and we believe that currently only I/O
>> requests are processed in iothreads, this is safe and we don't actually
>> need to audit everything.
>>
>> This is true as long as the assumption holds true (how do we ensure that
>> nobody ever introduces non-I/O code touching a block node in an
>> iothread?) and as long as the graph writer never yields or polls. I
>> think the latter condition is violated today, a random example is that
>> adjusting drain counts in bdrv_replace_child_noperm() does poll. Without
>> cooperation from all relevant places, the effectively locked code
>> section ends right there, even if the drained section continues. Even if
>> we can fix this, verifying that the conditions are met everywhere seems
>> not trivial.
>>
>> And that's exactly my second major concern: Even if we manage to
>> correctly implement things with drain, I don't see a way to meaningfully
>> review it. I just don't know how to verify with some confidence that
>> it's actually correct and covering everything that needs to be covered.
>>
>> Drain is not really a lock. But if you use it as one, the best it can
>> provide is advisory locking (callers, inside and outside the block
>> layer, need to explicitly support drain instead of having the lock
>> applied somewhere in the block layer functions). And even if all
>> relevant pieces actually make use of it, it still has an awkward
>> interface for locking:
>>
>> /* Similar to rdlock(), but doesn't wait for writers to finish. It is
>>  * the callers responsibility to make sure that there are no writers. */
>> bdrv_inc_in_flight()
>>
>> /* Similar to wrlock(). Waits for readers to finish. New readers are not
>>  * prevented from starting after it returns. Third parties are politely
>>  * asked not to touch the block node while it is drained. */
>> bdrv_drained_begin()
>>
>> (I think the unlock counterparts actually behave as expected from a real
>> lock.)
>>
>> Having an actual rdlock() (that waits for writers), and using static
>> analysis to verify that all relevant places use it (so that wrlock()
>> doesn't rely on politely asking third parties to leave the node alone)
>> gave me some confidence that we could verify the result.
>>
>> I'm not sure at all how to achieve the same with the drain interface. In
>> theory, it's possible. But it complicates the conditions so much that
>> I'm pretty much sure that the review wouldn't only be very time
>> consuming, but I would make mistakes during the review, rendering it
>> useless.
>>
>> Maybe throwing some more static analysis on the code can help, not sure.
>> It's going to be a bit more complex than with the other approach,
>> though.
> 
> Hi,
> Sorry for the long email. I've included three topics that may help us discuss
> draining and AioContext removal further.
> 
> The shortcomings of drain
> -------------------------
> I wanted to explore the logical argument that making graph modifications within
> a drained section is correct:
> - Graph modifications and BB/BDS lookup are Global State (GS).
> - Graph traversal from a BB/BDS pointer is IO.
> - Only one thread executes GS code at a time.
> - IO is quiesced within a drained section.
> - Therefore a drained section in GS code suspends graph traversal, other graph
>   modifications, and BB/BDS lookup.
> - Therefore it is safe to modify the graph from a GS drained section.
> 
> However, I hit on a problem that I think Emanuele and Paolo have already
> pointed out: draining is GS & IO. This might have worked under the 1 IOThread
> model but it does not make sense for multi-queue. It is possible to submit I/O
> requests in drained sections. How can multiple threads be in drained sections
> simultaneously and possibly submit further I/O requests in their drained
> sections? Those sections wouldn't be "drained" in any useful sense of the word.
> 
> It is necessary to adjust how recursive drained sections work:
> bdrv_drained_begin() must not return while there are deeper nested drained
> sections.
> 
> This is allowed:
> 
>      Monitor command            Block job
>      ---------------            ---------
>   > bdrv_drained_begin()
>            .                 > bdrv_drained_begin()
>            .                 < bdrv_drained_begin()
>            .                          .
>            .                          .
>            .                 > bdrv_drained_end()
>            .                 < bdrv_drained_end()
>   < bdrv_drained_begin()
>            .
>            .
>   > bdrv_drained_end()
>   < bdrv_drained_end()
> 
> This is not allowed:
> 
>      Monitor command            Block job
>      ---------------            ---------
>   > bdrv_drained_begin()
>            .                 > bdrv_drained_begin()
>            .                 < bdrv_drained_begin()
>            .                          .
>            .                          .
>   < bdrv_drained_begin()              .
>            .                          .
>            .                 > bdrv_drained_end()
>            .                 < bdrv_drained_end()
>   > bdrv_drained_end()
>   < bdrv_drained_end()
> 
> This restriction creates an ordering between bdrv_drained_begin() callers. In
> this example the block job must not depend on the monitor command completing
> first. Otherwise there would be a deadlock just like with two threads wait for
> each other while holding a mutex.
> 
> For sanity I think draining should be GS-only. IO code should not invoke
> bdrv_drained_begin() to avoid ordering problems when multiple threads drain
> simultaneously and have dependencies on each other.
> 
> block/mirror.c needs to be modified because it currently drains from IO when
> mirroring is about to end.

Yes, mirror seems to be the only clear place where draining is performed
from IO, namely in mirror_run. It's also a little bit weird because just
drained_begin is invoked from IO, while drained_end is in the main loop.

If I understand correctly, the reason for that is that we want to
prevent bdrv_get_dirty_count to be modified (ie that additional
modifications come) after we just finished mirroring.

Not sure how to deal with this though.

> 
> With this change to draining I think the logical argument for correctness with
> graph modifications holds.
> 
> Enforcing GS/IO separation at compile time
> ------------------------------------------
> Right now GS/IO asserts check assumptions at runtime. The next step may be
> using the type system to separate GS and IO APIs so it's impossible for IO code
> to accidentally call GS code, for example.
> 
>   typedef struct {
>       BlockDriverState *bs;
>   } BlockDriverStateIOPtr;
> 
>   typedef struct {
>       BlockDriverState *bs;
>   } BlockDriverStateGSPtr;
> 
> Then APIs can be protected as follows:
> 
>   void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...);
> 
> A function that only has a BlockDriverStateIOPtr cannot call
> bdrv_set_aio_context_ignore() by mistake since the compiler will complain that
> the first argument must be a BlockDriverStateGSPtr.

What about functions that do not need a BlockDriverState *, and for
example they use BdrvChild? I would assume that we need to replicate it
also for that.

And what about GS & IO functions? Not only drain, but also all the
generated_co_wrapper should be GS & IO.

> 
> Possible steps for AioContext removal
> -------------------------------------
> I also wanted to share my assumptions about multi-queue and AioContext removal.
> Please let me know if anything seems wrong or questionable:
> 
> - IO code can execute in any thread that has an AioContext.
> - Multiple threads may execute a IO code at the same time.
> - GS code only execute under the BQL.
> 
> For AioContext removal this means:
> 
> - bdrv_get_aio_context() becomes mostly meaningless since there is no need for
>   a special "home" AioContext.

Makes sense

> - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to
>   run a coroutine in the BDS's AioContext.

Why is there no need?

> - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many
>   threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin()
>   may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue
>   virtio-blk device).

Ok

> - AIO_WAIT_WHILE() simplifies to
> 
>     while ((cond)) {
>         aio_poll(qemu_get_current_aio_context(), true);
>         ...
>     }
> 
>   and the distinction between home AioContext and non-home context is
>   eliminated. AioContext unlocking is dropped.

I guess this implies no coroutine in BDS's AioContext.

> 
> Does this make sense? I haven't seen these things in recent patch series.

You haven't seen them because there's no way to do it if we don't even
agree on what to replace the AioContext lock with.
Getting to this point would imply firstly having something (drain,
rwlock, whatever) together with AioContext, and then gradually remove
it. At least, that's how I understood it.

Emanuele
Kevin Wolf May 23, 2022, 1:02 p.m. UTC | #22
Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben:
> Hi,
> Sorry for the long email. I've included three topics that may help us discuss
> draining and AioContext removal further.
> 
> The shortcomings of drain
> -------------------------
> I wanted to explore the logical argument that making graph modifications within
> a drained section is correct:
> - Graph modifications and BB/BDS lookup are Global State (GS).
> - Graph traversal from a BB/BDS pointer is IO.
> - Only one thread executes GS code at a time.
> - IO is quiesced within a drained section.

I think you're mixing two things here and calling both of them IO.

If a function is marked as IO, that means that it is safe to call from
I/O requests, which may be running in any iothread (they currently only
run in the home thread of the node's AioContext, but the function
annotations already promise that any thread is safe).

However, if a function is marked as IO, this doesn't necessarily mean
that it is always only called in the context of an I/O request. It can
be called by any other code, too.

So while drain does quiesce all I/O requests, this doesn't mean that
functions marked as IO won't run any more.

> - Therefore a drained section in GS code suspends graph traversal, other graph
>   modifications, and BB/BDS lookup.
> - Therefore it is safe to modify the graph from a GS drained section.

So unfortunately, I don't think this conclusion is correct.

In order to make your assumption true, we would have to require that all
callers of IO functions must have called bdrv_inc_in_flight(). We would
also have to find a way to enforce this preferable at compile time or
with static analysis, or at least with runtime assertions, because it
would be very easy to get wrong.

> However, I hit on a problem that I think Emanuele and Paolo have already
> pointed out: draining is GS & IO. This might have worked under the 1 IOThread
> model but it does not make sense for multi-queue. It is possible to submit I/O
> requests in drained sections. How can multiple threads be in drained sections
> simultaneously and possibly submit further I/O requests in their drained
> sections? Those sections wouldn't be "drained" in any useful sense of the word.

Drains asks other users not to touch the block node. Currently this only
includes, but the desire to use drain for locking means that we need to
extend it to pretty much any operation on the node, which would include
calling drain on that block node. So you should never have two callers
in a drain section at the same time, it would be a bug in this model.

Of course, we know that currently drain is not respected by all relevant
callers (most importantly, the monitor). If you want to use drain as a
form of locking, you need to solve this first.

> It is necessary to adjust how recursive drained sections work:
> bdrv_drained_begin() must not return while there are deeper nested drained
> sections.
> 
> This is allowed:
> 
>      Monitor command            Block job
>      ---------------            ---------
>   > bdrv_drained_begin()
>            .                 > bdrv_drained_begin()
>            .                 < bdrv_drained_begin()
>            .                          .
>            .                          .
>            .                 > bdrv_drained_end()
>            .                 < bdrv_drained_end()
>   < bdrv_drained_begin()
>            .
>            .
>   > bdrv_drained_end()
>   < bdrv_drained_end()

Just to make sure I understand the scenario that you have in mind here:
The monitor command is not what causes the block job to do the draining
because this is inside bdrv_drained_begin(), so the block job just
randomly got a callback that caused it to do so (maybe completion)?

Before bdrv_drained_begin() returns, anything is still allowed to run,
so I agree that this is valid.

> This is not allowed:
> 
>      Monitor command            Block job
>      ---------------            ---------
>   > bdrv_drained_begin()
>            .                 > bdrv_drained_begin()
>            .                 < bdrv_drained_begin()
>            .                          .
>            .                          .
>   < bdrv_drained_begin()              .
>            .                          .
>            .                 > bdrv_drained_end()
>            .                 < bdrv_drained_end()
>   > bdrv_drained_end()
>   < bdrv_drained_end()
> 
> This restriction creates an ordering between bdrv_drained_begin() callers. In
> this example the block job must not depend on the monitor command completing
> first. Otherwise there would be a deadlock just like with two threads wait for
> each other while holding a mutex.

So essentially drain needs to increase bs->in_flight, so that the outer
drain has to wait for the inner drain section to end before its
bdrv_drained_begin() can return.

> For sanity I think draining should be GS-only. IO code should not invoke
> bdrv_drained_begin() to avoid ordering problems when multiple threads drain
> simultaneously and have dependencies on each other.
> 
> block/mirror.c needs to be modified because it currently drains from IO when
> mirroring is about to end.
> 
> With this change to draining I think the logical argument for correctness with
> graph modifications holds.

What is your suggestion how to modify mirror? It drains so that no new
requests can sneak in and source and target stay in sync while it
switches to the completion code running in the main AioContext. You
can't just drop this.

> Enforcing GS/IO separation at compile time
> ------------------------------------------
> Right now GS/IO asserts check assumptions at runtime. The next step may be
> using the type system to separate GS and IO APIs so it's impossible for IO code
> to accidentally call GS code, for example.
> 
>   typedef struct {
>       BlockDriverState *bs;
>   } BlockDriverStateIOPtr;
> 
>   typedef struct {
>       BlockDriverState *bs;
>   } BlockDriverStateGSPtr;
> 
> Then APIs can be protected as follows:
> 
>   void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...);
> 
> A function that only has a BlockDriverStateIOPtr cannot call
> bdrv_set_aio_context_ignore() by mistake since the compiler will complain that
> the first argument must be a BlockDriverStateGSPtr.

And then you have a set of functions that cast from GS to IO pointers,
but not for the other way around? Or maybe getting as IO pointer would
even be coupled with automatically increasing bs->in_flight?

The other option that we were looking into for this was static analysis.
I had hacked up a small script to check consistency of these annotations
(without covering function pointers, though) to help me with the review
of Emanuele's patches that introduced them. If I understand correctly,
Paolo has scripts to do the same a little bit better.

As long as we can integrate such a script in 'make check', we wouldn't
necessarily have to have the churn in the C code to switch everything to
new wrapper types.

> Possible steps for AioContext removal
> -------------------------------------
> I also wanted to share my assumptions about multi-queue and AioContext removal.
> Please let me know if anything seems wrong or questionable:
> 
> - IO code can execute in any thread that has an AioContext.
> - Multiple threads may execute a IO code at the same time.
> - GS code only execute under the BQL.
> 
> For AioContext removal this means:
> 
> - bdrv_get_aio_context() becomes mostly meaningless since there is no need for
>   a special "home" AioContext.
> - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to
>   run a coroutine in the BDS's AioContext.
> - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many
>   threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin()
>   may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue
>   virtio-blk device).
> - AIO_WAIT_WHILE() simplifies to
> 
>     while ((cond)) {
>         aio_poll(qemu_get_current_aio_context(), true);
>         ...
>     }

Probably not exactly, because you still need aio_wait_kick() to wake up
the thread. We use AioWait.num_waiters != 0 to decide whether we need to
schedule a BH in the main thread (because doing so unconditionally for
every completed request would be very wasteful).

If you want to be able to run this loop from any thread instead of just
the main thread, you would have to store somewhere which thread to wake.

>   and the distinction between home AioContext and non-home context is
>   eliminated. AioContext unlocking is dropped.
> 
> Does this make sense? I haven't seen these things in recent patch series.

Intuitively I would agree with most. There may be details that I'm not
aware of at the moment. I'm not surprised that we haven't seen any such
things in recent patch series because there is an awful lot of
preparational work to be done before we can reach this final state.

Kevin
Stefan Hajnoczi May 23, 2022, 1:15 p.m. UTC | #23
On Mon, May 23, 2022 at 10:48:48AM +0200, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 22/05/2022 um 17:06 schrieb Stefan Hajnoczi:
> > On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote:
> >> Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben:
> >>> On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote:
> >>>> For example, all callers of bdrv_open() always take the AioContext lock.
> >>>> Often it is taken very high in the call stack, but it's always taken.
> >>>
> >>> I think it's actually not a problem of who takes the AioContext lock or
> >>> where; the requirements are contradictory:
> >>>
> >>> * IO_OR_GS_CODE() functions, when called from coroutine context, expect to
> >>> be called with the AioContext lock taken (example: bdrv_co_yield_to_drain)
> >>>
> >>> * to call these functions with the lock taken, the code has to run in the
> >>> BDS's home iothread.  Attempts to do otherwise results in deadlocks (the
> >>> main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot
> >>> happen without releasing the aiocontext lock)
> >>>
> >>> * running the code in the BDS's home iothread is not possible for
> >>> GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main
> >>> thread, but that cannot be guaranteed in general)
> >>>
> >>>> We might suppose that many callbacks are called under drain and in
> >>>> GLOBAL_STATE, which should be enough, but from our experimentation in
> >>>> the previous series we saw that currently not everything is under drain,
> >>>> leaving some operations unprotected (remember assert_graph_writable
> >>>> temporarily disabled, since drain coverage for bdrv_replace_child_noperm
> >>>> was not 100%?).
> >>>> Therefore we need to add more drains. But isn't drain what we decided to
> >>>> drop at the beginning? Why isn't drain good?
> >>>
> >>> To sum up the patch ordering deadlock that we have right now:
> >>>
> >>> * in some cases, graph manipulations are protected by the AioContext lock
> >>>
> >>> * eliminating the AioContext lock is needed to move callbacks to coroutine
> >>> contexts (see above for the deadlock scenario)
> >>>
> >>> * moving callbacks to coroutine context is needed by the graph rwlock
> >>> implementation
> >>>
> >>> On one hand, we cannot protect the graph across manipulations with a graph
> >>> rwlock without removing the AioContext lock; on the other hand, the
> >>> AioContext lock is what _right now_ protects the graph.
> >>>
> >>> So I'd rather go back to Emanuele's draining approach.  It may not be
> >>> beautiful, but it allows progress.  Once that is in place, we can remove the
> >>> AioContext lock (which mostly protects virtio-blk/virtio-scsi code right
> >>> now) and reevaluate our next steps.
> >>
> >> If we want to use drain for locking, we need to make sure that drain
> >> actually does the job correctly. I see two major problems with it:
> >>
> >> The first one is that drain only covers I/O paths, but we need to
> >> protect against _anything_ touching block nodes. This might mean a
> >> massive audit and making sure that everything in QEMU that could
> >> possibly touch a block node is integrated with drain.
> >>
> >> I think Emanuele has argued before that because writes to the graph only
> >> happen in the main thread and we believe that currently only I/O
> >> requests are processed in iothreads, this is safe and we don't actually
> >> need to audit everything.
> >>
> >> This is true as long as the assumption holds true (how do we ensure that
> >> nobody ever introduces non-I/O code touching a block node in an
> >> iothread?) and as long as the graph writer never yields or polls. I
> >> think the latter condition is violated today, a random example is that
> >> adjusting drain counts in bdrv_replace_child_noperm() does poll. Without
> >> cooperation from all relevant places, the effectively locked code
> >> section ends right there, even if the drained section continues. Even if
> >> we can fix this, verifying that the conditions are met everywhere seems
> >> not trivial.
> >>
> >> And that's exactly my second major concern: Even if we manage to
> >> correctly implement things with drain, I don't see a way to meaningfully
> >> review it. I just don't know how to verify with some confidence that
> >> it's actually correct and covering everything that needs to be covered.
> >>
> >> Drain is not really a lock. But if you use it as one, the best it can
> >> provide is advisory locking (callers, inside and outside the block
> >> layer, need to explicitly support drain instead of having the lock
> >> applied somewhere in the block layer functions). And even if all
> >> relevant pieces actually make use of it, it still has an awkward
> >> interface for locking:
> >>
> >> /* Similar to rdlock(), but doesn't wait for writers to finish. It is
> >>  * the callers responsibility to make sure that there are no writers. */
> >> bdrv_inc_in_flight()
> >>
> >> /* Similar to wrlock(). Waits for readers to finish. New readers are not
> >>  * prevented from starting after it returns. Third parties are politely
> >>  * asked not to touch the block node while it is drained. */
> >> bdrv_drained_begin()
> >>
> >> (I think the unlock counterparts actually behave as expected from a real
> >> lock.)
> >>
> >> Having an actual rdlock() (that waits for writers), and using static
> >> analysis to verify that all relevant places use it (so that wrlock()
> >> doesn't rely on politely asking third parties to leave the node alone)
> >> gave me some confidence that we could verify the result.
> >>
> >> I'm not sure at all how to achieve the same with the drain interface. In
> >> theory, it's possible. But it complicates the conditions so much that
> >> I'm pretty much sure that the review wouldn't only be very time
> >> consuming, but I would make mistakes during the review, rendering it
> >> useless.
> >>
> >> Maybe throwing some more static analysis on the code can help, not sure.
> >> It's going to be a bit more complex than with the other approach,
> >> though.
> > 
> > Hi,
> > Sorry for the long email. I've included three topics that may help us discuss
> > draining and AioContext removal further.
> > 
> > The shortcomings of drain
> > -------------------------
> > I wanted to explore the logical argument that making graph modifications within
> > a drained section is correct:
> > - Graph modifications and BB/BDS lookup are Global State (GS).
> > - Graph traversal from a BB/BDS pointer is IO.
> > - Only one thread executes GS code at a time.
> > - IO is quiesced within a drained section.
> > - Therefore a drained section in GS code suspends graph traversal, other graph
> >   modifications, and BB/BDS lookup.
> > - Therefore it is safe to modify the graph from a GS drained section.
> > 
> > However, I hit on a problem that I think Emanuele and Paolo have already
> > pointed out: draining is GS & IO. This might have worked under the 1 IOThread
> > model but it does not make sense for multi-queue. It is possible to submit I/O
> > requests in drained sections. How can multiple threads be in drained sections
> > simultaneously and possibly submit further I/O requests in their drained
> > sections? Those sections wouldn't be "drained" in any useful sense of the word.
> > 
> > It is necessary to adjust how recursive drained sections work:
> > bdrv_drained_begin() must not return while there are deeper nested drained
> > sections.
> > 
> > This is allowed:
> > 
> >      Monitor command            Block job
> >      ---------------            ---------
> >   > bdrv_drained_begin()
> >            .                 > bdrv_drained_begin()
> >            .                 < bdrv_drained_begin()
> >            .                          .
> >            .                          .
> >            .                 > bdrv_drained_end()
> >            .                 < bdrv_drained_end()
> >   < bdrv_drained_begin()
> >            .
> >            .
> >   > bdrv_drained_end()
> >   < bdrv_drained_end()
> > 
> > This is not allowed:
> > 
> >      Monitor command            Block job
> >      ---------------            ---------
> >   > bdrv_drained_begin()
> >            .                 > bdrv_drained_begin()
> >            .                 < bdrv_drained_begin()
> >            .                          .
> >            .                          .
> >   < bdrv_drained_begin()              .
> >            .                          .
> >            .                 > bdrv_drained_end()
> >            .                 < bdrv_drained_end()
> >   > bdrv_drained_end()
> >   < bdrv_drained_end()
> > 
> > This restriction creates an ordering between bdrv_drained_begin() callers. In
> > this example the block job must not depend on the monitor command completing
> > first. Otherwise there would be a deadlock just like with two threads wait for
> > each other while holding a mutex.
> > 
> > For sanity I think draining should be GS-only. IO code should not invoke
> > bdrv_drained_begin() to avoid ordering problems when multiple threads drain
> > simultaneously and have dependencies on each other.
> > 
> > block/mirror.c needs to be modified because it currently drains from IO when
> > mirroring is about to end.
> 
> Yes, mirror seems to be the only clear place where draining is performed
> from IO, namely in mirror_run. It's also a little bit weird because just
> drained_begin is invoked from IO, while drained_end is in the main loop.
> 
> If I understand correctly, the reason for that is that we want to
> prevent bdrv_get_dirty_count to be modified (ie that additional
> modifications come) after we just finished mirroring.
> 
> Not sure how to deal with this though.

I don't know the code so I don't have a concrete solution in mind.

> > 
> > With this change to draining I think the logical argument for correctness with
> > graph modifications holds.
> > 
> > Enforcing GS/IO separation at compile time
> > ------------------------------------------
> > Right now GS/IO asserts check assumptions at runtime. The next step may be
> > using the type system to separate GS and IO APIs so it's impossible for IO code
> > to accidentally call GS code, for example.
> > 
> >   typedef struct {
> >       BlockDriverState *bs;
> >   } BlockDriverStateIOPtr;
> > 
> >   typedef struct {
> >       BlockDriverState *bs;
> >   } BlockDriverStateGSPtr;
> > 
> > Then APIs can be protected as follows:
> > 
> >   void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...);
> > 
> > A function that only has a BlockDriverStateIOPtr cannot call
> > bdrv_set_aio_context_ignore() by mistake since the compiler will complain that
> > the first argument must be a BlockDriverStateGSPtr.
> 
> What about functions that do not need a BlockDriverState *, and for
> example they use BdrvChild? I would assume that we need to replicate it
> also for that.
> 
> And what about GS & IO functions? Not only drain, but also all the
> generated_co_wrapper should be GS & IO.

What is the meaning of GS & IO in a multi-queue world? I guess it's
effectively IO except it's well-behaved if called with the BQL held?

> 
> > 
> > Possible steps for AioContext removal
> > -------------------------------------
> > I also wanted to share my assumptions about multi-queue and AioContext removal.
> > Please let me know if anything seems wrong or questionable:
> > 
> > - IO code can execute in any thread that has an AioContext.
> > - Multiple threads may execute a IO code at the same time.
> > - GS code only execute under the BQL.
> > 
> > For AioContext removal this means:
> > 
> > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for
> >   a special "home" AioContext.
> 
> Makes sense
> 
> > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to
> >   run a coroutine in the BDS's AioContext.
> 
> Why is there no need?

The coroutine can execute in the current thread since the BDS must be
thread-safe.

If bdrv_coroutine_enter() is used in cases like block jobs to mean "run
in the same AioContext as the job coroutine" then the code should
probably be changed to explicitly use the job AioContext instead of the
more vague BDS AioContext.

> > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many
> >   threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin()
> >   may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue
> >   virtio-blk device).
> 
> Ok
> 
> > - AIO_WAIT_WHILE() simplifies to
> > 
> >     while ((cond)) {
> >         aio_poll(qemu_get_current_aio_context(), true);
> >         ...
> >     }
> > 
> >   and the distinction between home AioContext and non-home context is
> >   eliminated. AioContext unlocking is dropped.
> 
> I guess this implies no coroutine in BDS's AioContext.

I'm not sure what you mean but coroutines accessing the BDS can run in
any AioContext with multi-queue. If they have any locking requirements
amongst each other then that should be explicit instead of just throwing
them all into the BDS AioContext.

> > 
> > Does this make sense? I haven't seen these things in recent patch series.
> 
> You haven't seen them because there's no way to do it if we don't even
> agree on what to replace the AioContext lock with.
> Getting to this point would imply firstly having something (drain,
> rwlock, whatever) together with AioContext, and then gradually remove
> it. At least, that's how I understood it.

What, besides graph modification and the things I listed in this email,
still needs to be untangled from AioContext?

Stefan
Emanuele Giuseppe Esposito May 23, 2022, 1:54 p.m. UTC | #24
Am 23/05/2022 um 15:15 schrieb Stefan Hajnoczi:
> On Mon, May 23, 2022 at 10:48:48AM +0200, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 22/05/2022 um 17:06 schrieb Stefan Hajnoczi:
>>> On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote:
>>>> Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben:
>>>>> On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote:
>>>>>> For example, all callers of bdrv_open() always take the AioContext lock.
>>>>>> Often it is taken very high in the call stack, but it's always taken.
>>>>>
>>>>> I think it's actually not a problem of who takes the AioContext lock or
>>>>> where; the requirements are contradictory:
>>>>>
>>>>> * IO_OR_GS_CODE() functions, when called from coroutine context, expect to
>>>>> be called with the AioContext lock taken (example: bdrv_co_yield_to_drain)
>>>>>
>>>>> * to call these functions with the lock taken, the code has to run in the
>>>>> BDS's home iothread.  Attempts to do otherwise results in deadlocks (the
>>>>> main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot
>>>>> happen without releasing the aiocontext lock)
>>>>>
>>>>> * running the code in the BDS's home iothread is not possible for
>>>>> GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main
>>>>> thread, but that cannot be guaranteed in general)
>>>>>
>>>>>> We might suppose that many callbacks are called under drain and in
>>>>>> GLOBAL_STATE, which should be enough, but from our experimentation in
>>>>>> the previous series we saw that currently not everything is under drain,
>>>>>> leaving some operations unprotected (remember assert_graph_writable
>>>>>> temporarily disabled, since drain coverage for bdrv_replace_child_noperm
>>>>>> was not 100%?).
>>>>>> Therefore we need to add more drains. But isn't drain what we decided to
>>>>>> drop at the beginning? Why isn't drain good?
>>>>>
>>>>> To sum up the patch ordering deadlock that we have right now:
>>>>>
>>>>> * in some cases, graph manipulations are protected by the AioContext lock
>>>>>
>>>>> * eliminating the AioContext lock is needed to move callbacks to coroutine
>>>>> contexts (see above for the deadlock scenario)
>>>>>
>>>>> * moving callbacks to coroutine context is needed by the graph rwlock
>>>>> implementation
>>>>>
>>>>> On one hand, we cannot protect the graph across manipulations with a graph
>>>>> rwlock without removing the AioContext lock; on the other hand, the
>>>>> AioContext lock is what _right now_ protects the graph.
>>>>>
>>>>> So I'd rather go back to Emanuele's draining approach.  It may not be
>>>>> beautiful, but it allows progress.  Once that is in place, we can remove the
>>>>> AioContext lock (which mostly protects virtio-blk/virtio-scsi code right
>>>>> now) and reevaluate our next steps.
>>>>
>>>> If we want to use drain for locking, we need to make sure that drain
>>>> actually does the job correctly. I see two major problems with it:
>>>>
>>>> The first one is that drain only covers I/O paths, but we need to
>>>> protect against _anything_ touching block nodes. This might mean a
>>>> massive audit and making sure that everything in QEMU that could
>>>> possibly touch a block node is integrated with drain.
>>>>
>>>> I think Emanuele has argued before that because writes to the graph only
>>>> happen in the main thread and we believe that currently only I/O
>>>> requests are processed in iothreads, this is safe and we don't actually
>>>> need to audit everything.
>>>>
>>>> This is true as long as the assumption holds true (how do we ensure that
>>>> nobody ever introduces non-I/O code touching a block node in an
>>>> iothread?) and as long as the graph writer never yields or polls. I
>>>> think the latter condition is violated today, a random example is that
>>>> adjusting drain counts in bdrv_replace_child_noperm() does poll. Without
>>>> cooperation from all relevant places, the effectively locked code
>>>> section ends right there, even if the drained section continues. Even if
>>>> we can fix this, verifying that the conditions are met everywhere seems
>>>> not trivial.
>>>>
>>>> And that's exactly my second major concern: Even if we manage to
>>>> correctly implement things with drain, I don't see a way to meaningfully
>>>> review it. I just don't know how to verify with some confidence that
>>>> it's actually correct and covering everything that needs to be covered.
>>>>
>>>> Drain is not really a lock. But if you use it as one, the best it can
>>>> provide is advisory locking (callers, inside and outside the block
>>>> layer, need to explicitly support drain instead of having the lock
>>>> applied somewhere in the block layer functions). And even if all
>>>> relevant pieces actually make use of it, it still has an awkward
>>>> interface for locking:
>>>>
>>>> /* Similar to rdlock(), but doesn't wait for writers to finish. It is
>>>>  * the callers responsibility to make sure that there are no writers. */
>>>> bdrv_inc_in_flight()
>>>>
>>>> /* Similar to wrlock(). Waits for readers to finish. New readers are not
>>>>  * prevented from starting after it returns. Third parties are politely
>>>>  * asked not to touch the block node while it is drained. */
>>>> bdrv_drained_begin()
>>>>
>>>> (I think the unlock counterparts actually behave as expected from a real
>>>> lock.)
>>>>
>>>> Having an actual rdlock() (that waits for writers), and using static
>>>> analysis to verify that all relevant places use it (so that wrlock()
>>>> doesn't rely on politely asking third parties to leave the node alone)
>>>> gave me some confidence that we could verify the result.
>>>>
>>>> I'm not sure at all how to achieve the same with the drain interface. In
>>>> theory, it's possible. But it complicates the conditions so much that
>>>> I'm pretty much sure that the review wouldn't only be very time
>>>> consuming, but I would make mistakes during the review, rendering it
>>>> useless.
>>>>
>>>> Maybe throwing some more static analysis on the code can help, not sure.
>>>> It's going to be a bit more complex than with the other approach,
>>>> though.
>>>
>>> Hi,
>>> Sorry for the long email. I've included three topics that may help us discuss
>>> draining and AioContext removal further.
>>>
>>> The shortcomings of drain
>>> -------------------------
>>> I wanted to explore the logical argument that making graph modifications within
>>> a drained section is correct:
>>> - Graph modifications and BB/BDS lookup are Global State (GS).
>>> - Graph traversal from a BB/BDS pointer is IO.
>>> - Only one thread executes GS code at a time.
>>> - IO is quiesced within a drained section.
>>> - Therefore a drained section in GS code suspends graph traversal, other graph
>>>   modifications, and BB/BDS lookup.
>>> - Therefore it is safe to modify the graph from a GS drained section.
>>>
>>> However, I hit on a problem that I think Emanuele and Paolo have already
>>> pointed out: draining is GS & IO. This might have worked under the 1 IOThread
>>> model but it does not make sense for multi-queue. It is possible to submit I/O
>>> requests in drained sections. How can multiple threads be in drained sections
>>> simultaneously and possibly submit further I/O requests in their drained
>>> sections? Those sections wouldn't be "drained" in any useful sense of the word.
>>>
>>> It is necessary to adjust how recursive drained sections work:
>>> bdrv_drained_begin() must not return while there are deeper nested drained
>>> sections.
>>>
>>> This is allowed:
>>>
>>>      Monitor command            Block job
>>>      ---------------            ---------
>>>   > bdrv_drained_begin()
>>>            .                 > bdrv_drained_begin()
>>>            .                 < bdrv_drained_begin()
>>>            .                          .
>>>            .                          .
>>>            .                 > bdrv_drained_end()
>>>            .                 < bdrv_drained_end()
>>>   < bdrv_drained_begin()
>>>            .
>>>            .
>>>   > bdrv_drained_end()
>>>   < bdrv_drained_end()
>>>
>>> This is not allowed:
>>>
>>>      Monitor command            Block job
>>>      ---------------            ---------
>>>   > bdrv_drained_begin()
>>>            .                 > bdrv_drained_begin()
>>>            .                 < bdrv_drained_begin()
>>>            .                          .
>>>            .                          .
>>>   < bdrv_drained_begin()              .
>>>            .                          .
>>>            .                 > bdrv_drained_end()
>>>            .                 < bdrv_drained_end()
>>>   > bdrv_drained_end()
>>>   < bdrv_drained_end()
>>>
>>> This restriction creates an ordering between bdrv_drained_begin() callers. In
>>> this example the block job must not depend on the monitor command completing
>>> first. Otherwise there would be a deadlock just like with two threads wait for
>>> each other while holding a mutex.
>>>
>>> For sanity I think draining should be GS-only. IO code should not invoke
>>> bdrv_drained_begin() to avoid ordering problems when multiple threads drain
>>> simultaneously and have dependencies on each other.
>>>
>>> block/mirror.c needs to be modified because it currently drains from IO when
>>> mirroring is about to end.
>>
>> Yes, mirror seems to be the only clear place where draining is performed
>> from IO, namely in mirror_run. It's also a little bit weird because just
>> drained_begin is invoked from IO, while drained_end is in the main loop.
>>
>> If I understand correctly, the reason for that is that we want to
>> prevent bdrv_get_dirty_count to be modified (ie that additional
>> modifications come) after we just finished mirroring.
>>
>> Not sure how to deal with this though.
> 
> I don't know the code so I don't have a concrete solution in mind.
> 
>>>
>>> With this change to draining I think the logical argument for correctness with
>>> graph modifications holds.
>>>
>>> Enforcing GS/IO separation at compile time
>>> ------------------------------------------
>>> Right now GS/IO asserts check assumptions at runtime. The next step may be
>>> using the type system to separate GS and IO APIs so it's impossible for IO code
>>> to accidentally call GS code, for example.
>>>
>>>   typedef struct {
>>>       BlockDriverState *bs;
>>>   } BlockDriverStateIOPtr;
>>>
>>>   typedef struct {
>>>       BlockDriverState *bs;
>>>   } BlockDriverStateGSPtr;
>>>
>>> Then APIs can be protected as follows:
>>>
>>>   void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...);
>>>
>>> A function that only has a BlockDriverStateIOPtr cannot call
>>> bdrv_set_aio_context_ignore() by mistake since the compiler will complain that
>>> the first argument must be a BlockDriverStateGSPtr.
>>
>> What about functions that do not need a BlockDriverState *, and for
>> example they use BdrvChild? I would assume that we need to replicate it
>> also for that.
>>
>> And what about GS & IO functions? Not only drain, but also all the
>> generated_co_wrapper should be GS & IO.
> 
> What is the meaning of GS & IO in a multi-queue world? I guess it's
> effectively IO except it's well-behaved if called with the BQL held?

Good question. I think so. But will need to be protected, as they will
be probably graph traversing while being either GS or IO.

> 
>>
>>>
>>> Possible steps for AioContext removal
>>> -------------------------------------
>>> I also wanted to share my assumptions about multi-queue and AioContext removal.
>>> Please let me know if anything seems wrong or questionable:
>>>
>>> - IO code can execute in any thread that has an AioContext.
>>> - Multiple threads may execute a IO code at the same time.
>>> - GS code only execute under the BQL.
>>>
>>> For AioContext removal this means:
>>>
>>> - bdrv_get_aio_context() becomes mostly meaningless since there is no need for
>>>   a special "home" AioContext.
>>
>> Makes sense
>>
>>> - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to
>>>   run a coroutine in the BDS's AioContext.
>>
>> Why is there no need?
> 
> The coroutine can execute in the current thread since the BDS must be
> thread-safe.
> 
> If bdrv_coroutine_enter() is used in cases like block jobs to mean "run
> in the same AioContext as the job coroutine" then the code should
> probably be changed to explicitly use the job AioContext instead of the
> more vague BDS AioContext.

Ok makes sense

> 
>>> - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many
>>>   threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin()
>>>   may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue
>>>   virtio-blk device).
>>
>> Ok
>>
>>> - AIO_WAIT_WHILE() simplifies to
>>>
>>>     while ((cond)) {
>>>         aio_poll(qemu_get_current_aio_context(), true);
>>>         ...
>>>     }
>>>
>>>   and the distinction between home AioContext and non-home context is
>>>   eliminated. AioContext unlocking is dropped.
>>
>> I guess this implies no coroutine in BDS's AioContext.
> 
> I'm not sure what you mean but coroutines accessing the BDS can run in
> any AioContext with multi-queue. If they have any locking requirements
> amongst each other then that should be explicit instead of just throwing
> them all into the BDS AioContext.

Ok
> 
>>>
>>> Does this make sense? I haven't seen these things in recent patch series.
>>
>> You haven't seen them because there's no way to do it if we don't even
>> agree on what to replace the AioContext lock with.
>> Getting to this point would imply firstly having something (drain,
>> rwlock, whatever) together with AioContext, and then gradually remove
>> it. At least, that's how I understood it.
> 
> What, besides graph modification and the things I listed in this email,
> still needs to be untangled from AioContext?

Nothing else (I hope), but as Kevin also said, this requires a lot of
preliminary work before getting to this state.

Emanuele

> 
> Stefan
>
Stefan Hajnoczi May 23, 2022, 3:13 p.m. UTC | #25
On Mon, May 23, 2022 at 03:02:05PM +0200, Kevin Wolf wrote:
> Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben:
> > Hi,
> > Sorry for the long email. I've included three topics that may help us discuss
> > draining and AioContext removal further.
> > 
> > The shortcomings of drain
> > -------------------------
> > I wanted to explore the logical argument that making graph modifications within
> > a drained section is correct:
> > - Graph modifications and BB/BDS lookup are Global State (GS).
> > - Graph traversal from a BB/BDS pointer is IO.
> > - Only one thread executes GS code at a time.
> > - IO is quiesced within a drained section.
> 
> I think you're mixing two things here and calling both of them IO.
> 
> If a function is marked as IO, that means that it is safe to call from
> I/O requests, which may be running in any iothread (they currently only
> run in the home thread of the node's AioContext, but the function
> annotations already promise that any thread is safe).
> 
> However, if a function is marked as IO, this doesn't necessarily mean
> that it is always only called in the context of an I/O request. It can
> be called by any other code, too.
> 
> So while drain does quiesce all I/O requests, this doesn't mean that
> functions marked as IO won't run any more.

My model is that between bdrv_drained_begin() and bdrv_drained_end()
only the caller is allowed to invoke BB/BDS functions (including
functions marked IO).

The caller isn't strictly one thread and one or no coroutines. The
caller could use threads and coroutines, but the important thing is that
everything else that accesses the BB/BDS is suspended due to
.drained_begin() callbacks (and is_external).

So when you say a function marked IO can be called from outside an I/O
request, that "outside an I/O request" code must be quiesced too.
Otherwise drain is definitely not safe for graph modifications.

> > - Therefore a drained section in GS code suspends graph traversal, other graph
> >   modifications, and BB/BDS lookup.
> > - Therefore it is safe to modify the graph from a GS drained section.
> 
> So unfortunately, I don't think this conclusion is correct.
> 
> In order to make your assumption true, we would have to require that all
> callers of IO functions must have called bdrv_inc_in_flight(). We would
> also have to find a way to enforce this preferable at compile time or
> with static analysis, or at least with runtime assertions, because it
> would be very easy to get wrong.

Or that they are quiesced by .drained_begin() callbacks or is_external.

Do you have a concrete example?

> 
> > However, I hit on a problem that I think Emanuele and Paolo have already
> > pointed out: draining is GS & IO. This might have worked under the 1 IOThread
> > model but it does not make sense for multi-queue. It is possible to submit I/O
> > requests in drained sections. How can multiple threads be in drained sections
> > simultaneously and possibly submit further I/O requests in their drained
> > sections? Those sections wouldn't be "drained" in any useful sense of the word.
> 
> Drains asks other users not to touch the block node. Currently this only

BTW interesting language choice here: you're using the more general
definition of "other users" and "touch[ing] the block node" rather than
the narrow definition of just "I/O requests". That's exactly how I think
of drain but based on what you wrote above I'm not sure that's how you
think of it too?

> includes, but the desire to use drain for locking means that we need to
> extend it to pretty much any operation on the node, which would include
> calling drain on that block node. So you should never have two callers
> in a drain section at the same time, it would be a bug in this model.

Yes! Drain in its current form doesn't make sense for multi-queue since
multiple threads could be in drained sections at the same time and they
would all be allowed to submit new I/O requests.

> Of course, we know that currently drain is not respected by all relevant
> callers (most importantly, the monitor). If you want to use drain as a
> form of locking, you need to solve this first.
> 
> > It is necessary to adjust how recursive drained sections work:
> > bdrv_drained_begin() must not return while there are deeper nested drained
> > sections.
> > 
> > This is allowed:
> > 
> >      Monitor command            Block job
> >      ---------------            ---------
> >   > bdrv_drained_begin()
> >            .                 > bdrv_drained_begin()
> >            .                 < bdrv_drained_begin()
> >            .                          .
> >            .                          .
> >            .                 > bdrv_drained_end()
> >            .                 < bdrv_drained_end()
> >   < bdrv_drained_begin()
> >            .
> >            .
> >   > bdrv_drained_end()
> >   < bdrv_drained_end()
> 
> Just to make sure I understand the scenario that you have in mind here:
> The monitor command is not what causes the block job to do the draining
> because this is inside bdrv_drained_begin(), so the block job just
> randomly got a callback that caused it to do so (maybe completion)?

Yes, exactly that completion scenario. When the mirror block job
completes exactly when a monitor command calls bdrv_drained_begin(),
mirror_exit_common() deletes a temporary filter BDS node. It involves
drain and modifies the graph.

> 
> Before bdrv_drained_begin() returns, anything is still allowed to run,
> so I agree that this is valid.

Thanks for confirming!

> > This is not allowed:
> > 
> >      Monitor command            Block job
> >      ---------------            ---------
> >   > bdrv_drained_begin()
> >            .                 > bdrv_drained_begin()
> >            .                 < bdrv_drained_begin()
> >            .                          .
> >            .                          .
> >   < bdrv_drained_begin()              .
> >            .                          .
> >            .                 > bdrv_drained_end()
> >            .                 < bdrv_drained_end()
> >   > bdrv_drained_end()
> >   < bdrv_drained_end()
> > 
> > This restriction creates an ordering between bdrv_drained_begin() callers. In
> > this example the block job must not depend on the monitor command completing
> > first. Otherwise there would be a deadlock just like with two threads wait for
> > each other while holding a mutex.
> 
> So essentially drain needs to increase bs->in_flight, so that the outer
> drain has to wait for the inner drain section to end before its
> bdrv_drained_begin() can return.
> 
> > For sanity I think draining should be GS-only. IO code should not invoke
> > bdrv_drained_begin() to avoid ordering problems when multiple threads drain
> > simultaneously and have dependencies on each other.
> > 
> > block/mirror.c needs to be modified because it currently drains from IO when
> > mirroring is about to end.
> > 
> > With this change to draining I think the logical argument for correctness with
> > graph modifications holds.
> 
> What is your suggestion how to modify mirror? It drains so that no new
> requests can sneak in and source and target stay in sync while it
> switches to the completion code running in the main AioContext. You
> can't just drop this.

I haven't read the code in enough depth to come up with a solution.

> 
> > Enforcing GS/IO separation at compile time
> > ------------------------------------------
> > Right now GS/IO asserts check assumptions at runtime. The next step may be
> > using the type system to separate GS and IO APIs so it's impossible for IO code
> > to accidentally call GS code, for example.
> > 
> >   typedef struct {
> >       BlockDriverState *bs;
> >   } BlockDriverStateIOPtr;
> > 
> >   typedef struct {
> >       BlockDriverState *bs;
> >   } BlockDriverStateGSPtr;
> > 
> > Then APIs can be protected as follows:
> > 
> >   void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...);
> > 
> > A function that only has a BlockDriverStateIOPtr cannot call
> > bdrv_set_aio_context_ignore() by mistake since the compiler will complain that
> > the first argument must be a BlockDriverStateGSPtr.
> 
> And then you have a set of functions that cast from GS to IO pointers,
> but not for the other way around? Or maybe getting as IO pointer would
> even be coupled with automatically increasing bs->in_flight?
> 
> The other option that we were looking into for this was static analysis.
> I had hacked up a small script to check consistency of these annotations
> (without covering function pointers, though) to help me with the review
> of Emanuele's patches that introduced them. If I understand correctly,
> Paolo has scripts to do the same a little bit better.
> 
> As long as we can integrate such a script in 'make check', we wouldn't
> necessarily have to have the churn in the C code to switch everything to
> new wrapper types.

Yes, that's the problem with the typedef Ptr approach. It would involve
a lot of code changes. Maybe static analysis tools are better.

> 
> > Possible steps for AioContext removal
> > -------------------------------------
> > I also wanted to share my assumptions about multi-queue and AioContext removal.
> > Please let me know if anything seems wrong or questionable:
> > 
> > - IO code can execute in any thread that has an AioContext.
> > - Multiple threads may execute a IO code at the same time.
> > - GS code only execute under the BQL.
> > 
> > For AioContext removal this means:
> > 
> > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for
> >   a special "home" AioContext.
> > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to
> >   run a coroutine in the BDS's AioContext.
> > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many
> >   threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin()
> >   may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue
> >   virtio-blk device).
> > - AIO_WAIT_WHILE() simplifies to
> > 
> >     while ((cond)) {
> >         aio_poll(qemu_get_current_aio_context(), true);
> >         ...
> >     }
> 
> Probably not exactly, because you still need aio_wait_kick() to wake up
> the thread. We use AioWait.num_waiters != 0 to decide whether we need to
> schedule a BH in the main thread (because doing so unconditionally for
> every completed request would be very wasteful).
> 
> If you want to be able to run this loop from any thread instead of just
> the main thread, you would have to store somewhere which thread to wake.

qemu_cond_broadcast() can be used since the wakeup is a slow path.

> >   and the distinction between home AioContext and non-home context is
> >   eliminated. AioContext unlocking is dropped.
> > 
> > Does this make sense? I haven't seen these things in recent patch series.
> 
> Intuitively I would agree with most. There may be details that I'm not
> aware of at the moment. I'm not surprised that we haven't seen any such
> things in recent patch series because there is an awful lot of
> preparational work to be done before we can reach this final state.

Yes. I'm mostly hoping to find out whether my view in fundamentally
flawed or very different from what you and others think.

Stefan
Kevin Wolf May 23, 2022, 4:04 p.m. UTC | #26
Am 23.05.2022 um 17:13 hat Stefan Hajnoczi geschrieben:
> On Mon, May 23, 2022 at 03:02:05PM +0200, Kevin Wolf wrote:
> > Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben:
> > > Hi,
> > > Sorry for the long email. I've included three topics that may help us discuss
> > > draining and AioContext removal further.
> > > 
> > > The shortcomings of drain
> > > -------------------------
> > > I wanted to explore the logical argument that making graph modifications within
> > > a drained section is correct:
> > > - Graph modifications and BB/BDS lookup are Global State (GS).
> > > - Graph traversal from a BB/BDS pointer is IO.
> > > - Only one thread executes GS code at a time.
> > > - IO is quiesced within a drained section.
> > 
> > I think you're mixing two things here and calling both of them IO.
> > 
> > If a function is marked as IO, that means that it is safe to call from
> > I/O requests, which may be running in any iothread (they currently only
> > run in the home thread of the node's AioContext, but the function
> > annotations already promise that any thread is safe).
> > 
> > However, if a function is marked as IO, this doesn't necessarily mean
> > that it is always only called in the context of an I/O request. It can
> > be called by any other code, too.
> > 
> > So while drain does quiesce all I/O requests, this doesn't mean that
> > functions marked as IO won't run any more.
> 
> My model is that between bdrv_drained_begin() and bdrv_drained_end()
> only the caller is allowed to invoke BB/BDS functions (including
> functions marked IO).

Okay, this sounds better. It's not actually related to IO/GS, but to
BB/BDS functions, including both IO and GS functions.

So graph traversal from a BB/BDS pointer makes it a BB/BDS function, and
BB/BDS functions need to be quiesced within a drained section.

> The caller isn't strictly one thread and one or no coroutines. The
> caller could use threads and coroutines, but the important thing is
> that everything else that accesses the BB/BDS is suspended due to
> .drained_begin() callbacks (and is_external).
> 
> So when you say a function marked IO can be called from outside an I/O
> request, that "outside an I/O request" code must be quiesced too.
> Otherwise drain is definitely not safe for graph modifications.

If you phrase it as a condition and as a goal to achieve, then I agree
that this is required when you want to use drain for locking.

My impression was that you were using this to argue that the code is
already doing this and is already safe in this scenario, and this isn't
true. I think I misunderstood you there and you were really describing
the future state that you're envisioning.

> > > - Therefore a drained section in GS code suspends graph traversal, other graph
> > >   modifications, and BB/BDS lookup.
> > > - Therefore it is safe to modify the graph from a GS drained section.
> > 
> > So unfortunately, I don't think this conclusion is correct.
> > 
> > In order to make your assumption true, we would have to require that all
> > callers of IO functions must have called bdrv_inc_in_flight(). We would
> > also have to find a way to enforce this preferable at compile time or
> > with static analysis, or at least with runtime assertions, because it
> > would be very easy to get wrong.
> 
> Or that they are quiesced by .drained_begin() callbacks or is_external.
> 
> Do you have a concrete example?

Yes, you're right, bdrv_inc_in_flight() is only one way to achieve this.
They just need to make bdrv_drain_poll() return true as long as they are
active so that bdrv_drained_begin() waits for them. .drained_poll() is
another valid way to achieve this.

However, if we want to rely on static analysis to verify that everything
is covered, always requiring bdrv_inc_in_flight() might make this
easier. So possibly we want to require it even in cases where
.drained_poll() or aio_disable_external() would be enough in theory.

> > 
> > > However, I hit on a problem that I think Emanuele and Paolo have already
> > > pointed out: draining is GS & IO. This might have worked under the 1 IOThread
> > > model but it does not make sense for multi-queue. It is possible to submit I/O
> > > requests in drained sections. How can multiple threads be in drained sections
> > > simultaneously and possibly submit further I/O requests in their drained
> > > sections? Those sections wouldn't be "drained" in any useful sense of the word.
> > 
> > Drains asks other users not to touch the block node. Currently this only
> 
> BTW interesting language choice here: you're using the more general
> definition of "other users" and "touch[ing] the block node" rather than
> the narrow definition of just "I/O requests". That's exactly how I think
> of drain but based on what you wrote above I'm not sure that's how you
> think of it too?

I hope that my reply above made it clearer: The broader definition is
what is needed if we want to use drain to replace the AioContext lock
for protecting the graph, but the narrow definition is what is
implemented today.

> > includes, but the desire to use drain for locking means that we need to
> > extend it to pretty much any operation on the node, which would include
> > calling drain on that block node. So you should never have two callers
> > in a drain section at the same time, it would be a bug in this model.
> 
> Yes! Drain in its current form doesn't make sense for multi-queue since
> multiple threads could be in drained sections at the same time and they
> would all be allowed to submit new I/O requests.

Nobody would be allowed to submit new requests (because someone else has
drained the node), but they would do so anyway. ;-)

Actually, only half joking, because this shows how weak the protection
by drain is if we don't have a way to verify that the whole codebase
supports drain correctly.

> > Of course, we know that currently drain is not respected by all relevant
> > callers (most importantly, the monitor). If you want to use drain as a
> > form of locking, you need to solve this first.
> > 
> > > It is necessary to adjust how recursive drained sections work:
> > > bdrv_drained_begin() must not return while there are deeper nested drained
> > > sections.
> > > 
> > > This is allowed:
> > > 
> > >      Monitor command            Block job
> > >      ---------------            ---------
> > >   > bdrv_drained_begin()
> > >            .                 > bdrv_drained_begin()
> > >            .                 < bdrv_drained_begin()
> > >            .                          .
> > >            .                          .
> > >            .                 > bdrv_drained_end()
> > >            .                 < bdrv_drained_end()
> > >   < bdrv_drained_begin()
> > >            .
> > >            .
> > >   > bdrv_drained_end()
> > >   < bdrv_drained_end()
> > 
> > Just to make sure I understand the scenario that you have in mind here:
> > The monitor command is not what causes the block job to do the draining
> > because this is inside bdrv_drained_begin(), so the block job just
> > randomly got a callback that caused it to do so (maybe completion)?
> 
> Yes, exactly that completion scenario. When the mirror block job
> completes exactly when a monitor command calls bdrv_drained_begin(),
> mirror_exit_common() deletes a temporary filter BDS node. It involves
> drain and modifies the graph.
> 
> > 
> > Before bdrv_drained_begin() returns, anything is still allowed to run,
> > so I agree that this is valid.
> 
> Thanks for confirming!
> 
> > > This is not allowed:
> > > 
> > >      Monitor command            Block job
> > >      ---------------            ---------
> > >   > bdrv_drained_begin()
> > >            .                 > bdrv_drained_begin()
> > >            .                 < bdrv_drained_begin()
> > >            .                          .
> > >            .                          .
> > >   < bdrv_drained_begin()              .
> > >            .                          .
> > >            .                 > bdrv_drained_end()
> > >            .                 < bdrv_drained_end()
> > >   > bdrv_drained_end()
> > >   < bdrv_drained_end()
> > > 
> > > This restriction creates an ordering between bdrv_drained_begin() callers. In
> > > this example the block job must not depend on the monitor command completing
> > > first. Otherwise there would be a deadlock just like with two threads wait for
> > > each other while holding a mutex.
> > 
> > So essentially drain needs to increase bs->in_flight, so that the outer
> > drain has to wait for the inner drain section to end before its
> > bdrv_drained_begin() can return.
> > 
> > > For sanity I think draining should be GS-only. IO code should not invoke
> > > bdrv_drained_begin() to avoid ordering problems when multiple threads drain
> > > simultaneously and have dependencies on each other.
> > > 
> > > block/mirror.c needs to be modified because it currently drains from IO when
> > > mirroring is about to end.
> > > 
> > > With this change to draining I think the logical argument for correctness with
> > > graph modifications holds.
> > 
> > What is your suggestion how to modify mirror? It drains so that no new
> > requests can sneak in and source and target stay in sync while it
> > switches to the completion code running in the main AioContext. You
> > can't just drop this.
> 
> I haven't read the code in enough depth to come up with a solution.

So this sounds like it needs more thought before we can assume that
we'll have a final state where drain is GS-only.

> > 
> > > Enforcing GS/IO separation at compile time
> > > ------------------------------------------
> > > Right now GS/IO asserts check assumptions at runtime. The next step may be
> > > using the type system to separate GS and IO APIs so it's impossible for IO code
> > > to accidentally call GS code, for example.
> > > 
> > >   typedef struct {
> > >       BlockDriverState *bs;
> > >   } BlockDriverStateIOPtr;
> > > 
> > >   typedef struct {
> > >       BlockDriverState *bs;
> > >   } BlockDriverStateGSPtr;
> > > 
> > > Then APIs can be protected as follows:
> > > 
> > >   void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...);
> > > 
> > > A function that only has a BlockDriverStateIOPtr cannot call
> > > bdrv_set_aio_context_ignore() by mistake since the compiler will complain that
> > > the first argument must be a BlockDriverStateGSPtr.
> > 
> > And then you have a set of functions that cast from GS to IO pointers,
> > but not for the other way around? Or maybe getting as IO pointer would
> > even be coupled with automatically increasing bs->in_flight?
> > 
> > The other option that we were looking into for this was static analysis.
> > I had hacked up a small script to check consistency of these annotations
> > (without covering function pointers, though) to help me with the review
> > of Emanuele's patches that introduced them. If I understand correctly,
> > Paolo has scripts to do the same a little bit better.
> > 
> > As long as we can integrate such a script in 'make check', we wouldn't
> > necessarily have to have the churn in the C code to switch everything to
> > new wrapper types.
> 
> Yes, that's the problem with the typedef Ptr approach. It would involve
> a lot of code changes. Maybe static analysis tools are better.
> 
> > 
> > > Possible steps for AioContext removal
> > > -------------------------------------
> > > I also wanted to share my assumptions about multi-queue and AioContext removal.
> > > Please let me know if anything seems wrong or questionable:
> > > 
> > > - IO code can execute in any thread that has an AioContext.
> > > - Multiple threads may execute a IO code at the same time.
> > > - GS code only execute under the BQL.
> > > 
> > > For AioContext removal this means:
> > > 
> > > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for
> > >   a special "home" AioContext.
> > > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to
> > >   run a coroutine in the BDS's AioContext.
> > > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many
> > >   threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin()
> > >   may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue
> > >   virtio-blk device).
> > > - AIO_WAIT_WHILE() simplifies to
> > > 
> > >     while ((cond)) {
> > >         aio_poll(qemu_get_current_aio_context(), true);
> > >         ...
> > >     }
> > 
> > Probably not exactly, because you still need aio_wait_kick() to wake up
> > the thread. We use AioWait.num_waiters != 0 to decide whether we need to
> > schedule a BH in the main thread (because doing so unconditionally for
> > every completed request would be very wasteful).
> > 
> > If you want to be able to run this loop from any thread instead of just
> > the main thread, you would have to store somewhere which thread to wake.
> 
> qemu_cond_broadcast() can be used since the wakeup is a slow path.

I don't think we have a QemuCond for waking up a blocking aio_poll()?
Doesn't this usually involve writing to the event notifier file
descriptor of the specific AioContext?

> > >   and the distinction between home AioContext and non-home context is
> > >   eliminated. AioContext unlocking is dropped.
> > > 
> > > Does this make sense? I haven't seen these things in recent patch series.
> > 
> > Intuitively I would agree with most. There may be details that I'm not
> > aware of at the moment. I'm not surprised that we haven't seen any such
> > things in recent patch series because there is an awful lot of
> > preparational work to be done before we can reach this final state.
> 
> Yes. I'm mostly hoping to find out whether my view in fundamentally
> flawed or very different from what you and others think.

No, not fundamentally.

The big challenge in my mind is how to verify that the conditions are
actually met. I'm fairly sure that using drain this way is by far too
complex to rely on review only.

Kevin
Stefan Hajnoczi May 23, 2022, 4:45 p.m. UTC | #27
On Mon, May 23, 2022 at 06:04:55PM +0200, Kevin Wolf wrote:
> Am 23.05.2022 um 17:13 hat Stefan Hajnoczi geschrieben:
> > On Mon, May 23, 2022 at 03:02:05PM +0200, Kevin Wolf wrote:
> > > Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben:
> > > > Hi,
> > > > Sorry for the long email. I've included three topics that may help us discuss
> > > > draining and AioContext removal further.
> > > > 
> > > > The shortcomings of drain
> > > > -------------------------
> > > > I wanted to explore the logical argument that making graph modifications within
> > > > a drained section is correct:
> > > > - Graph modifications and BB/BDS lookup are Global State (GS).
> > > > - Graph traversal from a BB/BDS pointer is IO.
> > > > - Only one thread executes GS code at a time.
> > > > - IO is quiesced within a drained section.
> > > 
> > > I think you're mixing two things here and calling both of them IO.
> > > 
> > > If a function is marked as IO, that means that it is safe to call from
> > > I/O requests, which may be running in any iothread (they currently only
> > > run in the home thread of the node's AioContext, but the function
> > > annotations already promise that any thread is safe).
> > > 
> > > However, if a function is marked as IO, this doesn't necessarily mean
> > > that it is always only called in the context of an I/O request. It can
> > > be called by any other code, too.
> > > 
> > > So while drain does quiesce all I/O requests, this doesn't mean that
> > > functions marked as IO won't run any more.
> > 
> > My model is that between bdrv_drained_begin() and bdrv_drained_end()
> > only the caller is allowed to invoke BB/BDS functions (including
> > functions marked IO).
> 
> Okay, this sounds better. It's not actually related to IO/GS, but to
> BB/BDS functions, including both IO and GS functions.
> 
> So graph traversal from a BB/BDS pointer makes it a BB/BDS function, and
> BB/BDS functions need to be quiesced within a drained section.
> 
> > The caller isn't strictly one thread and one or no coroutines. The
> > caller could use threads and coroutines, but the important thing is
> > that everything else that accesses the BB/BDS is suspended due to
> > .drained_begin() callbacks (and is_external).
> > 
> > So when you say a function marked IO can be called from outside an I/O
> > request, that "outside an I/O request" code must be quiesced too.
> > Otherwise drain is definitely not safe for graph modifications.
> 
> If you phrase it as a condition and as a goal to achieve, then I agree
> that this is required when you want to use drain for locking.
> 
> My impression was that you were using this to argue that the code is
> already doing this and is already safe in this scenario, and this isn't
> true. I think I misunderstood you there and you were really describing
> the future state that you're envisioning.

Sorry, I didn't present it clearly. I'm trying to think through the
model needed to make graph modifications under drain safe.

> > > > - Therefore a drained section in GS code suspends graph traversal, other graph
> > > >   modifications, and BB/BDS lookup.
> > > > - Therefore it is safe to modify the graph from a GS drained section.
> > > 
> > > So unfortunately, I don't think this conclusion is correct.
> > > 
> > > In order to make your assumption true, we would have to require that all
> > > callers of IO functions must have called bdrv_inc_in_flight(). We would
> > > also have to find a way to enforce this preferable at compile time or
> > > with static analysis, or at least with runtime assertions, because it
> > > would be very easy to get wrong.
> > 
> > Or that they are quiesced by .drained_begin() callbacks or is_external.
> > 
> > Do you have a concrete example?
> 
> Yes, you're right, bdrv_inc_in_flight() is only one way to achieve this.
> They just need to make bdrv_drain_poll() return true as long as they are
> active so that bdrv_drained_begin() waits for them. .drained_poll() is
> another valid way to achieve this.
> 
> However, if we want to rely on static analysis to verify that everything
> is covered, always requiring bdrv_inc_in_flight() might make this
> easier. So possibly we want to require it even in cases where
> .drained_poll() or aio_disable_external() would be enough in theory.
> 
> > > 
> > > > However, I hit on a problem that I think Emanuele and Paolo have already
> > > > pointed out: draining is GS & IO. This might have worked under the 1 IOThread
> > > > model but it does not make sense for multi-queue. It is possible to submit I/O
> > > > requests in drained sections. How can multiple threads be in drained sections
> > > > simultaneously and possibly submit further I/O requests in their drained
> > > > sections? Those sections wouldn't be "drained" in any useful sense of the word.
> > > 
> > > Drains asks other users not to touch the block node. Currently this only
> > 
> > BTW interesting language choice here: you're using the more general
> > definition of "other users" and "touch[ing] the block node" rather than
> > the narrow definition of just "I/O requests". That's exactly how I think
> > of drain but based on what you wrote above I'm not sure that's how you
> > think of it too?
> 
> I hope that my reply above made it clearer: The broader definition is
> what is needed if we want to use drain to replace the AioContext lock
> for protecting the graph, but the narrow definition is what is
> implemented today.

Okay.

> > > includes, but the desire to use drain for locking means that we need to
> > > extend it to pretty much any operation on the node, which would include
> > > calling drain on that block node. So you should never have two callers
> > > in a drain section at the same time, it would be a bug in this model.
> > 
> > Yes! Drain in its current form doesn't make sense for multi-queue since
> > multiple threads could be in drained sections at the same time and they
> > would all be allowed to submit new I/O requests.
> 
> Nobody would be allowed to submit new requests (because someone else has
> drained the node), but they would do so anyway. ;-)
> 
> Actually, only half joking, because this shows how weak the protection
> by drain is if we don't have a way to verify that the whole codebase
> supports drain correctly.
> 
> > > Of course, we know that currently drain is not respected by all relevant
> > > callers (most importantly, the monitor). If you want to use drain as a
> > > form of locking, you need to solve this first.
> > > 
> > > > It is necessary to adjust how recursive drained sections work:
> > > > bdrv_drained_begin() must not return while there are deeper nested drained
> > > > sections.
> > > > 
> > > > This is allowed:
> > > > 
> > > >      Monitor command            Block job
> > > >      ---------------            ---------
> > > >   > bdrv_drained_begin()
> > > >            .                 > bdrv_drained_begin()
> > > >            .                 < bdrv_drained_begin()
> > > >            .                          .
> > > >            .                          .
> > > >            .                 > bdrv_drained_end()
> > > >            .                 < bdrv_drained_end()
> > > >   < bdrv_drained_begin()
> > > >            .
> > > >            .
> > > >   > bdrv_drained_end()
> > > >   < bdrv_drained_end()
> > > 
> > > Just to make sure I understand the scenario that you have in mind here:
> > > The monitor command is not what causes the block job to do the draining
> > > because this is inside bdrv_drained_begin(), so the block job just
> > > randomly got a callback that caused it to do so (maybe completion)?
> > 
> > Yes, exactly that completion scenario. When the mirror block job
> > completes exactly when a monitor command calls bdrv_drained_begin(),
> > mirror_exit_common() deletes a temporary filter BDS node. It involves
> > drain and modifies the graph.
> > 
> > > 
> > > Before bdrv_drained_begin() returns, anything is still allowed to run,
> > > so I agree that this is valid.
> > 
> > Thanks for confirming!
> > 
> > > > This is not allowed:
> > > > 
> > > >      Monitor command            Block job
> > > >      ---------------            ---------
> > > >   > bdrv_drained_begin()
> > > >            .                 > bdrv_drained_begin()
> > > >            .                 < bdrv_drained_begin()
> > > >            .                          .
> > > >            .                          .
> > > >   < bdrv_drained_begin()              .
> > > >            .                          .
> > > >            .                 > bdrv_drained_end()
> > > >            .                 < bdrv_drained_end()
> > > >   > bdrv_drained_end()
> > > >   < bdrv_drained_end()
> > > > 
> > > > This restriction creates an ordering between bdrv_drained_begin() callers. In
> > > > this example the block job must not depend on the monitor command completing
> > > > first. Otherwise there would be a deadlock just like with two threads wait for
> > > > each other while holding a mutex.
> > > 
> > > So essentially drain needs to increase bs->in_flight, so that the outer
> > > drain has to wait for the inner drain section to end before its
> > > bdrv_drained_begin() can return.
> > > 
> > > > For sanity I think draining should be GS-only. IO code should not invoke
> > > > bdrv_drained_begin() to avoid ordering problems when multiple threads drain
> > > > simultaneously and have dependencies on each other.
> > > > 
> > > > block/mirror.c needs to be modified because it currently drains from IO when
> > > > mirroring is about to end.
> > > > 
> > > > With this change to draining I think the logical argument for correctness with
> > > > graph modifications holds.
> > > 
> > > What is your suggestion how to modify mirror? It drains so that no new
> > > requests can sneak in and source and target stay in sync while it
> > > switches to the completion code running in the main AioContext. You
> > > can't just drop this.
> > 
> > I haven't read the code in enough depth to come up with a solution.
> 
> So this sounds like it needs more thought before we can assume that
> we'll have a final state where drain is GS-only.

I will look into solutions for mirror but it might be a few days before
I have time.

> > > 
> > > > Enforcing GS/IO separation at compile time
> > > > ------------------------------------------
> > > > Right now GS/IO asserts check assumptions at runtime. The next step may be
> > > > using the type system to separate GS and IO APIs so it's impossible for IO code
> > > > to accidentally call GS code, for example.
> > > > 
> > > >   typedef struct {
> > > >       BlockDriverState *bs;
> > > >   } BlockDriverStateIOPtr;
> > > > 
> > > >   typedef struct {
> > > >       BlockDriverState *bs;
> > > >   } BlockDriverStateGSPtr;
> > > > 
> > > > Then APIs can be protected as follows:
> > > > 
> > > >   void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...);
> > > > 
> > > > A function that only has a BlockDriverStateIOPtr cannot call
> > > > bdrv_set_aio_context_ignore() by mistake since the compiler will complain that
> > > > the first argument must be a BlockDriverStateGSPtr.
> > > 
> > > And then you have a set of functions that cast from GS to IO pointers,
> > > but not for the other way around? Or maybe getting as IO pointer would
> > > even be coupled with automatically increasing bs->in_flight?
> > > 
> > > The other option that we were looking into for this was static analysis.
> > > I had hacked up a small script to check consistency of these annotations
> > > (without covering function pointers, though) to help me with the review
> > > of Emanuele's patches that introduced them. If I understand correctly,
> > > Paolo has scripts to do the same a little bit better.
> > > 
> > > As long as we can integrate such a script in 'make check', we wouldn't
> > > necessarily have to have the churn in the C code to switch everything to
> > > new wrapper types.
> > 
> > Yes, that's the problem with the typedef Ptr approach. It would involve
> > a lot of code changes. Maybe static analysis tools are better.
> > 
> > > 
> > > > Possible steps for AioContext removal
> > > > -------------------------------------
> > > > I also wanted to share my assumptions about multi-queue and AioContext removal.
> > > > Please let me know if anything seems wrong or questionable:
> > > > 
> > > > - IO code can execute in any thread that has an AioContext.
> > > > - Multiple threads may execute a IO code at the same time.
> > > > - GS code only execute under the BQL.
> > > > 
> > > > For AioContext removal this means:
> > > > 
> > > > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for
> > > >   a special "home" AioContext.
> > > > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to
> > > >   run a coroutine in the BDS's AioContext.
> > > > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many
> > > >   threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin()
> > > >   may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue
> > > >   virtio-blk device).
> > > > - AIO_WAIT_WHILE() simplifies to
> > > > 
> > > >     while ((cond)) {
> > > >         aio_poll(qemu_get_current_aio_context(), true);
> > > >         ...
> > > >     }
> > > 
> > > Probably not exactly, because you still need aio_wait_kick() to wake up
> > > the thread. We use AioWait.num_waiters != 0 to decide whether we need to
> > > schedule a BH in the main thread (because doing so unconditionally for
> > > every completed request would be very wasteful).
> > > 
> > > If you want to be able to run this loop from any thread instead of just
> > > the main thread, you would have to store somewhere which thread to wake.
> > 
> > qemu_cond_broadcast() can be used since the wakeup is a slow path.
> 
> I don't think we have a QemuCond for waking up a blocking aio_poll()?
> Doesn't this usually involve writing to the event notifier file
> descriptor of the specific AioContext?

You're right, I forgot that the waiter is sitting in their own
aio_poll() so we cannot use qemu_cond_wait(). Another mechanism would be
necessary.

> > > >   and the distinction between home AioContext and non-home context is
> > > >   eliminated. AioContext unlocking is dropped.
> > > > 
> > > > Does this make sense? I haven't seen these things in recent patch series.
> > > 
> > > Intuitively I would agree with most. There may be details that I'm not
> > > aware of at the moment. I'm not surprised that we haven't seen any such
> > > things in recent patch series because there is an awful lot of
> > > preparational work to be done before we can reach this final state.
> > 
> > Yes. I'm mostly hoping to find out whether my view in fundamentally
> > flawed or very different from what you and others think.
> 
> No, not fundamentally.
> 
> The big challenge in my mind is how to verify that the conditions are
> actually met. I'm fairly sure that using drain this way is by far too
> complex to rely on review only.

I agree there needs to some kind of automated checker because relying on
code review won't work.

Stefan
Paolo Bonzini May 24, 2022, 7:55 a.m. UTC | #28
On 5/22/22 17:06, Stefan Hajnoczi wrote:
> However, I hit on a problem that I think Emanuele and Paolo have already
> pointed out: draining is GS & IO. This might have worked under the 1 IOThread
> model but it does not make sense for multi-queue. It is possible to submit I/O
> requests in drained sections. How can multiple threads be in drained sections
> simultaneously and possibly submit further I/O requests in their drained
> sections? Those sections wouldn't be "drained" in any useful sense of the word.

Yeah, that works only if the drained sections are well-behaved.

"External" sources of I/O are fine; they are disabled using is_external, 
and don't drain themselves I think.

Mirror is the only I/O user of drain, and it's fine because it never 
submits I/O to the drained BDS.

Drained sections in the main thread can be special cased to allow I/O 
(wrlock in this series would also allow I/O).

So I think that the "cooperation from all relevant places" that Kevin 
mentioned is already there, except for coroutine commands in the 
monitor.  Those are a bad idea in my opinion and I'd rather revert 
commit eb94b81a94 ("block: Convert 'block_resize' to coroutine") until 
we have a clearer idea of how to handle them.

I agree that it's basically impossible to review the change.  On the 
other hand, there's already a substantial amount of faith involved in 
the correctness of the current code.

In particular the AioContext lock does absolutely nothing to protect 
corutines in the main thread against graph changes---both from the 
monitor (including BHs as in "block: Fix BB.root changing across 
bdrv_next()") and from BDS coroutines.  The former are unprotected; the 
latter are protected by drain only: using drain to protect against graph 
writes would be a matter of extending *existing* faith to the 
multi-iothread case.

Once the deadlock is broken, we can proceed to remove the AioContext 
lock and then introduce actual coroutine-based locking.

> Possible steps for AioContext removal
> -------------------------------------
> I also wanted to share my assumptions about multi-queue and AioContext removal.
> Please let me know if anything seems wrong or questionable:
> 
> - IO code can execute in any thread that has an AioContext.
> - Multiple threads may execute a IO code at the same time.
> - GS code only execute under the BQL.
> 
> For AioContext removal this means:
> 
> - bdrv_get_aio_context() becomes mostly meaningless since there is no need for
>    a special "home" AioContext.

Correct.  bdrv_set_aio_context() remains useful as a way to set a home 
AioContext for sockets.

> - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to
>    run a coroutine in the BDS's AioContext.
> - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many
>    threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin()
>    may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue
>    virtio-blk device).

This is a change that can be done independent of this work.

> - AIO_WAIT_WHILE() simplifies to
> 
>      while ((cond)) {
>          aio_poll(qemu_get_current_aio_context(), true);
>          ...
>      }
> 
>    and the distinction between home AioContext and non-home context is
>    eliminated. AioContext unlocking is dropped.

(I'll reply on this from elsewhere in the thread).

> Does this make sense? I haven't seen these things in recent patch series.

I agree, and yeah all these are blocked on protecting graph modifications.

In parallel to the block layer discussions, it's possible to work on 
introducing a request queue lock in virtio-blk and virtio-scsi.  That's 
the only thing that relies on the AioContext lock outside the block layer.

Paolo
Stefan Hajnoczi May 24, 2022, 8:08 a.m. UTC | #29
On Tue, May 24, 2022 at 09:55:39AM +0200, Paolo Bonzini wrote:
> On 5/22/22 17:06, Stefan Hajnoczi wrote:
> > However, I hit on a problem that I think Emanuele and Paolo have already
> > pointed out: draining is GS & IO. This might have worked under the 1 IOThread
> > model but it does not make sense for multi-queue. It is possible to submit I/O
> > requests in drained sections. How can multiple threads be in drained sections
> > simultaneously and possibly submit further I/O requests in their drained
> > sections? Those sections wouldn't be "drained" in any useful sense of the word.
> 
> Yeah, that works only if the drained sections are well-behaved.
> 
> "External" sources of I/O are fine; they are disabled using is_external, and
> don't drain themselves I think.

I/O requests for a given BDS may be executing in multiple AioContexts,
so how do you call aio_disable_external() on all relevant AioContexts?

We covered this below but I wanted to reply here in case someone else
reads this part without reading the rest.

> Mirror is the only I/O user of drain, and it's fine because it never submits
> I/O to the drained BDS.
>
> Drained sections in the main thread can be special cased to allow I/O
> (wrlock in this series would also allow I/O).
> 
> So I think that the "cooperation from all relevant places" that Kevin
> mentioned is already there, except for coroutine commands in the monitor.
> Those are a bad idea in my opinion and I'd rather revert commit eb94b81a94
> ("block: Convert 'block_resize' to coroutine") until we have a clearer idea
> of how to handle them.
> 
> I agree that it's basically impossible to review the change.  On the other
> hand, there's already a substantial amount of faith involved in the
> correctness of the current code.
> 
> In particular the AioContext lock does absolutely nothing to protect
> corutines in the main thread against graph changes---both from the monitor
> (including BHs as in "block: Fix BB.root changing across bdrv_next()") and
> from BDS coroutines.  The former are unprotected; the latter are protected
> by drain only: using drain to protect against graph writes would be a matter
> of extending *existing* faith to the multi-iothread case.
> 
> Once the deadlock is broken, we can proceed to remove the AioContext lock
> and then introduce actual coroutine-based locking.
> 
> > Possible steps for AioContext removal
> > -------------------------------------
> > I also wanted to share my assumptions about multi-queue and AioContext removal.
> > Please let me know if anything seems wrong or questionable:
> > 
> > - IO code can execute in any thread that has an AioContext.
> > - Multiple threads may execute a IO code at the same time.
> > - GS code only execute under the BQL.
> > 
> > For AioContext removal this means:
> > 
> > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for
> >    a special "home" AioContext.
> 
> Correct.  bdrv_set_aio_context() remains useful as a way to set a home
> AioContext for sockets.
> 
> > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to
> >    run a coroutine in the BDS's AioContext.
> > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many
> >    threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin()
> >    may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue
> >    virtio-blk device).
> 
> This is a change that can be done independent of this work.
> 
> > - AIO_WAIT_WHILE() simplifies to
> > 
> >      while ((cond)) {
> >          aio_poll(qemu_get_current_aio_context(), true);
> >          ...
> >      }
> > 
> >    and the distinction between home AioContext and non-home context is
> >    eliminated. AioContext unlocking is dropped.
> 
> (I'll reply on this from elsewhere in the thread).
> 
> > Does this make sense? I haven't seen these things in recent patch series.
> 
> I agree, and yeah all these are blocked on protecting graph modifications.
> 
> In parallel to the block layer discussions, it's possible to work on
> introducing a request queue lock in virtio-blk and virtio-scsi.  That's the
> only thing that relies on the AioContext lock outside the block layer.

I'm not sure what the request queue lock protects in virtio-blk? In
virtio-scsi I guess a lock is needed to protect SCSI target emulation
state?

Stefan
Paolo Bonzini May 24, 2022, 9:17 a.m. UTC | #30
On 5/24/22 10:08, Stefan Hajnoczi wrote:
> On Tue, May 24, 2022 at 09:55:39AM +0200, Paolo Bonzini wrote:
>> On 5/22/22 17:06, Stefan Hajnoczi wrote:
>>> However, I hit on a problem that I think Emanuele and Paolo have already
>>> pointed out: draining is GS & IO. This might have worked under the 1 IOThread
>>> model but it does not make sense for multi-queue. It is possible to submit I/O
>>> requests in drained sections. How can multiple threads be in drained sections
>>> simultaneously and possibly submit further I/O requests in their drained
>>> sections? Those sections wouldn't be "drained" in any useful sense of the word.
>>
>> Yeah, that works only if the drained sections are well-behaved.
>>
>> "External" sources of I/O are fine; they are disabled using is_external, and
>> don't drain themselves I think.
> 
> I/O requests for a given BDS may be executing in multiple AioContexts,
> so how do you call aio_disable_external() on all relevant AioContexts?

With multiqueue yeah, we have to replace aio_disable_external() with 
drained_begin/end() callbacks; but I'm not talking about that yet.

>> In parallel to the block layer discussions, it's possible to work on
>> introducing a request queue lock in virtio-blk and virtio-scsi.  That's the
>> only thing that relies on the AioContext lock outside the block layer.
> 
> I'm not sure what the request queue lock protects in virtio-blk? In
> virtio-scsi I guess a lock is needed to protect SCSI target emulation
> state?

Yes, but even in virtio-blk there is this code that runs in the main 
thread and is currently protected by aio_context_acquire/release:

     blk_drain(s->blk);

     /* We drop queued requests after blk_drain() because blk_drain()
      * itself can produce them. */
     while (s->rq) {
         req = s->rq;
         s->rq = req->next;
         virtqueue_detach_element(req->vq, &req->elem, 0);
         virtio_blk_free_request(req);
     }

Maybe it's safe to run it without a lock because it runs after 
virtio_set_status(vdev, 0) but I'd rather play it safe and protect s->rq 
with a lock.

Paolo
Stefan Hajnoczi May 24, 2022, 10:20 a.m. UTC | #31
On Tue, May 24, 2022 at 11:17:06AM +0200, Paolo Bonzini wrote:
> On 5/24/22 10:08, Stefan Hajnoczi wrote:
> > On Tue, May 24, 2022 at 09:55:39AM +0200, Paolo Bonzini wrote:
> > > On 5/22/22 17:06, Stefan Hajnoczi wrote:
> > > > However, I hit on a problem that I think Emanuele and Paolo have already
> > > > pointed out: draining is GS & IO. This might have worked under the 1 IOThread
> > > > model but it does not make sense for multi-queue. It is possible to submit I/O
> > > > requests in drained sections. How can multiple threads be in drained sections
> > > > simultaneously and possibly submit further I/O requests in their drained
> > > > sections? Those sections wouldn't be "drained" in any useful sense of the word.
> > > 
> > > Yeah, that works only if the drained sections are well-behaved.
> > > 
> > > "External" sources of I/O are fine; they are disabled using is_external, and
> > > don't drain themselves I think.
> > 
> > I/O requests for a given BDS may be executing in multiple AioContexts,
> > so how do you call aio_disable_external() on all relevant AioContexts?
> 
> With multiqueue yeah, we have to replace aio_disable_external() with
> drained_begin/end() callbacks; but I'm not talking about that yet.
> 
> > > In parallel to the block layer discussions, it's possible to work on
> > > introducing a request queue lock in virtio-blk and virtio-scsi.  That's the
> > > only thing that relies on the AioContext lock outside the block layer.
> > 
> > I'm not sure what the request queue lock protects in virtio-blk? In
> > virtio-scsi I guess a lock is needed to protect SCSI target emulation
> > state?
> 
> Yes, but even in virtio-blk there is this code that runs in the main thread
> and is currently protected by aio_context_acquire/release:
> 
>     blk_drain(s->blk);
> 
>     /* We drop queued requests after blk_drain() because blk_drain()
>      * itself can produce them. */
>     while (s->rq) {
>         req = s->rq;
>         s->rq = req->next;
>         virtqueue_detach_element(req->vq, &req->elem, 0);
>         virtio_blk_free_request(req);
>     }
> 
> Maybe it's safe to run it without a lock because it runs after
> virtio_set_status(vdev, 0) but I'd rather play it safe and protect s->rq
> with a lock.

What does the lock protect?

A lock can prevent s->rq or req->vq corruption but it cannot prevent
request leaks. This loop's job is to free all requests so there is no
leak. If a lock is necessary then this code is already broken in a more
fundamental way because it can leak.

Stefan
Kevin Wolf May 24, 2022, 10:36 a.m. UTC | #32
Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben:
> On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote:
> > For example, all callers of bdrv_open() always take the AioContext lock.
> > Often it is taken very high in the call stack, but it's always taken.
> 
> I think it's actually not a problem of who takes the AioContext lock or
> where; the requirements are contradictory:

Okay, now that I have explained which challenges I see with the drain
solution, I'd also like to understand the problem that even motivates
you to go back to drains.

> * IO_OR_GS_CODE() functions, when called from coroutine context, expect to
> be called with the AioContext lock taken (example: bdrv_co_yield_to_drain)

Am I right to say this is not inherently part of the definition of
IO_OR_GS_CODE(), but just a property that these functions have in
practice?

In practice, the functions that are IO_OR_GS_CODE() are those that call
AIO_WAIT_WHILE() - which drops the lock, so it must have been taken
first. Of course, when calling from coroutine context, AIO_WAIT_WHILE()
is wrong, so these functions all have a different code path for
coroutines (or they aren't suitable to be called in coroutines at all).

Using a different code path means that the restrictions from
AIO_WAIT_WHILE() don't really apply any more and these functions become
effectively IO_CODE() when called in a coroutine. (At least I'm not
aware of any other piece of code apart from AIO_WAIT_WHILE() that makes
a function IO_OR_GS_CODE().)

Surrounding IO_CODE() with aio_context_acquire/release() is in the
definition of IO_CODE(), so your assumption seems right. (Not sure if
it's actually necessary in all cases and whether all callers do this
correctly, but with this definition we have declared that expections to
this are in fact bugs.)

(You mention bdrv_co_yield_to_drain() as an example. I don't think it's
a good example. The function isn't annotated, but it seems to me that
the correct annotation would be IO_CODE() anyway, i.e. safe to call from
any thread, not just IO_OR_GS_CODE().)

> * to call these functions with the lock taken, the code has to run in the
> BDS's home iothread.  Attempts to do otherwise results in deadlocks (the
> main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot
> happen without releasing the aiocontext lock)

This problem can't happen in the main thread itself, AIO_WAIT_WHILE() is
safe both in the home thread and the main thread (at least as long as
you lock only once) because it temporarily drops the lock. It has also
become the definition of IO_OR_GS_CODE(): This code has to run in the
home thread or the main thread.


Of course, above I just concluded that when called from coroutines, in
practice IO_OR_GS_CODE() essentially turns into IO_CODE(). This is
supposed to allow much more:

 * I/O API functions. These functions are thread-safe, and therefore
 * can run in any thread as long as the thread has called
 * aio_context_acquire/release().

Come to think of it, I believe that many of the functions we declared
IO_CODE() are actually just IO_OR_GS_CODE() (at best; for iothreads,
they certainly require running in the home thread, but the main thread
allowed by IO_OR_GS_CODE() might not work). We have all the coroutine
machinery so that the AioContext lock of the current thread is
automatically released and reacquired across yield. However, this is the
wrong AioContext when called from a different thread, so we need an
additional lock - which should be dropped during yield, too, but it
doesn't happen.

Maybe this is really the scenario you mean with this point?

Switching to drain for locking doesn't solve the problem, but only
possibly defer it. In order to complete the multiqueue work, we need
to make IO_CODE() functions actually conform to the definition of
IO_CODE().

Do we have a plan what this should look like in the final state when all
the multiqueue work is completed? Will it require replacing the more or
less automatic AioContext lock handling that we currently have in
coroutines with explicit unlock/lock around all yield points? I assume
that some kind of lock will still have to be held and it wouldn't just
disappear with the removal of the AioContext lock? Or will we only have
coroutine locks which don't have this problem?

> * running the code in the BDS's home iothread is not possible for
> GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main
> thread, but that cannot be guaranteed in general)

There is nothing that stops GLOBAL_STATE_CODE() from scheduling work in
the home iothread of a BDS and then waiting for it - or if it is a
coroutine, even to reschedule itself into the BDS home thread
temporarily.

This is essentially what all of the generated_co_wrapper functions do,
and the coroutine case is what bdrv_co_enter() does.

So I'm not sure what you mean by this?

Kevin
Kevin Wolf May 24, 2022, 12:10 p.m. UTC | #33
Am 18.05.2022 um 14:28 hat Emanuele Giuseppe Esposito geschrieben:
> label: // read till the end to see why I wrote this here
> 
> I was hoping someone from the "No" party would answer to your question,
> because as you know we reached this same conclusion together.
> 
> We thought about dropping the drain for various reasons, the main one
> (at least as far as I understood) is that we are not sure if something
> can still happen in between drain_begin/end, and it is a little bit
> confusing to use the same mechanism to block I/O and protect the graph.
> 
> We then thought about implementing a rwlock. A rdlock would clarify what
> we are protecting and who is using the lock. I had a rwlock draft
> implementation sent in this thread, but this also lead to additional
> problems.
> Main problem was that this new lock would introduce nested event loops,
> that together with such locking would just create deadlocks.
> If readers are in coroutines and writers are not (because graph
> operations are not running in coroutines), we have a lot of deadlocks.
> If a writer has to take the lock, it must wait all other readers to
> finish. And it does it by internally calling AIO_WAIT_WHILE, creating
> nested event loop. We don't know what could execute when polling for
> events, and for example another writer could be resumed.

Why is this a problem? Your AIO_WAIT_WHILE() condition would be that
there are neither readers nor writers, so you would just keep waiting
until the other writer is done.

> Ideally, we want writers in coroutines too.
>
> Additionally, many readers are running in what we call "mixed"
> functions: usually implemented automatically with generated_co_wrapper
> tag, they let a function (usually bdrv callback) run always in a
> coroutine, creating one if necessary. For example, bdrv_flush() makes
> sure hat bs->bdrv_co_flush() is always run in a coroutine.
> Such mixed functions are used in other callbacks too, making it really
> difficult to understand if we are in a coroutine or not, and mostly
> important make rwlock usage very difficult.

How do they make rwlock usage difficult?

*goes back to old IRC discussions*

Ah, the problem was not the AIO_WAIT_WHILE() while taking the lock, but
taking the lock first and then running an AIO_WAIT_WHILE() inside the
locked section. This is forbidden because the callbacks that run during
AIO_WAIT_WHILE() may in turn wait for the lock that you own, causing a
deadlock.

This is indeed a problem that running in coroutines would avoid because
the inner waiter would just yield and the outer one could complete its
job as soon as it's its turn.

My conclusion in the IRC discussion was that maybe we need to take the
graph locks when we're entering coroutine context, i.e. the "mixed"
functions would rdlock the graph when called from non-coroutine context
and would assume that it's already locked when called from coroutine
context.

> Which lead us to stepping back once more and try to convert all
> BlockDriverState callbacks in coroutines. This would greatly simplify
> rwlock usage, because we could make the rwlock coroutine-frendly
> (without any AIO_WAIT_WHILE, allowing a writer to wait for readers by
> just yielding and queuing itself in coroutine queues).
> 
> First step was then to convert all callbacks in coroutines, using
> generated_coroutine_wrapper (g_c_w).
> A typical g_c_w is implemented in this way:
> 	if (qemu_in_coroutine()) {
> 		callback();
> 	} else { // much simplified
> 		co = qemu_coroutine_create(callback);
> 		bdrv_coroutine_enter(bs, co);
> 		BDRV_POLL_WHILE(bs, coroutine_in_progress);
> 	}
> Once all callbacks are implemented using g_c_w, we can start splitting
> the two sides of the if function to only create a coroutine when we are
> outside from a bdrv callback.
> 
> However, we immediately found a problem while starting to convert the
> first callbacks: the AioContext lock is taken around some non coroutine
> callbacks! For example, bs->bdrv_open() is always called with the
> AioContext lock taken. In addition, callbacks like bdrv_open are
> graph-modifying functions, which is probably why we are taking the
> Aiocontext lock, and they do not like to run in coroutines.
> Anyways, the real problem comes when we create a coroutine in such
> places where the AioContext lock is taken and we have a graph-modifying
> function.
> 
> bdrv_coroutine_enter() calls aio_co_enter(), which in turns first checks
>  if the coroutine is entering another context from the current (which is
> not the case for open) and if we are already in coroutine (for sure
> not). Therefore it resorts to the following calls;
> 	aio_context_acquire(ctx);
>         qemu_aio_coroutine_enter(ctx, co);
>         aio_context_release(ctx);
> Which is clearly a problem, because we are taking the lock twice: once
> from the original caller of the callback, and once here due to the
> coroutine. This creates a lot of deadlock situations.

What are the deadlock situations that are created by locking twice?

The only problem I'm aware of is AIO_WAIT_WHILE(), which wants to
temporarily unlock the AioContext It calls aio_context_release() once to
achieve this, which obviously isn't enough when the context was locked
twice.

But AIO_WAIT_WHILE() isn't allowed in coroutines anyway. So how are we
running into deadlocks here?

Note that we're probably already doing this inside the .bdrv_open
implementations: They will ususally read something from the image file,
calling bdrv_preadv() which is already a generated_coroutine_wrapper
today and creates a coroutine internally with the same locking pattern
applied that you describe as problematic here.

Making .bdrv_open itself a generated_coroutine_wrapper wouldn't really
change anything fundamental, it would just pull the existing mechanism
one function higher in the call stack.

> For example, all callers of bdrv_open() always take the AioContext lock.
> Often it is taken very high in the call stack, but it's always taken.
> 
> Getting rid of the lock around qemu_aio_coroutine_enter() is difficult
> too, because coroutines expect to have the lock taken. For example, if
> we want to drain from a coroutine, bdrv_co_yield_to_drain releases the
> lock for us.

It's not difficult at all in your case where you know that you're
already in the right thread and the lock is taken: You can call
qemu_aio_coroutine_enter() directly instead of bdrv_coroutine_enter() in
this case.

But as I said, I'm not sure why we need to get rid of it at all.

Kevin
Paolo Bonzini May 24, 2022, 5:25 p.m. UTC | #34
On 5/24/22 12:20, Stefan Hajnoczi wrote:
>> Maybe it's safe to run it without a lock because it runs after
>> virtio_set_status(vdev, 0) but I'd rather play it safe and protect s->rq
>> with a lock.
>
> What does the lock protect?
> 
> A lock can prevent s->rq or req->vq corruption but it cannot prevent
> request leaks. This loop's job is to free all requests so there is no
> leak. If a lock is necessary then this code is already broken in a more
> fundamental way because it can leak.

Yes, you're right.  This particular list is always accessed in the 
iothread (if any) and blk_drain() is enough.  virtio-blk should already 
not need aio_context_{acquire,release}.

It's worth a comment, though!

Paolo
Paolo Bonzini May 25, 2022, 7:41 a.m. UTC | #35
On 5/24/22 12:36, Kevin Wolf wrote:
>> * IO_OR_GS_CODE() functions, when called from coroutine context, expect to
>> be called with the AioContext lock taken (example: bdrv_co_yield_to_drain)
> 
> Am I right to say this is not inherently part of the definition of
> IO_OR_GS_CODE(), but just a property that these functions have in
> practice? In practice, the functions that are IO_OR_GS_CODE()
> are those that call AIO_WAIT_WHILE().

Yes.

> Using a different code path means that the restrictions from
> AIO_WAIT_WHILE() don't really apply any more and these functions become
> effectively IO_CODE() when called in a coroutine. (At least I'm not
> aware of any other piece of code apart from AIO_WAIT_WHILE() that makes
> a function IO_OR_GS_CODE().)

The second point is correct.

The first depends on the definition of the coroutine path.  For example, 
bdrv_co_yield_to_drain() expects to run with bdrv_get_aio_context(bs)'s 
lock taken.  An iothread cannot take another iothread's AioContext lock 
to avoid AB-BA deadlocks; therefore, bdrv_co_yield_to_drain() can only 
run in the main thread or in bs's home iothread.

>> * to call these functions with the lock taken, the code has to run in the
>> BDS's home iothread.  Attempts to do otherwise results in deadlocks (the
>> main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot
>> happen without releasing the aiocontext lock)
> 
> This problem can't happen in the main thread itself, AIO_WAIT_WHILE() is
> safe both in the home thread and the main thread (at least as long as
> you lock only once) because it temporarily drops the lock. It has also
> become the definition of IO_OR_GS_CODE(): This code has to run in the
> home thread or the main thread.

It doesn't work to run bdrv_open_driver() in the home iothread because 
bdrv_open_driver can change the AioContext of a BDS (causing extreme 
confusion on what runs where and what AioContext locks have to be taken 
and released).

So in order to have bdrv_open_driver() run in the main thread, Emanuele 
added a variant of generated_co_wrapper that waits on the main thread. 
But it didn't work either, because then AIO_WAIT_WHILE() does not 
release the BDS's AioContext lock.

When ->bdrv_open() calls bdrv_replace_child_noperm() and it tries to 
drain, bdrv_co_yield_to_drain() schedules a bottom half on the iothread 
and yields; the bottom half can never run, because the AioContext lock 
is already taken elsewhere.

	main thread				ctx home thread
	aio_context_acquire(ctx)
	bdrv_open()
	  drv->bdrv_co_open()
	    bdrv_replace_child_noperm()
	      bdrv_co_yield_to_drain()
	        aio_context_release(ctx)
		aio_schedule_bh_oneshot()

So, bdrv_open_driver() and bdrv_close() are un-coroutin-izable.  I guess 
we could split the callback in two parts, one doing I/O and one doing 
graph manipulation (it may even be a good idea---the ->bdrv_inactivate() 
callback even exists already in the case of bdrv_close()---but I'm not 
sure it's always applicable).

> Come to think of it, I believe that many of the functions we declared
> IO_CODE() are actually just IO_OR_GS_CODE() (at best; for iothreads,
> they certainly require running in the home thread, but the main thread
> allowed by IO_OR_GS_CODE() might not work). We have all the coroutine
> machinery so that the AioContext lock of the current thread is
> automatically released and reacquired across yield. However, this is the
> wrong AioContext when called from a different thread, so we need an
> additional lock - which should be dropped during yield, too, but it
> doesn't happen.

There's no need for additional locks.  Drivers have to be protected by 
individual locks, which are either QemuMutex and dropped during yield, 
or CoMutex.  A QemuMutex used to protect a CoQueue is also dropped 
safely during yield.

The IO_CODE() distinction is indeed mostly theoretical until the file 
I/O BlockDriver protocols are protected by the AioContext lock and 
therefore are actually IO_OR_GS_CODE().  But that's the least of our 
worries; if file-posix AIO is done on qemu_get_current_aio_context() 
instead of bdrv_get_aio_context(bs), the AioContext lock is not needed 
anymore for I/O.

Apart from file I/O, all drivers were thread-safe at some point (now it 
seems blklogwrites.c is not, but the code also seems not safe against 
power loss and I wonder if it should be just deleted).

> Switching to drain for locking doesn't solve the problem, but only
> possibly defer it. In order to complete the multiqueue work, we need
> to make IO_CODE() functions actually conform to the definition of
> IO_CODE(). Do we have a plan what this should look like in the final
> state when all the multiqueue work is completed? Or will we only have
> coroutine locks which don't have this problem?

The latter apart from the occasional QemuMutex, which is used only when 
it does not cause the problem.

>> * running the code in the BDS's home iothread is not possible for
>> GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main
>> thread, but that cannot be guaranteed in general)
> 
> There is nothing that stops GLOBAL_STATE_CODE() from scheduling work in
> the home iothread of a BDS and then waiting for it - or if it is a
> coroutine, even to reschedule itself into the BDS home thread
> temporarily.

There are still things that I/O thread cannot do, for example 
bdrv_set_aio_context().  Even if it worked, I don't think it would be a 
good idea.  It basically would mean a "movable" main thread.  It's even 
harder to reason on it.

So that's how I got to the point where I don't think it's possible to 
proceed with the idea of coroutin-izing more code (a prerequisite for a 
coroutine-based graph rwlock) without first removing the AioContext 
lock.  But removing the AioContext lock requires a way to make the graph 
operations thread-safe, and then we go back to draining.

Paolo
Emanuele Giuseppe Esposito May 25, 2022, 8:27 a.m. UTC | #36
Am 24/05/2022 um 14:10 schrieb Kevin Wolf:
> Am 18.05.2022 um 14:28 hat Emanuele Giuseppe Esposito geschrieben:
>> label: // read till the end to see why I wrote this here
>>
>> I was hoping someone from the "No" party would answer to your question,
>> because as you know we reached this same conclusion together.
>>
>> We thought about dropping the drain for various reasons, the main one
>> (at least as far as I understood) is that we are not sure if something
>> can still happen in between drain_begin/end, and it is a little bit
>> confusing to use the same mechanism to block I/O and protect the graph.
>>
>> We then thought about implementing a rwlock. A rdlock would clarify what
>> we are protecting and who is using the lock. I had a rwlock draft
>> implementation sent in this thread, but this also lead to additional
>> problems.
>> Main problem was that this new lock would introduce nested event loops,
>> that together with such locking would just create deadlocks.
>> If readers are in coroutines and writers are not (because graph
>> operations are not running in coroutines), we have a lot of deadlocks.
>> If a writer has to take the lock, it must wait all other readers to
>> finish. And it does it by internally calling AIO_WAIT_WHILE, creating
>> nested event loop. We don't know what could execute when polling for
>> events, and for example another writer could be resumed.
> 
> Why is this a problem? Your AIO_WAIT_WHILE() condition would be that
> there are neither readers nor writers, so you would just keep waiting
> until the other writer is done.

Yes, but when we get to the AIO_WAIT_WHILE() condition the wrlock is
already being taken in the current writer.
I think that's what you also mean below.

> 
>> Ideally, we want writers in coroutines too.
>>
>> Additionally, many readers are running in what we call "mixed"
>> functions: usually implemented automatically with generated_co_wrapper
>> tag, they let a function (usually bdrv callback) run always in a
>> coroutine, creating one if necessary. For example, bdrv_flush() makes
>> sure hat bs->bdrv_co_flush() is always run in a coroutine.
>> Such mixed functions are used in other callbacks too, making it really
>> difficult to understand if we are in a coroutine or not, and mostly
>> important make rwlock usage very difficult.
> 
> How do they make rwlock usage difficult?
> 
> *goes back to old IRC discussions*
> 
> Ah, the problem was not the AIO_WAIT_WHILE() while taking the lock, but
> taking the lock first and then running an AIO_WAIT_WHILE() inside the
> locked section. This is forbidden because the callbacks that run during
> AIO_WAIT_WHILE() may in turn wait for the lock that you own, causing a
> deadlock.
> 

Yes

> This is indeed a problem that running in coroutines would avoid because
> the inner waiter would just yield and the outer one could complete its
> job as soon as it's its turn.
> 
> My conclusion in the IRC discussion was that maybe we need to take the
> graph locks when we're entering coroutine context, i.e. the "mixed"
> functions would rdlock the graph when called from non-coroutine context
> and would assume that it's already locked when called from coroutine
> context.

Yes, and that's what I tried to do.
But the first step was to transform all callbacks as coroutines. I think
you also agree with this, correct?

And therefore the easiest step was to convert all callbacks in
generated_co_wrapper functions, so that afterwards we could split them
between coroutine and not-coroutine logic, as discussed on IRC.
Once split, we add the lock in the way you suggested.

However, I didn't even get to the first step part, because tests were
deadlocking after just transforming 2-3 callbacks.

See Paolo thread for a nice explanation on why they are deadlocking and
converting these callbacks is difficult.

> 
>> Which lead us to stepping back once more and try to convert all
>> BlockDriverState callbacks in coroutines. This would greatly simplify
>> rwlock usage, because we could make the rwlock coroutine-frendly
>> (without any AIO_WAIT_WHILE, allowing a writer to wait for readers by
>> just yielding and queuing itself in coroutine queues).
>>
>> First step was then to convert all callbacks in coroutines, using
>> generated_coroutine_wrapper (g_c_w).
>> A typical g_c_w is implemented in this way:
>> 	if (qemu_in_coroutine()) {
>> 		callback();
>> 	} else { // much simplified
>> 		co = qemu_coroutine_create(callback);
>> 		bdrv_coroutine_enter(bs, co);
>> 		BDRV_POLL_WHILE(bs, coroutine_in_progress);
>> 	}
>> Once all callbacks are implemented using g_c_w, we can start splitting
>> the two sides of the if function to only create a coroutine when we are
>> outside from a bdrv callback.
>>
>> However, we immediately found a problem while starting to convert the
>> first callbacks: the AioContext lock is taken around some non coroutine
>> callbacks! For example, bs->bdrv_open() is always called with the
>> AioContext lock taken. In addition, callbacks like bdrv_open are
>> graph-modifying functions, which is probably why we are taking the
>> Aiocontext lock, and they do not like to run in coroutines.
>> Anyways, the real problem comes when we create a coroutine in such
>> places where the AioContext lock is taken and we have a graph-modifying
>> function.
>>
>> bdrv_coroutine_enter() calls aio_co_enter(), which in turns first checks
>>  if the coroutine is entering another context from the current (which is
>> not the case for open) and if we are already in coroutine (for sure
>> not). Therefore it resorts to the following calls;
>> 	aio_context_acquire(ctx);
>>         qemu_aio_coroutine_enter(ctx, co);
>>         aio_context_release(ctx);
>> Which is clearly a problem, because we are taking the lock twice: once
>> from the original caller of the callback, and once here due to the
>> coroutine. This creates a lot of deadlock situations.
> 
> What are the deadlock situations that are created by locking twice?

Scratch this, and refer to Paolo's thread.

> 
> The only problem I'm aware of is AIO_WAIT_WHILE(), which wants to
> temporarily unlock the AioContext It calls aio_context_release() once to
> achieve this, which obviously isn't enough when the context was locked
> twice.
> 
> But AIO_WAIT_WHILE() isn't allowed in coroutines anyway. So how are we
> running into deadlocks here?
> 
> Note that we're probably already doing this inside the .bdrv_open
> implementations: They will ususally read something from the image file,
> calling bdrv_preadv() which is already a generated_coroutine_wrapper
> today and creates a coroutine internally with the same locking pattern
> applied that you describe as problematic here.
> 
> Making .bdrv_open itself a generated_coroutine_wrapper wouldn't really
> change anything fundamental, it would just pull the existing mechanism
> one function higher in the call stack.
> 
>> For example, all callers of bdrv_open() always take the AioContext lock.
>> Often it is taken very high in the call stack, but it's always taken.
>>
>> Getting rid of the lock around qemu_aio_coroutine_enter() is difficult
>> too, because coroutines expect to have the lock taken. For example, if
>> we want to drain from a coroutine, bdrv_co_yield_to_drain releases the
>> lock for us.
> 
> It's not difficult at all in your case where you know that you're
> already in the right thread and the lock is taken: You can call
> qemu_aio_coroutine_enter() directly instead of bdrv_coroutine_enter() in
> this case.
> 
> But as I said, I'm not sure why we need to get rid of it at all.
> 
> Kevin
>