Message ID | 20220426085114.199647-1-eesposit@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Removal of AioContext lock, bs->parents and ->children: new rwlock | expand |
Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito: > Luckly, most of the cases where we recursively go through a graph are > the BlockDriverState callback functions in block_int-common.h > In order to understand what to protect, I categorized the callbacks in > block_int-common.h depending on the type of function that calls them: > > 1) If the caller is a generated_co_wrapper, this function must be > protected by rdlock. The reason is that generated_co_wrapper create > coroutines that run in the given bs AioContext, so it doesn't matter > if we are running in the main loop or not, the coroutine might run > in an iothread. > 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro, > then the function is safe. The main loop is the writer and thus won't > read and write at the same time. > 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE() > macro, then we need to check the callers and see case-by-case if the > caller is in the main loop, if it needs to take the lock, or delegate > this duty to its caller (to reduce the places where to take it). > > I used the vrc script (https://github.com/bonzini/vrc) to get help finding > all the callers of a callback. Using its filter function, I can > omit all functions protected by the added lock to avoid having duplicates > when querying for new callbacks. I was wondering, if a function is in category (3) and runs in an Iothread but the function itself is not (currently) recursive, meaning it doesn't really traverse the graph or calls someone that traverses it, should I add the rdlock anyways or not? Example: bdrv_co_drain_end Pros: + Covers if in future a new recursive callback for a new/existing BlockDriver is implemented. + Covers also the case where I or someone missed the recursive part. Cons: - Potentially introducing an unnecessary critical section. What do you think? Emanuele
On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote:
> Next step is the most complex one: figure where to put the rdlocks.
I got a little lost at this step but will hopefully understand more when
reading the patches. Earlier it said there is a counter for readers, so
it seems trivial to make it a recursive lock. I don't understand why you
want to avoid a recursive locking - the complications you described stem
from the fact that you want to avoid recursive locks?
Secondly, can you describe the role of the lock? The text mentions the
old AioContext lock that is being replaced but doesn't define this new
lock in its own right. Is the idea to make ->parents and ->children list
accesses thread-safe and all loops "safe" so they don't break when a BDS
is removed?
Stefan
On Wed, Apr 27, 2022 at 08:55:35AM +0200, Emanuele Giuseppe Esposito wrote: > > > Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito: > > Luckly, most of the cases where we recursively go through a graph are > > the BlockDriverState callback functions in block_int-common.h > > In order to understand what to protect, I categorized the callbacks in > > block_int-common.h depending on the type of function that calls them: > > > > 1) If the caller is a generated_co_wrapper, this function must be > > protected by rdlock. The reason is that generated_co_wrapper create > > coroutines that run in the given bs AioContext, so it doesn't matter > > if we are running in the main loop or not, the coroutine might run > > in an iothread. > > 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro, > > then the function is safe. The main loop is the writer and thus won't > > read and write at the same time. > > 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE() > > macro, then we need to check the callers and see case-by-case if the > > caller is in the main loop, if it needs to take the lock, or delegate > > this duty to its caller (to reduce the places where to take it). > > > > I used the vrc script (https://github.com/bonzini/vrc) to get help finding > > all the callers of a callback. Using its filter function, I can > > omit all functions protected by the added lock to avoid having duplicates > > when querying for new callbacks. > > I was wondering, if a function is in category (3) and runs in an > Iothread but the function itself is not (currently) recursive, meaning > it doesn't really traverse the graph or calls someone that traverses it, > should I add the rdlock anyways or not? > > Example: bdrv_co_drain_end > > Pros: > + Covers if in future a new recursive callback for a new/existing > BlockDriver is implemented. > + Covers also the case where I or someone missed the recursive part. > > Cons: > - Potentially introducing an unnecessary critical section. > > What do you think? ->bdrv_co_drain_end() is a callback function. Do you mean whether its caller, bdrv_drain_invoke_entry(), should take the rdlock around ->bdrv_co_drain_end()? Going up further in the call chain (and maybe switching threads), bdrv_do_drained_end() has QLIST_FOREACH(child, &bs->children, next) so it needs protection. If the caller of bdrv_do_drained_end() holds then rdlock then I think none of the child functions (including ->bdrv_co_drain_end()) need to take it explicitly. Stefan
Am 28/04/2022 um 12:45 schrieb Stefan Hajnoczi: > On Wed, Apr 27, 2022 at 08:55:35AM +0200, Emanuele Giuseppe Esposito wrote: >> >> >> Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito: >>> Luckly, most of the cases where we recursively go through a graph are >>> the BlockDriverState callback functions in block_int-common.h >>> In order to understand what to protect, I categorized the callbacks in >>> block_int-common.h depending on the type of function that calls them: >>> >>> 1) If the caller is a generated_co_wrapper, this function must be >>> protected by rdlock. The reason is that generated_co_wrapper create >>> coroutines that run in the given bs AioContext, so it doesn't matter >>> if we are running in the main loop or not, the coroutine might run >>> in an iothread. >>> 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro, >>> then the function is safe. The main loop is the writer and thus won't >>> read and write at the same time. >>> 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE() >>> macro, then we need to check the callers and see case-by-case if the >>> caller is in the main loop, if it needs to take the lock, or delegate >>> this duty to its caller (to reduce the places where to take it). >>> >>> I used the vrc script (https://github.com/bonzini/vrc) to get help finding >>> all the callers of a callback. Using its filter function, I can >>> omit all functions protected by the added lock to avoid having duplicates >>> when querying for new callbacks. >> >> I was wondering, if a function is in category (3) and runs in an >> Iothread but the function itself is not (currently) recursive, meaning >> it doesn't really traverse the graph or calls someone that traverses it, >> should I add the rdlock anyways or not? >> >> Example: bdrv_co_drain_end >> >> Pros: >> + Covers if in future a new recursive callback for a new/existing >> BlockDriver is implemented. >> + Covers also the case where I or someone missed the recursive part. >> >> Cons: >> - Potentially introducing an unnecessary critical section. >> >> What do you think? > > ->bdrv_co_drain_end() is a callback function. Do you mean whether its > caller, bdrv_drain_invoke_entry(), should take the rdlock around > ->bdrv_co_drain_end()? Yes. The problem is that the coroutine is created in bs AioContext, so it might be in an iothread. > > Going up further in the call chain (and maybe switching threads), > bdrv_do_drained_end() has QLIST_FOREACH(child, &bs->children, next) so > it needs protection. If the caller of bdrv_do_drained_end() holds then > rdlock then I think none of the child functions (including > ->bdrv_co_drain_end()) need to take it explicitly. Regarding bdrv_do_drained_end and similar, they are either running in the main loop (or they will be, if coming from a coroutine) or in the iothread running the AioContext of the bs involved. I think that most of the drains except for mirror.c are coming from main loop. I protected mirror.c in patch 8, even though right now I am not really sure that what I did is necessary, since the bh will be scheduled in the main loop. Therefore we don't really need locks around drains. Emanuele > > Stefan >
Am 28/04/2022 um 12:34 schrieb Stefan Hajnoczi: > On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote: >> Next step is the most complex one: figure where to put the rdlocks. > > I got a little lost at this step but will hopefully understand more when > reading the patches. Earlier it said there is a counter for readers, so > it seems trivial to make it a recursive lock. I don't understand why you > want to avoid a recursive locking - the complications you described stem > from the fact that you want to avoid recursive locks? I'll let Paolo to further elaborate on this. I think it would be cleaner if we try at least not to have recursive locks. > > Secondly, can you describe the role of the lock? The text mentions the > old AioContext lock that is being replaced but doesn't define this new > lock in its own right. Is the idea to make ->parents and ->children list > accesses thread-safe and all loops "safe" so they don't break when a BDS > is removed? Yes. Replace what AioContext lock is doing right now. Emanuele > > Stefan >
On Thu, Apr 28, 2022 at 11:56:09PM +0200, Emanuele Giuseppe Esposito wrote: > > > Am 28/04/2022 um 12:45 schrieb Stefan Hajnoczi: > > On Wed, Apr 27, 2022 at 08:55:35AM +0200, Emanuele Giuseppe Esposito wrote: > >> > >> > >> Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito: > >>> Luckly, most of the cases where we recursively go through a graph are > >>> the BlockDriverState callback functions in block_int-common.h > >>> In order to understand what to protect, I categorized the callbacks in > >>> block_int-common.h depending on the type of function that calls them: > >>> > >>> 1) If the caller is a generated_co_wrapper, this function must be > >>> protected by rdlock. The reason is that generated_co_wrapper create > >>> coroutines that run in the given bs AioContext, so it doesn't matter > >>> if we are running in the main loop or not, the coroutine might run > >>> in an iothread. > >>> 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro, > >>> then the function is safe. The main loop is the writer and thus won't > >>> read and write at the same time. > >>> 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE() > >>> macro, then we need to check the callers and see case-by-case if the > >>> caller is in the main loop, if it needs to take the lock, or delegate > >>> this duty to its caller (to reduce the places where to take it). > >>> > >>> I used the vrc script (https://github.com/bonzini/vrc) to get help finding > >>> all the callers of a callback. Using its filter function, I can > >>> omit all functions protected by the added lock to avoid having duplicates > >>> when querying for new callbacks. > >> > >> I was wondering, if a function is in category (3) and runs in an > >> Iothread but the function itself is not (currently) recursive, meaning > >> it doesn't really traverse the graph or calls someone that traverses it, > >> should I add the rdlock anyways or not? > >> > >> Example: bdrv_co_drain_end > >> > >> Pros: > >> + Covers if in future a new recursive callback for a new/existing > >> BlockDriver is implemented. > >> + Covers also the case where I or someone missed the recursive part. > >> > >> Cons: > >> - Potentially introducing an unnecessary critical section. > >> > >> What do you think? > > > > ->bdrv_co_drain_end() is a callback function. Do you mean whether its > > caller, bdrv_drain_invoke_entry(), should take the rdlock around > > ->bdrv_co_drain_end()? > > Yes. The problem is that the coroutine is created in bs AioContext, so > it might be in an iothread. > > > > > Going up further in the call chain (and maybe switching threads), > > bdrv_do_drained_end() has QLIST_FOREACH(child, &bs->children, next) so > > it needs protection. If the caller of bdrv_do_drained_end() holds then > > rdlock then I think none of the child functions (including > > ->bdrv_co_drain_end()) need to take it explicitly. > > Regarding bdrv_do_drained_end and similar, they are either running in > the main loop (or they will be, if coming from a coroutine) or in the > iothread running the AioContext of the bs involved. > > I think that most of the drains except for mirror.c are coming from main > loop. I protected mirror.c in patch 8, even though right now I am not > really sure that what I did is necessary, since the bh will be scheduled > in the main loop. > > Therefore we don't really need locks around drains. Are you saying rdlock isn't necessary in the main loop because nothing can take the wrlock while our code is executing in the main loop? Stefan
Am 30/04/2022 um 07:17 schrieb Stefan Hajnoczi: > On Thu, Apr 28, 2022 at 11:56:09PM +0200, Emanuele Giuseppe Esposito wrote: >> >> >> Am 28/04/2022 um 12:45 schrieb Stefan Hajnoczi: >>> On Wed, Apr 27, 2022 at 08:55:35AM +0200, Emanuele Giuseppe Esposito wrote: >>>> >>>> >>>> Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito: >>>>> Luckly, most of the cases where we recursively go through a graph are >>>>> the BlockDriverState callback functions in block_int-common.h >>>>> In order to understand what to protect, I categorized the callbacks in >>>>> block_int-common.h depending on the type of function that calls them: >>>>> >>>>> 1) If the caller is a generated_co_wrapper, this function must be >>>>> protected by rdlock. The reason is that generated_co_wrapper create >>>>> coroutines that run in the given bs AioContext, so it doesn't matter >>>>> if we are running in the main loop or not, the coroutine might run >>>>> in an iothread. >>>>> 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro, >>>>> then the function is safe. The main loop is the writer and thus won't >>>>> read and write at the same time. >>>>> 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE() >>>>> macro, then we need to check the callers and see case-by-case if the >>>>> caller is in the main loop, if it needs to take the lock, or delegate >>>>> this duty to its caller (to reduce the places where to take it). >>>>> >>>>> I used the vrc script (https://github.com/bonzini/vrc) to get help finding >>>>> all the callers of a callback. Using its filter function, I can >>>>> omit all functions protected by the added lock to avoid having duplicates >>>>> when querying for new callbacks. >>>> >>>> I was wondering, if a function is in category (3) and runs in an >>>> Iothread but the function itself is not (currently) recursive, meaning >>>> it doesn't really traverse the graph or calls someone that traverses it, >>>> should I add the rdlock anyways or not? >>>> >>>> Example: bdrv_co_drain_end >>>> >>>> Pros: >>>> + Covers if in future a new recursive callback for a new/existing >>>> BlockDriver is implemented. >>>> + Covers also the case where I or someone missed the recursive part. >>>> >>>> Cons: >>>> - Potentially introducing an unnecessary critical section. >>>> >>>> What do you think? >>> >>> ->bdrv_co_drain_end() is a callback function. Do you mean whether its >>> caller, bdrv_drain_invoke_entry(), should take the rdlock around >>> ->bdrv_co_drain_end()? >> >> Yes. The problem is that the coroutine is created in bs AioContext, so >> it might be in an iothread. >> >>> >>> Going up further in the call chain (and maybe switching threads), >>> bdrv_do_drained_end() has QLIST_FOREACH(child, &bs->children, next) so >>> it needs protection. If the caller of bdrv_do_drained_end() holds then >>> rdlock then I think none of the child functions (including >>> ->bdrv_co_drain_end()) need to take it explicitly. >> >> Regarding bdrv_do_drained_end and similar, they are either running in >> the main loop (or they will be, if coming from a coroutine) or in the >> iothread running the AioContext of the bs involved. >> >> I think that most of the drains except for mirror.c are coming from main >> loop. I protected mirror.c in patch 8, even though right now I am not >> really sure that what I did is necessary, since the bh will be scheduled >> in the main loop. >> >> Therefore we don't really need locks around drains. > > Are you saying rdlock isn't necessary in the main loop because nothing > can take the wrlock while our code is executing in the main loop? Yes, that's the idea. If I am not mistaken (and I hope I am not), only the main loop currently modifies/is allowed to modify the graph. The only case where currently we need to take the rdlock in main loop is when we have the case simplified_flush_callback(bs) { for (child in bs) bdrv_flush(child->bs); } some_function() { GLOBAL_STATE_CODE(); /* assume bdrv_get_aio_context(bs) != qemu_in_main_thread() */ bdrv_flush(bs); co = coroutine_create(bdrv_get_aio_context(bs)) qemu_coroutine_enter(co, simplified_flush_callback) } > > Stefan >
On 5/2/22 10:02, Emanuele Giuseppe Esposito wrote: >> Are you saying rdlock isn't necessary in the main loop because nothing >> can take the wrlock while our code is executing in the main loop? > Yes, that's the idea. > If I am not mistaken (and I hope I am not), only the main loop currently > modifies/is allowed to modify the graph. > > The only case where currently we need to take the rdlock in main loop is > when we have the case > > simplified_flush_callback(bs) { > for (child in bs) > bdrv_flush(child->bs); > } > > some_function() { > GLOBAL_STATE_CODE(); > /* assume bdrv_get_aio_context(bs) != qemu_in_main_thread() */ > > bdrv_flush(bs); > co = coroutine_create(bdrv_get_aio_context(bs)) > qemu_coroutine_enter(co, simplified_flush_callback) > } This is correct, but it is very unsafe as long as bdrv_flush(bs) is allowed to run both in coroutine context and outside. So we go circling back to the same issue that was there in the stackless coroutine experiment, i.e. functions that can run both in coroutine context and outside. This issue is fundamentally one of unclear invariants, and reminds me a lot of the problems with recursive mutexes. Paolo
Am 02.05.2022 um 10:02 hat Emanuele Giuseppe Esposito geschrieben: > > > Am 30/04/2022 um 07:17 schrieb Stefan Hajnoczi: > > On Thu, Apr 28, 2022 at 11:56:09PM +0200, Emanuele Giuseppe Esposito wrote: > >> > >> > >> Am 28/04/2022 um 12:45 schrieb Stefan Hajnoczi: > >>> On Wed, Apr 27, 2022 at 08:55:35AM +0200, Emanuele Giuseppe Esposito wrote: > >>>> > >>>> > >>>> Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito: > >>>>> Luckly, most of the cases where we recursively go through a graph are > >>>>> the BlockDriverState callback functions in block_int-common.h > >>>>> In order to understand what to protect, I categorized the callbacks in > >>>>> block_int-common.h depending on the type of function that calls them: > >>>>> > >>>>> 1) If the caller is a generated_co_wrapper, this function must be > >>>>> protected by rdlock. The reason is that generated_co_wrapper create > >>>>> coroutines that run in the given bs AioContext, so it doesn't matter > >>>>> if we are running in the main loop or not, the coroutine might run > >>>>> in an iothread. > >>>>> 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro, > >>>>> then the function is safe. The main loop is the writer and thus won't > >>>>> read and write at the same time. > >>>>> 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE() > >>>>> macro, then we need to check the callers and see case-by-case if the > >>>>> caller is in the main loop, if it needs to take the lock, or delegate > >>>>> this duty to its caller (to reduce the places where to take it). > >>>>> > >>>>> I used the vrc script (https://github.com/bonzini/vrc) to get help finding > >>>>> all the callers of a callback. Using its filter function, I can > >>>>> omit all functions protected by the added lock to avoid having duplicates > >>>>> when querying for new callbacks. > >>>> > >>>> I was wondering, if a function is in category (3) and runs in an > >>>> Iothread but the function itself is not (currently) recursive, meaning > >>>> it doesn't really traverse the graph or calls someone that traverses it, > >>>> should I add the rdlock anyways or not? > >>>> > >>>> Example: bdrv_co_drain_end > >>>> > >>>> Pros: > >>>> + Covers if in future a new recursive callback for a new/existing > >>>> BlockDriver is implemented. > >>>> + Covers also the case where I or someone missed the recursive part. > >>>> > >>>> Cons: > >>>> - Potentially introducing an unnecessary critical section. > >>>> > >>>> What do you think? > >>> > >>> ->bdrv_co_drain_end() is a callback function. Do you mean whether its > >>> caller, bdrv_drain_invoke_entry(), should take the rdlock around > >>> ->bdrv_co_drain_end()? > >> > >> Yes. The problem is that the coroutine is created in bs AioContext, so > >> it might be in an iothread. > >> > >>> > >>> Going up further in the call chain (and maybe switching threads), > >>> bdrv_do_drained_end() has QLIST_FOREACH(child, &bs->children, next) so > >>> it needs protection. If the caller of bdrv_do_drained_end() holds then > >>> rdlock then I think none of the child functions (including > >>> ->bdrv_co_drain_end()) need to take it explicitly. > >> > >> Regarding bdrv_do_drained_end and similar, they are either running in > >> the main loop (or they will be, if coming from a coroutine) or in the > >> iothread running the AioContext of the bs involved. > >> > >> I think that most of the drains except for mirror.c are coming from main > >> loop. I protected mirror.c in patch 8, even though right now I am not > >> really sure that what I did is necessary, since the bh will be scheduled > >> in the main loop. > >> > >> Therefore we don't really need locks around drains. > > > > Are you saying rdlock isn't necessary in the main loop because nothing > > can take the wrlock while our code is executing in the main loop? > > Yes, that's the idea. > If I am not mistaken (and I hope I am not), only the main loop currently > modifies/is allowed to modify the graph. Aren't you completely ignoring coroutines in this? What would a coroutine do that requires the graph not to change across a yield? (It's not easily possible to protect this today, and I think this was the source of some bugs in the past. But if we introduce some proper locking, I would expect it to solve this.) Kevin
On Mon, May 02, 2022 at 10:02:14AM +0200, Emanuele Giuseppe Esposito wrote: > > > Am 30/04/2022 um 07:17 schrieb Stefan Hajnoczi: > > On Thu, Apr 28, 2022 at 11:56:09PM +0200, Emanuele Giuseppe Esposito wrote: > >> > >> > >> Am 28/04/2022 um 12:45 schrieb Stefan Hajnoczi: > >>> On Wed, Apr 27, 2022 at 08:55:35AM +0200, Emanuele Giuseppe Esposito wrote: > >>>> > >>>> > >>>> Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito: > >>>>> Luckly, most of the cases where we recursively go through a graph are > >>>>> the BlockDriverState callback functions in block_int-common.h > >>>>> In order to understand what to protect, I categorized the callbacks in > >>>>> block_int-common.h depending on the type of function that calls them: > >>>>> > >>>>> 1) If the caller is a generated_co_wrapper, this function must be > >>>>> protected by rdlock. The reason is that generated_co_wrapper create > >>>>> coroutines that run in the given bs AioContext, so it doesn't matter > >>>>> if we are running in the main loop or not, the coroutine might run > >>>>> in an iothread. > >>>>> 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro, > >>>>> then the function is safe. The main loop is the writer and thus won't > >>>>> read and write at the same time. > >>>>> 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE() > >>>>> macro, then we need to check the callers and see case-by-case if the > >>>>> caller is in the main loop, if it needs to take the lock, or delegate > >>>>> this duty to its caller (to reduce the places where to take it). > >>>>> > >>>>> I used the vrc script (https://github.com/bonzini/vrc) to get help finding > >>>>> all the callers of a callback. Using its filter function, I can > >>>>> omit all functions protected by the added lock to avoid having duplicates > >>>>> when querying for new callbacks. > >>>> > >>>> I was wondering, if a function is in category (3) and runs in an > >>>> Iothread but the function itself is not (currently) recursive, meaning > >>>> it doesn't really traverse the graph or calls someone that traverses it, > >>>> should I add the rdlock anyways or not? > >>>> > >>>> Example: bdrv_co_drain_end > >>>> > >>>> Pros: > >>>> + Covers if in future a new recursive callback for a new/existing > >>>> BlockDriver is implemented. > >>>> + Covers also the case where I or someone missed the recursive part. > >>>> > >>>> Cons: > >>>> - Potentially introducing an unnecessary critical section. > >>>> > >>>> What do you think? > >>> > >>> ->bdrv_co_drain_end() is a callback function. Do you mean whether its > >>> caller, bdrv_drain_invoke_entry(), should take the rdlock around > >>> ->bdrv_co_drain_end()? > >> > >> Yes. The problem is that the coroutine is created in bs AioContext, so > >> it might be in an iothread. > >> > >>> > >>> Going up further in the call chain (and maybe switching threads), > >>> bdrv_do_drained_end() has QLIST_FOREACH(child, &bs->children, next) so > >>> it needs protection. If the caller of bdrv_do_drained_end() holds then > >>> rdlock then I think none of the child functions (including > >>> ->bdrv_co_drain_end()) need to take it explicitly. > >> > >> Regarding bdrv_do_drained_end and similar, they are either running in > >> the main loop (or they will be, if coming from a coroutine) or in the > >> iothread running the AioContext of the bs involved. > >> > >> I think that most of the drains except for mirror.c are coming from main > >> loop. I protected mirror.c in patch 8, even though right now I am not > >> really sure that what I did is necessary, since the bh will be scheduled > >> in the main loop. > >> > >> Therefore we don't really need locks around drains. > > > > Are you saying rdlock isn't necessary in the main loop because nothing > > can take the wrlock while our code is executing in the main loop? > > Yes, that's the idea. > If I am not mistaken (and I hope I am not), only the main loop currently > modifies/is allowed to modify the graph. > > The only case where currently we need to take the rdlock in main loop is > when we have the case > > simplified_flush_callback(bs) { > for (child in bs) > bdrv_flush(child->bs); > } > > some_function() { > GLOBAL_STATE_CODE(); > /* assume bdrv_get_aio_context(bs) != qemu_in_main_thread() */ > > bdrv_flush(bs); > co = coroutine_create(bdrv_get_aio_context(bs)) > qemu_coroutine_enter(co, simplified_flush_callback) > } Why does the main loop need to take rdlock in this case? Only the coroutine that runs in the non-main loop AioContext iterates over children. Also, there are a bunch of other functions in build/block/block-gen.c that seem to follow the same pattern, so I'm not sure why bdrv_flush() would be the only case? In general I'm having a hard time following this patch series and the discussion because you aren't giving a chain of reasoning. I can make guesses myself but some of them will be incorrect. It's better to have a full explanation so reviewers understand your perspective and the logic behind it. Stefan
On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote: > This is a new attempt to replace the need to take the AioContext lock to > protect graph modifications. In particular, we aim to remove > (or better, substitute) the AioContext around bdrv_replace_child_noperm, > since this function changes BlockDriverState's ->parents and ->children > lists. > > In the previous version, we decided to discard using subtree_drains to > protect the nodes, for different reasons: for those unfamiliar with it, > please see https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/ I reread the thread and it's unclear to me why drain is the wrong mechanism for protecting graph modifications. We theorized a lot but ultimately is this new mechanism sufficiently different from bdrv_drained_begin()/end() to make it worth implementing? Instead of invoking .drained_begin() callbacks to stop further I/O, we're now queuing coroutines (without backpressure information that whoever is spawning I/O needs so they can stop). The writer still waits for in-flight I/O to finish, including I/O not associated with the bdrv graph we wish to modify (because rdlock is per-AioContext and unrelated to a specific graph). Is this really more lightweight than drain? If I understand correctly, the original goal was to avoid the need to hold the AioContext lock across bdrv_replace_child_noperm(). I would focus on that and use drain for now. Maybe I've missed an important point about why the new mechanism is needed? Stefan
On Wed, May 04, 2022 at 02:39:05PM +0100, Stefan Hajnoczi wrote: > On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote: > > This is a new attempt to replace the need to take the AioContext lock to > > protect graph modifications. In particular, we aim to remove > > (or better, substitute) the AioContext around bdrv_replace_child_noperm, > > since this function changes BlockDriverState's ->parents and ->children > > lists. > > > > In the previous version, we decided to discard using subtree_drains to > > protect the nodes, for different reasons: for those unfamiliar with it, > > please see https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/ > > I reread the thread and it's unclear to me why drain is the wrong > mechanism for protecting graph modifications. We theorized a lot but > ultimately is this new mechanism sufficiently different from > bdrv_drained_begin()/end() to make it worth implementing? > > Instead of invoking .drained_begin() callbacks to stop further I/O, > we're now queuing coroutines (without backpressure information that > whoever is spawning I/O needs so they can stop). The writer still waits > for in-flight I/O to finish, including I/O not associated with the bdrv > graph we wish to modify (because rdlock is per-AioContext and unrelated > to a specific graph). Is this really more lightweight than drain? > > If I understand correctly, the original goal was to avoid the need to > hold the AioContext lock across bdrv_replace_child_noperm(). I would > focus on that and use drain for now. > > Maybe I've missed an important point about why the new mechanism is > needed? Ping? Stefan
Am 17/05/2022 um 12:59 schrieb Stefan Hajnoczi: > On Wed, May 04, 2022 at 02:39:05PM +0100, Stefan Hajnoczi wrote: >> On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote: >>> This is a new attempt to replace the need to take the AioContext lock to >>> protect graph modifications. In particular, we aim to remove >>> (or better, substitute) the AioContext around bdrv_replace_child_noperm, >>> since this function changes BlockDriverState's ->parents and ->children >>> lists. >>> >>> In the previous version, we decided to discard using subtree_drains to >>> protect the nodes, for different reasons: for those unfamiliar with it, >>> please see https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/ >> >> I reread the thread and it's unclear to me why drain is the wrong >> mechanism for protecting graph modifications. We theorized a lot but >> ultimately is this new mechanism sufficiently different from >> bdrv_drained_begin()/end() to make it worth implementing? >> >> Instead of invoking .drained_begin() callbacks to stop further I/O, >> we're now queuing coroutines (without backpressure information that >> whoever is spawning I/O needs so they can stop). The writer still waits >> for in-flight I/O to finish, including I/O not associated with the bdrv >> graph we wish to modify (because rdlock is per-AioContext and unrelated >> to a specific graph). Is this really more lightweight than drain? >> >> If I understand correctly, the original goal was to avoid the need to >> hold the AioContext lock across bdrv_replace_child_noperm(). I would >> focus on that and use drain for now. >> >> Maybe I've missed an important point about why the new mechanism is >> needed? > > Ping? label: // read till the end to see why I wrote this here I was hoping someone from the "No" party would answer to your question, because as you know we reached this same conclusion together. We thought about dropping the drain for various reasons, the main one (at least as far as I understood) is that we are not sure if something can still happen in between drain_begin/end, and it is a little bit confusing to use the same mechanism to block I/O and protect the graph. We then thought about implementing a rwlock. A rdlock would clarify what we are protecting and who is using the lock. I had a rwlock draft implementation sent in this thread, but this also lead to additional problems. Main problem was that this new lock would introduce nested event loops, that together with such locking would just create deadlocks. If readers are in coroutines and writers are not (because graph operations are not running in coroutines), we have a lot of deadlocks. If a writer has to take the lock, it must wait all other readers to finish. And it does it by internally calling AIO_WAIT_WHILE, creating nested event loop. We don't know what could execute when polling for events, and for example another writer could be resumed. Ideally, we want writers in coroutines too. Additionally, many readers are running in what we call "mixed" functions: usually implemented automatically with generated_co_wrapper tag, they let a function (usually bdrv callback) run always in a coroutine, creating one if necessary. For example, bdrv_flush() makes sure hat bs->bdrv_co_flush() is always run in a coroutine. Such mixed functions are used in other callbacks too, making it really difficult to understand if we are in a coroutine or not, and mostly important make rwlock usage very difficult. Which lead us to stepping back once more and try to convert all BlockDriverState callbacks in coroutines. This would greatly simplify rwlock usage, because we could make the rwlock coroutine-frendly (without any AIO_WAIT_WHILE, allowing a writer to wait for readers by just yielding and queuing itself in coroutine queues). First step was then to convert all callbacks in coroutines, using generated_coroutine_wrapper (g_c_w). A typical g_c_w is implemented in this way: if (qemu_in_coroutine()) { callback(); } else { // much simplified co = qemu_coroutine_create(callback); bdrv_coroutine_enter(bs, co); BDRV_POLL_WHILE(bs, coroutine_in_progress); } Once all callbacks are implemented using g_c_w, we can start splitting the two sides of the if function to only create a coroutine when we are outside from a bdrv callback. However, we immediately found a problem while starting to convert the first callbacks: the AioContext lock is taken around some non coroutine callbacks! For example, bs->bdrv_open() is always called with the AioContext lock taken. In addition, callbacks like bdrv_open are graph-modifying functions, which is probably why we are taking the Aiocontext lock, and they do not like to run in coroutines. Anyways, the real problem comes when we create a coroutine in such places where the AioContext lock is taken and we have a graph-modifying function. bdrv_coroutine_enter() calls aio_co_enter(), which in turns first checks if the coroutine is entering another context from the current (which is not the case for open) and if we are already in coroutine (for sure not). Therefore it resorts to the following calls; aio_context_acquire(ctx); qemu_aio_coroutine_enter(ctx, co); aio_context_release(ctx); Which is clearly a problem, because we are taking the lock twice: once from the original caller of the callback, and once here due to the coroutine. This creates a lot of deadlock situations. For example, all callers of bdrv_open() always take the AioContext lock. Often it is taken very high in the call stack, but it's always taken. Getting rid of the lock around qemu_aio_coroutine_enter() is difficult too, because coroutines expect to have the lock taken. For example, if we want to drain from a coroutine, bdrv_co_yield_to_drain releases the lock for us. The solution is to get rid of the aio_context lock that was originally taken in the caller, but the problem is: can we get rid of the AioContext lock? What is that lock protecting, being so high in the call stack? What is the alternative for its usage? We are getting rid of it and replacing it with nothing. We are going to drop the AioContext lock to allow coroutines callback, but leave graph operations unprotected. We might suppose that many callbacks are called under drain and in GLOBAL_STATE, which should be enough, but from our experimentation in the previous series we saw that currently not everything is under drain, leaving some operations unprotected (remember assert_graph_writable temporarily disabled, since drain coverage for bdrv_replace_child_noperm was not 100%?). Therefore we need to add more drains. But isn't drain what we decided to drop at the beginning? Why isn't drain good? goto label; Thank you, Emanuele > > Stefan >
On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote: > For example, all callers of bdrv_open() always take the AioContext lock. > Often it is taken very high in the call stack, but it's always taken. I think it's actually not a problem of who takes the AioContext lock or where; the requirements are contradictory: * IO_OR_GS_CODE() functions, when called from coroutine context, expect to be called with the AioContext lock taken (example: bdrv_co_yield_to_drain) * to call these functions with the lock taken, the code has to run in the BDS's home iothread. Attempts to do otherwise results in deadlocks (the main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot happen without releasing the aiocontext lock) * running the code in the BDS's home iothread is not possible for GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main thread, but that cannot be guaranteed in general) > We might suppose that many callbacks are called under drain and in > GLOBAL_STATE, which should be enough, but from our experimentation in > the previous series we saw that currently not everything is under drain, > leaving some operations unprotected (remember assert_graph_writable > temporarily disabled, since drain coverage for bdrv_replace_child_noperm > was not 100%?). > Therefore we need to add more drains. But isn't drain what we decided to > drop at the beginning? Why isn't drain good? To sum up the patch ordering deadlock that we have right now: * in some cases, graph manipulations are protected by the AioContext lock * eliminating the AioContext lock is needed to move callbacks to coroutine contexts (see above for the deadlock scenario) * moving callbacks to coroutine context is needed by the graph rwlock implementation On one hand, we cannot protect the graph across manipulations with a graph rwlock without removing the AioContext lock; on the other hand, the AioContext lock is what _right now_ protects the graph. So I'd rather go back to Emanuele's draining approach. It may not be beautiful, but it allows progress. Once that is in place, we can remove the AioContext lock (which mostly protects virtio-blk/virtio-scsi code right now) and reevaluate our next steps. Paolo
On Wed, May 18, 2022 at 02:28:41PM +0200, Emanuele Giuseppe Esposito wrote: > > > Am 17/05/2022 um 12:59 schrieb Stefan Hajnoczi: > > On Wed, May 04, 2022 at 02:39:05PM +0100, Stefan Hajnoczi wrote: > >> On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote: > >>> This is a new attempt to replace the need to take the AioContext lock to > >>> protect graph modifications. In particular, we aim to remove > >>> (or better, substitute) the AioContext around bdrv_replace_child_noperm, > >>> since this function changes BlockDriverState's ->parents and ->children > >>> lists. > >>> > >>> In the previous version, we decided to discard using subtree_drains to > >>> protect the nodes, for different reasons: for those unfamiliar with it, > >>> please see https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/ > >> > >> I reread the thread and it's unclear to me why drain is the wrong > >> mechanism for protecting graph modifications. We theorized a lot but > >> ultimately is this new mechanism sufficiently different from > >> bdrv_drained_begin()/end() to make it worth implementing? > >> > >> Instead of invoking .drained_begin() callbacks to stop further I/O, > >> we're now queuing coroutines (without backpressure information that > >> whoever is spawning I/O needs so they can stop). The writer still waits > >> for in-flight I/O to finish, including I/O not associated with the bdrv > >> graph we wish to modify (because rdlock is per-AioContext and unrelated > >> to a specific graph). Is this really more lightweight than drain? > >> > >> If I understand correctly, the original goal was to avoid the need to > >> hold the AioContext lock across bdrv_replace_child_noperm(). I would > >> focus on that and use drain for now. > >> > >> Maybe I've missed an important point about why the new mechanism is > >> needed? > > > > Ping? > > label: // read till the end to see why I wrote this here > > I was hoping someone from the "No" party would answer to your question, > because as you know we reached this same conclusion together. > > We thought about dropping the drain for various reasons, the main one > (at least as far as I understood) is that we are not sure if something > can still happen in between drain_begin/end, and it is a little bit > confusing to use the same mechanism to block I/O and protect the graph. We had discussions about what could go wrong and there was a feeling that maybe a separate mechanism is appropriate for graph modifications, but there was no concrete reason why draining around graph modification won't work. If no one has a concrete reason then drain still seems like the most promising approach to protecting graph modifications. The rwlock patch wasn't sufficiently different from drain to have significant advantages in my opinion. > We then thought about implementing a rwlock. A rdlock would clarify what > we are protecting and who is using the lock. I had a rwlock draft > implementation sent in this thread, but this also lead to additional > problems. > Main problem was that this new lock would introduce nested event loops, > that together with such locking would just create deadlocks. > If readers are in coroutines and writers are not (because graph > operations are not running in coroutines), we have a lot of deadlocks. > If a writer has to take the lock, it must wait all other readers to > finish. And it does it by internally calling AIO_WAIT_WHILE, creating > nested event loop. We don't know what could execute when polling for > events, and for example another writer could be resumed. > Ideally, we want writers in coroutines too. What is the deadlock? Do the readers depend on the writer somehow? > Additionally, many readers are running in what we call "mixed" > functions: usually implemented automatically with generated_co_wrapper > tag, they let a function (usually bdrv callback) run always in a > coroutine, creating one if necessary. For example, bdrv_flush() makes > sure hat bs->bdrv_co_flush() is always run in a coroutine. > Such mixed functions are used in other callbacks too, making it really > difficult to understand if we are in a coroutine or not, and mostly > important make rwlock usage very difficult. > > Which lead us to stepping back once more and try to convert all > BlockDriverState callbacks in coroutines. This would greatly simplify > rwlock usage, because we could make the rwlock coroutine-frendly > (without any AIO_WAIT_WHILE, allowing a writer to wait for readers by > just yielding and queuing itself in coroutine queues). > > First step was then to convert all callbacks in coroutines, using > generated_coroutine_wrapper (g_c_w). > A typical g_c_w is implemented in this way: > if (qemu_in_coroutine()) { > callback(); > } else { // much simplified > co = qemu_coroutine_create(callback); > bdrv_coroutine_enter(bs, co); > BDRV_POLL_WHILE(bs, coroutine_in_progress); > } > Once all callbacks are implemented using g_c_w, we can start splitting > the two sides of the if function to only create a coroutine when we are > outside from a bdrv callback. > > However, we immediately found a problem while starting to convert the > first callbacks: the AioContext lock is taken around some non coroutine > callbacks! For example, bs->bdrv_open() is always called with the > AioContext lock taken. In addition, callbacks like bdrv_open are > graph-modifying functions, which is probably why we are taking the > Aiocontext lock, and they do not like to run in coroutines. > Anyways, the real problem comes when we create a coroutine in such > places where the AioContext lock is taken and we have a graph-modifying > function. > > bdrv_coroutine_enter() calls aio_co_enter(), which in turns first checks > if the coroutine is entering another context from the current (which is > not the case for open) and if we are already in coroutine (for sure > not). Therefore it resorts to the following calls; > aio_context_acquire(ctx); > qemu_aio_coroutine_enter(ctx, co); > aio_context_release(ctx); > Which is clearly a problem, because we are taking the lock twice: once > from the original caller of the callback, and once here due to the > coroutine. This creates a lot of deadlock situations. aio_context_acquire() is a recursive lock. Where is the deadlock? Stefan
On Wed, May 18, 2022 at 02:43:50PM +0200, Paolo Bonzini wrote: > On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote: > > For example, all callers of bdrv_open() always take the AioContext lock. > > Often it is taken very high in the call stack, but it's always taken. > > I think it's actually not a problem of who takes the AioContext lock or > where; the requirements are contradictory: > > * IO_OR_GS_CODE() functions, when called from coroutine context, expect to > be called with the AioContext lock taken (example: bdrv_co_yield_to_drain) > > * to call these functions with the lock taken, the code has to run in the > BDS's home iothread. Attempts to do otherwise results in deadlocks (the > main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot > happen without releasing the aiocontext lock) > > > * running the code in the BDS's home iothread is not possible for > GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main > thread, but that cannot be guaranteed in general) > > > We might suppose that many callbacks are called under drain and in > > GLOBAL_STATE, which should be enough, but from our experimentation in > > the previous series we saw that currently not everything is under drain, > > leaving some operations unprotected (remember assert_graph_writable > > temporarily disabled, since drain coverage for bdrv_replace_child_noperm > > was not 100%?). > > Therefore we need to add more drains. But isn't drain what we decided to > > drop at the beginning? Why isn't drain good? > > To sum up the patch ordering deadlock that we have right now: > > * in some cases, graph manipulations are protected by the AioContext lock > > * eliminating the AioContext lock is needed to move callbacks to coroutine > contexts (see above for the deadlock scenario) > > * moving callbacks to coroutine context is needed by the graph rwlock > implementation > > On one hand, we cannot protect the graph across manipulations with a graph > rwlock without removing the AioContext lock; on the other hand, the > AioContext lock is what _right now_ protects the graph. > > So I'd rather go back to Emanuele's draining approach. It may not be > beautiful, but it allows progress. Once that is in place, we can remove the > AioContext lock (which mostly protects virtio-blk/virtio-scsi code right > now) and reevaluate our next steps. Me too, I don't think the rwlock was particularly nice either. Stefan
Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben: > On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote: > > For example, all callers of bdrv_open() always take the AioContext lock. > > Often it is taken very high in the call stack, but it's always taken. > > I think it's actually not a problem of who takes the AioContext lock or > where; the requirements are contradictory: > > * IO_OR_GS_CODE() functions, when called from coroutine context, expect to > be called with the AioContext lock taken (example: bdrv_co_yield_to_drain) > > * to call these functions with the lock taken, the code has to run in the > BDS's home iothread. Attempts to do otherwise results in deadlocks (the > main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot > happen without releasing the aiocontext lock) > > * running the code in the BDS's home iothread is not possible for > GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main > thread, but that cannot be guaranteed in general) > > > We might suppose that many callbacks are called under drain and in > > GLOBAL_STATE, which should be enough, but from our experimentation in > > the previous series we saw that currently not everything is under drain, > > leaving some operations unprotected (remember assert_graph_writable > > temporarily disabled, since drain coverage for bdrv_replace_child_noperm > > was not 100%?). > > Therefore we need to add more drains. But isn't drain what we decided to > > drop at the beginning? Why isn't drain good? > > To sum up the patch ordering deadlock that we have right now: > > * in some cases, graph manipulations are protected by the AioContext lock > > * eliminating the AioContext lock is needed to move callbacks to coroutine > contexts (see above for the deadlock scenario) > > * moving callbacks to coroutine context is needed by the graph rwlock > implementation > > On one hand, we cannot protect the graph across manipulations with a graph > rwlock without removing the AioContext lock; on the other hand, the > AioContext lock is what _right now_ protects the graph. > > So I'd rather go back to Emanuele's draining approach. It may not be > beautiful, but it allows progress. Once that is in place, we can remove the > AioContext lock (which mostly protects virtio-blk/virtio-scsi code right > now) and reevaluate our next steps. If we want to use drain for locking, we need to make sure that drain actually does the job correctly. I see two major problems with it: The first one is that drain only covers I/O paths, but we need to protect against _anything_ touching block nodes. This might mean a massive audit and making sure that everything in QEMU that could possibly touch a block node is integrated with drain. I think Emanuele has argued before that because writes to the graph only happen in the main thread and we believe that currently only I/O requests are processed in iothreads, this is safe and we don't actually need to audit everything. This is true as long as the assumption holds true (how do we ensure that nobody ever introduces non-I/O code touching a block node in an iothread?) and as long as the graph writer never yields or polls. I think the latter condition is violated today, a random example is that adjusting drain counts in bdrv_replace_child_noperm() does poll. Without cooperation from all relevant places, the effectively locked code section ends right there, even if the drained section continues. Even if we can fix this, verifying that the conditions are met everywhere seems not trivial. And that's exactly my second major concern: Even if we manage to correctly implement things with drain, I don't see a way to meaningfully review it. I just don't know how to verify with some confidence that it's actually correct and covering everything that needs to be covered. Drain is not really a lock. But if you use it as one, the best it can provide is advisory locking (callers, inside and outside the block layer, need to explicitly support drain instead of having the lock applied somewhere in the block layer functions). And even if all relevant pieces actually make use of it, it still has an awkward interface for locking: /* Similar to rdlock(), but doesn't wait for writers to finish. It is * the callers responsibility to make sure that there are no writers. */ bdrv_inc_in_flight() /* Similar to wrlock(). Waits for readers to finish. New readers are not * prevented from starting after it returns. Third parties are politely * asked not to touch the block node while it is drained. */ bdrv_drained_begin() (I think the unlock counterparts actually behave as expected from a real lock.) Having an actual rdlock() (that waits for writers), and using static analysis to verify that all relevant places use it (so that wrlock() doesn't rely on politely asking third parties to leave the node alone) gave me some confidence that we could verify the result. I'm not sure at all how to achieve the same with the drain interface. In theory, it's possible. But it complicates the conditions so much that I'm pretty much sure that the review wouldn't only be very time consuming, but I would make mistakes during the review, rendering it useless. Maybe throwing some more static analysis on the code can help, not sure. It's going to be a bit more complex than with the other approach, though. Kevin
On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote: > If we want to use drain for locking, we need to make sure that drain > actually does the job correctly. I see two major problems with it: > > The first one is that drain only covers I/O paths, but we need to > protect against _anything_ touching block nodes. This might mean a > massive audit and making sure that everything in QEMU that could > possibly touch a block node is integrated with drain. > I think Emanuele has argued before that because writes to the graph only > happen in the main thread and we believe that currently only I/O > requests are processed in iothreads, this is safe and we don't actually > need to audit everything. I'm interested in the non-I/O code path cases you're thinking about: Block jobs receive callbacks during drain. They are safe. Exports: - The nbd export has code to deal with drain and looks safe. - vhost-user-blk uses aio_set_fd_handler(is_external=true) for virtqueue kick fds but not for the vhost-user UNIX domain socket (not sure if that's a problem). - FUSE uses aio_set_fd_handler(is_external=true) and looks safe. The monitor runs with the BQL in the main loop and doesn't use coroutines. It should be safe. Anything else? Stefan
Am 19.05.2022 um 13:27 hat Stefan Hajnoczi geschrieben: > On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote: > > If we want to use drain for locking, we need to make sure that drain > > actually does the job correctly. I see two major problems with it: > > > > The first one is that drain only covers I/O paths, but we need to > > protect against _anything_ touching block nodes. This might mean a > > massive audit and making sure that everything in QEMU that could > > possibly touch a block node is integrated with drain. > > > I think Emanuele has argued before that because writes to the graph only > > happen in the main thread and we believe that currently only I/O > > requests are processed in iothreads, this is safe and we don't actually > > need to audit everything. > > I'm interested in the non-I/O code path cases you're thinking about: > > Block jobs receive callbacks during drain. They are safe. We've had bugs in the callbacks before, so while they are probably as safe as they can get when each user has to actively support drain, I'm a bit less than 100% confident. > Exports: > - The nbd export has code to deal with drain and looks safe. > - vhost-user-blk uses aio_set_fd_handler(is_external=true) for virtqueue > kick fds but not for the vhost-user UNIX domain socket (not sure if > that's a problem). > - FUSE uses aio_set_fd_handler(is_external=true) and looks safe. > > The monitor runs with the BQL in the main loop and doesn't use > coroutines. It should be safe. The monitor does use coroutines (though I think this doesn't make a difference; the more important point is that this coroutine runs in qemu_aio_context while executing commands) and is not safe with respect to drain and we're seeing bugs coming from it: https://lists.gnu.org/archive/html/qemu-block/2022-03/msg00582.html The broader point here is that even running with the BQL in the main loop doesn't help you at all if the graph writer it could interfere with polls or is a coroutine that yields. You're only safe if _both_ sides run with the BQL in the main thread and neither poll nor yield. This is the condition I explained under which Emanuele's reasoning holds true. So you can choose between verifying that the monitor actively respects drain (it doesn't currently) or verifying that no graph writer can poll or yield during its drain section so that holding the BQL is enough (not true today and I'm not sure if it could be made true even in theory). > Anything else? How to figure this out is precisely my question to you. :-) Maybe migration completion? Some random BHs? A guest enabling a virtio-blk device so that the dataplane gets started and its backend is moved to a different AioContext? I'm not sure if these cases are bad. Maybe they aren't. But how do I tell without reading every code path? Well, and the follow-up question: How do we make sure that the list of "anything else" doesn't change over time when we rely on auditing every item on it for the correctness of drain? Kevin
On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote: > Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben: > > On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote: > > > For example, all callers of bdrv_open() always take the AioContext lock. > > > Often it is taken very high in the call stack, but it's always taken. > > > > I think it's actually not a problem of who takes the AioContext lock or > > where; the requirements are contradictory: > > > > * IO_OR_GS_CODE() functions, when called from coroutine context, expect to > > be called with the AioContext lock taken (example: bdrv_co_yield_to_drain) > > > > * to call these functions with the lock taken, the code has to run in the > > BDS's home iothread. Attempts to do otherwise results in deadlocks (the > > main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot > > happen without releasing the aiocontext lock) > > > > * running the code in the BDS's home iothread is not possible for > > GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main > > thread, but that cannot be guaranteed in general) > > > > > We might suppose that many callbacks are called under drain and in > > > GLOBAL_STATE, which should be enough, but from our experimentation in > > > the previous series we saw that currently not everything is under drain, > > > leaving some operations unprotected (remember assert_graph_writable > > > temporarily disabled, since drain coverage for bdrv_replace_child_noperm > > > was not 100%?). > > > Therefore we need to add more drains. But isn't drain what we decided to > > > drop at the beginning? Why isn't drain good? > > > > To sum up the patch ordering deadlock that we have right now: > > > > * in some cases, graph manipulations are protected by the AioContext lock > > > > * eliminating the AioContext lock is needed to move callbacks to coroutine > > contexts (see above for the deadlock scenario) > > > > * moving callbacks to coroutine context is needed by the graph rwlock > > implementation > > > > On one hand, we cannot protect the graph across manipulations with a graph > > rwlock without removing the AioContext lock; on the other hand, the > > AioContext lock is what _right now_ protects the graph. > > > > So I'd rather go back to Emanuele's draining approach. It may not be > > beautiful, but it allows progress. Once that is in place, we can remove the > > AioContext lock (which mostly protects virtio-blk/virtio-scsi code right > > now) and reevaluate our next steps. > > If we want to use drain for locking, we need to make sure that drain > actually does the job correctly. I see two major problems with it: > > The first one is that drain only covers I/O paths, but we need to > protect against _anything_ touching block nodes. This might mean a > massive audit and making sure that everything in QEMU that could > possibly touch a block node is integrated with drain. > > I think Emanuele has argued before that because writes to the graph only > happen in the main thread and we believe that currently only I/O > requests are processed in iothreads, this is safe and we don't actually > need to audit everything. > > This is true as long as the assumption holds true (how do we ensure that > nobody ever introduces non-I/O code touching a block node in an > iothread?) and as long as the graph writer never yields or polls. I > think the latter condition is violated today, a random example is that > adjusting drain counts in bdrv_replace_child_noperm() does poll. Without > cooperation from all relevant places, the effectively locked code > section ends right there, even if the drained section continues. Even if > we can fix this, verifying that the conditions are met everywhere seems > not trivial. > > And that's exactly my second major concern: Even if we manage to > correctly implement things with drain, I don't see a way to meaningfully > review it. I just don't know how to verify with some confidence that > it's actually correct and covering everything that needs to be covered. > > Drain is not really a lock. But if you use it as one, the best it can > provide is advisory locking (callers, inside and outside the block > layer, need to explicitly support drain instead of having the lock > applied somewhere in the block layer functions). And even if all > relevant pieces actually make use of it, it still has an awkward > interface for locking: > > /* Similar to rdlock(), but doesn't wait for writers to finish. It is > * the callers responsibility to make sure that there are no writers. */ > bdrv_inc_in_flight() > > /* Similar to wrlock(). Waits for readers to finish. New readers are not > * prevented from starting after it returns. Third parties are politely > * asked not to touch the block node while it is drained. */ > bdrv_drained_begin() > > (I think the unlock counterparts actually behave as expected from a real > lock.) > > Having an actual rdlock() (that waits for writers), and using static > analysis to verify that all relevant places use it (so that wrlock() > doesn't rely on politely asking third parties to leave the node alone) > gave me some confidence that we could verify the result. > > I'm not sure at all how to achieve the same with the drain interface. In > theory, it's possible. But it complicates the conditions so much that > I'm pretty much sure that the review wouldn't only be very time > consuming, but I would make mistakes during the review, rendering it > useless. > > Maybe throwing some more static analysis on the code can help, not sure. > It's going to be a bit more complex than with the other approach, > though. Hi, Sorry for the long email. I've included three topics that may help us discuss draining and AioContext removal further. The shortcomings of drain ------------------------- I wanted to explore the logical argument that making graph modifications within a drained section is correct: - Graph modifications and BB/BDS lookup are Global State (GS). - Graph traversal from a BB/BDS pointer is IO. - Only one thread executes GS code at a time. - IO is quiesced within a drained section. - Therefore a drained section in GS code suspends graph traversal, other graph modifications, and BB/BDS lookup. - Therefore it is safe to modify the graph from a GS drained section. However, I hit on a problem that I think Emanuele and Paolo have already pointed out: draining is GS & IO. This might have worked under the 1 IOThread model but it does not make sense for multi-queue. It is possible to submit I/O requests in drained sections. How can multiple threads be in drained sections simultaneously and possibly submit further I/O requests in their drained sections? Those sections wouldn't be "drained" in any useful sense of the word. It is necessary to adjust how recursive drained sections work: bdrv_drained_begin() must not return while there are deeper nested drained sections. This is allowed: Monitor command Block job --------------- --------- > bdrv_drained_begin() . > bdrv_drained_begin() . < bdrv_drained_begin() . . . . . > bdrv_drained_end() . < bdrv_drained_end() < bdrv_drained_begin() . . > bdrv_drained_end() < bdrv_drained_end() This is not allowed: Monitor command Block job --------------- --------- > bdrv_drained_begin() . > bdrv_drained_begin() . < bdrv_drained_begin() . . . . < bdrv_drained_begin() . . . . > bdrv_drained_end() . < bdrv_drained_end() > bdrv_drained_end() < bdrv_drained_end() This restriction creates an ordering between bdrv_drained_begin() callers. In this example the block job must not depend on the monitor command completing first. Otherwise there would be a deadlock just like with two threads wait for each other while holding a mutex. For sanity I think draining should be GS-only. IO code should not invoke bdrv_drained_begin() to avoid ordering problems when multiple threads drain simultaneously and have dependencies on each other. block/mirror.c needs to be modified because it currently drains from IO when mirroring is about to end. With this change to draining I think the logical argument for correctness with graph modifications holds. Enforcing GS/IO separation at compile time ------------------------------------------ Right now GS/IO asserts check assumptions at runtime. The next step may be using the type system to separate GS and IO APIs so it's impossible for IO code to accidentally call GS code, for example. typedef struct { BlockDriverState *bs; } BlockDriverStateIOPtr; typedef struct { BlockDriverState *bs; } BlockDriverStateGSPtr; Then APIs can be protected as follows: void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...); A function that only has a BlockDriverStateIOPtr cannot call bdrv_set_aio_context_ignore() by mistake since the compiler will complain that the first argument must be a BlockDriverStateGSPtr. Possible steps for AioContext removal ------------------------------------- I also wanted to share my assumptions about multi-queue and AioContext removal. Please let me know if anything seems wrong or questionable: - IO code can execute in any thread that has an AioContext. - Multiple threads may execute a IO code at the same time. - GS code only execute under the BQL. For AioContext removal this means: - bdrv_get_aio_context() becomes mostly meaningless since there is no need for a special "home" AioContext. - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to run a coroutine in the BDS's AioContext. - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin() may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue virtio-blk device). - AIO_WAIT_WHILE() simplifies to while ((cond)) { aio_poll(qemu_get_current_aio_context(), true); ... } and the distinction between home AioContext and non-home context is eliminated. AioContext unlocking is dropped. Does this make sense? I haven't seen these things in recent patch series. Stefan
Am 22/05/2022 um 17:06 schrieb Stefan Hajnoczi: > On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote: >> Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben: >>> On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote: >>>> For example, all callers of bdrv_open() always take the AioContext lock. >>>> Often it is taken very high in the call stack, but it's always taken. >>> >>> I think it's actually not a problem of who takes the AioContext lock or >>> where; the requirements are contradictory: >>> >>> * IO_OR_GS_CODE() functions, when called from coroutine context, expect to >>> be called with the AioContext lock taken (example: bdrv_co_yield_to_drain) >>> >>> * to call these functions with the lock taken, the code has to run in the >>> BDS's home iothread. Attempts to do otherwise results in deadlocks (the >>> main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot >>> happen without releasing the aiocontext lock) >>> >>> * running the code in the BDS's home iothread is not possible for >>> GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main >>> thread, but that cannot be guaranteed in general) >>> >>>> We might suppose that many callbacks are called under drain and in >>>> GLOBAL_STATE, which should be enough, but from our experimentation in >>>> the previous series we saw that currently not everything is under drain, >>>> leaving some operations unprotected (remember assert_graph_writable >>>> temporarily disabled, since drain coverage for bdrv_replace_child_noperm >>>> was not 100%?). >>>> Therefore we need to add more drains. But isn't drain what we decided to >>>> drop at the beginning? Why isn't drain good? >>> >>> To sum up the patch ordering deadlock that we have right now: >>> >>> * in some cases, graph manipulations are protected by the AioContext lock >>> >>> * eliminating the AioContext lock is needed to move callbacks to coroutine >>> contexts (see above for the deadlock scenario) >>> >>> * moving callbacks to coroutine context is needed by the graph rwlock >>> implementation >>> >>> On one hand, we cannot protect the graph across manipulations with a graph >>> rwlock without removing the AioContext lock; on the other hand, the >>> AioContext lock is what _right now_ protects the graph. >>> >>> So I'd rather go back to Emanuele's draining approach. It may not be >>> beautiful, but it allows progress. Once that is in place, we can remove the >>> AioContext lock (which mostly protects virtio-blk/virtio-scsi code right >>> now) and reevaluate our next steps. >> >> If we want to use drain for locking, we need to make sure that drain >> actually does the job correctly. I see two major problems with it: >> >> The first one is that drain only covers I/O paths, but we need to >> protect against _anything_ touching block nodes. This might mean a >> massive audit and making sure that everything in QEMU that could >> possibly touch a block node is integrated with drain. >> >> I think Emanuele has argued before that because writes to the graph only >> happen in the main thread and we believe that currently only I/O >> requests are processed in iothreads, this is safe and we don't actually >> need to audit everything. >> >> This is true as long as the assumption holds true (how do we ensure that >> nobody ever introduces non-I/O code touching a block node in an >> iothread?) and as long as the graph writer never yields or polls. I >> think the latter condition is violated today, a random example is that >> adjusting drain counts in bdrv_replace_child_noperm() does poll. Without >> cooperation from all relevant places, the effectively locked code >> section ends right there, even if the drained section continues. Even if >> we can fix this, verifying that the conditions are met everywhere seems >> not trivial. >> >> And that's exactly my second major concern: Even if we manage to >> correctly implement things with drain, I don't see a way to meaningfully >> review it. I just don't know how to verify with some confidence that >> it's actually correct and covering everything that needs to be covered. >> >> Drain is not really a lock. But if you use it as one, the best it can >> provide is advisory locking (callers, inside and outside the block >> layer, need to explicitly support drain instead of having the lock >> applied somewhere in the block layer functions). And even if all >> relevant pieces actually make use of it, it still has an awkward >> interface for locking: >> >> /* Similar to rdlock(), but doesn't wait for writers to finish. It is >> * the callers responsibility to make sure that there are no writers. */ >> bdrv_inc_in_flight() >> >> /* Similar to wrlock(). Waits for readers to finish. New readers are not >> * prevented from starting after it returns. Third parties are politely >> * asked not to touch the block node while it is drained. */ >> bdrv_drained_begin() >> >> (I think the unlock counterparts actually behave as expected from a real >> lock.) >> >> Having an actual rdlock() (that waits for writers), and using static >> analysis to verify that all relevant places use it (so that wrlock() >> doesn't rely on politely asking third parties to leave the node alone) >> gave me some confidence that we could verify the result. >> >> I'm not sure at all how to achieve the same with the drain interface. In >> theory, it's possible. But it complicates the conditions so much that >> I'm pretty much sure that the review wouldn't only be very time >> consuming, but I would make mistakes during the review, rendering it >> useless. >> >> Maybe throwing some more static analysis on the code can help, not sure. >> It's going to be a bit more complex than with the other approach, >> though. > > Hi, > Sorry for the long email. I've included three topics that may help us discuss > draining and AioContext removal further. > > The shortcomings of drain > ------------------------- > I wanted to explore the logical argument that making graph modifications within > a drained section is correct: > - Graph modifications and BB/BDS lookup are Global State (GS). > - Graph traversal from a BB/BDS pointer is IO. > - Only one thread executes GS code at a time. > - IO is quiesced within a drained section. > - Therefore a drained section in GS code suspends graph traversal, other graph > modifications, and BB/BDS lookup. > - Therefore it is safe to modify the graph from a GS drained section. > > However, I hit on a problem that I think Emanuele and Paolo have already > pointed out: draining is GS & IO. This might have worked under the 1 IOThread > model but it does not make sense for multi-queue. It is possible to submit I/O > requests in drained sections. How can multiple threads be in drained sections > simultaneously and possibly submit further I/O requests in their drained > sections? Those sections wouldn't be "drained" in any useful sense of the word. > > It is necessary to adjust how recursive drained sections work: > bdrv_drained_begin() must not return while there are deeper nested drained > sections. > > This is allowed: > > Monitor command Block job > --------------- --------- > > bdrv_drained_begin() > . > bdrv_drained_begin() > . < bdrv_drained_begin() > . . > . . > . > bdrv_drained_end() > . < bdrv_drained_end() > < bdrv_drained_begin() > . > . > > bdrv_drained_end() > < bdrv_drained_end() > > This is not allowed: > > Monitor command Block job > --------------- --------- > > bdrv_drained_begin() > . > bdrv_drained_begin() > . < bdrv_drained_begin() > . . > . . > < bdrv_drained_begin() . > . . > . > bdrv_drained_end() > . < bdrv_drained_end() > > bdrv_drained_end() > < bdrv_drained_end() > > This restriction creates an ordering between bdrv_drained_begin() callers. In > this example the block job must not depend on the monitor command completing > first. Otherwise there would be a deadlock just like with two threads wait for > each other while holding a mutex. > > For sanity I think draining should be GS-only. IO code should not invoke > bdrv_drained_begin() to avoid ordering problems when multiple threads drain > simultaneously and have dependencies on each other. > > block/mirror.c needs to be modified because it currently drains from IO when > mirroring is about to end. Yes, mirror seems to be the only clear place where draining is performed from IO, namely in mirror_run. It's also a little bit weird because just drained_begin is invoked from IO, while drained_end is in the main loop. If I understand correctly, the reason for that is that we want to prevent bdrv_get_dirty_count to be modified (ie that additional modifications come) after we just finished mirroring. Not sure how to deal with this though. > > With this change to draining I think the logical argument for correctness with > graph modifications holds. > > Enforcing GS/IO separation at compile time > ------------------------------------------ > Right now GS/IO asserts check assumptions at runtime. The next step may be > using the type system to separate GS and IO APIs so it's impossible for IO code > to accidentally call GS code, for example. > > typedef struct { > BlockDriverState *bs; > } BlockDriverStateIOPtr; > > typedef struct { > BlockDriverState *bs; > } BlockDriverStateGSPtr; > > Then APIs can be protected as follows: > > void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...); > > A function that only has a BlockDriverStateIOPtr cannot call > bdrv_set_aio_context_ignore() by mistake since the compiler will complain that > the first argument must be a BlockDriverStateGSPtr. What about functions that do not need a BlockDriverState *, and for example they use BdrvChild? I would assume that we need to replicate it also for that. And what about GS & IO functions? Not only drain, but also all the generated_co_wrapper should be GS & IO. > > Possible steps for AioContext removal > ------------------------------------- > I also wanted to share my assumptions about multi-queue and AioContext removal. > Please let me know if anything seems wrong or questionable: > > - IO code can execute in any thread that has an AioContext. > - Multiple threads may execute a IO code at the same time. > - GS code only execute under the BQL. > > For AioContext removal this means: > > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for > a special "home" AioContext. Makes sense > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to > run a coroutine in the BDS's AioContext. Why is there no need? > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many > threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin() > may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue > virtio-blk device). Ok > - AIO_WAIT_WHILE() simplifies to > > while ((cond)) { > aio_poll(qemu_get_current_aio_context(), true); > ... > } > > and the distinction between home AioContext and non-home context is > eliminated. AioContext unlocking is dropped. I guess this implies no coroutine in BDS's AioContext. > > Does this make sense? I haven't seen these things in recent patch series. You haven't seen them because there's no way to do it if we don't even agree on what to replace the AioContext lock with. Getting to this point would imply firstly having something (drain, rwlock, whatever) together with AioContext, and then gradually remove it. At least, that's how I understood it. Emanuele
Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben: > Hi, > Sorry for the long email. I've included three topics that may help us discuss > draining and AioContext removal further. > > The shortcomings of drain > ------------------------- > I wanted to explore the logical argument that making graph modifications within > a drained section is correct: > - Graph modifications and BB/BDS lookup are Global State (GS). > - Graph traversal from a BB/BDS pointer is IO. > - Only one thread executes GS code at a time. > - IO is quiesced within a drained section. I think you're mixing two things here and calling both of them IO. If a function is marked as IO, that means that it is safe to call from I/O requests, which may be running in any iothread (they currently only run in the home thread of the node's AioContext, but the function annotations already promise that any thread is safe). However, if a function is marked as IO, this doesn't necessarily mean that it is always only called in the context of an I/O request. It can be called by any other code, too. So while drain does quiesce all I/O requests, this doesn't mean that functions marked as IO won't run any more. > - Therefore a drained section in GS code suspends graph traversal, other graph > modifications, and BB/BDS lookup. > - Therefore it is safe to modify the graph from a GS drained section. So unfortunately, I don't think this conclusion is correct. In order to make your assumption true, we would have to require that all callers of IO functions must have called bdrv_inc_in_flight(). We would also have to find a way to enforce this preferable at compile time or with static analysis, or at least with runtime assertions, because it would be very easy to get wrong. > However, I hit on a problem that I think Emanuele and Paolo have already > pointed out: draining is GS & IO. This might have worked under the 1 IOThread > model but it does not make sense for multi-queue. It is possible to submit I/O > requests in drained sections. How can multiple threads be in drained sections > simultaneously and possibly submit further I/O requests in their drained > sections? Those sections wouldn't be "drained" in any useful sense of the word. Drains asks other users not to touch the block node. Currently this only includes, but the desire to use drain for locking means that we need to extend it to pretty much any operation on the node, which would include calling drain on that block node. So you should never have two callers in a drain section at the same time, it would be a bug in this model. Of course, we know that currently drain is not respected by all relevant callers (most importantly, the monitor). If you want to use drain as a form of locking, you need to solve this first. > It is necessary to adjust how recursive drained sections work: > bdrv_drained_begin() must not return while there are deeper nested drained > sections. > > This is allowed: > > Monitor command Block job > --------------- --------- > > bdrv_drained_begin() > . > bdrv_drained_begin() > . < bdrv_drained_begin() > . . > . . > . > bdrv_drained_end() > . < bdrv_drained_end() > < bdrv_drained_begin() > . > . > > bdrv_drained_end() > < bdrv_drained_end() Just to make sure I understand the scenario that you have in mind here: The monitor command is not what causes the block job to do the draining because this is inside bdrv_drained_begin(), so the block job just randomly got a callback that caused it to do so (maybe completion)? Before bdrv_drained_begin() returns, anything is still allowed to run, so I agree that this is valid. > This is not allowed: > > Monitor command Block job > --------------- --------- > > bdrv_drained_begin() > . > bdrv_drained_begin() > . < bdrv_drained_begin() > . . > . . > < bdrv_drained_begin() . > . . > . > bdrv_drained_end() > . < bdrv_drained_end() > > bdrv_drained_end() > < bdrv_drained_end() > > This restriction creates an ordering between bdrv_drained_begin() callers. In > this example the block job must not depend on the monitor command completing > first. Otherwise there would be a deadlock just like with two threads wait for > each other while holding a mutex. So essentially drain needs to increase bs->in_flight, so that the outer drain has to wait for the inner drain section to end before its bdrv_drained_begin() can return. > For sanity I think draining should be GS-only. IO code should not invoke > bdrv_drained_begin() to avoid ordering problems when multiple threads drain > simultaneously and have dependencies on each other. > > block/mirror.c needs to be modified because it currently drains from IO when > mirroring is about to end. > > With this change to draining I think the logical argument for correctness with > graph modifications holds. What is your suggestion how to modify mirror? It drains so that no new requests can sneak in and source and target stay in sync while it switches to the completion code running in the main AioContext. You can't just drop this. > Enforcing GS/IO separation at compile time > ------------------------------------------ > Right now GS/IO asserts check assumptions at runtime. The next step may be > using the type system to separate GS and IO APIs so it's impossible for IO code > to accidentally call GS code, for example. > > typedef struct { > BlockDriverState *bs; > } BlockDriverStateIOPtr; > > typedef struct { > BlockDriverState *bs; > } BlockDriverStateGSPtr; > > Then APIs can be protected as follows: > > void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...); > > A function that only has a BlockDriverStateIOPtr cannot call > bdrv_set_aio_context_ignore() by mistake since the compiler will complain that > the first argument must be a BlockDriverStateGSPtr. And then you have a set of functions that cast from GS to IO pointers, but not for the other way around? Or maybe getting as IO pointer would even be coupled with automatically increasing bs->in_flight? The other option that we were looking into for this was static analysis. I had hacked up a small script to check consistency of these annotations (without covering function pointers, though) to help me with the review of Emanuele's patches that introduced them. If I understand correctly, Paolo has scripts to do the same a little bit better. As long as we can integrate such a script in 'make check', we wouldn't necessarily have to have the churn in the C code to switch everything to new wrapper types. > Possible steps for AioContext removal > ------------------------------------- > I also wanted to share my assumptions about multi-queue and AioContext removal. > Please let me know if anything seems wrong or questionable: > > - IO code can execute in any thread that has an AioContext. > - Multiple threads may execute a IO code at the same time. > - GS code only execute under the BQL. > > For AioContext removal this means: > > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for > a special "home" AioContext. > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to > run a coroutine in the BDS's AioContext. > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many > threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin() > may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue > virtio-blk device). > - AIO_WAIT_WHILE() simplifies to > > while ((cond)) { > aio_poll(qemu_get_current_aio_context(), true); > ... > } Probably not exactly, because you still need aio_wait_kick() to wake up the thread. We use AioWait.num_waiters != 0 to decide whether we need to schedule a BH in the main thread (because doing so unconditionally for every completed request would be very wasteful). If you want to be able to run this loop from any thread instead of just the main thread, you would have to store somewhere which thread to wake. > and the distinction between home AioContext and non-home context is > eliminated. AioContext unlocking is dropped. > > Does this make sense? I haven't seen these things in recent patch series. Intuitively I would agree with most. There may be details that I'm not aware of at the moment. I'm not surprised that we haven't seen any such things in recent patch series because there is an awful lot of preparational work to be done before we can reach this final state. Kevin
On Mon, May 23, 2022 at 10:48:48AM +0200, Emanuele Giuseppe Esposito wrote: > > > Am 22/05/2022 um 17:06 schrieb Stefan Hajnoczi: > > On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote: > >> Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben: > >>> On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote: > >>>> For example, all callers of bdrv_open() always take the AioContext lock. > >>>> Often it is taken very high in the call stack, but it's always taken. > >>> > >>> I think it's actually not a problem of who takes the AioContext lock or > >>> where; the requirements are contradictory: > >>> > >>> * IO_OR_GS_CODE() functions, when called from coroutine context, expect to > >>> be called with the AioContext lock taken (example: bdrv_co_yield_to_drain) > >>> > >>> * to call these functions with the lock taken, the code has to run in the > >>> BDS's home iothread. Attempts to do otherwise results in deadlocks (the > >>> main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot > >>> happen without releasing the aiocontext lock) > >>> > >>> * running the code in the BDS's home iothread is not possible for > >>> GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main > >>> thread, but that cannot be guaranteed in general) > >>> > >>>> We might suppose that many callbacks are called under drain and in > >>>> GLOBAL_STATE, which should be enough, but from our experimentation in > >>>> the previous series we saw that currently not everything is under drain, > >>>> leaving some operations unprotected (remember assert_graph_writable > >>>> temporarily disabled, since drain coverage for bdrv_replace_child_noperm > >>>> was not 100%?). > >>>> Therefore we need to add more drains. But isn't drain what we decided to > >>>> drop at the beginning? Why isn't drain good? > >>> > >>> To sum up the patch ordering deadlock that we have right now: > >>> > >>> * in some cases, graph manipulations are protected by the AioContext lock > >>> > >>> * eliminating the AioContext lock is needed to move callbacks to coroutine > >>> contexts (see above for the deadlock scenario) > >>> > >>> * moving callbacks to coroutine context is needed by the graph rwlock > >>> implementation > >>> > >>> On one hand, we cannot protect the graph across manipulations with a graph > >>> rwlock without removing the AioContext lock; on the other hand, the > >>> AioContext lock is what _right now_ protects the graph. > >>> > >>> So I'd rather go back to Emanuele's draining approach. It may not be > >>> beautiful, but it allows progress. Once that is in place, we can remove the > >>> AioContext lock (which mostly protects virtio-blk/virtio-scsi code right > >>> now) and reevaluate our next steps. > >> > >> If we want to use drain for locking, we need to make sure that drain > >> actually does the job correctly. I see two major problems with it: > >> > >> The first one is that drain only covers I/O paths, but we need to > >> protect against _anything_ touching block nodes. This might mean a > >> massive audit and making sure that everything in QEMU that could > >> possibly touch a block node is integrated with drain. > >> > >> I think Emanuele has argued before that because writes to the graph only > >> happen in the main thread and we believe that currently only I/O > >> requests are processed in iothreads, this is safe and we don't actually > >> need to audit everything. > >> > >> This is true as long as the assumption holds true (how do we ensure that > >> nobody ever introduces non-I/O code touching a block node in an > >> iothread?) and as long as the graph writer never yields or polls. I > >> think the latter condition is violated today, a random example is that > >> adjusting drain counts in bdrv_replace_child_noperm() does poll. Without > >> cooperation from all relevant places, the effectively locked code > >> section ends right there, even if the drained section continues. Even if > >> we can fix this, verifying that the conditions are met everywhere seems > >> not trivial. > >> > >> And that's exactly my second major concern: Even if we manage to > >> correctly implement things with drain, I don't see a way to meaningfully > >> review it. I just don't know how to verify with some confidence that > >> it's actually correct and covering everything that needs to be covered. > >> > >> Drain is not really a lock. But if you use it as one, the best it can > >> provide is advisory locking (callers, inside and outside the block > >> layer, need to explicitly support drain instead of having the lock > >> applied somewhere in the block layer functions). And even if all > >> relevant pieces actually make use of it, it still has an awkward > >> interface for locking: > >> > >> /* Similar to rdlock(), but doesn't wait for writers to finish. It is > >> * the callers responsibility to make sure that there are no writers. */ > >> bdrv_inc_in_flight() > >> > >> /* Similar to wrlock(). Waits for readers to finish. New readers are not > >> * prevented from starting after it returns. Third parties are politely > >> * asked not to touch the block node while it is drained. */ > >> bdrv_drained_begin() > >> > >> (I think the unlock counterparts actually behave as expected from a real > >> lock.) > >> > >> Having an actual rdlock() (that waits for writers), and using static > >> analysis to verify that all relevant places use it (so that wrlock() > >> doesn't rely on politely asking third parties to leave the node alone) > >> gave me some confidence that we could verify the result. > >> > >> I'm not sure at all how to achieve the same with the drain interface. In > >> theory, it's possible. But it complicates the conditions so much that > >> I'm pretty much sure that the review wouldn't only be very time > >> consuming, but I would make mistakes during the review, rendering it > >> useless. > >> > >> Maybe throwing some more static analysis on the code can help, not sure. > >> It's going to be a bit more complex than with the other approach, > >> though. > > > > Hi, > > Sorry for the long email. I've included three topics that may help us discuss > > draining and AioContext removal further. > > > > The shortcomings of drain > > ------------------------- > > I wanted to explore the logical argument that making graph modifications within > > a drained section is correct: > > - Graph modifications and BB/BDS lookup are Global State (GS). > > - Graph traversal from a BB/BDS pointer is IO. > > - Only one thread executes GS code at a time. > > - IO is quiesced within a drained section. > > - Therefore a drained section in GS code suspends graph traversal, other graph > > modifications, and BB/BDS lookup. > > - Therefore it is safe to modify the graph from a GS drained section. > > > > However, I hit on a problem that I think Emanuele and Paolo have already > > pointed out: draining is GS & IO. This might have worked under the 1 IOThread > > model but it does not make sense for multi-queue. It is possible to submit I/O > > requests in drained sections. How can multiple threads be in drained sections > > simultaneously and possibly submit further I/O requests in their drained > > sections? Those sections wouldn't be "drained" in any useful sense of the word. > > > > It is necessary to adjust how recursive drained sections work: > > bdrv_drained_begin() must not return while there are deeper nested drained > > sections. > > > > This is allowed: > > > > Monitor command Block job > > --------------- --------- > > > bdrv_drained_begin() > > . > bdrv_drained_begin() > > . < bdrv_drained_begin() > > . . > > . . > > . > bdrv_drained_end() > > . < bdrv_drained_end() > > < bdrv_drained_begin() > > . > > . > > > bdrv_drained_end() > > < bdrv_drained_end() > > > > This is not allowed: > > > > Monitor command Block job > > --------------- --------- > > > bdrv_drained_begin() > > . > bdrv_drained_begin() > > . < bdrv_drained_begin() > > . . > > . . > > < bdrv_drained_begin() . > > . . > > . > bdrv_drained_end() > > . < bdrv_drained_end() > > > bdrv_drained_end() > > < bdrv_drained_end() > > > > This restriction creates an ordering between bdrv_drained_begin() callers. In > > this example the block job must not depend on the monitor command completing > > first. Otherwise there would be a deadlock just like with two threads wait for > > each other while holding a mutex. > > > > For sanity I think draining should be GS-only. IO code should not invoke > > bdrv_drained_begin() to avoid ordering problems when multiple threads drain > > simultaneously and have dependencies on each other. > > > > block/mirror.c needs to be modified because it currently drains from IO when > > mirroring is about to end. > > Yes, mirror seems to be the only clear place where draining is performed > from IO, namely in mirror_run. It's also a little bit weird because just > drained_begin is invoked from IO, while drained_end is in the main loop. > > If I understand correctly, the reason for that is that we want to > prevent bdrv_get_dirty_count to be modified (ie that additional > modifications come) after we just finished mirroring. > > Not sure how to deal with this though. I don't know the code so I don't have a concrete solution in mind. > > > > With this change to draining I think the logical argument for correctness with > > graph modifications holds. > > > > Enforcing GS/IO separation at compile time > > ------------------------------------------ > > Right now GS/IO asserts check assumptions at runtime. The next step may be > > using the type system to separate GS and IO APIs so it's impossible for IO code > > to accidentally call GS code, for example. > > > > typedef struct { > > BlockDriverState *bs; > > } BlockDriverStateIOPtr; > > > > typedef struct { > > BlockDriverState *bs; > > } BlockDriverStateGSPtr; > > > > Then APIs can be protected as follows: > > > > void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...); > > > > A function that only has a BlockDriverStateIOPtr cannot call > > bdrv_set_aio_context_ignore() by mistake since the compiler will complain that > > the first argument must be a BlockDriverStateGSPtr. > > What about functions that do not need a BlockDriverState *, and for > example they use BdrvChild? I would assume that we need to replicate it > also for that. > > And what about GS & IO functions? Not only drain, but also all the > generated_co_wrapper should be GS & IO. What is the meaning of GS & IO in a multi-queue world? I guess it's effectively IO except it's well-behaved if called with the BQL held? > > > > > Possible steps for AioContext removal > > ------------------------------------- > > I also wanted to share my assumptions about multi-queue and AioContext removal. > > Please let me know if anything seems wrong or questionable: > > > > - IO code can execute in any thread that has an AioContext. > > - Multiple threads may execute a IO code at the same time. > > - GS code only execute under the BQL. > > > > For AioContext removal this means: > > > > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for > > a special "home" AioContext. > > Makes sense > > > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to > > run a coroutine in the BDS's AioContext. > > Why is there no need? The coroutine can execute in the current thread since the BDS must be thread-safe. If bdrv_coroutine_enter() is used in cases like block jobs to mean "run in the same AioContext as the job coroutine" then the code should probably be changed to explicitly use the job AioContext instead of the more vague BDS AioContext. > > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many > > threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin() > > may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue > > virtio-blk device). > > Ok > > > - AIO_WAIT_WHILE() simplifies to > > > > while ((cond)) { > > aio_poll(qemu_get_current_aio_context(), true); > > ... > > } > > > > and the distinction between home AioContext and non-home context is > > eliminated. AioContext unlocking is dropped. > > I guess this implies no coroutine in BDS's AioContext. I'm not sure what you mean but coroutines accessing the BDS can run in any AioContext with multi-queue. If they have any locking requirements amongst each other then that should be explicit instead of just throwing them all into the BDS AioContext. > > > > Does this make sense? I haven't seen these things in recent patch series. > > You haven't seen them because there's no way to do it if we don't even > agree on what to replace the AioContext lock with. > Getting to this point would imply firstly having something (drain, > rwlock, whatever) together with AioContext, and then gradually remove > it. At least, that's how I understood it. What, besides graph modification and the things I listed in this email, still needs to be untangled from AioContext? Stefan
Am 23/05/2022 um 15:15 schrieb Stefan Hajnoczi: > On Mon, May 23, 2022 at 10:48:48AM +0200, Emanuele Giuseppe Esposito wrote: >> >> >> Am 22/05/2022 um 17:06 schrieb Stefan Hajnoczi: >>> On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote: >>>> Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben: >>>>> On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote: >>>>>> For example, all callers of bdrv_open() always take the AioContext lock. >>>>>> Often it is taken very high in the call stack, but it's always taken. >>>>> >>>>> I think it's actually not a problem of who takes the AioContext lock or >>>>> where; the requirements are contradictory: >>>>> >>>>> * IO_OR_GS_CODE() functions, when called from coroutine context, expect to >>>>> be called with the AioContext lock taken (example: bdrv_co_yield_to_drain) >>>>> >>>>> * to call these functions with the lock taken, the code has to run in the >>>>> BDS's home iothread. Attempts to do otherwise results in deadlocks (the >>>>> main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot >>>>> happen without releasing the aiocontext lock) >>>>> >>>>> * running the code in the BDS's home iothread is not possible for >>>>> GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main >>>>> thread, but that cannot be guaranteed in general) >>>>> >>>>>> We might suppose that many callbacks are called under drain and in >>>>>> GLOBAL_STATE, which should be enough, but from our experimentation in >>>>>> the previous series we saw that currently not everything is under drain, >>>>>> leaving some operations unprotected (remember assert_graph_writable >>>>>> temporarily disabled, since drain coverage for bdrv_replace_child_noperm >>>>>> was not 100%?). >>>>>> Therefore we need to add more drains. But isn't drain what we decided to >>>>>> drop at the beginning? Why isn't drain good? >>>>> >>>>> To sum up the patch ordering deadlock that we have right now: >>>>> >>>>> * in some cases, graph manipulations are protected by the AioContext lock >>>>> >>>>> * eliminating the AioContext lock is needed to move callbacks to coroutine >>>>> contexts (see above for the deadlock scenario) >>>>> >>>>> * moving callbacks to coroutine context is needed by the graph rwlock >>>>> implementation >>>>> >>>>> On one hand, we cannot protect the graph across manipulations with a graph >>>>> rwlock without removing the AioContext lock; on the other hand, the >>>>> AioContext lock is what _right now_ protects the graph. >>>>> >>>>> So I'd rather go back to Emanuele's draining approach. It may not be >>>>> beautiful, but it allows progress. Once that is in place, we can remove the >>>>> AioContext lock (which mostly protects virtio-blk/virtio-scsi code right >>>>> now) and reevaluate our next steps. >>>> >>>> If we want to use drain for locking, we need to make sure that drain >>>> actually does the job correctly. I see two major problems with it: >>>> >>>> The first one is that drain only covers I/O paths, but we need to >>>> protect against _anything_ touching block nodes. This might mean a >>>> massive audit and making sure that everything in QEMU that could >>>> possibly touch a block node is integrated with drain. >>>> >>>> I think Emanuele has argued before that because writes to the graph only >>>> happen in the main thread and we believe that currently only I/O >>>> requests are processed in iothreads, this is safe and we don't actually >>>> need to audit everything. >>>> >>>> This is true as long as the assumption holds true (how do we ensure that >>>> nobody ever introduces non-I/O code touching a block node in an >>>> iothread?) and as long as the graph writer never yields or polls. I >>>> think the latter condition is violated today, a random example is that >>>> adjusting drain counts in bdrv_replace_child_noperm() does poll. Without >>>> cooperation from all relevant places, the effectively locked code >>>> section ends right there, even if the drained section continues. Even if >>>> we can fix this, verifying that the conditions are met everywhere seems >>>> not trivial. >>>> >>>> And that's exactly my second major concern: Even if we manage to >>>> correctly implement things with drain, I don't see a way to meaningfully >>>> review it. I just don't know how to verify with some confidence that >>>> it's actually correct and covering everything that needs to be covered. >>>> >>>> Drain is not really a lock. But if you use it as one, the best it can >>>> provide is advisory locking (callers, inside and outside the block >>>> layer, need to explicitly support drain instead of having the lock >>>> applied somewhere in the block layer functions). And even if all >>>> relevant pieces actually make use of it, it still has an awkward >>>> interface for locking: >>>> >>>> /* Similar to rdlock(), but doesn't wait for writers to finish. It is >>>> * the callers responsibility to make sure that there are no writers. */ >>>> bdrv_inc_in_flight() >>>> >>>> /* Similar to wrlock(). Waits for readers to finish. New readers are not >>>> * prevented from starting after it returns. Third parties are politely >>>> * asked not to touch the block node while it is drained. */ >>>> bdrv_drained_begin() >>>> >>>> (I think the unlock counterparts actually behave as expected from a real >>>> lock.) >>>> >>>> Having an actual rdlock() (that waits for writers), and using static >>>> analysis to verify that all relevant places use it (so that wrlock() >>>> doesn't rely on politely asking third parties to leave the node alone) >>>> gave me some confidence that we could verify the result. >>>> >>>> I'm not sure at all how to achieve the same with the drain interface. In >>>> theory, it's possible. But it complicates the conditions so much that >>>> I'm pretty much sure that the review wouldn't only be very time >>>> consuming, but I would make mistakes during the review, rendering it >>>> useless. >>>> >>>> Maybe throwing some more static analysis on the code can help, not sure. >>>> It's going to be a bit more complex than with the other approach, >>>> though. >>> >>> Hi, >>> Sorry for the long email. I've included three topics that may help us discuss >>> draining and AioContext removal further. >>> >>> The shortcomings of drain >>> ------------------------- >>> I wanted to explore the logical argument that making graph modifications within >>> a drained section is correct: >>> - Graph modifications and BB/BDS lookup are Global State (GS). >>> - Graph traversal from a BB/BDS pointer is IO. >>> - Only one thread executes GS code at a time. >>> - IO is quiesced within a drained section. >>> - Therefore a drained section in GS code suspends graph traversal, other graph >>> modifications, and BB/BDS lookup. >>> - Therefore it is safe to modify the graph from a GS drained section. >>> >>> However, I hit on a problem that I think Emanuele and Paolo have already >>> pointed out: draining is GS & IO. This might have worked under the 1 IOThread >>> model but it does not make sense for multi-queue. It is possible to submit I/O >>> requests in drained sections. How can multiple threads be in drained sections >>> simultaneously and possibly submit further I/O requests in their drained >>> sections? Those sections wouldn't be "drained" in any useful sense of the word. >>> >>> It is necessary to adjust how recursive drained sections work: >>> bdrv_drained_begin() must not return while there are deeper nested drained >>> sections. >>> >>> This is allowed: >>> >>> Monitor command Block job >>> --------------- --------- >>> > bdrv_drained_begin() >>> . > bdrv_drained_begin() >>> . < bdrv_drained_begin() >>> . . >>> . . >>> . > bdrv_drained_end() >>> . < bdrv_drained_end() >>> < bdrv_drained_begin() >>> . >>> . >>> > bdrv_drained_end() >>> < bdrv_drained_end() >>> >>> This is not allowed: >>> >>> Monitor command Block job >>> --------------- --------- >>> > bdrv_drained_begin() >>> . > bdrv_drained_begin() >>> . < bdrv_drained_begin() >>> . . >>> . . >>> < bdrv_drained_begin() . >>> . . >>> . > bdrv_drained_end() >>> . < bdrv_drained_end() >>> > bdrv_drained_end() >>> < bdrv_drained_end() >>> >>> This restriction creates an ordering between bdrv_drained_begin() callers. In >>> this example the block job must not depend on the monitor command completing >>> first. Otherwise there would be a deadlock just like with two threads wait for >>> each other while holding a mutex. >>> >>> For sanity I think draining should be GS-only. IO code should not invoke >>> bdrv_drained_begin() to avoid ordering problems when multiple threads drain >>> simultaneously and have dependencies on each other. >>> >>> block/mirror.c needs to be modified because it currently drains from IO when >>> mirroring is about to end. >> >> Yes, mirror seems to be the only clear place where draining is performed >> from IO, namely in mirror_run. It's also a little bit weird because just >> drained_begin is invoked from IO, while drained_end is in the main loop. >> >> If I understand correctly, the reason for that is that we want to >> prevent bdrv_get_dirty_count to be modified (ie that additional >> modifications come) after we just finished mirroring. >> >> Not sure how to deal with this though. > > I don't know the code so I don't have a concrete solution in mind. > >>> >>> With this change to draining I think the logical argument for correctness with >>> graph modifications holds. >>> >>> Enforcing GS/IO separation at compile time >>> ------------------------------------------ >>> Right now GS/IO asserts check assumptions at runtime. The next step may be >>> using the type system to separate GS and IO APIs so it's impossible for IO code >>> to accidentally call GS code, for example. >>> >>> typedef struct { >>> BlockDriverState *bs; >>> } BlockDriverStateIOPtr; >>> >>> typedef struct { >>> BlockDriverState *bs; >>> } BlockDriverStateGSPtr; >>> >>> Then APIs can be protected as follows: >>> >>> void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...); >>> >>> A function that only has a BlockDriverStateIOPtr cannot call >>> bdrv_set_aio_context_ignore() by mistake since the compiler will complain that >>> the first argument must be a BlockDriverStateGSPtr. >> >> What about functions that do not need a BlockDriverState *, and for >> example they use BdrvChild? I would assume that we need to replicate it >> also for that. >> >> And what about GS & IO functions? Not only drain, but also all the >> generated_co_wrapper should be GS & IO. > > What is the meaning of GS & IO in a multi-queue world? I guess it's > effectively IO except it's well-behaved if called with the BQL held? Good question. I think so. But will need to be protected, as they will be probably graph traversing while being either GS or IO. > >> >>> >>> Possible steps for AioContext removal >>> ------------------------------------- >>> I also wanted to share my assumptions about multi-queue and AioContext removal. >>> Please let me know if anything seems wrong or questionable: >>> >>> - IO code can execute in any thread that has an AioContext. >>> - Multiple threads may execute a IO code at the same time. >>> - GS code only execute under the BQL. >>> >>> For AioContext removal this means: >>> >>> - bdrv_get_aio_context() becomes mostly meaningless since there is no need for >>> a special "home" AioContext. >> >> Makes sense >> >>> - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to >>> run a coroutine in the BDS's AioContext. >> >> Why is there no need? > > The coroutine can execute in the current thread since the BDS must be > thread-safe. > > If bdrv_coroutine_enter() is used in cases like block jobs to mean "run > in the same AioContext as the job coroutine" then the code should > probably be changed to explicitly use the job AioContext instead of the > more vague BDS AioContext. Ok makes sense > >>> - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many >>> threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin() >>> may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue >>> virtio-blk device). >> >> Ok >> >>> - AIO_WAIT_WHILE() simplifies to >>> >>> while ((cond)) { >>> aio_poll(qemu_get_current_aio_context(), true); >>> ... >>> } >>> >>> and the distinction between home AioContext and non-home context is >>> eliminated. AioContext unlocking is dropped. >> >> I guess this implies no coroutine in BDS's AioContext. > > I'm not sure what you mean but coroutines accessing the BDS can run in > any AioContext with multi-queue. If they have any locking requirements > amongst each other then that should be explicit instead of just throwing > them all into the BDS AioContext. Ok > >>> >>> Does this make sense? I haven't seen these things in recent patch series. >> >> You haven't seen them because there's no way to do it if we don't even >> agree on what to replace the AioContext lock with. >> Getting to this point would imply firstly having something (drain, >> rwlock, whatever) together with AioContext, and then gradually remove >> it. At least, that's how I understood it. > > What, besides graph modification and the things I listed in this email, > still needs to be untangled from AioContext? Nothing else (I hope), but as Kevin also said, this requires a lot of preliminary work before getting to this state. Emanuele > > Stefan >
On Mon, May 23, 2022 at 03:02:05PM +0200, Kevin Wolf wrote: > Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben: > > Hi, > > Sorry for the long email. I've included three topics that may help us discuss > > draining and AioContext removal further. > > > > The shortcomings of drain > > ------------------------- > > I wanted to explore the logical argument that making graph modifications within > > a drained section is correct: > > - Graph modifications and BB/BDS lookup are Global State (GS). > > - Graph traversal from a BB/BDS pointer is IO. > > - Only one thread executes GS code at a time. > > - IO is quiesced within a drained section. > > I think you're mixing two things here and calling both of them IO. > > If a function is marked as IO, that means that it is safe to call from > I/O requests, which may be running in any iothread (they currently only > run in the home thread of the node's AioContext, but the function > annotations already promise that any thread is safe). > > However, if a function is marked as IO, this doesn't necessarily mean > that it is always only called in the context of an I/O request. It can > be called by any other code, too. > > So while drain does quiesce all I/O requests, this doesn't mean that > functions marked as IO won't run any more. My model is that between bdrv_drained_begin() and bdrv_drained_end() only the caller is allowed to invoke BB/BDS functions (including functions marked IO). The caller isn't strictly one thread and one or no coroutines. The caller could use threads and coroutines, but the important thing is that everything else that accesses the BB/BDS is suspended due to .drained_begin() callbacks (and is_external). So when you say a function marked IO can be called from outside an I/O request, that "outside an I/O request" code must be quiesced too. Otherwise drain is definitely not safe for graph modifications. > > - Therefore a drained section in GS code suspends graph traversal, other graph > > modifications, and BB/BDS lookup. > > - Therefore it is safe to modify the graph from a GS drained section. > > So unfortunately, I don't think this conclusion is correct. > > In order to make your assumption true, we would have to require that all > callers of IO functions must have called bdrv_inc_in_flight(). We would > also have to find a way to enforce this preferable at compile time or > with static analysis, or at least with runtime assertions, because it > would be very easy to get wrong. Or that they are quiesced by .drained_begin() callbacks or is_external. Do you have a concrete example? > > > However, I hit on a problem that I think Emanuele and Paolo have already > > pointed out: draining is GS & IO. This might have worked under the 1 IOThread > > model but it does not make sense for multi-queue. It is possible to submit I/O > > requests in drained sections. How can multiple threads be in drained sections > > simultaneously and possibly submit further I/O requests in their drained > > sections? Those sections wouldn't be "drained" in any useful sense of the word. > > Drains asks other users not to touch the block node. Currently this only BTW interesting language choice here: you're using the more general definition of "other users" and "touch[ing] the block node" rather than the narrow definition of just "I/O requests". That's exactly how I think of drain but based on what you wrote above I'm not sure that's how you think of it too? > includes, but the desire to use drain for locking means that we need to > extend it to pretty much any operation on the node, which would include > calling drain on that block node. So you should never have two callers > in a drain section at the same time, it would be a bug in this model. Yes! Drain in its current form doesn't make sense for multi-queue since multiple threads could be in drained sections at the same time and they would all be allowed to submit new I/O requests. > Of course, we know that currently drain is not respected by all relevant > callers (most importantly, the monitor). If you want to use drain as a > form of locking, you need to solve this first. > > > It is necessary to adjust how recursive drained sections work: > > bdrv_drained_begin() must not return while there are deeper nested drained > > sections. > > > > This is allowed: > > > > Monitor command Block job > > --------------- --------- > > > bdrv_drained_begin() > > . > bdrv_drained_begin() > > . < bdrv_drained_begin() > > . . > > . . > > . > bdrv_drained_end() > > . < bdrv_drained_end() > > < bdrv_drained_begin() > > . > > . > > > bdrv_drained_end() > > < bdrv_drained_end() > > Just to make sure I understand the scenario that you have in mind here: > The monitor command is not what causes the block job to do the draining > because this is inside bdrv_drained_begin(), so the block job just > randomly got a callback that caused it to do so (maybe completion)? Yes, exactly that completion scenario. When the mirror block job completes exactly when a monitor command calls bdrv_drained_begin(), mirror_exit_common() deletes a temporary filter BDS node. It involves drain and modifies the graph. > > Before bdrv_drained_begin() returns, anything is still allowed to run, > so I agree that this is valid. Thanks for confirming! > > This is not allowed: > > > > Monitor command Block job > > --------------- --------- > > > bdrv_drained_begin() > > . > bdrv_drained_begin() > > . < bdrv_drained_begin() > > . . > > . . > > < bdrv_drained_begin() . > > . . > > . > bdrv_drained_end() > > . < bdrv_drained_end() > > > bdrv_drained_end() > > < bdrv_drained_end() > > > > This restriction creates an ordering between bdrv_drained_begin() callers. In > > this example the block job must not depend on the monitor command completing > > first. Otherwise there would be a deadlock just like with two threads wait for > > each other while holding a mutex. > > So essentially drain needs to increase bs->in_flight, so that the outer > drain has to wait for the inner drain section to end before its > bdrv_drained_begin() can return. > > > For sanity I think draining should be GS-only. IO code should not invoke > > bdrv_drained_begin() to avoid ordering problems when multiple threads drain > > simultaneously and have dependencies on each other. > > > > block/mirror.c needs to be modified because it currently drains from IO when > > mirroring is about to end. > > > > With this change to draining I think the logical argument for correctness with > > graph modifications holds. > > What is your suggestion how to modify mirror? It drains so that no new > requests can sneak in and source and target stay in sync while it > switches to the completion code running in the main AioContext. You > can't just drop this. I haven't read the code in enough depth to come up with a solution. > > > Enforcing GS/IO separation at compile time > > ------------------------------------------ > > Right now GS/IO asserts check assumptions at runtime. The next step may be > > using the type system to separate GS and IO APIs so it's impossible for IO code > > to accidentally call GS code, for example. > > > > typedef struct { > > BlockDriverState *bs; > > } BlockDriverStateIOPtr; > > > > typedef struct { > > BlockDriverState *bs; > > } BlockDriverStateGSPtr; > > > > Then APIs can be protected as follows: > > > > void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...); > > > > A function that only has a BlockDriverStateIOPtr cannot call > > bdrv_set_aio_context_ignore() by mistake since the compiler will complain that > > the first argument must be a BlockDriverStateGSPtr. > > And then you have a set of functions that cast from GS to IO pointers, > but not for the other way around? Or maybe getting as IO pointer would > even be coupled with automatically increasing bs->in_flight? > > The other option that we were looking into for this was static analysis. > I had hacked up a small script to check consistency of these annotations > (without covering function pointers, though) to help me with the review > of Emanuele's patches that introduced them. If I understand correctly, > Paolo has scripts to do the same a little bit better. > > As long as we can integrate such a script in 'make check', we wouldn't > necessarily have to have the churn in the C code to switch everything to > new wrapper types. Yes, that's the problem with the typedef Ptr approach. It would involve a lot of code changes. Maybe static analysis tools are better. > > > Possible steps for AioContext removal > > ------------------------------------- > > I also wanted to share my assumptions about multi-queue and AioContext removal. > > Please let me know if anything seems wrong or questionable: > > > > - IO code can execute in any thread that has an AioContext. > > - Multiple threads may execute a IO code at the same time. > > - GS code only execute under the BQL. > > > > For AioContext removal this means: > > > > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for > > a special "home" AioContext. > > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to > > run a coroutine in the BDS's AioContext. > > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many > > threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin() > > may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue > > virtio-blk device). > > - AIO_WAIT_WHILE() simplifies to > > > > while ((cond)) { > > aio_poll(qemu_get_current_aio_context(), true); > > ... > > } > > Probably not exactly, because you still need aio_wait_kick() to wake up > the thread. We use AioWait.num_waiters != 0 to decide whether we need to > schedule a BH in the main thread (because doing so unconditionally for > every completed request would be very wasteful). > > If you want to be able to run this loop from any thread instead of just > the main thread, you would have to store somewhere which thread to wake. qemu_cond_broadcast() can be used since the wakeup is a slow path. > > and the distinction between home AioContext and non-home context is > > eliminated. AioContext unlocking is dropped. > > > > Does this make sense? I haven't seen these things in recent patch series. > > Intuitively I would agree with most. There may be details that I'm not > aware of at the moment. I'm not surprised that we haven't seen any such > things in recent patch series because there is an awful lot of > preparational work to be done before we can reach this final state. Yes. I'm mostly hoping to find out whether my view in fundamentally flawed or very different from what you and others think. Stefan
Am 23.05.2022 um 17:13 hat Stefan Hajnoczi geschrieben: > On Mon, May 23, 2022 at 03:02:05PM +0200, Kevin Wolf wrote: > > Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben: > > > Hi, > > > Sorry for the long email. I've included three topics that may help us discuss > > > draining and AioContext removal further. > > > > > > The shortcomings of drain > > > ------------------------- > > > I wanted to explore the logical argument that making graph modifications within > > > a drained section is correct: > > > - Graph modifications and BB/BDS lookup are Global State (GS). > > > - Graph traversal from a BB/BDS pointer is IO. > > > - Only one thread executes GS code at a time. > > > - IO is quiesced within a drained section. > > > > I think you're mixing two things here and calling both of them IO. > > > > If a function is marked as IO, that means that it is safe to call from > > I/O requests, which may be running in any iothread (they currently only > > run in the home thread of the node's AioContext, but the function > > annotations already promise that any thread is safe). > > > > However, if a function is marked as IO, this doesn't necessarily mean > > that it is always only called in the context of an I/O request. It can > > be called by any other code, too. > > > > So while drain does quiesce all I/O requests, this doesn't mean that > > functions marked as IO won't run any more. > > My model is that between bdrv_drained_begin() and bdrv_drained_end() > only the caller is allowed to invoke BB/BDS functions (including > functions marked IO). Okay, this sounds better. It's not actually related to IO/GS, but to BB/BDS functions, including both IO and GS functions. So graph traversal from a BB/BDS pointer makes it a BB/BDS function, and BB/BDS functions need to be quiesced within a drained section. > The caller isn't strictly one thread and one or no coroutines. The > caller could use threads and coroutines, but the important thing is > that everything else that accesses the BB/BDS is suspended due to > .drained_begin() callbacks (and is_external). > > So when you say a function marked IO can be called from outside an I/O > request, that "outside an I/O request" code must be quiesced too. > Otherwise drain is definitely not safe for graph modifications. If you phrase it as a condition and as a goal to achieve, then I agree that this is required when you want to use drain for locking. My impression was that you were using this to argue that the code is already doing this and is already safe in this scenario, and this isn't true. I think I misunderstood you there and you were really describing the future state that you're envisioning. > > > - Therefore a drained section in GS code suspends graph traversal, other graph > > > modifications, and BB/BDS lookup. > > > - Therefore it is safe to modify the graph from a GS drained section. > > > > So unfortunately, I don't think this conclusion is correct. > > > > In order to make your assumption true, we would have to require that all > > callers of IO functions must have called bdrv_inc_in_flight(). We would > > also have to find a way to enforce this preferable at compile time or > > with static analysis, or at least with runtime assertions, because it > > would be very easy to get wrong. > > Or that they are quiesced by .drained_begin() callbacks or is_external. > > Do you have a concrete example? Yes, you're right, bdrv_inc_in_flight() is only one way to achieve this. They just need to make bdrv_drain_poll() return true as long as they are active so that bdrv_drained_begin() waits for them. .drained_poll() is another valid way to achieve this. However, if we want to rely on static analysis to verify that everything is covered, always requiring bdrv_inc_in_flight() might make this easier. So possibly we want to require it even in cases where .drained_poll() or aio_disable_external() would be enough in theory. > > > > > However, I hit on a problem that I think Emanuele and Paolo have already > > > pointed out: draining is GS & IO. This might have worked under the 1 IOThread > > > model but it does not make sense for multi-queue. It is possible to submit I/O > > > requests in drained sections. How can multiple threads be in drained sections > > > simultaneously and possibly submit further I/O requests in their drained > > > sections? Those sections wouldn't be "drained" in any useful sense of the word. > > > > Drains asks other users not to touch the block node. Currently this only > > BTW interesting language choice here: you're using the more general > definition of "other users" and "touch[ing] the block node" rather than > the narrow definition of just "I/O requests". That's exactly how I think > of drain but based on what you wrote above I'm not sure that's how you > think of it too? I hope that my reply above made it clearer: The broader definition is what is needed if we want to use drain to replace the AioContext lock for protecting the graph, but the narrow definition is what is implemented today. > > includes, but the desire to use drain for locking means that we need to > > extend it to pretty much any operation on the node, which would include > > calling drain on that block node. So you should never have two callers > > in a drain section at the same time, it would be a bug in this model. > > Yes! Drain in its current form doesn't make sense for multi-queue since > multiple threads could be in drained sections at the same time and they > would all be allowed to submit new I/O requests. Nobody would be allowed to submit new requests (because someone else has drained the node), but they would do so anyway. ;-) Actually, only half joking, because this shows how weak the protection by drain is if we don't have a way to verify that the whole codebase supports drain correctly. > > Of course, we know that currently drain is not respected by all relevant > > callers (most importantly, the monitor). If you want to use drain as a > > form of locking, you need to solve this first. > > > > > It is necessary to adjust how recursive drained sections work: > > > bdrv_drained_begin() must not return while there are deeper nested drained > > > sections. > > > > > > This is allowed: > > > > > > Monitor command Block job > > > --------------- --------- > > > > bdrv_drained_begin() > > > . > bdrv_drained_begin() > > > . < bdrv_drained_begin() > > > . . > > > . . > > > . > bdrv_drained_end() > > > . < bdrv_drained_end() > > > < bdrv_drained_begin() > > > . > > > . > > > > bdrv_drained_end() > > > < bdrv_drained_end() > > > > Just to make sure I understand the scenario that you have in mind here: > > The monitor command is not what causes the block job to do the draining > > because this is inside bdrv_drained_begin(), so the block job just > > randomly got a callback that caused it to do so (maybe completion)? > > Yes, exactly that completion scenario. When the mirror block job > completes exactly when a monitor command calls bdrv_drained_begin(), > mirror_exit_common() deletes a temporary filter BDS node. It involves > drain and modifies the graph. > > > > > Before bdrv_drained_begin() returns, anything is still allowed to run, > > so I agree that this is valid. > > Thanks for confirming! > > > > This is not allowed: > > > > > > Monitor command Block job > > > --------------- --------- > > > > bdrv_drained_begin() > > > . > bdrv_drained_begin() > > > . < bdrv_drained_begin() > > > . . > > > . . > > > < bdrv_drained_begin() . > > > . . > > > . > bdrv_drained_end() > > > . < bdrv_drained_end() > > > > bdrv_drained_end() > > > < bdrv_drained_end() > > > > > > This restriction creates an ordering between bdrv_drained_begin() callers. In > > > this example the block job must not depend on the monitor command completing > > > first. Otherwise there would be a deadlock just like with two threads wait for > > > each other while holding a mutex. > > > > So essentially drain needs to increase bs->in_flight, so that the outer > > drain has to wait for the inner drain section to end before its > > bdrv_drained_begin() can return. > > > > > For sanity I think draining should be GS-only. IO code should not invoke > > > bdrv_drained_begin() to avoid ordering problems when multiple threads drain > > > simultaneously and have dependencies on each other. > > > > > > block/mirror.c needs to be modified because it currently drains from IO when > > > mirroring is about to end. > > > > > > With this change to draining I think the logical argument for correctness with > > > graph modifications holds. > > > > What is your suggestion how to modify mirror? It drains so that no new > > requests can sneak in and source and target stay in sync while it > > switches to the completion code running in the main AioContext. You > > can't just drop this. > > I haven't read the code in enough depth to come up with a solution. So this sounds like it needs more thought before we can assume that we'll have a final state where drain is GS-only. > > > > > Enforcing GS/IO separation at compile time > > > ------------------------------------------ > > > Right now GS/IO asserts check assumptions at runtime. The next step may be > > > using the type system to separate GS and IO APIs so it's impossible for IO code > > > to accidentally call GS code, for example. > > > > > > typedef struct { > > > BlockDriverState *bs; > > > } BlockDriverStateIOPtr; > > > > > > typedef struct { > > > BlockDriverState *bs; > > > } BlockDriverStateGSPtr; > > > > > > Then APIs can be protected as follows: > > > > > > void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...); > > > > > > A function that only has a BlockDriverStateIOPtr cannot call > > > bdrv_set_aio_context_ignore() by mistake since the compiler will complain that > > > the first argument must be a BlockDriverStateGSPtr. > > > > And then you have a set of functions that cast from GS to IO pointers, > > but not for the other way around? Or maybe getting as IO pointer would > > even be coupled with automatically increasing bs->in_flight? > > > > The other option that we were looking into for this was static analysis. > > I had hacked up a small script to check consistency of these annotations > > (without covering function pointers, though) to help me with the review > > of Emanuele's patches that introduced them. If I understand correctly, > > Paolo has scripts to do the same a little bit better. > > > > As long as we can integrate such a script in 'make check', we wouldn't > > necessarily have to have the churn in the C code to switch everything to > > new wrapper types. > > Yes, that's the problem with the typedef Ptr approach. It would involve > a lot of code changes. Maybe static analysis tools are better. > > > > > > Possible steps for AioContext removal > > > ------------------------------------- > > > I also wanted to share my assumptions about multi-queue and AioContext removal. > > > Please let me know if anything seems wrong or questionable: > > > > > > - IO code can execute in any thread that has an AioContext. > > > - Multiple threads may execute a IO code at the same time. > > > - GS code only execute under the BQL. > > > > > > For AioContext removal this means: > > > > > > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for > > > a special "home" AioContext. > > > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to > > > run a coroutine in the BDS's AioContext. > > > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many > > > threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin() > > > may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue > > > virtio-blk device). > > > - AIO_WAIT_WHILE() simplifies to > > > > > > while ((cond)) { > > > aio_poll(qemu_get_current_aio_context(), true); > > > ... > > > } > > > > Probably not exactly, because you still need aio_wait_kick() to wake up > > the thread. We use AioWait.num_waiters != 0 to decide whether we need to > > schedule a BH in the main thread (because doing so unconditionally for > > every completed request would be very wasteful). > > > > If you want to be able to run this loop from any thread instead of just > > the main thread, you would have to store somewhere which thread to wake. > > qemu_cond_broadcast() can be used since the wakeup is a slow path. I don't think we have a QemuCond for waking up a blocking aio_poll()? Doesn't this usually involve writing to the event notifier file descriptor of the specific AioContext? > > > and the distinction between home AioContext and non-home context is > > > eliminated. AioContext unlocking is dropped. > > > > > > Does this make sense? I haven't seen these things in recent patch series. > > > > Intuitively I would agree with most. There may be details that I'm not > > aware of at the moment. I'm not surprised that we haven't seen any such > > things in recent patch series because there is an awful lot of > > preparational work to be done before we can reach this final state. > > Yes. I'm mostly hoping to find out whether my view in fundamentally > flawed or very different from what you and others think. No, not fundamentally. The big challenge in my mind is how to verify that the conditions are actually met. I'm fairly sure that using drain this way is by far too complex to rely on review only. Kevin
On Mon, May 23, 2022 at 06:04:55PM +0200, Kevin Wolf wrote: > Am 23.05.2022 um 17:13 hat Stefan Hajnoczi geschrieben: > > On Mon, May 23, 2022 at 03:02:05PM +0200, Kevin Wolf wrote: > > > Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben: > > > > Hi, > > > > Sorry for the long email. I've included three topics that may help us discuss > > > > draining and AioContext removal further. > > > > > > > > The shortcomings of drain > > > > ------------------------- > > > > I wanted to explore the logical argument that making graph modifications within > > > > a drained section is correct: > > > > - Graph modifications and BB/BDS lookup are Global State (GS). > > > > - Graph traversal from a BB/BDS pointer is IO. > > > > - Only one thread executes GS code at a time. > > > > - IO is quiesced within a drained section. > > > > > > I think you're mixing two things here and calling both of them IO. > > > > > > If a function is marked as IO, that means that it is safe to call from > > > I/O requests, which may be running in any iothread (they currently only > > > run in the home thread of the node's AioContext, but the function > > > annotations already promise that any thread is safe). > > > > > > However, if a function is marked as IO, this doesn't necessarily mean > > > that it is always only called in the context of an I/O request. It can > > > be called by any other code, too. > > > > > > So while drain does quiesce all I/O requests, this doesn't mean that > > > functions marked as IO won't run any more. > > > > My model is that between bdrv_drained_begin() and bdrv_drained_end() > > only the caller is allowed to invoke BB/BDS functions (including > > functions marked IO). > > Okay, this sounds better. It's not actually related to IO/GS, but to > BB/BDS functions, including both IO and GS functions. > > So graph traversal from a BB/BDS pointer makes it a BB/BDS function, and > BB/BDS functions need to be quiesced within a drained section. > > > The caller isn't strictly one thread and one or no coroutines. The > > caller could use threads and coroutines, but the important thing is > > that everything else that accesses the BB/BDS is suspended due to > > .drained_begin() callbacks (and is_external). > > > > So when you say a function marked IO can be called from outside an I/O > > request, that "outside an I/O request" code must be quiesced too. > > Otherwise drain is definitely not safe for graph modifications. > > If you phrase it as a condition and as a goal to achieve, then I agree > that this is required when you want to use drain for locking. > > My impression was that you were using this to argue that the code is > already doing this and is already safe in this scenario, and this isn't > true. I think I misunderstood you there and you were really describing > the future state that you're envisioning. Sorry, I didn't present it clearly. I'm trying to think through the model needed to make graph modifications under drain safe. > > > > - Therefore a drained section in GS code suspends graph traversal, other graph > > > > modifications, and BB/BDS lookup. > > > > - Therefore it is safe to modify the graph from a GS drained section. > > > > > > So unfortunately, I don't think this conclusion is correct. > > > > > > In order to make your assumption true, we would have to require that all > > > callers of IO functions must have called bdrv_inc_in_flight(). We would > > > also have to find a way to enforce this preferable at compile time or > > > with static analysis, or at least with runtime assertions, because it > > > would be very easy to get wrong. > > > > Or that they are quiesced by .drained_begin() callbacks or is_external. > > > > Do you have a concrete example? > > Yes, you're right, bdrv_inc_in_flight() is only one way to achieve this. > They just need to make bdrv_drain_poll() return true as long as they are > active so that bdrv_drained_begin() waits for them. .drained_poll() is > another valid way to achieve this. > > However, if we want to rely on static analysis to verify that everything > is covered, always requiring bdrv_inc_in_flight() might make this > easier. So possibly we want to require it even in cases where > .drained_poll() or aio_disable_external() would be enough in theory. > > > > > > > > However, I hit on a problem that I think Emanuele and Paolo have already > > > > pointed out: draining is GS & IO. This might have worked under the 1 IOThread > > > > model but it does not make sense for multi-queue. It is possible to submit I/O > > > > requests in drained sections. How can multiple threads be in drained sections > > > > simultaneously and possibly submit further I/O requests in their drained > > > > sections? Those sections wouldn't be "drained" in any useful sense of the word. > > > > > > Drains asks other users not to touch the block node. Currently this only > > > > BTW interesting language choice here: you're using the more general > > definition of "other users" and "touch[ing] the block node" rather than > > the narrow definition of just "I/O requests". That's exactly how I think > > of drain but based on what you wrote above I'm not sure that's how you > > think of it too? > > I hope that my reply above made it clearer: The broader definition is > what is needed if we want to use drain to replace the AioContext lock > for protecting the graph, but the narrow definition is what is > implemented today. Okay. > > > includes, but the desire to use drain for locking means that we need to > > > extend it to pretty much any operation on the node, which would include > > > calling drain on that block node. So you should never have two callers > > > in a drain section at the same time, it would be a bug in this model. > > > > Yes! Drain in its current form doesn't make sense for multi-queue since > > multiple threads could be in drained sections at the same time and they > > would all be allowed to submit new I/O requests. > > Nobody would be allowed to submit new requests (because someone else has > drained the node), but they would do so anyway. ;-) > > Actually, only half joking, because this shows how weak the protection > by drain is if we don't have a way to verify that the whole codebase > supports drain correctly. > > > > Of course, we know that currently drain is not respected by all relevant > > > callers (most importantly, the monitor). If you want to use drain as a > > > form of locking, you need to solve this first. > > > > > > > It is necessary to adjust how recursive drained sections work: > > > > bdrv_drained_begin() must not return while there are deeper nested drained > > > > sections. > > > > > > > > This is allowed: > > > > > > > > Monitor command Block job > > > > --------------- --------- > > > > > bdrv_drained_begin() > > > > . > bdrv_drained_begin() > > > > . < bdrv_drained_begin() > > > > . . > > > > . . > > > > . > bdrv_drained_end() > > > > . < bdrv_drained_end() > > > > < bdrv_drained_begin() > > > > . > > > > . > > > > > bdrv_drained_end() > > > > < bdrv_drained_end() > > > > > > Just to make sure I understand the scenario that you have in mind here: > > > The monitor command is not what causes the block job to do the draining > > > because this is inside bdrv_drained_begin(), so the block job just > > > randomly got a callback that caused it to do so (maybe completion)? > > > > Yes, exactly that completion scenario. When the mirror block job > > completes exactly when a monitor command calls bdrv_drained_begin(), > > mirror_exit_common() deletes a temporary filter BDS node. It involves > > drain and modifies the graph. > > > > > > > > Before bdrv_drained_begin() returns, anything is still allowed to run, > > > so I agree that this is valid. > > > > Thanks for confirming! > > > > > > This is not allowed: > > > > > > > > Monitor command Block job > > > > --------------- --------- > > > > > bdrv_drained_begin() > > > > . > bdrv_drained_begin() > > > > . < bdrv_drained_begin() > > > > . . > > > > . . > > > > < bdrv_drained_begin() . > > > > . . > > > > . > bdrv_drained_end() > > > > . < bdrv_drained_end() > > > > > bdrv_drained_end() > > > > < bdrv_drained_end() > > > > > > > > This restriction creates an ordering between bdrv_drained_begin() callers. In > > > > this example the block job must not depend on the monitor command completing > > > > first. Otherwise there would be a deadlock just like with two threads wait for > > > > each other while holding a mutex. > > > > > > So essentially drain needs to increase bs->in_flight, so that the outer > > > drain has to wait for the inner drain section to end before its > > > bdrv_drained_begin() can return. > > > > > > > For sanity I think draining should be GS-only. IO code should not invoke > > > > bdrv_drained_begin() to avoid ordering problems when multiple threads drain > > > > simultaneously and have dependencies on each other. > > > > > > > > block/mirror.c needs to be modified because it currently drains from IO when > > > > mirroring is about to end. > > > > > > > > With this change to draining I think the logical argument for correctness with > > > > graph modifications holds. > > > > > > What is your suggestion how to modify mirror? It drains so that no new > > > requests can sneak in and source and target stay in sync while it > > > switches to the completion code running in the main AioContext. You > > > can't just drop this. > > > > I haven't read the code in enough depth to come up with a solution. > > So this sounds like it needs more thought before we can assume that > we'll have a final state where drain is GS-only. I will look into solutions for mirror but it might be a few days before I have time. > > > > > > > Enforcing GS/IO separation at compile time > > > > ------------------------------------------ > > > > Right now GS/IO asserts check assumptions at runtime. The next step may be > > > > using the type system to separate GS and IO APIs so it's impossible for IO code > > > > to accidentally call GS code, for example. > > > > > > > > typedef struct { > > > > BlockDriverState *bs; > > > > } BlockDriverStateIOPtr; > > > > > > > > typedef struct { > > > > BlockDriverState *bs; > > > > } BlockDriverStateGSPtr; > > > > > > > > Then APIs can be protected as follows: > > > > > > > > void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...); > > > > > > > > A function that only has a BlockDriverStateIOPtr cannot call > > > > bdrv_set_aio_context_ignore() by mistake since the compiler will complain that > > > > the first argument must be a BlockDriverStateGSPtr. > > > > > > And then you have a set of functions that cast from GS to IO pointers, > > > but not for the other way around? Or maybe getting as IO pointer would > > > even be coupled with automatically increasing bs->in_flight? > > > > > > The other option that we were looking into for this was static analysis. > > > I had hacked up a small script to check consistency of these annotations > > > (without covering function pointers, though) to help me with the review > > > of Emanuele's patches that introduced them. If I understand correctly, > > > Paolo has scripts to do the same a little bit better. > > > > > > As long as we can integrate such a script in 'make check', we wouldn't > > > necessarily have to have the churn in the C code to switch everything to > > > new wrapper types. > > > > Yes, that's the problem with the typedef Ptr approach. It would involve > > a lot of code changes. Maybe static analysis tools are better. > > > > > > > > > Possible steps for AioContext removal > > > > ------------------------------------- > > > > I also wanted to share my assumptions about multi-queue and AioContext removal. > > > > Please let me know if anything seems wrong or questionable: > > > > > > > > - IO code can execute in any thread that has an AioContext. > > > > - Multiple threads may execute a IO code at the same time. > > > > - GS code only execute under the BQL. > > > > > > > > For AioContext removal this means: > > > > > > > > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for > > > > a special "home" AioContext. > > > > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to > > > > run a coroutine in the BDS's AioContext. > > > > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many > > > > threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin() > > > > may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue > > > > virtio-blk device). > > > > - AIO_WAIT_WHILE() simplifies to > > > > > > > > while ((cond)) { > > > > aio_poll(qemu_get_current_aio_context(), true); > > > > ... > > > > } > > > > > > Probably not exactly, because you still need aio_wait_kick() to wake up > > > the thread. We use AioWait.num_waiters != 0 to decide whether we need to > > > schedule a BH in the main thread (because doing so unconditionally for > > > every completed request would be very wasteful). > > > > > > If you want to be able to run this loop from any thread instead of just > > > the main thread, you would have to store somewhere which thread to wake. > > > > qemu_cond_broadcast() can be used since the wakeup is a slow path. > > I don't think we have a QemuCond for waking up a blocking aio_poll()? > Doesn't this usually involve writing to the event notifier file > descriptor of the specific AioContext? You're right, I forgot that the waiter is sitting in their own aio_poll() so we cannot use qemu_cond_wait(). Another mechanism would be necessary. > > > > and the distinction between home AioContext and non-home context is > > > > eliminated. AioContext unlocking is dropped. > > > > > > > > Does this make sense? I haven't seen these things in recent patch series. > > > > > > Intuitively I would agree with most. There may be details that I'm not > > > aware of at the moment. I'm not surprised that we haven't seen any such > > > things in recent patch series because there is an awful lot of > > > preparational work to be done before we can reach this final state. > > > > Yes. I'm mostly hoping to find out whether my view in fundamentally > > flawed or very different from what you and others think. > > No, not fundamentally. > > The big challenge in my mind is how to verify that the conditions are > actually met. I'm fairly sure that using drain this way is by far too > complex to rely on review only. I agree there needs to some kind of automated checker because relying on code review won't work. Stefan
On 5/22/22 17:06, Stefan Hajnoczi wrote: > However, I hit on a problem that I think Emanuele and Paolo have already > pointed out: draining is GS & IO. This might have worked under the 1 IOThread > model but it does not make sense for multi-queue. It is possible to submit I/O > requests in drained sections. How can multiple threads be in drained sections > simultaneously and possibly submit further I/O requests in their drained > sections? Those sections wouldn't be "drained" in any useful sense of the word. Yeah, that works only if the drained sections are well-behaved. "External" sources of I/O are fine; they are disabled using is_external, and don't drain themselves I think. Mirror is the only I/O user of drain, and it's fine because it never submits I/O to the drained BDS. Drained sections in the main thread can be special cased to allow I/O (wrlock in this series would also allow I/O). So I think that the "cooperation from all relevant places" that Kevin mentioned is already there, except for coroutine commands in the monitor. Those are a bad idea in my opinion and I'd rather revert commit eb94b81a94 ("block: Convert 'block_resize' to coroutine") until we have a clearer idea of how to handle them. I agree that it's basically impossible to review the change. On the other hand, there's already a substantial amount of faith involved in the correctness of the current code. In particular the AioContext lock does absolutely nothing to protect corutines in the main thread against graph changes---both from the monitor (including BHs as in "block: Fix BB.root changing across bdrv_next()") and from BDS coroutines. The former are unprotected; the latter are protected by drain only: using drain to protect against graph writes would be a matter of extending *existing* faith to the multi-iothread case. Once the deadlock is broken, we can proceed to remove the AioContext lock and then introduce actual coroutine-based locking. > Possible steps for AioContext removal > ------------------------------------- > I also wanted to share my assumptions about multi-queue and AioContext removal. > Please let me know if anything seems wrong or questionable: > > - IO code can execute in any thread that has an AioContext. > - Multiple threads may execute a IO code at the same time. > - GS code only execute under the BQL. > > For AioContext removal this means: > > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for > a special "home" AioContext. Correct. bdrv_set_aio_context() remains useful as a way to set a home AioContext for sockets. > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to > run a coroutine in the BDS's AioContext. > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many > threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin() > may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue > virtio-blk device). This is a change that can be done independent of this work. > - AIO_WAIT_WHILE() simplifies to > > while ((cond)) { > aio_poll(qemu_get_current_aio_context(), true); > ... > } > > and the distinction between home AioContext and non-home context is > eliminated. AioContext unlocking is dropped. (I'll reply on this from elsewhere in the thread). > Does this make sense? I haven't seen these things in recent patch series. I agree, and yeah all these are blocked on protecting graph modifications. In parallel to the block layer discussions, it's possible to work on introducing a request queue lock in virtio-blk and virtio-scsi. That's the only thing that relies on the AioContext lock outside the block layer. Paolo
On Tue, May 24, 2022 at 09:55:39AM +0200, Paolo Bonzini wrote: > On 5/22/22 17:06, Stefan Hajnoczi wrote: > > However, I hit on a problem that I think Emanuele and Paolo have already > > pointed out: draining is GS & IO. This might have worked under the 1 IOThread > > model but it does not make sense for multi-queue. It is possible to submit I/O > > requests in drained sections. How can multiple threads be in drained sections > > simultaneously and possibly submit further I/O requests in their drained > > sections? Those sections wouldn't be "drained" in any useful sense of the word. > > Yeah, that works only if the drained sections are well-behaved. > > "External" sources of I/O are fine; they are disabled using is_external, and > don't drain themselves I think. I/O requests for a given BDS may be executing in multiple AioContexts, so how do you call aio_disable_external() on all relevant AioContexts? We covered this below but I wanted to reply here in case someone else reads this part without reading the rest. > Mirror is the only I/O user of drain, and it's fine because it never submits > I/O to the drained BDS. > > Drained sections in the main thread can be special cased to allow I/O > (wrlock in this series would also allow I/O). > > So I think that the "cooperation from all relevant places" that Kevin > mentioned is already there, except for coroutine commands in the monitor. > Those are a bad idea in my opinion and I'd rather revert commit eb94b81a94 > ("block: Convert 'block_resize' to coroutine") until we have a clearer idea > of how to handle them. > > I agree that it's basically impossible to review the change. On the other > hand, there's already a substantial amount of faith involved in the > correctness of the current code. > > In particular the AioContext lock does absolutely nothing to protect > corutines in the main thread against graph changes---both from the monitor > (including BHs as in "block: Fix BB.root changing across bdrv_next()") and > from BDS coroutines. The former are unprotected; the latter are protected > by drain only: using drain to protect against graph writes would be a matter > of extending *existing* faith to the multi-iothread case. > > Once the deadlock is broken, we can proceed to remove the AioContext lock > and then introduce actual coroutine-based locking. > > > Possible steps for AioContext removal > > ------------------------------------- > > I also wanted to share my assumptions about multi-queue and AioContext removal. > > Please let me know if anything seems wrong or questionable: > > > > - IO code can execute in any thread that has an AioContext. > > - Multiple threads may execute a IO code at the same time. > > - GS code only execute under the BQL. > > > > For AioContext removal this means: > > > > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for > > a special "home" AioContext. > > Correct. bdrv_set_aio_context() remains useful as a way to set a home > AioContext for sockets. > > > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to > > run a coroutine in the BDS's AioContext. > > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many > > threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin() > > may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue > > virtio-blk device). > > This is a change that can be done independent of this work. > > > - AIO_WAIT_WHILE() simplifies to > > > > while ((cond)) { > > aio_poll(qemu_get_current_aio_context(), true); > > ... > > } > > > > and the distinction between home AioContext and non-home context is > > eliminated. AioContext unlocking is dropped. > > (I'll reply on this from elsewhere in the thread). > > > Does this make sense? I haven't seen these things in recent patch series. > > I agree, and yeah all these are blocked on protecting graph modifications. > > In parallel to the block layer discussions, it's possible to work on > introducing a request queue lock in virtio-blk and virtio-scsi. That's the > only thing that relies on the AioContext lock outside the block layer. I'm not sure what the request queue lock protects in virtio-blk? In virtio-scsi I guess a lock is needed to protect SCSI target emulation state? Stefan
On 5/24/22 10:08, Stefan Hajnoczi wrote: > On Tue, May 24, 2022 at 09:55:39AM +0200, Paolo Bonzini wrote: >> On 5/22/22 17:06, Stefan Hajnoczi wrote: >>> However, I hit on a problem that I think Emanuele and Paolo have already >>> pointed out: draining is GS & IO. This might have worked under the 1 IOThread >>> model but it does not make sense for multi-queue. It is possible to submit I/O >>> requests in drained sections. How can multiple threads be in drained sections >>> simultaneously and possibly submit further I/O requests in their drained >>> sections? Those sections wouldn't be "drained" in any useful sense of the word. >> >> Yeah, that works only if the drained sections are well-behaved. >> >> "External" sources of I/O are fine; they are disabled using is_external, and >> don't drain themselves I think. > > I/O requests for a given BDS may be executing in multiple AioContexts, > so how do you call aio_disable_external() on all relevant AioContexts? With multiqueue yeah, we have to replace aio_disable_external() with drained_begin/end() callbacks; but I'm not talking about that yet. >> In parallel to the block layer discussions, it's possible to work on >> introducing a request queue lock in virtio-blk and virtio-scsi. That's the >> only thing that relies on the AioContext lock outside the block layer. > > I'm not sure what the request queue lock protects in virtio-blk? In > virtio-scsi I guess a lock is needed to protect SCSI target emulation > state? Yes, but even in virtio-blk there is this code that runs in the main thread and is currently protected by aio_context_acquire/release: blk_drain(s->blk); /* We drop queued requests after blk_drain() because blk_drain() * itself can produce them. */ while (s->rq) { req = s->rq; s->rq = req->next; virtqueue_detach_element(req->vq, &req->elem, 0); virtio_blk_free_request(req); } Maybe it's safe to run it without a lock because it runs after virtio_set_status(vdev, 0) but I'd rather play it safe and protect s->rq with a lock. Paolo
On Tue, May 24, 2022 at 11:17:06AM +0200, Paolo Bonzini wrote: > On 5/24/22 10:08, Stefan Hajnoczi wrote: > > On Tue, May 24, 2022 at 09:55:39AM +0200, Paolo Bonzini wrote: > > > On 5/22/22 17:06, Stefan Hajnoczi wrote: > > > > However, I hit on a problem that I think Emanuele and Paolo have already > > > > pointed out: draining is GS & IO. This might have worked under the 1 IOThread > > > > model but it does not make sense for multi-queue. It is possible to submit I/O > > > > requests in drained sections. How can multiple threads be in drained sections > > > > simultaneously and possibly submit further I/O requests in their drained > > > > sections? Those sections wouldn't be "drained" in any useful sense of the word. > > > > > > Yeah, that works only if the drained sections are well-behaved. > > > > > > "External" sources of I/O are fine; they are disabled using is_external, and > > > don't drain themselves I think. > > > > I/O requests for a given BDS may be executing in multiple AioContexts, > > so how do you call aio_disable_external() on all relevant AioContexts? > > With multiqueue yeah, we have to replace aio_disable_external() with > drained_begin/end() callbacks; but I'm not talking about that yet. > > > > In parallel to the block layer discussions, it's possible to work on > > > introducing a request queue lock in virtio-blk and virtio-scsi. That's the > > > only thing that relies on the AioContext lock outside the block layer. > > > > I'm not sure what the request queue lock protects in virtio-blk? In > > virtio-scsi I guess a lock is needed to protect SCSI target emulation > > state? > > Yes, but even in virtio-blk there is this code that runs in the main thread > and is currently protected by aio_context_acquire/release: > > blk_drain(s->blk); > > /* We drop queued requests after blk_drain() because blk_drain() > * itself can produce them. */ > while (s->rq) { > req = s->rq; > s->rq = req->next; > virtqueue_detach_element(req->vq, &req->elem, 0); > virtio_blk_free_request(req); > } > > Maybe it's safe to run it without a lock because it runs after > virtio_set_status(vdev, 0) but I'd rather play it safe and protect s->rq > with a lock. What does the lock protect? A lock can prevent s->rq or req->vq corruption but it cannot prevent request leaks. This loop's job is to free all requests so there is no leak. If a lock is necessary then this code is already broken in a more fundamental way because it can leak. Stefan
Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben: > On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote: > > For example, all callers of bdrv_open() always take the AioContext lock. > > Often it is taken very high in the call stack, but it's always taken. > > I think it's actually not a problem of who takes the AioContext lock or > where; the requirements are contradictory: Okay, now that I have explained which challenges I see with the drain solution, I'd also like to understand the problem that even motivates you to go back to drains. > * IO_OR_GS_CODE() functions, when called from coroutine context, expect to > be called with the AioContext lock taken (example: bdrv_co_yield_to_drain) Am I right to say this is not inherently part of the definition of IO_OR_GS_CODE(), but just a property that these functions have in practice? In practice, the functions that are IO_OR_GS_CODE() are those that call AIO_WAIT_WHILE() - which drops the lock, so it must have been taken first. Of course, when calling from coroutine context, AIO_WAIT_WHILE() is wrong, so these functions all have a different code path for coroutines (or they aren't suitable to be called in coroutines at all). Using a different code path means that the restrictions from AIO_WAIT_WHILE() don't really apply any more and these functions become effectively IO_CODE() when called in a coroutine. (At least I'm not aware of any other piece of code apart from AIO_WAIT_WHILE() that makes a function IO_OR_GS_CODE().) Surrounding IO_CODE() with aio_context_acquire/release() is in the definition of IO_CODE(), so your assumption seems right. (Not sure if it's actually necessary in all cases and whether all callers do this correctly, but with this definition we have declared that expections to this are in fact bugs.) (You mention bdrv_co_yield_to_drain() as an example. I don't think it's a good example. The function isn't annotated, but it seems to me that the correct annotation would be IO_CODE() anyway, i.e. safe to call from any thread, not just IO_OR_GS_CODE().) > * to call these functions with the lock taken, the code has to run in the > BDS's home iothread. Attempts to do otherwise results in deadlocks (the > main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot > happen without releasing the aiocontext lock) This problem can't happen in the main thread itself, AIO_WAIT_WHILE() is safe both in the home thread and the main thread (at least as long as you lock only once) because it temporarily drops the lock. It has also become the definition of IO_OR_GS_CODE(): This code has to run in the home thread or the main thread. Of course, above I just concluded that when called from coroutines, in practice IO_OR_GS_CODE() essentially turns into IO_CODE(). This is supposed to allow much more: * I/O API functions. These functions are thread-safe, and therefore * can run in any thread as long as the thread has called * aio_context_acquire/release(). Come to think of it, I believe that many of the functions we declared IO_CODE() are actually just IO_OR_GS_CODE() (at best; for iothreads, they certainly require running in the home thread, but the main thread allowed by IO_OR_GS_CODE() might not work). We have all the coroutine machinery so that the AioContext lock of the current thread is automatically released and reacquired across yield. However, this is the wrong AioContext when called from a different thread, so we need an additional lock - which should be dropped during yield, too, but it doesn't happen. Maybe this is really the scenario you mean with this point? Switching to drain for locking doesn't solve the problem, but only possibly defer it. In order to complete the multiqueue work, we need to make IO_CODE() functions actually conform to the definition of IO_CODE(). Do we have a plan what this should look like in the final state when all the multiqueue work is completed? Will it require replacing the more or less automatic AioContext lock handling that we currently have in coroutines with explicit unlock/lock around all yield points? I assume that some kind of lock will still have to be held and it wouldn't just disappear with the removal of the AioContext lock? Or will we only have coroutine locks which don't have this problem? > * running the code in the BDS's home iothread is not possible for > GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main > thread, but that cannot be guaranteed in general) There is nothing that stops GLOBAL_STATE_CODE() from scheduling work in the home iothread of a BDS and then waiting for it - or if it is a coroutine, even to reschedule itself into the BDS home thread temporarily. This is essentially what all of the generated_co_wrapper functions do, and the coroutine case is what bdrv_co_enter() does. So I'm not sure what you mean by this? Kevin
Am 18.05.2022 um 14:28 hat Emanuele Giuseppe Esposito geschrieben: > label: // read till the end to see why I wrote this here > > I was hoping someone from the "No" party would answer to your question, > because as you know we reached this same conclusion together. > > We thought about dropping the drain for various reasons, the main one > (at least as far as I understood) is that we are not sure if something > can still happen in between drain_begin/end, and it is a little bit > confusing to use the same mechanism to block I/O and protect the graph. > > We then thought about implementing a rwlock. A rdlock would clarify what > we are protecting and who is using the lock. I had a rwlock draft > implementation sent in this thread, but this also lead to additional > problems. > Main problem was that this new lock would introduce nested event loops, > that together with such locking would just create deadlocks. > If readers are in coroutines and writers are not (because graph > operations are not running in coroutines), we have a lot of deadlocks. > If a writer has to take the lock, it must wait all other readers to > finish. And it does it by internally calling AIO_WAIT_WHILE, creating > nested event loop. We don't know what could execute when polling for > events, and for example another writer could be resumed. Why is this a problem? Your AIO_WAIT_WHILE() condition would be that there are neither readers nor writers, so you would just keep waiting until the other writer is done. > Ideally, we want writers in coroutines too. > > Additionally, many readers are running in what we call "mixed" > functions: usually implemented automatically with generated_co_wrapper > tag, they let a function (usually bdrv callback) run always in a > coroutine, creating one if necessary. For example, bdrv_flush() makes > sure hat bs->bdrv_co_flush() is always run in a coroutine. > Such mixed functions are used in other callbacks too, making it really > difficult to understand if we are in a coroutine or not, and mostly > important make rwlock usage very difficult. How do they make rwlock usage difficult? *goes back to old IRC discussions* Ah, the problem was not the AIO_WAIT_WHILE() while taking the lock, but taking the lock first and then running an AIO_WAIT_WHILE() inside the locked section. This is forbidden because the callbacks that run during AIO_WAIT_WHILE() may in turn wait for the lock that you own, causing a deadlock. This is indeed a problem that running in coroutines would avoid because the inner waiter would just yield and the outer one could complete its job as soon as it's its turn. My conclusion in the IRC discussion was that maybe we need to take the graph locks when we're entering coroutine context, i.e. the "mixed" functions would rdlock the graph when called from non-coroutine context and would assume that it's already locked when called from coroutine context. > Which lead us to stepping back once more and try to convert all > BlockDriverState callbacks in coroutines. This would greatly simplify > rwlock usage, because we could make the rwlock coroutine-frendly > (without any AIO_WAIT_WHILE, allowing a writer to wait for readers by > just yielding and queuing itself in coroutine queues). > > First step was then to convert all callbacks in coroutines, using > generated_coroutine_wrapper (g_c_w). > A typical g_c_w is implemented in this way: > if (qemu_in_coroutine()) { > callback(); > } else { // much simplified > co = qemu_coroutine_create(callback); > bdrv_coroutine_enter(bs, co); > BDRV_POLL_WHILE(bs, coroutine_in_progress); > } > Once all callbacks are implemented using g_c_w, we can start splitting > the two sides of the if function to only create a coroutine when we are > outside from a bdrv callback. > > However, we immediately found a problem while starting to convert the > first callbacks: the AioContext lock is taken around some non coroutine > callbacks! For example, bs->bdrv_open() is always called with the > AioContext lock taken. In addition, callbacks like bdrv_open are > graph-modifying functions, which is probably why we are taking the > Aiocontext lock, and they do not like to run in coroutines. > Anyways, the real problem comes when we create a coroutine in such > places where the AioContext lock is taken and we have a graph-modifying > function. > > bdrv_coroutine_enter() calls aio_co_enter(), which in turns first checks > if the coroutine is entering another context from the current (which is > not the case for open) and if we are already in coroutine (for sure > not). Therefore it resorts to the following calls; > aio_context_acquire(ctx); > qemu_aio_coroutine_enter(ctx, co); > aio_context_release(ctx); > Which is clearly a problem, because we are taking the lock twice: once > from the original caller of the callback, and once here due to the > coroutine. This creates a lot of deadlock situations. What are the deadlock situations that are created by locking twice? The only problem I'm aware of is AIO_WAIT_WHILE(), which wants to temporarily unlock the AioContext It calls aio_context_release() once to achieve this, which obviously isn't enough when the context was locked twice. But AIO_WAIT_WHILE() isn't allowed in coroutines anyway. So how are we running into deadlocks here? Note that we're probably already doing this inside the .bdrv_open implementations: They will ususally read something from the image file, calling bdrv_preadv() which is already a generated_coroutine_wrapper today and creates a coroutine internally with the same locking pattern applied that you describe as problematic here. Making .bdrv_open itself a generated_coroutine_wrapper wouldn't really change anything fundamental, it would just pull the existing mechanism one function higher in the call stack. > For example, all callers of bdrv_open() always take the AioContext lock. > Often it is taken very high in the call stack, but it's always taken. > > Getting rid of the lock around qemu_aio_coroutine_enter() is difficult > too, because coroutines expect to have the lock taken. For example, if > we want to drain from a coroutine, bdrv_co_yield_to_drain releases the > lock for us. It's not difficult at all in your case where you know that you're already in the right thread and the lock is taken: You can call qemu_aio_coroutine_enter() directly instead of bdrv_coroutine_enter() in this case. But as I said, I'm not sure why we need to get rid of it at all. Kevin
On 5/24/22 12:20, Stefan Hajnoczi wrote: >> Maybe it's safe to run it without a lock because it runs after >> virtio_set_status(vdev, 0) but I'd rather play it safe and protect s->rq >> with a lock. > > What does the lock protect? > > A lock can prevent s->rq or req->vq corruption but it cannot prevent > request leaks. This loop's job is to free all requests so there is no > leak. If a lock is necessary then this code is already broken in a more > fundamental way because it can leak. Yes, you're right. This particular list is always accessed in the iothread (if any) and blk_drain() is enough. virtio-blk should already not need aio_context_{acquire,release}. It's worth a comment, though! Paolo
On 5/24/22 12:36, Kevin Wolf wrote: >> * IO_OR_GS_CODE() functions, when called from coroutine context, expect to >> be called with the AioContext lock taken (example: bdrv_co_yield_to_drain) > > Am I right to say this is not inherently part of the definition of > IO_OR_GS_CODE(), but just a property that these functions have in > practice? In practice, the functions that are IO_OR_GS_CODE() > are those that call AIO_WAIT_WHILE(). Yes. > Using a different code path means that the restrictions from > AIO_WAIT_WHILE() don't really apply any more and these functions become > effectively IO_CODE() when called in a coroutine. (At least I'm not > aware of any other piece of code apart from AIO_WAIT_WHILE() that makes > a function IO_OR_GS_CODE().) The second point is correct. The first depends on the definition of the coroutine path. For example, bdrv_co_yield_to_drain() expects to run with bdrv_get_aio_context(bs)'s lock taken. An iothread cannot take another iothread's AioContext lock to avoid AB-BA deadlocks; therefore, bdrv_co_yield_to_drain() can only run in the main thread or in bs's home iothread. >> * to call these functions with the lock taken, the code has to run in the >> BDS's home iothread. Attempts to do otherwise results in deadlocks (the >> main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot >> happen without releasing the aiocontext lock) > > This problem can't happen in the main thread itself, AIO_WAIT_WHILE() is > safe both in the home thread and the main thread (at least as long as > you lock only once) because it temporarily drops the lock. It has also > become the definition of IO_OR_GS_CODE(): This code has to run in the > home thread or the main thread. It doesn't work to run bdrv_open_driver() in the home iothread because bdrv_open_driver can change the AioContext of a BDS (causing extreme confusion on what runs where and what AioContext locks have to be taken and released). So in order to have bdrv_open_driver() run in the main thread, Emanuele added a variant of generated_co_wrapper that waits on the main thread. But it didn't work either, because then AIO_WAIT_WHILE() does not release the BDS's AioContext lock. When ->bdrv_open() calls bdrv_replace_child_noperm() and it tries to drain, bdrv_co_yield_to_drain() schedules a bottom half on the iothread and yields; the bottom half can never run, because the AioContext lock is already taken elsewhere. main thread ctx home thread aio_context_acquire(ctx) bdrv_open() drv->bdrv_co_open() bdrv_replace_child_noperm() bdrv_co_yield_to_drain() aio_context_release(ctx) aio_schedule_bh_oneshot() So, bdrv_open_driver() and bdrv_close() are un-coroutin-izable. I guess we could split the callback in two parts, one doing I/O and one doing graph manipulation (it may even be a good idea---the ->bdrv_inactivate() callback even exists already in the case of bdrv_close()---but I'm not sure it's always applicable). > Come to think of it, I believe that many of the functions we declared > IO_CODE() are actually just IO_OR_GS_CODE() (at best; for iothreads, > they certainly require running in the home thread, but the main thread > allowed by IO_OR_GS_CODE() might not work). We have all the coroutine > machinery so that the AioContext lock of the current thread is > automatically released and reacquired across yield. However, this is the > wrong AioContext when called from a different thread, so we need an > additional lock - which should be dropped during yield, too, but it > doesn't happen. There's no need for additional locks. Drivers have to be protected by individual locks, which are either QemuMutex and dropped during yield, or CoMutex. A QemuMutex used to protect a CoQueue is also dropped safely during yield. The IO_CODE() distinction is indeed mostly theoretical until the file I/O BlockDriver protocols are protected by the AioContext lock and therefore are actually IO_OR_GS_CODE(). But that's the least of our worries; if file-posix AIO is done on qemu_get_current_aio_context() instead of bdrv_get_aio_context(bs), the AioContext lock is not needed anymore for I/O. Apart from file I/O, all drivers were thread-safe at some point (now it seems blklogwrites.c is not, but the code also seems not safe against power loss and I wonder if it should be just deleted). > Switching to drain for locking doesn't solve the problem, but only > possibly defer it. In order to complete the multiqueue work, we need > to make IO_CODE() functions actually conform to the definition of > IO_CODE(). Do we have a plan what this should look like in the final > state when all the multiqueue work is completed? Or will we only have > coroutine locks which don't have this problem? The latter apart from the occasional QemuMutex, which is used only when it does not cause the problem. >> * running the code in the BDS's home iothread is not possible for >> GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main >> thread, but that cannot be guaranteed in general) > > There is nothing that stops GLOBAL_STATE_CODE() from scheduling work in > the home iothread of a BDS and then waiting for it - or if it is a > coroutine, even to reschedule itself into the BDS home thread > temporarily. There are still things that I/O thread cannot do, for example bdrv_set_aio_context(). Even if it worked, I don't think it would be a good idea. It basically would mean a "movable" main thread. It's even harder to reason on it. So that's how I got to the point where I don't think it's possible to proceed with the idea of coroutin-izing more code (a prerequisite for a coroutine-based graph rwlock) without first removing the AioContext lock. But removing the AioContext lock requires a way to make the graph operations thread-safe, and then we go back to draining. Paolo
Am 24/05/2022 um 14:10 schrieb Kevin Wolf: > Am 18.05.2022 um 14:28 hat Emanuele Giuseppe Esposito geschrieben: >> label: // read till the end to see why I wrote this here >> >> I was hoping someone from the "No" party would answer to your question, >> because as you know we reached this same conclusion together. >> >> We thought about dropping the drain for various reasons, the main one >> (at least as far as I understood) is that we are not sure if something >> can still happen in between drain_begin/end, and it is a little bit >> confusing to use the same mechanism to block I/O and protect the graph. >> >> We then thought about implementing a rwlock. A rdlock would clarify what >> we are protecting and who is using the lock. I had a rwlock draft >> implementation sent in this thread, but this also lead to additional >> problems. >> Main problem was that this new lock would introduce nested event loops, >> that together with such locking would just create deadlocks. >> If readers are in coroutines and writers are not (because graph >> operations are not running in coroutines), we have a lot of deadlocks. >> If a writer has to take the lock, it must wait all other readers to >> finish. And it does it by internally calling AIO_WAIT_WHILE, creating >> nested event loop. We don't know what could execute when polling for >> events, and for example another writer could be resumed. > > Why is this a problem? Your AIO_WAIT_WHILE() condition would be that > there are neither readers nor writers, so you would just keep waiting > until the other writer is done. Yes, but when we get to the AIO_WAIT_WHILE() condition the wrlock is already being taken in the current writer. I think that's what you also mean below. > >> Ideally, we want writers in coroutines too. >> >> Additionally, many readers are running in what we call "mixed" >> functions: usually implemented automatically with generated_co_wrapper >> tag, they let a function (usually bdrv callback) run always in a >> coroutine, creating one if necessary. For example, bdrv_flush() makes >> sure hat bs->bdrv_co_flush() is always run in a coroutine. >> Such mixed functions are used in other callbacks too, making it really >> difficult to understand if we are in a coroutine or not, and mostly >> important make rwlock usage very difficult. > > How do they make rwlock usage difficult? > > *goes back to old IRC discussions* > > Ah, the problem was not the AIO_WAIT_WHILE() while taking the lock, but > taking the lock first and then running an AIO_WAIT_WHILE() inside the > locked section. This is forbidden because the callbacks that run during > AIO_WAIT_WHILE() may in turn wait for the lock that you own, causing a > deadlock. > Yes > This is indeed a problem that running in coroutines would avoid because > the inner waiter would just yield and the outer one could complete its > job as soon as it's its turn. > > My conclusion in the IRC discussion was that maybe we need to take the > graph locks when we're entering coroutine context, i.e. the "mixed" > functions would rdlock the graph when called from non-coroutine context > and would assume that it's already locked when called from coroutine > context. Yes, and that's what I tried to do. But the first step was to transform all callbacks as coroutines. I think you also agree with this, correct? And therefore the easiest step was to convert all callbacks in generated_co_wrapper functions, so that afterwards we could split them between coroutine and not-coroutine logic, as discussed on IRC. Once split, we add the lock in the way you suggested. However, I didn't even get to the first step part, because tests were deadlocking after just transforming 2-3 callbacks. See Paolo thread for a nice explanation on why they are deadlocking and converting these callbacks is difficult. > >> Which lead us to stepping back once more and try to convert all >> BlockDriverState callbacks in coroutines. This would greatly simplify >> rwlock usage, because we could make the rwlock coroutine-frendly >> (without any AIO_WAIT_WHILE, allowing a writer to wait for readers by >> just yielding and queuing itself in coroutine queues). >> >> First step was then to convert all callbacks in coroutines, using >> generated_coroutine_wrapper (g_c_w). >> A typical g_c_w is implemented in this way: >> if (qemu_in_coroutine()) { >> callback(); >> } else { // much simplified >> co = qemu_coroutine_create(callback); >> bdrv_coroutine_enter(bs, co); >> BDRV_POLL_WHILE(bs, coroutine_in_progress); >> } >> Once all callbacks are implemented using g_c_w, we can start splitting >> the two sides of the if function to only create a coroutine when we are >> outside from a bdrv callback. >> >> However, we immediately found a problem while starting to convert the >> first callbacks: the AioContext lock is taken around some non coroutine >> callbacks! For example, bs->bdrv_open() is always called with the >> AioContext lock taken. In addition, callbacks like bdrv_open are >> graph-modifying functions, which is probably why we are taking the >> Aiocontext lock, and they do not like to run in coroutines. >> Anyways, the real problem comes when we create a coroutine in such >> places where the AioContext lock is taken and we have a graph-modifying >> function. >> >> bdrv_coroutine_enter() calls aio_co_enter(), which in turns first checks >> if the coroutine is entering another context from the current (which is >> not the case for open) and if we are already in coroutine (for sure >> not). Therefore it resorts to the following calls; >> aio_context_acquire(ctx); >> qemu_aio_coroutine_enter(ctx, co); >> aio_context_release(ctx); >> Which is clearly a problem, because we are taking the lock twice: once >> from the original caller of the callback, and once here due to the >> coroutine. This creates a lot of deadlock situations. > > What are the deadlock situations that are created by locking twice? Scratch this, and refer to Paolo's thread. > > The only problem I'm aware of is AIO_WAIT_WHILE(), which wants to > temporarily unlock the AioContext It calls aio_context_release() once to > achieve this, which obviously isn't enough when the context was locked > twice. > > But AIO_WAIT_WHILE() isn't allowed in coroutines anyway. So how are we > running into deadlocks here? > > Note that we're probably already doing this inside the .bdrv_open > implementations: They will ususally read something from the image file, > calling bdrv_preadv() which is already a generated_coroutine_wrapper > today and creates a coroutine internally with the same locking pattern > applied that you describe as problematic here. > > Making .bdrv_open itself a generated_coroutine_wrapper wouldn't really > change anything fundamental, it would just pull the existing mechanism > one function higher in the call stack. > >> For example, all callers of bdrv_open() always take the AioContext lock. >> Often it is taken very high in the call stack, but it's always taken. >> >> Getting rid of the lock around qemu_aio_coroutine_enter() is difficult >> too, because coroutines expect to have the lock taken. For example, if >> we want to drain from a coroutine, bdrv_co_yield_to_drain releases the >> lock for us. > > It's not difficult at all in your case where you know that you're > already in the right thread and the lock is taken: You can call > qemu_aio_coroutine_enter() directly instead of bdrv_coroutine_enter() in > this case. > > But as I said, I'm not sure why we need to get rid of it at all. > > Kevin >