diff mbox series

[PULL,11/33] scsi: only access SCSIDevice->requests from one thread

Message ID 20231221212339.164439-12-kwolf@redhat.com (mailing list archive)
State New, archived
Headers show
Series [PULL,01/33] nbd/server: avoid per-NBDRequest nbd_client_get/put() | expand

Commit Message

Kevin Wolf Dec. 21, 2023, 9:23 p.m. UTC
From: Stefan Hajnoczi <stefanha@redhat.com>

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
  the requests list.
- When the VM is stopped only the main loop may access the requests
  list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20231204164259.1515217-2-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/hw/scsi/scsi.h |   7 +-
 hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
 2 files changed, 131 insertions(+), 57 deletions(-)

Comments

Hanna Czenczek Jan. 23, 2024, 4:40 p.m. UTC | #1
On 21.12.23 22:23, Kevin Wolf wrote:
> From: Stefan Hajnoczi<stefanha@redhat.com>
>
> Stop depending on the AioContext lock and instead access
> SCSIDevice->requests from only one thread at a time:
> - When the VM is running only the BlockBackend's AioContext may access
>    the requests list.
> - When the VM is stopped only the main loop may access the requests
>    list.
>
> These constraints protect the requests list without the need for locking
> in the I/O code path.
>
> Note that multiple IOThreads are not supported yet because the code
> assumes all SCSIRequests are executed from a single AioContext. Leave
> that as future work.
>
> Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
> Reviewed-by: Eric Blake<eblake@redhat.com>
> Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
> Signed-off-by: Kevin Wolf<kwolf@redhat.com>
> ---
>   include/hw/scsi/scsi.h |   7 +-
>   hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
>   2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks 
more often because of this commit than because of the original bug, i.e. 
when repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd 
device, this tends to happen when unplugging the scsi-hd:

{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: 
Assertion `ctx == blk->ctx' failed.

(gdb) bt
#0  0x00007f32a668d83c in  () at /usr/lib/libc.so.6
#1  0x00007f32a663d668 in raise () at /usr/lib/libc.so.6
#2  0x00007f32a66254b8 in abort () at /usr/lib/libc.so.6
#3  0x00007f32a66253dc in  () at /usr/lib/libc.so.6
#4  0x00007f32a6635d26 in  () at /usr/lib/libc.so.6
#5  0x0000556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at 
../block/block-backend.c:2429
#6  blk_get_aio_context (blk=0x556e6e89ccf0) at 
../block/block-backend.c:2417
#7  0x0000556e6b112d87 in scsi_device_for_each_req_async_bh 
(opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
#8  0x0000556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at 
../util/async.c:218
#9  0x0000556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290, 
blocking=blocking@entry=true) at ../util/aio-posix.c:722
#10 0x0000556e6b4564b6 in iothread_run 
(opaque=opaque@entry=0x556e6d89d920) at ../iothread.c:63
#11 0x0000556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at 
../util/qemu-thread-posix.c:541
#12 0x00007f32a668b9eb in  () at /usr/lib/libc.so.6
#13 0x00007f32a670f7cc in  () at /usr/lib/libc.so.6

I don’t know anything about the problem yet, but as usual, I like 
speculation and discovering how wrong I was later on, so one thing I 
came across that’s funny about virtio-scsi is that requests can happen 
even while a disk is being attached or detached.  That is, Linux seems 
to probe all LUNs when a new virtio-scsi device is being attached, and 
it won’t stop just because a disk is being attached or removed.  So 
maybe that’s part of the problem, that we get a request while the BB is 
being detached, and temporarily in an inconsistent state (BDS context 
differs from BB context).

I’ll look more into it.

Hanna
Kevin Wolf Jan. 23, 2024, 5:10 p.m. UTC | #2
Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
> On 21.12.23 22:23, Kevin Wolf wrote:
> > From: Stefan Hajnoczi<stefanha@redhat.com>
> > 
> > Stop depending on the AioContext lock and instead access
> > SCSIDevice->requests from only one thread at a time:
> > - When the VM is running only the BlockBackend's AioContext may access
> >    the requests list.
> > - When the VM is stopped only the main loop may access the requests
> >    list.
> > 
> > These constraints protect the requests list without the need for locking
> > in the I/O code path.
> > 
> > Note that multiple IOThreads are not supported yet because the code
> > assumes all SCSIRequests are executed from a single AioContext. Leave
> > that as future work.
> > 
> > Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
> > Reviewed-by: Eric Blake<eblake@redhat.com>
> > Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
> > Signed-off-by: Kevin Wolf<kwolf@redhat.com>
> > ---
> >   include/hw/scsi/scsi.h |   7 +-
> >   hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
> >   2 files changed, 131 insertions(+), 57 deletions(-)
> 
> My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more
> often because of this commit than because of the original bug, i.e. when
> repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
> this tends to happen when unplugging the scsi-hd:
> 
> {"execute":"device_del","arguments":{"id":"stg0"}}
> {"return": {}}
> qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
> Assertion `ctx == blk->ctx' failed.
> 
> (gdb) bt
> #0  0x00007f32a668d83c in  () at /usr/lib/libc.so.6
> #1  0x00007f32a663d668 in raise () at /usr/lib/libc.so.6
> #2  0x00007f32a66254b8 in abort () at /usr/lib/libc.so.6
> #3  0x00007f32a66253dc in  () at /usr/lib/libc.so.6
> #4  0x00007f32a6635d26 in  () at /usr/lib/libc.so.6
> #5  0x0000556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at
> ../block/block-backend.c:2429
> #6  blk_get_aio_context (blk=0x556e6e89ccf0) at
> ../block/block-backend.c:2417
> #7  0x0000556e6b112d87 in scsi_device_for_each_req_async_bh
> (opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
> #8  0x0000556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at
> ../util/async.c:218
> #9  0x0000556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290,
> blocking=blocking@entry=true) at ../util/aio-posix.c:722
> #10 0x0000556e6b4564b6 in iothread_run (opaque=opaque@entry=0x556e6d89d920)
> at ../iothread.c:63
> #11 0x0000556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at
> ../util/qemu-thread-posix.c:541
> #12 0x00007f32a668b9eb in  () at /usr/lib/libc.so.6
> #13 0x00007f32a670f7cc in  () at /usr/lib/libc.so.6
> 
> I don’t know anything about the problem yet, but as usual, I like
> speculation and discovering how wrong I was later on, so one thing I came
> across that’s funny about virtio-scsi is that requests can happen even while
> a disk is being attached or detached.  That is, Linux seems to probe all
> LUNs when a new virtio-scsi device is being attached, and it won’t stop just
> because a disk is being attached or removed.  So maybe that’s part of the
> problem, that we get a request while the BB is being detached, and
> temporarily in an inconsistent state (BDS context differs from BB context).

I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.

At the first sight I don't see a reason why this would break something,
but I've learnt not to trust my first impression with the graph locking
work...

Of course, I also didn't check if there are more things inside of the
device emulation that need additional locking in this case, too. But
even if so, blk_get_aio_context() should never see an inconsistent
state.

Kevin
Hanna Czenczek Jan. 23, 2024, 5:21 p.m. UTC | #3
On 23.01.24 17:40, Hanna Czenczek wrote:
> On 21.12.23 22:23, Kevin Wolf wrote:
>> From: Stefan Hajnoczi<stefanha@redhat.com>
>>
>> Stop depending on the AioContext lock and instead access
>> SCSIDevice->requests from only one thread at a time:
>> - When the VM is running only the BlockBackend's AioContext may access
>>    the requests list.
>> - When the VM is stopped only the main loop may access the requests
>>    list.
>>
>> These constraints protect the requests list without the need for locking
>> in the I/O code path.
>>
>> Note that multiple IOThreads are not supported yet because the code
>> assumes all SCSIRequests are executed from a single AioContext. Leave
>> that as future work.
>>
>> Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
>> Reviewed-by: Eric Blake<eblake@redhat.com>
>> Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
>> Signed-off-by: Kevin Wolf<kwolf@redhat.com>
>> ---
>>   include/hw/scsi/scsi.h |   7 +-
>>   hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
>>   2 files changed, 131 insertions(+), 57 deletions(-)
[...]

> I don’t know anything about the problem yet, but as usual, I like 
> speculation and discovering how wrong I was later on, so one thing I 
> came across that’s funny about virtio-scsi is that requests can happen 
> even while a disk is being attached or detached.  That is, Linux seems 
> to probe all LUNs when a new virtio-scsi device is being attached, and 
> it won’t stop just because a disk is being attached or removed.  So 
> maybe that’s part of the problem, that we get a request while the BB 
> is being detached, and temporarily in an inconsistent state (BDS 
> context differs from BB context).
>
> I’ll look more into it.

What I think happens is that scsi_device_purge_requests() runs (perhaps 
through virtio_scsi_hotunplug() -> qdev_simple_device_unplug_cb() -> 
scsi_qdev_unrealize()?), which schedules 
scsi_device_for_each_req_async_bh() to run, but doesn’t await it.  We go 
on, begin to move the BB and its BDS back to the main context (via 
blk_set_aio_context() in virtio_scsi_hotunplug()), but 
scsi_device_for_each_req_async_bh() still runs in the I/O thread, it 
calls blk_get_aio_context() while the contexts are inconsistent, and we 
get the crash.

There is a comment above blk_get_aio_context() in 
scsi_device_for_each_req_async_bh() about the BB potentially being moved 
to a different context prior to the BH running, but it doesn’t consider 
the possibility that that move may occur *concurrently*.

I don’t know how to fix this, though.  The whole idea of anything 
happening to a BB while it is being moved to a different context seems 
so wrong to me that I’d want to slap a big lock on it, but I have the 
feeling that that isn’t what we want.

Hanna
Hanna Czenczek Jan. 23, 2024, 5:23 p.m. UTC | #4
On 23.01.24 18:10, Kevin Wolf wrote:
> Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
>> On 21.12.23 22:23, Kevin Wolf wrote:
>>> From: Stefan Hajnoczi<stefanha@redhat.com>
>>>
>>> Stop depending on the AioContext lock and instead access
>>> SCSIDevice->requests from only one thread at a time:
>>> - When the VM is running only the BlockBackend's AioContext may access
>>>     the requests list.
>>> - When the VM is stopped only the main loop may access the requests
>>>     list.
>>>
>>> These constraints protect the requests list without the need for locking
>>> in the I/O code path.
>>>
>>> Note that multiple IOThreads are not supported yet because the code
>>> assumes all SCSIRequests are executed from a single AioContext. Leave
>>> that as future work.
>>>
>>> Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
>>> Reviewed-by: Eric Blake<eblake@redhat.com>
>>> Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
>>> Signed-off-by: Kevin Wolf<kwolf@redhat.com>
>>> ---
>>>    include/hw/scsi/scsi.h |   7 +-
>>>    hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
>>>    2 files changed, 131 insertions(+), 57 deletions(-)
>> My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
>> often because of this commit than because of the original bug, i.e. when
>> repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
>> this tends to happen when unplugging the scsi-hd:
>>
>> {"execute":"device_del","arguments":{"id":"stg0"}}
>> {"return": {}}
>> qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
>> Assertion `ctx == blk->ctx' failed.
>>
>> (gdb) bt
>> #0  0x00007f32a668d83c in  () at /usr/lib/libc.so.6
>> #1  0x00007f32a663d668 in raise () at /usr/lib/libc.so.6
>> #2  0x00007f32a66254b8 in abort () at /usr/lib/libc.so.6
>> #3  0x00007f32a66253dc in  () at /usr/lib/libc.so.6
>> #4  0x00007f32a6635d26 in  () at /usr/lib/libc.so.6
>> #5  0x0000556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at
>> ../block/block-backend.c:2429
>> #6  blk_get_aio_context (blk=0x556e6e89ccf0) at
>> ../block/block-backend.c:2417
>> #7  0x0000556e6b112d87 in scsi_device_for_each_req_async_bh
>> (opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
>> #8  0x0000556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at
>> ../util/async.c:218
>> #9  0x0000556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290,
>> blocking=blocking@entry=true) at ../util/aio-posix.c:722
>> #10 0x0000556e6b4564b6 in iothread_run (opaque=opaque@entry=0x556e6d89d920)
>> at ../iothread.c:63
>> #11 0x0000556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at
>> ../util/qemu-thread-posix.c:541
>> #12 0x00007f32a668b9eb in  () at /usr/lib/libc.so.6
>> #13 0x00007f32a670f7cc in  () at /usr/lib/libc.so.6
>>
>> I don’t know anything about the problem yet, but as usual, I like
>> speculation and discovering how wrong I was later on, so one thing I came
>> across that’s funny about virtio-scsi is that requests can happen even while
>> a disk is being attached or detached.  That is, Linux seems to probe all
>> LUNs when a new virtio-scsi device is being attached, and it won’t stop just
>> because a disk is being attached or removed.  So maybe that’s part of the
>> problem, that we get a request while the BB is being detached, and
>> temporarily in an inconsistent state (BDS context differs from BB context).
> I don't know anything about the problem either, but since you already
> speculated about the cause, let me speculate about the solution:
> Can we hold the graph writer lock for the tran_commit() call in
> bdrv_try_change_aio_context()? And of course take the reader lock for
> blk_get_aio_context(), but that should be completely unproblematic.
>
> At the first sight I don't see a reason why this would break something,
> but I've learnt not to trust my first impression with the graph locking
> work...
>
> Of course, I also didn't check if there are more things inside of the
> device emulation that need additional locking in this case, too. But
> even if so, blk_get_aio_context() should never see an inconsistent
> state.

Ah, sorry, saw your reply only now that I hit “send”.

I forgot that we do have that big lock that I thought rather to avoid 
:)  Sounds good and very reasonable to me.  Changing the contexts in the 
graph sounds like a graph change operation, and reading and comparing 
contexts in the graph sounds like reading the graph.

Hanna
Hanna Czenczek Jan. 24, 2024, 12:12 p.m. UTC | #5
On 23.01.24 18:10, Kevin Wolf wrote:
> Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
>> On 21.12.23 22:23, Kevin Wolf wrote:
>>> From: Stefan Hajnoczi<stefanha@redhat.com>
>>>
>>> Stop depending on the AioContext lock and instead access
>>> SCSIDevice->requests from only one thread at a time:
>>> - When the VM is running only the BlockBackend's AioContext may access
>>>     the requests list.
>>> - When the VM is stopped only the main loop may access the requests
>>>     list.
>>>
>>> These constraints protect the requests list without the need for locking
>>> in the I/O code path.
>>>
>>> Note that multiple IOThreads are not supported yet because the code
>>> assumes all SCSIRequests are executed from a single AioContext. Leave
>>> that as future work.
>>>
>>> Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
>>> Reviewed-by: Eric Blake<eblake@redhat.com>
>>> Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
>>> Signed-off-by: Kevin Wolf<kwolf@redhat.com>
>>> ---
>>>    include/hw/scsi/scsi.h |   7 +-
>>>    hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
>>>    2 files changed, 131 insertions(+), 57 deletions(-)
>> My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
>> often because of this commit than because of the original bug, i.e. when
>> repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
>> this tends to happen when unplugging the scsi-hd:
>>
>> {"execute":"device_del","arguments":{"id":"stg0"}}
>> {"return": {}}
>> qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
>> Assertion `ctx == blk->ctx' failed.

[...]

> I don't know anything about the problem either, but since you already
> speculated about the cause, let me speculate about the solution:
> Can we hold the graph writer lock for the tran_commit() call in
> bdrv_try_change_aio_context()? And of course take the reader lock for
> blk_get_aio_context(), but that should be completely unproblematic.

I tried this, and it’s not easy taking the lock just for tran_commit(), 
because some callers of bdrv_try_change_aio_context() already hold the 
write lock (specifically bdrv_attach_child_common(), 
bdrv_attach_child_common_abort(), and bdrv_root_unref_child()[1]), and 
qmp_x_blockdev_set_iothread() holds the read lock.  Other callers don’t 
hold any lock[2].

So I’m not sure whether we should mark all of 
bdrv_try_change_aio_context() as GRAPH_WRLOCK and then make all callers 
take the lock, or really only take it for tran_commit(), and have 
callers release the lock around bdrv_try_change_aio_context(). Former 
sounds better to naïve me.

(In any case, FWIW, having blk_set_aio_context() take the write lock, 
and scsi_device_for_each_req_async_bh() take the read lock[3], does fix 
the assertion failure.)

Hanna

[1] bdrv_root_unref_child() is not marked as GRAPH_WRLOCK, but it’s 
callers generally seem to ensure that the lock is taken when calling it.

[2] blk_set_aio_context() (evidently), blk_exp_add(), 
external_snapshot_abort(), {blockdev,drive}_backup_action(), 
qmp_{blockdev,drive}_mirror()

[3] I’ve made the _bh a coroutine (for bdrv_graph_co_rdlock()) and 
replaced the aio_bh_schedule_oneshot() by aio_co_enter() – hope that’s 
right.
Stefan Hajnoczi Jan. 24, 2024, 9:53 p.m. UTC | #6
On Wed, Jan 24, 2024 at 01:12:47PM +0100, Hanna Czenczek wrote:
> On 23.01.24 18:10, Kevin Wolf wrote:
> > Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
> > > On 21.12.23 22:23, Kevin Wolf wrote:
> > > > From: Stefan Hajnoczi<stefanha@redhat.com>
> > > > 
> > > > Stop depending on the AioContext lock and instead access
> > > > SCSIDevice->requests from only one thread at a time:
> > > > - When the VM is running only the BlockBackend's AioContext may access
> > > >     the requests list.
> > > > - When the VM is stopped only the main loop may access the requests
> > > >     list.
> > > > 
> > > > These constraints protect the requests list without the need for locking
> > > > in the I/O code path.
> > > > 
> > > > Note that multiple IOThreads are not supported yet because the code
> > > > assumes all SCSIRequests are executed from a single AioContext. Leave
> > > > that as future work.
> > > > 
> > > > Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
> > > > Reviewed-by: Eric Blake<eblake@redhat.com>
> > > > Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
> > > > Signed-off-by: Kevin Wolf<kwolf@redhat.com>
> > > > ---
> > > >    include/hw/scsi/scsi.h |   7 +-
> > > >    hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
> > > >    2 files changed, 131 insertions(+), 57 deletions(-)
> > > My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
> > > often because of this commit than because of the original bug, i.e. when
> > > repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
> > > this tends to happen when unplugging the scsi-hd:
> > > 
> > > {"execute":"device_del","arguments":{"id":"stg0"}}
> > > {"return": {}}
> > > qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
> > > Assertion `ctx == blk->ctx' failed.
> 
> [...]
> 
> > I don't know anything about the problem either, but since you already
> > speculated about the cause, let me speculate about the solution:
> > Can we hold the graph writer lock for the tran_commit() call in
> > bdrv_try_change_aio_context()? And of course take the reader lock for
> > blk_get_aio_context(), but that should be completely unproblematic.
> 
> I tried this, and it’s not easy taking the lock just for tran_commit(),
> because some callers of bdrv_try_change_aio_context() already hold the write
> lock (specifically bdrv_attach_child_common(),
> bdrv_attach_child_common_abort(), and bdrv_root_unref_child()[1]), and
> qmp_x_blockdev_set_iothread() holds the read lock.  Other callers don’t hold
> any lock[2].
> 
> So I’m not sure whether we should mark all of bdrv_try_change_aio_context()
> as GRAPH_WRLOCK and then make all callers take the lock, or really only take
> it for tran_commit(), and have callers release the lock around
> bdrv_try_change_aio_context(). Former sounds better to naïve me.
> 
> (In any case, FWIW, having blk_set_aio_context() take the write lock, and
> scsi_device_for_each_req_async_bh() take the read lock[3], does fix the
> assertion failure.)

I wonder if a simpler solution is blk_inc_in_flight() in
scsi_device_for_each_req_async() and blk_dec_in_flight() in
scsi_device_for_each_req_async_bh() so that drain
waits for the BH.

There is a drain around the AioContext change, so as long as
scsi_device_for_each_req_async() was called before blk_set_aio_context()
and not _during_ aio_poll(), we would prevent inconsistent BB vs BDS
aio_contexts.

Stefan

> 
> Hanna
> 
> [1] bdrv_root_unref_child() is not marked as GRAPH_WRLOCK, but it’s callers
> generally seem to ensure that the lock is taken when calling it.
> 
> [2] blk_set_aio_context() (evidently), blk_exp_add(),
> external_snapshot_abort(), {blockdev,drive}_backup_action(),
> qmp_{blockdev,drive}_mirror()
> 
> [3] I’ve made the _bh a coroutine (for bdrv_graph_co_rdlock()) and replaced
> the aio_bh_schedule_oneshot() by aio_co_enter() – hope that’s right.
Hanna Czenczek Jan. 25, 2024, 9:06 a.m. UTC | #7
On 24.01.24 22:53, Stefan Hajnoczi wrote:
> On Wed, Jan 24, 2024 at 01:12:47PM +0100, Hanna Czenczek wrote:
>> On 23.01.24 18:10, Kevin Wolf wrote:
>>> Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
>>>> On 21.12.23 22:23, Kevin Wolf wrote:
>>>>> From: Stefan Hajnoczi<stefanha@redhat.com>
>>>>>
>>>>> Stop depending on the AioContext lock and instead access
>>>>> SCSIDevice->requests from only one thread at a time:
>>>>> - When the VM is running only the BlockBackend's AioContext may access
>>>>>      the requests list.
>>>>> - When the VM is stopped only the main loop may access the requests
>>>>>      list.
>>>>>
>>>>> These constraints protect the requests list without the need for locking
>>>>> in the I/O code path.
>>>>>
>>>>> Note that multiple IOThreads are not supported yet because the code
>>>>> assumes all SCSIRequests are executed from a single AioContext. Leave
>>>>> that as future work.
>>>>>
>>>>> Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
>>>>> Reviewed-by: Eric Blake<eblake@redhat.com>
>>>>> Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
>>>>> Signed-off-by: Kevin Wolf<kwolf@redhat.com>
>>>>> ---
>>>>>     include/hw/scsi/scsi.h |   7 +-
>>>>>     hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
>>>>>     2 files changed, 131 insertions(+), 57 deletions(-)
>>>> My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
>>>> often because of this commit than because of the original bug, i.e. when
>>>> repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
>>>> this tends to happen when unplugging the scsi-hd:
>>>>
>>>> {"execute":"device_del","arguments":{"id":"stg0"}}
>>>> {"return": {}}
>>>> qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
>>>> Assertion `ctx == blk->ctx' failed.
>> [...]
>>
>>> I don't know anything about the problem either, but since you already
>>> speculated about the cause, let me speculate about the solution:
>>> Can we hold the graph writer lock for the tran_commit() call in
>>> bdrv_try_change_aio_context()? And of course take the reader lock for
>>> blk_get_aio_context(), but that should be completely unproblematic.
>> I tried this, and it’s not easy taking the lock just for tran_commit(),
>> because some callers of bdrv_try_change_aio_context() already hold the write
>> lock (specifically bdrv_attach_child_common(),
>> bdrv_attach_child_common_abort(), and bdrv_root_unref_child()[1]), and
>> qmp_x_blockdev_set_iothread() holds the read lock.  Other callers don’t hold
>> any lock[2].
>>
>> So I’m not sure whether we should mark all of bdrv_try_change_aio_context()
>> as GRAPH_WRLOCK and then make all callers take the lock, or really only take
>> it for tran_commit(), and have callers release the lock around
>> bdrv_try_change_aio_context(). Former sounds better to naïve me.
>>
>> (In any case, FWIW, having blk_set_aio_context() take the write lock, and
>> scsi_device_for_each_req_async_bh() take the read lock[3], does fix the
>> assertion failure.)
> I wonder if a simpler solution is blk_inc_in_flight() in
> scsi_device_for_each_req_async() and blk_dec_in_flight() in
> scsi_device_for_each_req_async_bh() so that drain
> waits for the BH.
>
> There is a drain around the AioContext change, so as long as
> scsi_device_for_each_req_async() was called before blk_set_aio_context()
> and not _during_ aio_poll(), we would prevent inconsistent BB vs BDS
> aio_contexts.

Actually, Kevin has suggested on IRC that we drop the whole drain. :)

Dropping the write lock in our outside of bdrv_try_change_aio_context() 
for callers that have already taken it seems unsafe, so the only option 
would be to make the whole function write-lock-able.  The drained 
section can cause problems with that if it ends up wanting to reorganize 
the graph, so AFAIU drain should never be done while under a write 
lock.  This is already a problem now because there are three callers 
that do call bdrv_try_change_aio_context() while under a write lock, so 
it seems like we shouldn’t keep the drain as-is.

So, Kevin suggested just dropping that drain, because I/O requests are 
no longer supposed to care about a BDS’s native AioContext anymore 
anyway, so it seems like the need for the drain has gone away with 
multiqueue.  Then we could make the whole function GRAPH_WRLOCK.

Hanna
Stefan Hajnoczi Jan. 25, 2024, 1:25 p.m. UTC | #8
On Thu, Jan 25, 2024 at 10:06:51AM +0100, Hanna Czenczek wrote:
> On 24.01.24 22:53, Stefan Hajnoczi wrote:
> > On Wed, Jan 24, 2024 at 01:12:47PM +0100, Hanna Czenczek wrote:
> > > On 23.01.24 18:10, Kevin Wolf wrote:
> > > > Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
> > > > > On 21.12.23 22:23, Kevin Wolf wrote:
> > > > > > From: Stefan Hajnoczi<stefanha@redhat.com>
> > > > > > 
> > > > > > Stop depending on the AioContext lock and instead access
> > > > > > SCSIDevice->requests from only one thread at a time:
> > > > > > - When the VM is running only the BlockBackend's AioContext may access
> > > > > >      the requests list.
> > > > > > - When the VM is stopped only the main loop may access the requests
> > > > > >      list.
> > > > > > 
> > > > > > These constraints protect the requests list without the need for locking
> > > > > > in the I/O code path.
> > > > > > 
> > > > > > Note that multiple IOThreads are not supported yet because the code
> > > > > > assumes all SCSIRequests are executed from a single AioContext. Leave
> > > > > > that as future work.
> > > > > > 
> > > > > > Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
> > > > > > Reviewed-by: Eric Blake<eblake@redhat.com>
> > > > > > Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
> > > > > > Signed-off-by: Kevin Wolf<kwolf@redhat.com>
> > > > > > ---
> > > > > >     include/hw/scsi/scsi.h |   7 +-
> > > > > >     hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
> > > > > >     2 files changed, 131 insertions(+), 57 deletions(-)
> > > > > My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
> > > > > often because of this commit than because of the original bug, i.e. when
> > > > > repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
> > > > > this tends to happen when unplugging the scsi-hd:
> > > > > 
> > > > > {"execute":"device_del","arguments":{"id":"stg0"}}
> > > > > {"return": {}}
> > > > > qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
> > > > > Assertion `ctx == blk->ctx' failed.
> > > [...]
> > > 
> > > > I don't know anything about the problem either, but since you already
> > > > speculated about the cause, let me speculate about the solution:
> > > > Can we hold the graph writer lock for the tran_commit() call in
> > > > bdrv_try_change_aio_context()? And of course take the reader lock for
> > > > blk_get_aio_context(), but that should be completely unproblematic.
> > > I tried this, and it’s not easy taking the lock just for tran_commit(),
> > > because some callers of bdrv_try_change_aio_context() already hold the write
> > > lock (specifically bdrv_attach_child_common(),
> > > bdrv_attach_child_common_abort(), and bdrv_root_unref_child()[1]), and
> > > qmp_x_blockdev_set_iothread() holds the read lock.  Other callers don’t hold
> > > any lock[2].
> > > 
> > > So I’m not sure whether we should mark all of bdrv_try_change_aio_context()
> > > as GRAPH_WRLOCK and then make all callers take the lock, or really only take
> > > it for tran_commit(), and have callers release the lock around
> > > bdrv_try_change_aio_context(). Former sounds better to naïve me.
> > > 
> > > (In any case, FWIW, having blk_set_aio_context() take the write lock, and
> > > scsi_device_for_each_req_async_bh() take the read lock[3], does fix the
> > > assertion failure.)
> > I wonder if a simpler solution is blk_inc_in_flight() in
> > scsi_device_for_each_req_async() and blk_dec_in_flight() in
> > scsi_device_for_each_req_async_bh() so that drain
> > waits for the BH.
> > 
> > There is a drain around the AioContext change, so as long as
> > scsi_device_for_each_req_async() was called before blk_set_aio_context()
> > and not _during_ aio_poll(), we would prevent inconsistent BB vs BDS
> > aio_contexts.
> 
> Actually, Kevin has suggested on IRC that we drop the whole drain. :)
> 
> Dropping the write lock in our outside of bdrv_try_change_aio_context() for
> callers that have already taken it seems unsafe, so the only option would be
> to make the whole function write-lock-able.  The drained section can cause
> problems with that if it ends up wanting to reorganize the graph, so AFAIU
> drain should never be done while under a write lock.  This is already a
> problem now because there are three callers that do call
> bdrv_try_change_aio_context() while under a write lock, so it seems like we
> shouldn’t keep the drain as-is.
> 
> So, Kevin suggested just dropping that drain, because I/O requests are no
> longer supposed to care about a BDS’s native AioContext anymore anyway, so
> it seems like the need for the drain has gone away with multiqueue.  Then we
> could make the whole function GRAPH_WRLOCK.

Okay.

Stefan
Hanna Czenczek Jan. 25, 2024, 5:32 p.m. UTC | #9
On 23.01.24 18:10, Kevin Wolf wrote:
> Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
>> On 21.12.23 22:23, Kevin Wolf wrote:
>>> From: Stefan Hajnoczi<stefanha@redhat.com>
>>>
>>> Stop depending on the AioContext lock and instead access
>>> SCSIDevice->requests from only one thread at a time:
>>> - When the VM is running only the BlockBackend's AioContext may access
>>>     the requests list.
>>> - When the VM is stopped only the main loop may access the requests
>>>     list.
>>>
>>> These constraints protect the requests list without the need for locking
>>> in the I/O code path.
>>>
>>> Note that multiple IOThreads are not supported yet because the code
>>> assumes all SCSIRequests are executed from a single AioContext. Leave
>>> that as future work.
>>>
>>> Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
>>> Reviewed-by: Eric Blake<eblake@redhat.com>
>>> Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
>>> Signed-off-by: Kevin Wolf<kwolf@redhat.com>
>>> ---
>>>    include/hw/scsi/scsi.h |   7 +-
>>>    hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
>>>    2 files changed, 131 insertions(+), 57 deletions(-)
>> My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more
>> often because of this commit than because of the original bug, i.e. when
>> repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
>> this tends to happen when unplugging the scsi-hd:
>>
>> {"execute":"device_del","arguments":{"id":"stg0"}}
>> {"return": {}}
>> qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
>> Assertion `ctx == blk->ctx' failed.

[...]

>> I don’t know anything about the problem yet, but as usual, I like
>> speculation and discovering how wrong I was later on, so one thing I came
>> across that’s funny about virtio-scsi is that requests can happen even while
>> a disk is being attached or detached.  That is, Linux seems to probe all
>> LUNs when a new virtio-scsi device is being attached, and it won’t stop just
>> because a disk is being attached or removed.  So maybe that’s part of the
>> problem, that we get a request while the BB is being detached, and
>> temporarily in an inconsistent state (BDS context differs from BB context).
> I don't know anything about the problem either, but since you already
> speculated about the cause, let me speculate about the solution:
> Can we hold the graph writer lock for the tran_commit() call in
> bdrv_try_change_aio_context()? And of course take the reader lock for
> blk_get_aio_context(), but that should be completely unproblematic.

Actually, now that completely unproblematic part is giving me trouble.  
I wanted to just put a graph lock into blk_get_aio_context() (making it 
a coroutine with a wrapper), but callers of blk_get_aio_context() 
generally assume the context is going to stay the BB’s context for as 
long as their AioContext * variable is in scope.  I was tempted to think 
callers know what happens to the BB they pass to blk_get_aio_context(), 
and it won’t change contexts so easily, but then I remembered this is 
exactly what happens in this case; we run 
scsi_device_for_each_req_async_bh() in one thread (which calls 
blk_get_aio_context()), and in the other, we change the BB’s context.

It seems like there are very few blk_* functions right now that require 
taking a graph lock around it, so I’m hesitant to go that route.  But if 
we’re protecting a BB’s context via the graph write lock, I can’t think 
of a way around having to take a read lock whenever reading a BB’s 
context, and holding it for as long as we assume that context to remain 
the BB’s context.  It’s also hard to figure out how long that is, case 
by case; for example, dma_blk_read() schedules an AIO function in the BB 
context; but we probably don’t care that this context remains the BB’s 
context until the request is done.  In the case of 
scsi_device_for_each_req_async_bh(), we already take care to re-schedule 
it when it turns out the context is outdated, so it does seem quite 
important here, and we probably want to keep the lock until after the 
QTAILQ_FOREACH_SAFE() loop.

On a tangent, this TOCTTOU problem makes me wary of other blk_* 
functions that query information.  For example, fuse_read() (in 
block/export/fuse.c) truncates requests to the BB length.  But what if 
the BB length changes concurrently between blk_getlength() and 
blk_pread()?  While we can justify using the graph lock for a BB’s 
AioContext, we can’t use it for other metadata like its length.

Hanna
Kevin Wolf Jan. 26, 2024, 1:18 p.m. UTC | #10
Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben:
> On 23.01.24 18:10, Kevin Wolf wrote:
> > Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
> > > On 21.12.23 22:23, Kevin Wolf wrote:
> > > > From: Stefan Hajnoczi<stefanha@redhat.com>
> > > > 
> > > > Stop depending on the AioContext lock and instead access
> > > > SCSIDevice->requests from only one thread at a time:
> > > > - When the VM is running only the BlockBackend's AioContext may access
> > > >     the requests list.
> > > > - When the VM is stopped only the main loop may access the requests
> > > >     list.
> > > > 
> > > > These constraints protect the requests list without the need for locking
> > > > in the I/O code path.
> > > > 
> > > > Note that multiple IOThreads are not supported yet because the code
> > > > assumes all SCSIRequests are executed from a single AioContext. Leave
> > > > that as future work.
> > > > 
> > > > Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
> > > > Reviewed-by: Eric Blake<eblake@redhat.com>
> > > > Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
> > > > Signed-off-by: Kevin Wolf<kwolf@redhat.com>
> > > > ---
> > > >    include/hw/scsi/scsi.h |   7 +-
> > > >    hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
> > > >    2 files changed, 131 insertions(+), 57 deletions(-)
> > > My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more
> > > often because of this commit than because of the original bug, i.e. when
> > > repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
> > > this tends to happen when unplugging the scsi-hd:
> > > 
> > > {"execute":"device_del","arguments":{"id":"stg0"}}
> > > {"return": {}}
> > > qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
> > > Assertion `ctx == blk->ctx' failed.
> 
> [...]
> 
> > > I don’t know anything about the problem yet, but as usual, I like
> > > speculation and discovering how wrong I was later on, so one thing I came
> > > across that’s funny about virtio-scsi is that requests can happen even while
> > > a disk is being attached or detached.  That is, Linux seems to probe all
> > > LUNs when a new virtio-scsi device is being attached, and it won’t stop just
> > > because a disk is being attached or removed.  So maybe that’s part of the
> > > problem, that we get a request while the BB is being detached, and
> > > temporarily in an inconsistent state (BDS context differs from BB context).
> > I don't know anything about the problem either, but since you already
> > speculated about the cause, let me speculate about the solution:
> > Can we hold the graph writer lock for the tran_commit() call in
> > bdrv_try_change_aio_context()? And of course take the reader lock for
> > blk_get_aio_context(), but that should be completely unproblematic.
> 
> Actually, now that completely unproblematic part is giving me trouble.  I
> wanted to just put a graph lock into blk_get_aio_context() (making it a
> coroutine with a wrapper)

Which is the first thing I neglected and already not great. We have
calls of blk_get_aio_context() in the SCSI I/O path, and creating a
coroutine and doing at least two context switches simply for this call
is a lot of overhead...

> but callers of blk_get_aio_context() generally assume the context is
> going to stay the BB’s context for as long as their AioContext *
> variable is in scope.

I'm not so sure about that. And taking another step back, I'm actually
also not sure how much it still matters now that they can submit I/O
from any thread.

Maybe the correct solution is to remove the assertion from
blk_get_aio_context() and just always return blk->ctx. If it's in the
middle of a change, you'll either get the old one or the new one. Either
one is fine to submit I/O from, and if you care about changes for other
reasons (like SCSI does), then you need explicit code to protect it
anyway (which SCSI apparently has, but it doesn't work).

> I was tempted to think callers know what happens to the BB they pass
> to blk_get_aio_context(), and it won’t change contexts so easily, but
> then I remembered this is exactly what happens in this case; we run
> scsi_device_for_each_req_async_bh() in one thread (which calls
> blk_get_aio_context()), and in the other, we change the BB’s context.

Let's think a bit more about scsi_device_for_each_req_async()
specifically. This is a function that runs in the main thread. Nothing
will change any AioContext assignment if it doesn't call it. It wants to
make sure that scsi_device_for_each_req_async_bh() is called in the
same AioContext where the virtqueue is processed, so it schedules a BH
and waits for it.

Waiting for it means running a nested event loop that could do anything,
including changing AioContexts. So this is what needs the locking, not
the blk_get_aio_context() call in scsi_device_for_each_req_async_bh().
If we lock before the nested event loop and unlock in the BH, the check
in the BH can become an assertion. (It is important that we unlock in
the BH rather than after waiting because if something takes the writer
lock, we need to unlock during the nested event loop of bdrv_wrlock() to
avoid a deadlock.)

And spawning a coroutine for this looks a lot more acceptable because
it's on a slow path anyway.

In fact, we probably don't technically need a coroutine to take the
reader lock here. We can have a new graph lock function that asserts
that there is no writer (we know because we're running in the main loop)
and then atomically increments the reader count. But maybe that already
complicates things again...

> It seems like there are very few blk_* functions right now that
> require taking a graph lock around it, so I’m hesitant to go that
> route.  But if we’re protecting a BB’s context via the graph write
> lock, I can’t think of a way around having to take a read lock
> whenever reading a BB’s context, and holding it for as long as we
> assume that context to remain the BB’s context.  It’s also hard to
> figure out how long that is, case by case; for example, dma_blk_read()
> schedules an AIO function in the BB context; but we probably don’t
> care that this context remains the BB’s context until the request is
> done.  In the case of scsi_device_for_each_req_async_bh(), we already
> take care to re-schedule it when it turns out the context is outdated,
> so it does seem quite important here, and we probably want to keep the
> lock until after the QTAILQ_FOREACH_SAFE() loop.

Maybe we need to audit all callers. Fortunately, there don't seem to be
too many. At least not direct ones...

> On a tangent, this TOCTTOU problem makes me wary of other blk_*
> functions that query information.  For example, fuse_read() (in
> block/export/fuse.c) truncates requests to the BB length.  But what if
> the BB length changes concurrently between blk_getlength() and
> blk_pread()?  While we can justify using the graph lock for a BB’s
> AioContext, we can’t use it for other metadata like its length.

Hm... Is "tough luck" an acceptable answer? ;-)

Kevin
Hanna Czenczek Jan. 26, 2024, 3:24 p.m. UTC | #11
On 26.01.24 14:18, Kevin Wolf wrote:
> Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben:
>> On 23.01.24 18:10, Kevin Wolf wrote:
>>> Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
>>>> On 21.12.23 22:23, Kevin Wolf wrote:
>>>>> From: Stefan Hajnoczi<stefanha@redhat.com>
>>>>>
>>>>> Stop depending on the AioContext lock and instead access
>>>>> SCSIDevice->requests from only one thread at a time:
>>>>> - When the VM is running only the BlockBackend's AioContext may access
>>>>>      the requests list.
>>>>> - When the VM is stopped only the main loop may access the requests
>>>>>      list.
>>>>>
>>>>> These constraints protect the requests list without the need for locking
>>>>> in the I/O code path.
>>>>>
>>>>> Note that multiple IOThreads are not supported yet because the code
>>>>> assumes all SCSIRequests are executed from a single AioContext. Leave
>>>>> that as future work.
>>>>>
>>>>> Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
>>>>> Reviewed-by: Eric Blake<eblake@redhat.com>
>>>>> Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
>>>>> Signed-off-by: Kevin Wolf<kwolf@redhat.com>
>>>>> ---
>>>>>     include/hw/scsi/scsi.h |   7 +-
>>>>>     hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
>>>>>     2 files changed, 131 insertions(+), 57 deletions(-)
>>>> My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
>>>> often because of this commit than because of the original bug, i.e. when
>>>> repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
>>>> this tends to happen when unplugging the scsi-hd:

Note: We (on issues.redhat.com) have a separate report that seems to be 
concerning this very problem: https://issues.redhat.com/browse/RHEL-19381

>>>> {"execute":"device_del","arguments":{"id":"stg0"}}
>>>> {"return": {}}
>>>> qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
>>>> Assertion `ctx == blk->ctx' failed.
>> [...]
>>
>>>> I don’t know anything about the problem yet, but as usual, I like
>>>> speculation and discovering how wrong I was later on, so one thing I came
>>>> across that’s funny about virtio-scsi is that requests can happen even while
>>>> a disk is being attached or detached.  That is, Linux seems to probe all
>>>> LUNs when a new virtio-scsi device is being attached, and it won’t stop just
>>>> because a disk is being attached or removed.  So maybe that’s part of the
>>>> problem, that we get a request while the BB is being detached, and
>>>> temporarily in an inconsistent state (BDS context differs from BB context).
>>> I don't know anything about the problem either, but since you already
>>> speculated about the cause, let me speculate about the solution:
>>> Can we hold the graph writer lock for the tran_commit() call in
>>> bdrv_try_change_aio_context()? And of course take the reader lock for
>>> blk_get_aio_context(), but that should be completely unproblematic.
>> Actually, now that completely unproblematic part is giving me trouble.  I
>> wanted to just put a graph lock into blk_get_aio_context() (making it a
>> coroutine with a wrapper)
> Which is the first thing I neglected and already not great. We have
> calls of blk_get_aio_context() in the SCSI I/O path, and creating a
> coroutine and doing at least two context switches simply for this call
> is a lot of overhead...
>
>> but callers of blk_get_aio_context() generally assume the context is
>> going to stay the BB’s context for as long as their AioContext *
>> variable is in scope.
> I'm not so sure about that. And taking another step back, I'm actually
> also not sure how much it still matters now that they can submit I/O
> from any thread.

That’s my impression, too, but “not sure” doesn’t feel great. :) 
scsi_device_for_each_req_async_bh() specifically double-checks whether 
it’s still in the right context before invoking the specified function, 
so it seems there was some intention to continue to run in the context 
associated with the BB.

(Not judging whether that intent makes sense or not, yet.)

> Maybe the correct solution is to remove the assertion from
> blk_get_aio_context() and just always return blk->ctx. If it's in the
> middle of a change, you'll either get the old one or the new one. Either
> one is fine to submit I/O from, and if you care about changes for other
> reasons (like SCSI does), then you need explicit code to protect it
> anyway (which SCSI apparently has, but it doesn't work).

I think most callers do just assume the BB stays in the context they got 
(without any proof, admittedly), but I agree that under re-evaluation, 
it probably doesn’t actually matter to them, really. And yes, basically, 
if the caller doesn’t need to take a lock because it doesn’t really 
matter whether blk->ctx changes while its still using the old value, 
blk_get_aio_context() in turn doesn’t need to double-check blk->ctx 
against the root node’s context either, and nobody needs a lock.

So I agree, it’s on the caller to protect against a potentially changing 
context, blk_get_aio_context() should just return blk->ctx and not check 
against the root node.

(On a tangent: blk_drain() is a caller of blk_get_aio_context(), and it 
polls that context.  Why does it need to poll that context specifically 
when requests may be in any context?  Is it because if there are 
requests in the main thread, we must poll that, but otherwise it’s fine 
to poll any thread, and we can only have requests in the main thread if 
that’s the BB’s context?)

>> I was tempted to think callers know what happens to the BB they pass
>> to blk_get_aio_context(), and it won’t change contexts so easily, but
>> then I remembered this is exactly what happens in this case; we run
>> scsi_device_for_each_req_async_bh() in one thread (which calls
>> blk_get_aio_context()), and in the other, we change the BB’s context.
> Let's think a bit more about scsi_device_for_each_req_async()
> specifically. This is a function that runs in the main thread. Nothing
> will change any AioContext assignment if it doesn't call it. It wants to
> make sure that scsi_device_for_each_req_async_bh() is called in the
> same AioContext where the virtqueue is processed, so it schedules a BH
> and waits for it.

I don’t quite follow, it doesn’t wait for the BH.  It uses 
aio_bh_schedule_oneshot(), not aio_wait_bh_oneshot().  While you’re 
right that if it did wait, the BB context might still change, in 
practice we wouldn’t have the problem at hand because the caller is 
actually the one to change the context, concurrently while the BH is 
running.

> Waiting for it means running a nested event loop that could do anything,
> including changing AioContexts. So this is what needs the locking, not
> the blk_get_aio_context() call in scsi_device_for_each_req_async_bh().
> If we lock before the nested event loop and unlock in the BH, the check
> in the BH can become an assertion. (It is important that we unlock in
> the BH rather than after waiting because if something takes the writer
> lock, we need to unlock during the nested event loop of bdrv_wrlock() to
> avoid a deadlock.)
>
> And spawning a coroutine for this looks a lot more acceptable because
> it's on a slow path anyway.
>
> In fact, we probably don't technically need a coroutine to take the
> reader lock here. We can have a new graph lock function that asserts
> that there is no writer (we know because we're running in the main loop)
> and then atomically increments the reader count. But maybe that already
> complicates things again...

So as far as I understand we can’t just use aio_wait_bh_oneshot() and 
wrap it in bdrv_graph_rd{,un}lock_main_loop(), because that doesn’t 
actually lock the graph.  I feel like adding a new graph lock function 
for this quite highly specific case could be dangerous, because it seems 
easy to use the wrong way.

Just having a trampoline coroutine to call bdrv_graph_co_rd{,un}lock() 
seems simple enough and reasonable here (not a hot path).  Can we have 
that coroutine then use aio_wait_bh_oneshot() with the existing _bh 
function, or should that be made a coroutine, too?

>> It seems like there are very few blk_* functions right now that
>> require taking a graph lock around it, so I’m hesitant to go that
>> route.  But if we’re protecting a BB’s context via the graph write
>> lock, I can’t think of a way around having to take a read lock
>> whenever reading a BB’s context, and holding it for as long as we
>> assume that context to remain the BB’s context.  It’s also hard to
>> figure out how long that is, case by case; for example, dma_blk_read()
>> schedules an AIO function in the BB context; but we probably don’t
>> care that this context remains the BB’s context until the request is
>> done.  In the case of scsi_device_for_each_req_async_bh(), we already
>> take care to re-schedule it when it turns out the context is outdated,
>> so it does seem quite important here, and we probably want to keep the
>> lock until after the QTAILQ_FOREACH_SAFE() loop.
> Maybe we need to audit all callers. Fortunately, there don't seem to be
> too many. At least not direct ones...
>
>> On a tangent, this TOCTTOU problem makes me wary of other blk_*
>> functions that query information.  For example, fuse_read() (in
>> block/export/fuse.c) truncates requests to the BB length.  But what if
>> the BB length changes concurrently between blk_getlength() and
>> blk_pread()?  While we can justify using the graph lock for a BB’s
>> AioContext, we can’t use it for other metadata like its length.
> Hm... Is "tough luck" an acceptable answer? ;-)

Absolutely, if we do it acknowledgingly (great word).  I’m just a bit 
worried not all of these corner cases have been acknowledged, and some 
of them may be looking for a different answer.

Hanna
Hanna Czenczek Jan. 29, 2024, 4:30 p.m. UTC | #12
On 23.01.24 18:10, Kevin Wolf wrote:
> Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
>> On 21.12.23 22:23, Kevin Wolf wrote:
>>> From: Stefan Hajnoczi<stefanha@redhat.com>
>>>
>>> Stop depending on the AioContext lock and instead access
>>> SCSIDevice->requests from only one thread at a time:
>>> - When the VM is running only the BlockBackend's AioContext may access
>>>     the requests list.
>>> - When the VM is stopped only the main loop may access the requests
>>>     list.
>>>
>>> These constraints protect the requests list without the need for locking
>>> in the I/O code path.
>>>
>>> Note that multiple IOThreads are not supported yet because the code
>>> assumes all SCSIRequests are executed from a single AioContext. Leave
>>> that as future work.
>>>
>>> Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
>>> Reviewed-by: Eric Blake<eblake@redhat.com>
>>> Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
>>> Signed-off-by: Kevin Wolf<kwolf@redhat.com>
>>> ---
>>>    include/hw/scsi/scsi.h |   7 +-
>>>    hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
>>>    2 files changed, 131 insertions(+), 57 deletions(-)
>> My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more
>> often because of this commit than because of the original bug, i.e. when
>> repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
>> this tends to happen when unplugging the scsi-hd:
>>
>> {"execute":"device_del","arguments":{"id":"stg0"}}
>> {"return": {}}
>> qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
>> Assertion `ctx == blk->ctx' failed.
>>
>> (gdb) bt
>> #0  0x00007f32a668d83c in  () at /usr/lib/libc.so.6
>> #1  0x00007f32a663d668 in raise () at /usr/lib/libc.so.6
>> #2  0x00007f32a66254b8 in abort () at /usr/lib/libc.so.6
>> #3  0x00007f32a66253dc in  () at /usr/lib/libc.so.6
>> #4  0x00007f32a6635d26 in  () at /usr/lib/libc.so.6
>> #5  0x0000556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at
>> ../block/block-backend.c:2429
>> #6  blk_get_aio_context (blk=0x556e6e89ccf0) at
>> ../block/block-backend.c:2417
>> #7  0x0000556e6b112d87 in scsi_device_for_each_req_async_bh
>> (opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
>> #8  0x0000556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at
>> ../util/async.c:218
>> #9  0x0000556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290,
>> blocking=blocking@entry=true) at ../util/aio-posix.c:722
>> #10 0x0000556e6b4564b6 in iothread_run (opaque=opaque@entry=0x556e6d89d920)
>> at ../iothread.c:63
>> #11 0x0000556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at
>> ../util/qemu-thread-posix.c:541
>> #12 0x00007f32a668b9eb in  () at /usr/lib/libc.so.6
>> #13 0x00007f32a670f7cc in  () at /usr/lib/libc.so.6
>>
>> I don’t know anything about the problem yet, but as usual, I like
>> speculation and discovering how wrong I was later on, so one thing I came
>> across that’s funny about virtio-scsi is that requests can happen even while
>> a disk is being attached or detached.  That is, Linux seems to probe all
>> LUNs when a new virtio-scsi device is being attached, and it won’t stop just
>> because a disk is being attached or removed.  So maybe that’s part of the
>> problem, that we get a request while the BB is being detached, and
>> temporarily in an inconsistent state (BDS context differs from BB context).
> I don't know anything about the problem either, but since you already
> speculated about the cause, let me speculate about the solution:
> Can we hold the graph writer lock for the tran_commit() call in
> bdrv_try_change_aio_context()?

Removing the drain to allow for all of bdrv_try_change_aio_context() to 
require GRAPH_WRLOCK, but that broke tests/unit/test-block-iothread, 
because without draining, block jobs would need to switch AioContexts 
while running, and job_set_aio_context() doesn’t like that.  Similarly 
to blk_get_aio_context(), I assume we can in theory just drop the 
assertion there and change the context while the job is running, because 
then the job can just change AioContexts on the next pause point (and in 
the meantime send requests from the old context, which is fine), but 
this does get quite murky.  (One rather virtual (but present) problem is 
that test-block-iothread itself contains some assertions in the job that 
its AioContext is actually the on its running in, but this assertion 
would no longer necessarily hold true.)

I don’t like using drain as a form of lock specifically against 
AioContext changes, but maybe Stefan is right, and we should use it in 
this specific case to get just the single problem fixed.  (Though it’s 
not quite trivial either.  We’d probably still want to remove the 
assertion from blk_get_aio_context(), so we don’t have to require all of 
its callers to hold a count in the in-flight counter.)

Hanna
Kevin Wolf Jan. 31, 2024, 10:17 a.m. UTC | #13
Am 29.01.2024 um 17:30 hat Hanna Czenczek geschrieben:
> I don’t like using drain as a form of lock specifically against AioContext
> changes, but maybe Stefan is right, and we should use it in this specific
> case to get just the single problem fixed.  (Though it’s not quite trivial
> either.  We’d probably still want to remove the assertion from
> blk_get_aio_context(), so we don’t have to require all of its callers to
> hold a count in the in-flight counter.)

Okay, fair, maybe fixing the specific problem is more important that
solving the more generic blk_get_aio_context() race.

In this case, wouldn't it be enough to increase the in-flight counter so
that the drain before switching AioContexts would run the BH before
anything bad can happen? Does the following work?

Kevin

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 0a2eb11c56..dc09eb8024 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -120,17 +120,11 @@ static void scsi_device_for_each_req_async_bh(void *opaque)
     SCSIRequest *next;
 
     /*
-     * If the AioContext changed before this BH was called then reschedule into
-     * the new AioContext before accessing ->requests. This can happen when
-     * scsi_device_for_each_req_async() is called and then the AioContext is
-     * changed before BHs are run.
+     * The AioContext can't have changed because we increased the in-flight
+     * counter for s->conf.blk.
      */
     ctx = blk_get_aio_context(s->conf.blk);
-    if (ctx != qemu_get_current_aio_context()) {
-        aio_bh_schedule_oneshot(ctx, scsi_device_for_each_req_async_bh,
-                                g_steal_pointer(&data));
-        return;
-    }
+    assert(ctx == qemu_get_current_aio_context());
 
     QTAILQ_FOREACH_SAFE(req, &s->requests, next, next) {
         data->fn(req, data->fn_opaque);
@@ -138,6 +132,7 @@ static void scsi_device_for_each_req_async_bh(void *opaque)
 
     /* Drop the reference taken by scsi_device_for_each_req_async() */
     object_unref(OBJECT(s));
+    blk_dec_in_flight(s->conf.blk);
 }
 
 /*
@@ -163,6 +158,7 @@ static void scsi_device_for_each_req_async(SCSIDevice *s,
      */
     object_ref(OBJECT(s));
 
+    blk_inc_in_flight(s->conf.blk);
     aio_bh_schedule_oneshot(blk_get_aio_context(s->conf.blk),
                             scsi_device_for_each_req_async_bh,
                             data);
Stefan Hajnoczi Jan. 31, 2024, 8:35 p.m. UTC | #14
On Fri, Jan 26, 2024 at 04:24:49PM +0100, Hanna Czenczek wrote:
> On 26.01.24 14:18, Kevin Wolf wrote:
> > Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben:
> > > On 23.01.24 18:10, Kevin Wolf wrote:
> > > > Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
> > > > > On 21.12.23 22:23, Kevin Wolf wrote:
> > > > > > From: Stefan Hajnoczi<stefanha@redhat.com>
> > > > > > 
> > > > > > Stop depending on the AioContext lock and instead access
> > > > > > SCSIDevice->requests from only one thread at a time:
> > > > > > - When the VM is running only the BlockBackend's AioContext may access
> > > > > >      the requests list.
> > > > > > - When the VM is stopped only the main loop may access the requests
> > > > > >      list.
> > > > > > 
> > > > > > These constraints protect the requests list without the need for locking
> > > > > > in the I/O code path.
> > > > > > 
> > > > > > Note that multiple IOThreads are not supported yet because the code
> > > > > > assumes all SCSIRequests are executed from a single AioContext. Leave
> > > > > > that as future work.
> > > > > > 
> > > > > > Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
> > > > > > Reviewed-by: Eric Blake<eblake@redhat.com>
> > > > > > Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
> > > > > > Signed-off-by: Kevin Wolf<kwolf@redhat.com>
> > > > > > ---
> > > > > >     include/hw/scsi/scsi.h |   7 +-
> > > > > >     hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
> > > > > >     2 files changed, 131 insertions(+), 57 deletions(-)
> > > > > My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
> > > > > often because of this commit than because of the original bug, i.e. when
> > > > > repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
> > > > > this tends to happen when unplugging the scsi-hd:
> 
> Note: We (on issues.redhat.com) have a separate report that seems to be
> concerning this very problem: https://issues.redhat.com/browse/RHEL-19381
> 
> > > > > {"execute":"device_del","arguments":{"id":"stg0"}}
> > > > > {"return": {}}
> > > > > qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
> > > > > Assertion `ctx == blk->ctx' failed.
> > > [...]
> > > 
> > > > > I don’t know anything about the problem yet, but as usual, I like
> > > > > speculation and discovering how wrong I was later on, so one thing I came
> > > > > across that’s funny about virtio-scsi is that requests can happen even while
> > > > > a disk is being attached or detached.  That is, Linux seems to probe all
> > > > > LUNs when a new virtio-scsi device is being attached, and it won’t stop just
> > > > > because a disk is being attached or removed.  So maybe that’s part of the
> > > > > problem, that we get a request while the BB is being detached, and
> > > > > temporarily in an inconsistent state (BDS context differs from BB context).
> > > > I don't know anything about the problem either, but since you already
> > > > speculated about the cause, let me speculate about the solution:
> > > > Can we hold the graph writer lock for the tran_commit() call in
> > > > bdrv_try_change_aio_context()? And of course take the reader lock for
> > > > blk_get_aio_context(), but that should be completely unproblematic.
> > > Actually, now that completely unproblematic part is giving me trouble.  I
> > > wanted to just put a graph lock into blk_get_aio_context() (making it a
> > > coroutine with a wrapper)
> > Which is the first thing I neglected and already not great. We have
> > calls of blk_get_aio_context() in the SCSI I/O path, and creating a
> > coroutine and doing at least two context switches simply for this call
> > is a lot of overhead...
> > 
> > > but callers of blk_get_aio_context() generally assume the context is
> > > going to stay the BB’s context for as long as their AioContext *
> > > variable is in scope.
> > I'm not so sure about that. And taking another step back, I'm actually
> > also not sure how much it still matters now that they can submit I/O
> > from any thread.
> 
> That’s my impression, too, but “not sure” doesn’t feel great. :)
> scsi_device_for_each_req_async_bh() specifically double-checks whether it’s
> still in the right context before invoking the specified function, so it
> seems there was some intention to continue to run in the context associated
> with the BB.
> 
> (Not judging whether that intent makes sense or not, yet.)
> 
> > Maybe the correct solution is to remove the assertion from
> > blk_get_aio_context() and just always return blk->ctx. If it's in the
> > middle of a change, you'll either get the old one or the new one. Either
> > one is fine to submit I/O from, and if you care about changes for other
> > reasons (like SCSI does), then you need explicit code to protect it
> > anyway (which SCSI apparently has, but it doesn't work).
> 
> I think most callers do just assume the BB stays in the context they got
> (without any proof, admittedly), but I agree that under re-evaluation, it
> probably doesn’t actually matter to them, really. And yes, basically, if the
> caller doesn’t need to take a lock because it doesn’t really matter whether
> blk->ctx changes while its still using the old value, blk_get_aio_context()
> in turn doesn’t need to double-check blk->ctx against the root node’s
> context either, and nobody needs a lock.
> 
> So I agree, it’s on the caller to protect against a potentially changing
> context, blk_get_aio_context() should just return blk->ctx and not check
> against the root node.
> 
> (On a tangent: blk_drain() is a caller of blk_get_aio_context(), and it
> polls that context.  Why does it need to poll that context specifically when
> requests may be in any context?  Is it because if there are requests in the
> main thread, we must poll that, but otherwise it’s fine to poll any thread,
> and we can only have requests in the main thread if that’s the BB’s
> context?)
> 
> > > I was tempted to think callers know what happens to the BB they pass
> > > to blk_get_aio_context(), and it won’t change contexts so easily, but
> > > then I remembered this is exactly what happens in this case; we run
> > > scsi_device_for_each_req_async_bh() in one thread (which calls
> > > blk_get_aio_context()), and in the other, we change the BB’s context.
> > Let's think a bit more about scsi_device_for_each_req_async()
> > specifically. This is a function that runs in the main thread. Nothing
> > will change any AioContext assignment if it doesn't call it. It wants to
> > make sure that scsi_device_for_each_req_async_bh() is called in the
> > same AioContext where the virtqueue is processed, so it schedules a BH
> > and waits for it.
> 
> I don’t quite follow, it doesn’t wait for the BH.  It uses
> aio_bh_schedule_oneshot(), not aio_wait_bh_oneshot().  While you’re right
> that if it did wait, the BB context might still change, in practice we
> wouldn’t have the problem at hand because the caller is actually the one to
> change the context, concurrently while the BH is running.
> 
> > Waiting for it means running a nested event loop that could do anything,
> > including changing AioContexts. So this is what needs the locking, not
> > the blk_get_aio_context() call in scsi_device_for_each_req_async_bh().
> > If we lock before the nested event loop and unlock in the BH, the check
> > in the BH can become an assertion. (It is important that we unlock in
> > the BH rather than after waiting because if something takes the writer
> > lock, we need to unlock during the nested event loop of bdrv_wrlock() to
> > avoid a deadlock.)
> > 
> > And spawning a coroutine for this looks a lot more acceptable because
> > it's on a slow path anyway.
> > 
> > In fact, we probably don't technically need a coroutine to take the
> > reader lock here. We can have a new graph lock function that asserts
> > that there is no writer (we know because we're running in the main loop)
> > and then atomically increments the reader count. But maybe that already
> > complicates things again...
> 
> So as far as I understand we can’t just use aio_wait_bh_oneshot() and wrap
> it in bdrv_graph_rd{,un}lock_main_loop(), because that doesn’t actually lock
> the graph.  I feel like adding a new graph lock function for this quite
> highly specific case could be dangerous, because it seems easy to use the
> wrong way.
> 
> Just having a trampoline coroutine to call bdrv_graph_co_rd{,un}lock() seems
> simple enough and reasonable here (not a hot path).  Can we have that
> coroutine then use aio_wait_bh_oneshot() with the existing _bh function, or
> should that be made a coroutine, too?

There is a reason for running in the context associated with the BB: the
virtio-scsi code assumes all request processing happens in the BB's
AioContext. The SCSI request list and other SCSI emulation code is not
thread-safe!

The invariant is that SCSI request processing must only happen in one
AioContext. Other parts of QEMU may perform block I/O from other
AioContexts because they don't run SCSI emulation for this device.

Stefan
Hanna Czenczek Feb. 1, 2024, 9:43 a.m. UTC | #15
On 31.01.24 11:17, Kevin Wolf wrote:
> Am 29.01.2024 um 17:30 hat Hanna Czenczek geschrieben:
>> I don’t like using drain as a form of lock specifically against AioContext
>> changes, but maybe Stefan is right, and we should use it in this specific
>> case to get just the single problem fixed.  (Though it’s not quite trivial
>> either.  We’d probably still want to remove the assertion from
>> blk_get_aio_context(), so we don’t have to require all of its callers to
>> hold a count in the in-flight counter.)
> Okay, fair, maybe fixing the specific problem is more important that
> solving the more generic blk_get_aio_context() race.
>
> In this case, wouldn't it be enough to increase the in-flight counter so
> that the drain before switching AioContexts would run the BH before
> anything bad can happen? Does the following work?

Yes, that’s what I had in mind (Stefan, too, I think), and in testing, 
it looks good.

Hanna

>
> Kevin
>
> diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
> index 0a2eb11c56..dc09eb8024 100644
> --- a/hw/scsi/scsi-bus.c
> +++ b/hw/scsi/scsi-bus.c
> @@ -120,17 +120,11 @@ static void scsi_device_for_each_req_async_bh(void *opaque)
>       SCSIRequest *next;
>   
>       /*
> -     * If the AioContext changed before this BH was called then reschedule into
> -     * the new AioContext before accessing ->requests. This can happen when
> -     * scsi_device_for_each_req_async() is called and then the AioContext is
> -     * changed before BHs are run.
> +     * The AioContext can't have changed because we increased the in-flight
> +     * counter for s->conf.blk.
>        */
>       ctx = blk_get_aio_context(s->conf.blk);
> -    if (ctx != qemu_get_current_aio_context()) {
> -        aio_bh_schedule_oneshot(ctx, scsi_device_for_each_req_async_bh,
> -                                g_steal_pointer(&data));
> -        return;
> -    }
> +    assert(ctx == qemu_get_current_aio_context());
>   
>       QTAILQ_FOREACH_SAFE(req, &s->requests, next, next) {
>           data->fn(req, data->fn_opaque);
> @@ -138,6 +132,7 @@ static void scsi_device_for_each_req_async_bh(void *opaque)
>   
>       /* Drop the reference taken by scsi_device_for_each_req_async() */
>       object_unref(OBJECT(s));
> +    blk_dec_in_flight(s->conf.blk);
>   }
>   
>   /*
> @@ -163,6 +158,7 @@ static void scsi_device_for_each_req_async(SCSIDevice *s,
>        */
>       object_ref(OBJECT(s));
>   
> +    blk_inc_in_flight(s->conf.blk);
>       aio_bh_schedule_oneshot(blk_get_aio_context(s->conf.blk),
>                               scsi_device_for_each_req_async_bh,
>                               data);
>
Kevin Wolf Feb. 1, 2024, 10:21 a.m. UTC | #16
Am 01.02.2024 um 10:43 hat Hanna Czenczek geschrieben:
> On 31.01.24 11:17, Kevin Wolf wrote:
> > Am 29.01.2024 um 17:30 hat Hanna Czenczek geschrieben:
> > > I don’t like using drain as a form of lock specifically against AioContext
> > > changes, but maybe Stefan is right, and we should use it in this specific
> > > case to get just the single problem fixed.  (Though it’s not quite trivial
> > > either.  We’d probably still want to remove the assertion from
> > > blk_get_aio_context(), so we don’t have to require all of its callers to
> > > hold a count in the in-flight counter.)
> > Okay, fair, maybe fixing the specific problem is more important that
> > solving the more generic blk_get_aio_context() race.
> > 
> > In this case, wouldn't it be enough to increase the in-flight counter so
> > that the drain before switching AioContexts would run the BH before
> > anything bad can happen? Does the following work?
> 
> Yes, that’s what I had in mind (Stefan, too, I think), and in testing,
> it looks good.

Oh, sorry, I completely misunderstood then. I thought you were talking
about adding a new drained section somewhere and that sounded a bit more
complicated. :-)

If it works, let's do this. Would you like to pick this up and send it
as a formal patch (possibly in a more polished form), or should I do
that?

Kevin
Hanna Czenczek Feb. 1, 2024, 10:35 a.m. UTC | #17
On 01.02.24 11:21, Kevin Wolf wrote:
> Am 01.02.2024 um 10:43 hat Hanna Czenczek geschrieben:
>> On 31.01.24 11:17, Kevin Wolf wrote:
>>> Am 29.01.2024 um 17:30 hat Hanna Czenczek geschrieben:
>>>> I don’t like using drain as a form of lock specifically against AioContext
>>>> changes, but maybe Stefan is right, and we should use it in this specific
>>>> case to get just the single problem fixed.  (Though it’s not quite trivial
>>>> either.  We’d probably still want to remove the assertion from
>>>> blk_get_aio_context(), so we don’t have to require all of its callers to
>>>> hold a count in the in-flight counter.)
>>> Okay, fair, maybe fixing the specific problem is more important that
>>> solving the more generic blk_get_aio_context() race.
>>>
>>> In this case, wouldn't it be enough to increase the in-flight counter so
>>> that the drain before switching AioContexts would run the BH before
>>> anything bad can happen? Does the following work?
>> Yes, that’s what I had in mind (Stefan, too, I think), and in testing,
>> it looks good.
> Oh, sorry, I completely misunderstood then. I thought you were talking
> about adding a new drained section somewhere and that sounded a bit more
> complicated. :-)
>
> If it works, let's do this. Would you like to pick this up and send it
> as a formal patch (possibly in a more polished form), or should I do
> that?

Sure, I can do it.

Hanna
Hanna Czenczek Feb. 1, 2024, 2:10 p.m. UTC | #18
On 31.01.24 21:35, Stefan Hajnoczi wrote:
> On Fri, Jan 26, 2024 at 04:24:49PM +0100, Hanna Czenczek wrote:
>> On 26.01.24 14:18, Kevin Wolf wrote:
>>> Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben:
>>>> On 23.01.24 18:10, Kevin Wolf wrote:
>>>>> Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
>>>>>> On 21.12.23 22:23, Kevin Wolf wrote:
>>>>>>> From: Stefan Hajnoczi<stefanha@redhat.com>
>>>>>>>
>>>>>>> Stop depending on the AioContext lock and instead access
>>>>>>> SCSIDevice->requests from only one thread at a time:
>>>>>>> - When the VM is running only the BlockBackend's AioContext may access
>>>>>>>       the requests list.
>>>>>>> - When the VM is stopped only the main loop may access the requests
>>>>>>>       list.
>>>>>>>
>>>>>>> These constraints protect the requests list without the need for locking
>>>>>>> in the I/O code path.
>>>>>>>
>>>>>>> Note that multiple IOThreads are not supported yet because the code
>>>>>>> assumes all SCSIRequests are executed from a single AioContext. Leave
>>>>>>> that as future work.
>>>>>>>
>>>>>>> Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
>>>>>>> Reviewed-by: Eric Blake<eblake@redhat.com>
>>>>>>> Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
>>>>>>> Signed-off-by: Kevin Wolf<kwolf@redhat.com>
>>>>>>> ---
>>>>>>>      include/hw/scsi/scsi.h |   7 +-
>>>>>>>      hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
>>>>>>>      2 files changed, 131 insertions(+), 57 deletions(-)
>>>>>> My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
>>>>>> often because of this commit than because of the original bug, i.e. when
>>>>>> repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
>>>>>> this tends to happen when unplugging the scsi-hd:
>> Note: We (on issues.redhat.com) have a separate report that seems to be
>> concerning this very problem: https://issues.redhat.com/browse/RHEL-19381
>>
>>>>>> {"execute":"device_del","arguments":{"id":"stg0"}}
>>>>>> {"return": {}}
>>>>>> qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
>>>>>> Assertion `ctx == blk->ctx' failed.
>>>> [...]
>>>>
>>>>>> I don’t know anything about the problem yet, but as usual, I like
>>>>>> speculation and discovering how wrong I was later on, so one thing I came
>>>>>> across that’s funny about virtio-scsi is that requests can happen even while
>>>>>> a disk is being attached or detached.  That is, Linux seems to probe all
>>>>>> LUNs when a new virtio-scsi device is being attached, and it won’t stop just
>>>>>> because a disk is being attached or removed.  So maybe that’s part of the
>>>>>> problem, that we get a request while the BB is being detached, and
>>>>>> temporarily in an inconsistent state (BDS context differs from BB context).
>>>>> I don't know anything about the problem either, but since you already
>>>>> speculated about the cause, let me speculate about the solution:
>>>>> Can we hold the graph writer lock for the tran_commit() call in
>>>>> bdrv_try_change_aio_context()? And of course take the reader lock for
>>>>> blk_get_aio_context(), but that should be completely unproblematic.
>>>> Actually, now that completely unproblematic part is giving me trouble.  I
>>>> wanted to just put a graph lock into blk_get_aio_context() (making it a
>>>> coroutine with a wrapper)
>>> Which is the first thing I neglected and already not great. We have
>>> calls of blk_get_aio_context() in the SCSI I/O path, and creating a
>>> coroutine and doing at least two context switches simply for this call
>>> is a lot of overhead...
>>>
>>>> but callers of blk_get_aio_context() generally assume the context is
>>>> going to stay the BB’s context for as long as their AioContext *
>>>> variable is in scope.
>>> I'm not so sure about that. And taking another step back, I'm actually
>>> also not sure how much it still matters now that they can submit I/O
>>> from any thread.
>> That’s my impression, too, but “not sure” doesn’t feel great. :)
>> scsi_device_for_each_req_async_bh() specifically double-checks whether it’s
>> still in the right context before invoking the specified function, so it
>> seems there was some intention to continue to run in the context associated
>> with the BB.
>>
>> (Not judging whether that intent makes sense or not, yet.)
>>
>>> Maybe the correct solution is to remove the assertion from
>>> blk_get_aio_context() and just always return blk->ctx. If it's in the
>>> middle of a change, you'll either get the old one or the new one. Either
>>> one is fine to submit I/O from, and if you care about changes for other
>>> reasons (like SCSI does), then you need explicit code to protect it
>>> anyway (which SCSI apparently has, but it doesn't work).
>> I think most callers do just assume the BB stays in the context they got
>> (without any proof, admittedly), but I agree that under re-evaluation, it
>> probably doesn’t actually matter to them, really. And yes, basically, if the
>> caller doesn’t need to take a lock because it doesn’t really matter whether
>> blk->ctx changes while its still using the old value, blk_get_aio_context()
>> in turn doesn’t need to double-check blk->ctx against the root node’s
>> context either, and nobody needs a lock.
>>
>> So I agree, it’s on the caller to protect against a potentially changing
>> context, blk_get_aio_context() should just return blk->ctx and not check
>> against the root node.
>>
>> (On a tangent: blk_drain() is a caller of blk_get_aio_context(), and it
>> polls that context.  Why does it need to poll that context specifically when
>> requests may be in any context?  Is it because if there are requests in the
>> main thread, we must poll that, but otherwise it’s fine to poll any thread,
>> and we can only have requests in the main thread if that’s the BB’s
>> context?)
>>
>>>> I was tempted to think callers know what happens to the BB they pass
>>>> to blk_get_aio_context(), and it won’t change contexts so easily, but
>>>> then I remembered this is exactly what happens in this case; we run
>>>> scsi_device_for_each_req_async_bh() in one thread (which calls
>>>> blk_get_aio_context()), and in the other, we change the BB’s context.
>>> Let's think a bit more about scsi_device_for_each_req_async()
>>> specifically. This is a function that runs in the main thread. Nothing
>>> will change any AioContext assignment if it doesn't call it. It wants to
>>> make sure that scsi_device_for_each_req_async_bh() is called in the
>>> same AioContext where the virtqueue is processed, so it schedules a BH
>>> and waits for it.
>> I don’t quite follow, it doesn’t wait for the BH.  It uses
>> aio_bh_schedule_oneshot(), not aio_wait_bh_oneshot().  While you’re right
>> that if it did wait, the BB context might still change, in practice we
>> wouldn’t have the problem at hand because the caller is actually the one to
>> change the context, concurrently while the BH is running.
>>
>>> Waiting for it means running a nested event loop that could do anything,
>>> including changing AioContexts. So this is what needs the locking, not
>>> the blk_get_aio_context() call in scsi_device_for_each_req_async_bh().
>>> If we lock before the nested event loop and unlock in the BH, the check
>>> in the BH can become an assertion. (It is important that we unlock in
>>> the BH rather than after waiting because if something takes the writer
>>> lock, we need to unlock during the nested event loop of bdrv_wrlock() to
>>> avoid a deadlock.)
>>>
>>> And spawning a coroutine for this looks a lot more acceptable because
>>> it's on a slow path anyway.
>>>
>>> In fact, we probably don't technically need a coroutine to take the
>>> reader lock here. We can have a new graph lock function that asserts
>>> that there is no writer (we know because we're running in the main loop)
>>> and then atomically increments the reader count. But maybe that already
>>> complicates things again...
>> So as far as I understand we can’t just use aio_wait_bh_oneshot() and wrap
>> it in bdrv_graph_rd{,un}lock_main_loop(), because that doesn’t actually lock
>> the graph.  I feel like adding a new graph lock function for this quite
>> highly specific case could be dangerous, because it seems easy to use the
>> wrong way.
>>
>> Just having a trampoline coroutine to call bdrv_graph_co_rd{,un}lock() seems
>> simple enough and reasonable here (not a hot path).  Can we have that
>> coroutine then use aio_wait_bh_oneshot() with the existing _bh function, or
>> should that be made a coroutine, too?
> There is a reason for running in the context associated with the BB: the
> virtio-scsi code assumes all request processing happens in the BB's
> AioContext. The SCSI request list and other SCSI emulation code is not
> thread-safe!

One peculiarity about virtio-scsi, as far as I understand, is that its 
context is not necessarily the BB’s context, because one virtio-scsi 
device may have many BBs.  While the BBs are being hot-plugged or 
un-plugged, their context may change (as is happening here), but that 
doesn’t stop SCSI request processing, because SCSI requests happen 
independently of whether there are devices on the SCSI bus.

If SCSI request processing is not thread-safe, doesn’t that mean it 
always must be done in the very same context, i.e. the context the 
virtio-scsi device was configured to use?  Just because a new scsi-hd BB 
is added or removed, and so we temporarily have a main context BB 
associated with the virtio-scsi device, I don’t think we should switch 
to processing requests in the main context.

Hanna
Stefan Hajnoczi Feb. 1, 2024, 2:28 p.m. UTC | #19
On Thu, Feb 01, 2024 at 03:10:12PM +0100, Hanna Czenczek wrote:
> On 31.01.24 21:35, Stefan Hajnoczi wrote:
> > On Fri, Jan 26, 2024 at 04:24:49PM +0100, Hanna Czenczek wrote:
> > > On 26.01.24 14:18, Kevin Wolf wrote:
> > > > Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben:
> > > > > On 23.01.24 18:10, Kevin Wolf wrote:
> > > > > > Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
> > > > > > > On 21.12.23 22:23, Kevin Wolf wrote:
> > > > > > > > From: Stefan Hajnoczi<stefanha@redhat.com>
> > > > > > > > 
> > > > > > > > Stop depending on the AioContext lock and instead access
> > > > > > > > SCSIDevice->requests from only one thread at a time:
> > > > > > > > - When the VM is running only the BlockBackend's AioContext may access
> > > > > > > >       the requests list.
> > > > > > > > - When the VM is stopped only the main loop may access the requests
> > > > > > > >       list.
> > > > > > > > 
> > > > > > > > These constraints protect the requests list without the need for locking
> > > > > > > > in the I/O code path.
> > > > > > > > 
> > > > > > > > Note that multiple IOThreads are not supported yet because the code
> > > > > > > > assumes all SCSIRequests are executed from a single AioContext. Leave
> > > > > > > > that as future work.
> > > > > > > > 
> > > > > > > > Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
> > > > > > > > Reviewed-by: Eric Blake<eblake@redhat.com>
> > > > > > > > Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
> > > > > > > > Signed-off-by: Kevin Wolf<kwolf@redhat.com>
> > > > > > > > ---
> > > > > > > >      include/hw/scsi/scsi.h |   7 +-
> > > > > > > >      hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
> > > > > > > >      2 files changed, 131 insertions(+), 57 deletions(-)
> > > > > > > My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
> > > > > > > often because of this commit than because of the original bug, i.e. when
> > > > > > > repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
> > > > > > > this tends to happen when unplugging the scsi-hd:
> > > Note: We (on issues.redhat.com) have a separate report that seems to be
> > > concerning this very problem: https://issues.redhat.com/browse/RHEL-19381
> > > 
> > > > > > > {"execute":"device_del","arguments":{"id":"stg0"}}
> > > > > > > {"return": {}}
> > > > > > > qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
> > > > > > > Assertion `ctx == blk->ctx' failed.
> > > > > [...]
> > > > > 
> > > > > > > I don’t know anything about the problem yet, but as usual, I like
> > > > > > > speculation and discovering how wrong I was later on, so one thing I came
> > > > > > > across that’s funny about virtio-scsi is that requests can happen even while
> > > > > > > a disk is being attached or detached.  That is, Linux seems to probe all
> > > > > > > LUNs when a new virtio-scsi device is being attached, and it won’t stop just
> > > > > > > because a disk is being attached or removed.  So maybe that’s part of the
> > > > > > > problem, that we get a request while the BB is being detached, and
> > > > > > > temporarily in an inconsistent state (BDS context differs from BB context).
> > > > > > I don't know anything about the problem either, but since you already
> > > > > > speculated about the cause, let me speculate about the solution:
> > > > > > Can we hold the graph writer lock for the tran_commit() call in
> > > > > > bdrv_try_change_aio_context()? And of course take the reader lock for
> > > > > > blk_get_aio_context(), but that should be completely unproblematic.
> > > > > Actually, now that completely unproblematic part is giving me trouble.  I
> > > > > wanted to just put a graph lock into blk_get_aio_context() (making it a
> > > > > coroutine with a wrapper)
> > > > Which is the first thing I neglected and already not great. We have
> > > > calls of blk_get_aio_context() in the SCSI I/O path, and creating a
> > > > coroutine and doing at least two context switches simply for this call
> > > > is a lot of overhead...
> > > > 
> > > > > but callers of blk_get_aio_context() generally assume the context is
> > > > > going to stay the BB’s context for as long as their AioContext *
> > > > > variable is in scope.
> > > > I'm not so sure about that. And taking another step back, I'm actually
> > > > also not sure how much it still matters now that they can submit I/O
> > > > from any thread.
> > > That’s my impression, too, but “not sure” doesn’t feel great. :)
> > > scsi_device_for_each_req_async_bh() specifically double-checks whether it’s
> > > still in the right context before invoking the specified function, so it
> > > seems there was some intention to continue to run in the context associated
> > > with the BB.
> > > 
> > > (Not judging whether that intent makes sense or not, yet.)
> > > 
> > > > Maybe the correct solution is to remove the assertion from
> > > > blk_get_aio_context() and just always return blk->ctx. If it's in the
> > > > middle of a change, you'll either get the old one or the new one. Either
> > > > one is fine to submit I/O from, and if you care about changes for other
> > > > reasons (like SCSI does), then you need explicit code to protect it
> > > > anyway (which SCSI apparently has, but it doesn't work).
> > > I think most callers do just assume the BB stays in the context they got
> > > (without any proof, admittedly), but I agree that under re-evaluation, it
> > > probably doesn’t actually matter to them, really. And yes, basically, if the
> > > caller doesn’t need to take a lock because it doesn’t really matter whether
> > > blk->ctx changes while its still using the old value, blk_get_aio_context()
> > > in turn doesn’t need to double-check blk->ctx against the root node’s
> > > context either, and nobody needs a lock.
> > > 
> > > So I agree, it’s on the caller to protect against a potentially changing
> > > context, blk_get_aio_context() should just return blk->ctx and not check
> > > against the root node.
> > > 
> > > (On a tangent: blk_drain() is a caller of blk_get_aio_context(), and it
> > > polls that context.  Why does it need to poll that context specifically when
> > > requests may be in any context?  Is it because if there are requests in the
> > > main thread, we must poll that, but otherwise it’s fine to poll any thread,
> > > and we can only have requests in the main thread if that’s the BB’s
> > > context?)
> > > 
> > > > > I was tempted to think callers know what happens to the BB they pass
> > > > > to blk_get_aio_context(), and it won’t change contexts so easily, but
> > > > > then I remembered this is exactly what happens in this case; we run
> > > > > scsi_device_for_each_req_async_bh() in one thread (which calls
> > > > > blk_get_aio_context()), and in the other, we change the BB’s context.
> > > > Let's think a bit more about scsi_device_for_each_req_async()
> > > > specifically. This is a function that runs in the main thread. Nothing
> > > > will change any AioContext assignment if it doesn't call it. It wants to
> > > > make sure that scsi_device_for_each_req_async_bh() is called in the
> > > > same AioContext where the virtqueue is processed, so it schedules a BH
> > > > and waits for it.
> > > I don’t quite follow, it doesn’t wait for the BH.  It uses
> > > aio_bh_schedule_oneshot(), not aio_wait_bh_oneshot().  While you’re right
> > > that if it did wait, the BB context might still change, in practice we
> > > wouldn’t have the problem at hand because the caller is actually the one to
> > > change the context, concurrently while the BH is running.
> > > 
> > > > Waiting for it means running a nested event loop that could do anything,
> > > > including changing AioContexts. So this is what needs the locking, not
> > > > the blk_get_aio_context() call in scsi_device_for_each_req_async_bh().
> > > > If we lock before the nested event loop and unlock in the BH, the check
> > > > in the BH can become an assertion. (It is important that we unlock in
> > > > the BH rather than after waiting because if something takes the writer
> > > > lock, we need to unlock during the nested event loop of bdrv_wrlock() to
> > > > avoid a deadlock.)
> > > > 
> > > > And spawning a coroutine for this looks a lot more acceptable because
> > > > it's on a slow path anyway.
> > > > 
> > > > In fact, we probably don't technically need a coroutine to take the
> > > > reader lock here. We can have a new graph lock function that asserts
> > > > that there is no writer (we know because we're running in the main loop)
> > > > and then atomically increments the reader count. But maybe that already
> > > > complicates things again...
> > > So as far as I understand we can’t just use aio_wait_bh_oneshot() and wrap
> > > it in bdrv_graph_rd{,un}lock_main_loop(), because that doesn’t actually lock
> > > the graph.  I feel like adding a new graph lock function for this quite
> > > highly specific case could be dangerous, because it seems easy to use the
> > > wrong way.
> > > 
> > > Just having a trampoline coroutine to call bdrv_graph_co_rd{,un}lock() seems
> > > simple enough and reasonable here (not a hot path).  Can we have that
> > > coroutine then use aio_wait_bh_oneshot() with the existing _bh function, or
> > > should that be made a coroutine, too?
> > There is a reason for running in the context associated with the BB: the
> > virtio-scsi code assumes all request processing happens in the BB's
> > AioContext. The SCSI request list and other SCSI emulation code is not
> > thread-safe!
> 
> One peculiarity about virtio-scsi, as far as I understand, is that its
> context is not necessarily the BB’s context, because one virtio-scsi device
> may have many BBs.  While the BBs are being hot-plugged or un-plugged, their
> context may change (as is happening here), but that doesn’t stop SCSI
> request processing, because SCSI requests happen independently of whether
> there are devices on the SCSI bus.
> 
> If SCSI request processing is not thread-safe, doesn’t that mean it always
> must be done in the very same context, i.e. the context the virtio-scsi
> device was configured to use?  Just because a new scsi-hd BB is added or
> removed, and so we temporarily have a main context BB associated with the
> virtio-scsi device, I don’t think we should switch to processing requests in
> the main context.

This case is not supposed to happen because virtio_scsi_hotplug()
immediately places the BB into the virtio-scsi device's AioContext by
calling blk_set_aio_context().

The AioContext invariant is checked at several points in the SCSI
request lifecycle by this function:

  static inline void virtio_scsi_ctx_check(VirtIOSCSI *s, SCSIDevice *d)
  {   
      if (s->dataplane_started && d && blk_is_available(d->conf.blk)) {
          assert(blk_get_aio_context(d->conf.blk) == s->ctx);
      } 
  }

Did you find a scenario where the virtio-scsi AioContext is different
from the scsi-hd BB's Aiocontext?

Stefan
Hanna Czenczek Feb. 1, 2024, 3:25 p.m. UTC | #20
On 01.02.24 15:28, Stefan Hajnoczi wrote:
> On Thu, Feb 01, 2024 at 03:10:12PM +0100, Hanna Czenczek wrote:
>> On 31.01.24 21:35, Stefan Hajnoczi wrote:
>>> On Fri, Jan 26, 2024 at 04:24:49PM +0100, Hanna Czenczek wrote:
>>>> On 26.01.24 14:18, Kevin Wolf wrote:
>>>>> Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben:
>>>>>> On 23.01.24 18:10, Kevin Wolf wrote:
>>>>>>> Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
>>>>>>>> On 21.12.23 22:23, Kevin Wolf wrote:
>>>>>>>>> From: Stefan Hajnoczi<stefanha@redhat.com>
>>>>>>>>>
>>>>>>>>> Stop depending on the AioContext lock and instead access
>>>>>>>>> SCSIDevice->requests from only one thread at a time:
>>>>>>>>> - When the VM is running only the BlockBackend's AioContext may access
>>>>>>>>>        the requests list.
>>>>>>>>> - When the VM is stopped only the main loop may access the requests
>>>>>>>>>        list.
>>>>>>>>>
>>>>>>>>> These constraints protect the requests list without the need for locking
>>>>>>>>> in the I/O code path.
>>>>>>>>>
>>>>>>>>> Note that multiple IOThreads are not supported yet because the code
>>>>>>>>> assumes all SCSIRequests are executed from a single AioContext. Leave
>>>>>>>>> that as future work.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
>>>>>>>>> Reviewed-by: Eric Blake<eblake@redhat.com>
>>>>>>>>> Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
>>>>>>>>> Signed-off-by: Kevin Wolf<kwolf@redhat.com>
>>>>>>>>> ---
>>>>>>>>>       include/hw/scsi/scsi.h |   7 +-
>>>>>>>>>       hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
>>>>>>>>>       2 files changed, 131 insertions(+), 57 deletions(-)
>>>>>>>> My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
>>>>>>>> often because of this commit than because of the original bug, i.e. when
>>>>>>>> repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
>>>>>>>> this tends to happen when unplugging the scsi-hd:
>>>> Note: We (on issues.redhat.com) have a separate report that seems to be
>>>> concerning this very problem: https://issues.redhat.com/browse/RHEL-19381
>>>>
>>>>>>>> {"execute":"device_del","arguments":{"id":"stg0"}}
>>>>>>>> {"return": {}}
>>>>>>>> qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
>>>>>>>> Assertion `ctx == blk->ctx' failed.
>>>>>> [...]
>>>>>>
>>>>>>>> I don’t know anything about the problem yet, but as usual, I like
>>>>>>>> speculation and discovering how wrong I was later on, so one thing I came
>>>>>>>> across that’s funny about virtio-scsi is that requests can happen even while
>>>>>>>> a disk is being attached or detached.  That is, Linux seems to probe all
>>>>>>>> LUNs when a new virtio-scsi device is being attached, and it won’t stop just
>>>>>>>> because a disk is being attached or removed.  So maybe that’s part of the
>>>>>>>> problem, that we get a request while the BB is being detached, and
>>>>>>>> temporarily in an inconsistent state (BDS context differs from BB context).
>>>>>>> I don't know anything about the problem either, but since you already
>>>>>>> speculated about the cause, let me speculate about the solution:
>>>>>>> Can we hold the graph writer lock for the tran_commit() call in
>>>>>>> bdrv_try_change_aio_context()? And of course take the reader lock for
>>>>>>> blk_get_aio_context(), but that should be completely unproblematic.
>>>>>> Actually, now that completely unproblematic part is giving me trouble.  I
>>>>>> wanted to just put a graph lock into blk_get_aio_context() (making it a
>>>>>> coroutine with a wrapper)
>>>>> Which is the first thing I neglected and already not great. We have
>>>>> calls of blk_get_aio_context() in the SCSI I/O path, and creating a
>>>>> coroutine and doing at least two context switches simply for this call
>>>>> is a lot of overhead...
>>>>>
>>>>>> but callers of blk_get_aio_context() generally assume the context is
>>>>>> going to stay the BB’s context for as long as their AioContext *
>>>>>> variable is in scope.
>>>>> I'm not so sure about that. And taking another step back, I'm actually
>>>>> also not sure how much it still matters now that they can submit I/O
>>>>> from any thread.
>>>> That’s my impression, too, but “not sure” doesn’t feel great. :)
>>>> scsi_device_for_each_req_async_bh() specifically double-checks whether it’s
>>>> still in the right context before invoking the specified function, so it
>>>> seems there was some intention to continue to run in the context associated
>>>> with the BB.
>>>>
>>>> (Not judging whether that intent makes sense or not, yet.)
>>>>
>>>>> Maybe the correct solution is to remove the assertion from
>>>>> blk_get_aio_context() and just always return blk->ctx. If it's in the
>>>>> middle of a change, you'll either get the old one or the new one. Either
>>>>> one is fine to submit I/O from, and if you care about changes for other
>>>>> reasons (like SCSI does), then you need explicit code to protect it
>>>>> anyway (which SCSI apparently has, but it doesn't work).
>>>> I think most callers do just assume the BB stays in the context they got
>>>> (without any proof, admittedly), but I agree that under re-evaluation, it
>>>> probably doesn’t actually matter to them, really. And yes, basically, if the
>>>> caller doesn’t need to take a lock because it doesn’t really matter whether
>>>> blk->ctx changes while its still using the old value, blk_get_aio_context()
>>>> in turn doesn’t need to double-check blk->ctx against the root node’s
>>>> context either, and nobody needs a lock.
>>>>
>>>> So I agree, it’s on the caller to protect against a potentially changing
>>>> context, blk_get_aio_context() should just return blk->ctx and not check
>>>> against the root node.
>>>>
>>>> (On a tangent: blk_drain() is a caller of blk_get_aio_context(), and it
>>>> polls that context.  Why does it need to poll that context specifically when
>>>> requests may be in any context?  Is it because if there are requests in the
>>>> main thread, we must poll that, but otherwise it’s fine to poll any thread,
>>>> and we can only have requests in the main thread if that’s the BB’s
>>>> context?)
>>>>
>>>>>> I was tempted to think callers know what happens to the BB they pass
>>>>>> to blk_get_aio_context(), and it won’t change contexts so easily, but
>>>>>> then I remembered this is exactly what happens in this case; we run
>>>>>> scsi_device_for_each_req_async_bh() in one thread (which calls
>>>>>> blk_get_aio_context()), and in the other, we change the BB’s context.
>>>>> Let's think a bit more about scsi_device_for_each_req_async()
>>>>> specifically. This is a function that runs in the main thread. Nothing
>>>>> will change any AioContext assignment if it doesn't call it. It wants to
>>>>> make sure that scsi_device_for_each_req_async_bh() is called in the
>>>>> same AioContext where the virtqueue is processed, so it schedules a BH
>>>>> and waits for it.
>>>> I don’t quite follow, it doesn’t wait for the BH.  It uses
>>>> aio_bh_schedule_oneshot(), not aio_wait_bh_oneshot().  While you’re right
>>>> that if it did wait, the BB context might still change, in practice we
>>>> wouldn’t have the problem at hand because the caller is actually the one to
>>>> change the context, concurrently while the BH is running.
>>>>
>>>>> Waiting for it means running a nested event loop that could do anything,
>>>>> including changing AioContexts. So this is what needs the locking, not
>>>>> the blk_get_aio_context() call in scsi_device_for_each_req_async_bh().
>>>>> If we lock before the nested event loop and unlock in the BH, the check
>>>>> in the BH can become an assertion. (It is important that we unlock in
>>>>> the BH rather than after waiting because if something takes the writer
>>>>> lock, we need to unlock during the nested event loop of bdrv_wrlock() to
>>>>> avoid a deadlock.)
>>>>>
>>>>> And spawning a coroutine for this looks a lot more acceptable because
>>>>> it's on a slow path anyway.
>>>>>
>>>>> In fact, we probably don't technically need a coroutine to take the
>>>>> reader lock here. We can have a new graph lock function that asserts
>>>>> that there is no writer (we know because we're running in the main loop)
>>>>> and then atomically increments the reader count. But maybe that already
>>>>> complicates things again...
>>>> So as far as I understand we can’t just use aio_wait_bh_oneshot() and wrap
>>>> it in bdrv_graph_rd{,un}lock_main_loop(), because that doesn’t actually lock
>>>> the graph.  I feel like adding a new graph lock function for this quite
>>>> highly specific case could be dangerous, because it seems easy to use the
>>>> wrong way.
>>>>
>>>> Just having a trampoline coroutine to call bdrv_graph_co_rd{,un}lock() seems
>>>> simple enough and reasonable here (not a hot path).  Can we have that
>>>> coroutine then use aio_wait_bh_oneshot() with the existing _bh function, or
>>>> should that be made a coroutine, too?
>>> There is a reason for running in the context associated with the BB: the
>>> virtio-scsi code assumes all request processing happens in the BB's
>>> AioContext. The SCSI request list and other SCSI emulation code is not
>>> thread-safe!
>> One peculiarity about virtio-scsi, as far as I understand, is that its
>> context is not necessarily the BB’s context, because one virtio-scsi device
>> may have many BBs.  While the BBs are being hot-plugged or un-plugged, their
>> context may change (as is happening here), but that doesn’t stop SCSI
>> request processing, because SCSI requests happen independently of whether
>> there are devices on the SCSI bus.
>>
>> If SCSI request processing is not thread-safe, doesn’t that mean it always
>> must be done in the very same context, i.e. the context the virtio-scsi
>> device was configured to use?  Just because a new scsi-hd BB is added or
>> removed, and so we temporarily have a main context BB associated with the
>> virtio-scsi device, I don’t think we should switch to processing requests in
>> the main context.
> This case is not supposed to happen because virtio_scsi_hotplug()
> immediately places the BB into the virtio-scsi device's AioContext by
> calling blk_set_aio_context().
>
> The AioContext invariant is checked at several points in the SCSI
> request lifecycle by this function:
>
>    static inline void virtio_scsi_ctx_check(VirtIOSCSI *s, SCSIDevice *d)
>    {
>        if (s->dataplane_started && d && blk_is_available(d->conf.blk)) {
>            assert(blk_get_aio_context(d->conf.blk) == s->ctx);
>        }
>    }

Yes, in fact, when I looked at other callers of blk_get_aio_context(), 
this was one place that didn’t really make sense to me, exactly because 
I doubt the invariant.

(Other places are scsi_aio_complete() and scsi_read_complete_noio().)

> Did you find a scenario where the virtio-scsi AioContext is different
> from the scsi-hd BB's Aiocontext?

Technically, that’s the reason for this thread, specifically that 
virtio_scsi_hotunplug() switches the BB back to the main context while 
scsi_device_for_each_req_async_bh() is running.  Yes, we can fix that 
specific case via the in-flight counter, but I’m wondering whether 
there’s really any merit in requiring the BB to always be in 
virtio-scsi’s context, or whether it would make more sense to schedule 
everything in virtio-scsi’s context.  Now that BBs/BDSs can receive 
requests from any context, that is.

The critical path is in hot-plugging and -unplugging, because those 
happen in the main context, concurrently to request processing in 
virtio-scsi’s context.  As for hot-plugging, what I’ve seen is 
https://issues.redhat.com/browse/RHEL-3934#comment-23272702 : The 
scsi-hd device is created before it’s hot-plugged into virtio-scsi, so 
technically, we do then have a scsi-hd device whose context is different 
from virtio-scsi.  The blk_drain() that’s being done to this new scsi-hd 
device does lead into virtio_scsi_drained_begin(), so there is at least 
some connection between the two.

As for hot-unplugging, my worry is that there may be SCSI requests 
ongoing, which are processed in virtio-scsi’s context.  My hope is that 
scsi_device_purge_requests() settles all of them, so that there are no 
requests left after virtio_scsi_hotunplug()’s 
qdev_simple_device_unplug_cb(), before the BB is switched to the main 
context.  Right now, it doesn’t do that, though, because we use 
scsi_device_for_each_req_async(), i.e. don’t wait for those requests to 
be cancelled.  With the in-flight patch, the subsequent blk_drain() in 
scsi_device_purge_requests() would then await it, though.[1]

Even with that, the situation is all but clear to me.  We do run 
scsi_req_cancel_async() for every single request we currently have, and 
then wait until the in-flight counter reaches 0, which seems good[2], 
but with the rest of the unplugging code running in the main context, 
and virtio-scsi continuing to process requests from the guest in a 
different context, I can’t easily figure out why it would be impossible 
for the guest to launch a SCSI request for that SCSI disk that is being 
unplugged.  On one hand, just because the guest has accepted hot unplug 
does not mean it would be impossible to act against supposed protocol 
and submit another request.  On the other, this unplugging and unrealize 
state machine to me is a very complex and opaque state machine that 
makes it difficult to grasp where exactly the ties between a scsi-hd 
device with its BB and the virtio-scsi device are completely severed, 
i.e. until which point it is possible for virtio-scsi code to see the BB 
during the unplugging process, and consider it the target of a request.

It just seems simpler to me to not rely on the BB's context at all.

Hanna


[1] So, fun note: Incrementing the in-flight counter would fix the bug 
even without a drain in bdrv_try_change_aio_context(), because 
scsi_device_purge_requests() has a blk_drain() anyway.

[2] I had to inspect the code, though, so already this is non-obvious.  
There are no comments on either scsi_device_purge_one_req() or 
scsi_device_purge_requests(), so it’s unclear what their guarantees 
are.  scsi_device_purge_one_req() calls scsi_req_cancel_async(), which 
indicates that the request isn’t necessarily deleted after the function 
returns, and so you need to look at the code: Non-I/O requests are 
deleted, but I/O requests are not, they are just cancelled.  However, I 
assume that I/O requests have incremented some in-flight counter, so I 
assume that the blk_drain() in scsi_device_purge_requests() takes care 
of settling them all.
Hanna Czenczek Feb. 1, 2024, 3:49 p.m. UTC | #21
On 01.02.24 16:25, Hanna Czenczek wrote:

[...]

> It just seems simpler to me to not rely on the BB's context at all.

Hm, I now see the problem is that the processing (and scheduling) is 
largely done in generic SCSI code, which doesn’t have access to 
virtio-scsi’s context, only to that of the BB.  That makes my idea quite 
impossible. :/
Hanna Czenczek Feb. 2, 2024, 12:32 p.m. UTC | #22
On 01.02.24 16:25, Hanna Czenczek wrote:
> On 01.02.24 15:28, Stefan Hajnoczi wrote:

[...]

>> Did you find a scenario where the virtio-scsi AioContext is different
>> from the scsi-hd BB's Aiocontext?
>
> Technically, that’s the reason for this thread, specifically that 
> virtio_scsi_hotunplug() switches the BB back to the main context while 
> scsi_device_for_each_req_async_bh() is running.  Yes, we can fix that 
> specific case via the in-flight counter, but I’m wondering whether 
> there’s really any merit in requiring the BB to always be in 
> virtio-scsi’s context, or whether it would make more sense to schedule 
> everything in virtio-scsi’s context.  Now that BBs/BDSs can receive 
> requests from any context, that is.

Now that I know that wouldn’t be easy, let me turn this around: As far 
as I understand, scsi_device_for_each_req_async_bh() should still run in 
virtio-scsi’s context, but that’s hard, so we take the BB’s context, 
which we therefore require to be the same one. Further, (again AFAIU,) 
virtio-scsi’s context cannot change (only set in 
virtio_scsi_dataplane_setup(), which is run in 
virtio_scsi_device_realize()).  Therefore, why does the 
scsi_device_for_each_req_async() code accommodate for BB context changes?

Hanna
Stefan Hajnoczi Feb. 6, 2024, 7:32 p.m. UTC | #23
On Fri, Feb 02, 2024 at 01:32:39PM +0100, Hanna Czenczek wrote:
> On 01.02.24 16:25, Hanna Czenczek wrote:
> > On 01.02.24 15:28, Stefan Hajnoczi wrote:
> 
> [...]
> 
> > > Did you find a scenario where the virtio-scsi AioContext is different
> > > from the scsi-hd BB's Aiocontext?
> > 
> > Technically, that’s the reason for this thread, specifically that
> > virtio_scsi_hotunplug() switches the BB back to the main context while
> > scsi_device_for_each_req_async_bh() is running.  Yes, we can fix that
> > specific case via the in-flight counter, but I’m wondering whether
> > there’s really any merit in requiring the BB to always be in
> > virtio-scsi’s context, or whether it would make more sense to schedule
> > everything in virtio-scsi’s context.  Now that BBs/BDSs can receive
> > requests from any context, that is.
> 
> Now that I know that wouldn’t be easy, let me turn this around: As far as I
> understand, scsi_device_for_each_req_async_bh() should still run in
> virtio-scsi’s context, but that’s hard, so we take the BB’s context, which
> we therefore require to be the same one. Further, (again AFAIU,)
> virtio-scsi’s context cannot change (only set in
> virtio_scsi_dataplane_setup(), which is run in
> virtio_scsi_device_realize()).  Therefore, why does the
> scsi_device_for_each_req_async() code accommodate for BB context changes?

1. scsi_disk_reset() -> scsi_device_purge_requests() is called without
   in-flight requests.
2. The BH is scheduled by scsi_device_purge_requests() ->
   scsi_device_for_each_req_async().
3. blk_drain() is a nop when there no in-flight requests and does not
   flush BHs.
3. The AioContext changes when the virtio-scsi device resets.
4. The BH executes.

Kevin and I touched on the idea of flushing BHs in bdrv_drain() even
when there are no requests in flight. This hasn't been implemented as of
today, but would also reduce the chance of scenarios like the one I
mentioned.

I think it's safer to handle the case where the BH runs after an
AioContext change until either everything is thread-safe or the
AioContext never changes.

Stefan
diff mbox series

Patch

diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 3692ca82f3..10c4e8288d 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -69,14 +69,19 @@  struct SCSIDevice
 {
     DeviceState qdev;
     VMChangeStateEntry *vmsentry;
-    QEMUBH *bh;
     uint32_t id;
     BlockConf conf;
     SCSISense unit_attention;
     bool sense_is_ua;
     uint8_t sense[SCSI_SENSE_BUF_SIZE];
     uint32_t sense_len;
+
+    /*
+     * The requests list is only accessed from the AioContext that executes
+     * requests or from the main loop when IOThread processing is stopped.
+     */
     QTAILQ_HEAD(, SCSIRequest) requests;
+
     uint32_t channel;
     uint32_t lun;
     int blocksize;
diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index fc4b77fdb0..b649cdf555 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -85,6 +85,89 @@  SCSIDevice *scsi_device_get(SCSIBus *bus, int channel, int id, int lun)
     return d;
 }
 
+/*
+ * Invoke @fn() for each enqueued request in device @s. Must be called from the
+ * main loop thread while the guest is stopped. This is only suitable for
+ * vmstate ->put(), use scsi_device_for_each_req_async() for other cases.
+ */
+static void scsi_device_for_each_req_sync(SCSIDevice *s,
+                                          void (*fn)(SCSIRequest *, void *),
+                                          void *opaque)
+{
+    SCSIRequest *req;
+    SCSIRequest *next_req;
+
+    assert(!runstate_is_running());
+    assert(qemu_in_main_thread());
+
+    QTAILQ_FOREACH_SAFE(req, &s->requests, next, next_req) {
+        fn(req, opaque);
+    }
+}
+
+typedef struct {
+    SCSIDevice *s;
+    void (*fn)(SCSIRequest *, void *);
+    void *fn_opaque;
+} SCSIDeviceForEachReqAsyncData;
+
+static void scsi_device_for_each_req_async_bh(void *opaque)
+{
+    g_autofree SCSIDeviceForEachReqAsyncData *data = opaque;
+    SCSIDevice *s = data->s;
+    AioContext *ctx;
+    SCSIRequest *req;
+    SCSIRequest *next;
+
+    /*
+     * If the AioContext changed before this BH was called then reschedule into
+     * the new AioContext before accessing ->requests. This can happen when
+     * scsi_device_for_each_req_async() is called and then the AioContext is
+     * changed before BHs are run.
+     */
+    ctx = blk_get_aio_context(s->conf.blk);
+    if (ctx != qemu_get_current_aio_context()) {
+        aio_bh_schedule_oneshot(ctx, scsi_device_for_each_req_async_bh,
+                                g_steal_pointer(&data));
+        return;
+    }
+
+    QTAILQ_FOREACH_SAFE(req, &s->requests, next, next) {
+        data->fn(req, data->fn_opaque);
+    }
+
+    /* Drop the reference taken by scsi_device_for_each_req_async() */
+    object_unref(OBJECT(s));
+}
+
+/*
+ * Schedule @fn() to be invoked for each enqueued request in device @s. @fn()
+ * runs in the AioContext that is executing the request.
+ */
+static void scsi_device_for_each_req_async(SCSIDevice *s,
+                                           void (*fn)(SCSIRequest *, void *),
+                                           void *opaque)
+{
+    assert(qemu_in_main_thread());
+
+    SCSIDeviceForEachReqAsyncData *data =
+        g_new(SCSIDeviceForEachReqAsyncData, 1);
+
+    data->s = s;
+    data->fn = fn;
+    data->fn_opaque = opaque;
+
+    /*
+     * Hold a reference to the SCSIDevice until
+     * scsi_device_for_each_req_async_bh() finishes.
+     */
+    object_ref(OBJECT(s));
+
+    aio_bh_schedule_oneshot(blk_get_aio_context(s->conf.blk),
+                            scsi_device_for_each_req_async_bh,
+                            data);
+}
+
 static void scsi_device_realize(SCSIDevice *s, Error **errp)
 {
     SCSIDeviceClass *sc = SCSI_DEVICE_GET_CLASS(s);
@@ -144,20 +227,18 @@  void scsi_bus_init_named(SCSIBus *bus, size_t bus_size, DeviceState *host,
     qbus_set_bus_hotplug_handler(BUS(bus));
 }
 
-static void scsi_dma_restart_bh(void *opaque)
+void scsi_req_retry(SCSIRequest *req)
 {
-    SCSIDevice *s = opaque;
-    SCSIRequest *req, *next;
-
-    qemu_bh_delete(s->bh);
-    s->bh = NULL;
+    req->retry = true;
+}
 
-    aio_context_acquire(blk_get_aio_context(s->conf.blk));
-    QTAILQ_FOREACH_SAFE(req, &s->requests, next, next) {
-        scsi_req_ref(req);
-        if (req->retry) {
-            req->retry = false;
-            switch (req->cmd.mode) {
+/* Called in the AioContext that is executing the request */
+static void scsi_dma_restart_req(SCSIRequest *req, void *opaque)
+{
+    scsi_req_ref(req);
+    if (req->retry) {
+        req->retry = false;
+        switch (req->cmd.mode) {
             case SCSI_XFER_FROM_DEV:
             case SCSI_XFER_TO_DEV:
                 scsi_req_continue(req);
@@ -166,37 +247,22 @@  static void scsi_dma_restart_bh(void *opaque)
                 scsi_req_dequeue(req);
                 scsi_req_enqueue(req);
                 break;
-            }
         }
-        scsi_req_unref(req);
     }
-    aio_context_release(blk_get_aio_context(s->conf.blk));
-    /* Drop the reference that was acquired in scsi_dma_restart_cb */
-    object_unref(OBJECT(s));
-}
-
-void scsi_req_retry(SCSIRequest *req)
-{
-    /* No need to save a reference, because scsi_dma_restart_bh just
-     * looks at the request list.  */
-    req->retry = true;
+    scsi_req_unref(req);
 }
 
 static void scsi_dma_restart_cb(void *opaque, bool running, RunState state)
 {
     SCSIDevice *s = opaque;
 
+    assert(qemu_in_main_thread());
+
     if (!running) {
         return;
     }
-    if (!s->bh) {
-        AioContext *ctx = blk_get_aio_context(s->conf.blk);
-        /* The reference is dropped in scsi_dma_restart_bh.*/
-        object_ref(OBJECT(s));
-        s->bh = aio_bh_new_guarded(ctx, scsi_dma_restart_bh, s,
-                                   &DEVICE(s)->mem_reentrancy_guard);
-        qemu_bh_schedule(s->bh);
-    }
+
+    scsi_device_for_each_req_async(s, scsi_dma_restart_req, NULL);
 }
 
 static bool scsi_bus_is_address_free(SCSIBus *bus,
@@ -1657,15 +1723,16 @@  void scsi_device_set_ua(SCSIDevice *sdev, SCSISense sense)
     }
 }
 
+static void scsi_device_purge_one_req(SCSIRequest *req, void *opaque)
+{
+    scsi_req_cancel_async(req, NULL);
+}
+
 void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense)
 {
-    SCSIRequest *req;
+    scsi_device_for_each_req_async(sdev, scsi_device_purge_one_req, NULL);
 
     aio_context_acquire(blk_get_aio_context(sdev->conf.blk));
-    while (!QTAILQ_EMPTY(&sdev->requests)) {
-        req = QTAILQ_FIRST(&sdev->requests);
-        scsi_req_cancel_async(req, NULL);
-    }
     blk_drain(sdev->conf.blk);
     aio_context_release(blk_get_aio_context(sdev->conf.blk));
     scsi_device_set_ua(sdev, sense);
@@ -1737,31 +1804,33 @@  static char *scsibus_get_fw_dev_path(DeviceState *dev)
 
 /* SCSI request list.  For simplicity, pv points to the whole device */
 
+static void put_scsi_req(SCSIRequest *req, void *opaque)
+{
+    QEMUFile *f = opaque;
+
+    assert(!req->io_canceled);
+    assert(req->status == -1 && req->host_status == -1);
+    assert(req->enqueued);
+
+    qemu_put_sbyte(f, req->retry ? 1 : 2);
+    qemu_put_buffer(f, req->cmd.buf, sizeof(req->cmd.buf));
+    qemu_put_be32s(f, &req->tag);
+    qemu_put_be32s(f, &req->lun);
+    if (req->bus->info->save_request) {
+        req->bus->info->save_request(f, req);
+    }
+    if (req->ops->save_request) {
+        req->ops->save_request(f, req);
+    }
+}
+
 static int put_scsi_requests(QEMUFile *f, void *pv, size_t size,
                              const VMStateField *field, JSONWriter *vmdesc)
 {
     SCSIDevice *s = pv;
-    SCSIBus *bus = DO_UPCAST(SCSIBus, qbus, s->qdev.parent_bus);
-    SCSIRequest *req;
 
-    QTAILQ_FOREACH(req, &s->requests, next) {
-        assert(!req->io_canceled);
-        assert(req->status == -1 && req->host_status == -1);
-        assert(req->enqueued);
-
-        qemu_put_sbyte(f, req->retry ? 1 : 2);
-        qemu_put_buffer(f, req->cmd.buf, sizeof(req->cmd.buf));
-        qemu_put_be32s(f, &req->tag);
-        qemu_put_be32s(f, &req->lun);
-        if (bus->info->save_request) {
-            bus->info->save_request(f, req);
-        }
-        if (req->ops->save_request) {
-            req->ops->save_request(f, req);
-        }
-    }
+    scsi_device_for_each_req_sync(s, put_scsi_req, f);
     qemu_put_sbyte(f, 0);
-
     return 0;
 }