mbox series

[0/3] monitor: only run coroutine commands in qemu_aio_context

Message ID 20240116190042.1363717-1-stefanha@redhat.com (mailing list archive)
Headers show
Series monitor: only run coroutine commands in qemu_aio_context | expand

Message

Stefan Hajnoczi Jan. 16, 2024, 7 p.m. UTC
Several bugs have been reported related to how QMP commands are rescheduled in
qemu_aio_context:
- https://gitlab.com/qemu-project/qemu/-/issues/1933
- https://issues.redhat.com/browse/RHEL-17369
- https://bugzilla.redhat.com/show_bug.cgi?id=2215192
- https://bugzilla.redhat.com/show_bug.cgi?id=2214985

The first instance of the bug interacted with drain_call_rcu() temporarily
dropping the BQL and resulted in vCPU threads entering device emulation code
simultaneously (something that should never happen). I set out to make
drain_call_rcu() safe to use in this environment, but Paolo and Kevin discussed
the possibility of avoiding rescheduling the monitor_qmp_dispatcher_co()
coroutine for non-coroutine commands. This would prevent monitor commands from
running during vCPU thread aio_poll() entirely and addresses the root cause.

This patch series implements this idea. qemu-iotests is sensitive to the exact
order in which QMP events and responses are emitted. Running QMP handlers in
the iohandler AioContext causes some QMP events to be ordered differently than
before. It is therefore necessary to adjust the reference output in many test
cases. The actual QMP code change is small and everything else is just to make
qemu-iotests happy.

If you have bugs related to the same issue, please retest them with these
patches. Thanks!

Stefan Hajnoczi (3):
  iotests: add filter_qmp_generated_node_ids()
  iotests: port 141 to Python for reliable QMP testing
  monitor: only run coroutine commands in qemu_aio_context

 monitor/qmp.c                                 |  17 -
 qapi/qmp-dispatch.c                           |  24 +-
 tests/qemu-iotests/060.out                    |   4 +-
 tests/qemu-iotests/071.out                    |   4 +-
 tests/qemu-iotests/081.out                    |  16 +-
 tests/qemu-iotests/087.out                    |  12 +-
 tests/qemu-iotests/108.out                    |   2 +-
 tests/qemu-iotests/109                        |   4 +-
 tests/qemu-iotests/109.out                    |  78 ++---
 tests/qemu-iotests/117.out                    |   2 +-
 tests/qemu-iotests/120.out                    |   2 +-
 tests/qemu-iotests/127.out                    |   2 +-
 tests/qemu-iotests/140.out                    |   2 +-
 tests/qemu-iotests/141                        | 297 +++++++-----------
 tests/qemu-iotests/141.out                    | 190 +++--------
 tests/qemu-iotests/143.out                    |   2 +-
 tests/qemu-iotests/156.out                    |   2 +-
 tests/qemu-iotests/176.out                    |  16 +-
 tests/qemu-iotests/182.out                    |   2 +-
 tests/qemu-iotests/183.out                    |   4 +-
 tests/qemu-iotests/184.out                    |  32 +-
 tests/qemu-iotests/185                        |   6 +-
 tests/qemu-iotests/185.out                    |  45 ++-
 tests/qemu-iotests/191.out                    |  16 +-
 tests/qemu-iotests/195.out                    |  16 +-
 tests/qemu-iotests/223.out                    |  16 +-
 tests/qemu-iotests/227.out                    |  32 +-
 tests/qemu-iotests/247.out                    |   2 +-
 tests/qemu-iotests/273.out                    |   8 +-
 tests/qemu-iotests/308                        |   4 +-
 tests/qemu-iotests/308.out                    |   4 +-
 tests/qemu-iotests/iotests.py                 |   7 +
 tests/qemu-iotests/tests/file-io-error        |   5 +-
 tests/qemu-iotests/tests/iothreads-resize.out |   2 +-
 tests/qemu-iotests/tests/qsd-jobs.out         |   4 +-
 35 files changed, 375 insertions(+), 506 deletions(-)

Comments

Michael Tokarev Jan. 18, 2024, 11:20 a.m. UTC | #1
16.01.2024 22:00, Stefan Hajnoczi wrote:
> Several bugs have been reported related to how QMP commands are rescheduled in
> qemu_aio_context:
> - https://gitlab.com/qemu-project/qemu/-/issues/1933
> - https://issues.redhat.com/browse/RHEL-17369
> - https://bugzilla.redhat.com/show_bug.cgi?id=2215192
> - https://bugzilla.redhat.com/show_bug.cgi?id=2214985
> 
> The first instance of the bug interacted with drain_call_rcu() temporarily
> dropping the BQL and resulted in vCPU threads entering device emulation code
> simultaneously (something that should never happen). I set out to make
> drain_call_rcu() safe to use in this environment, but Paolo and Kevin discussed
> the possibility of avoiding rescheduling the monitor_qmp_dispatcher_co()
> coroutine for non-coroutine commands. This would prevent monitor commands from
> running during vCPU thread aio_poll() entirely and addresses the root cause.
> 
> This patch series implements this idea. qemu-iotests is sensitive to the exact
> order in which QMP events and responses are emitted. Running QMP handlers in
> the iohandler AioContext causes some QMP events to be ordered differently than
> before. It is therefore necessary to adjust the reference output in many test
> cases. The actual QMP code change is small and everything else is just to make
> qemu-iotests happy.

This seems to be -stable material too, once it is applied to master.
It's difficult for me to catch the issue (#1933 @gitlab) locally, but I did
have some success there, and it always work after this patch is applied.
It also seem to work fine on 7.2, not only on 8.2, fwiw.

Thanks,

/mjt
Peter Maydell Jan. 29, 2024, 11:38 a.m. UTC | #2
On Tue, 16 Jan 2024 at 19:01, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> Several bugs have been reported related to how QMP commands are rescheduled in
> qemu_aio_context:
> - https://gitlab.com/qemu-project/qemu/-/issues/1933
> - https://issues.redhat.com/browse/RHEL-17369
> - https://bugzilla.redhat.com/show_bug.cgi?id=2215192
> - https://bugzilla.redhat.com/show_bug.cgi?id=2214985
>
> The first instance of the bug interacted with drain_call_rcu() temporarily
> dropping the BQL and resulted in vCPU threads entering device emulation code
> simultaneously (something that should never happen). I set out to make
> drain_call_rcu() safe to use in this environment, but Paolo and Kevin discussed
> the possibility of avoiding rescheduling the monitor_qmp_dispatcher_co()
> coroutine for non-coroutine commands. This would prevent monitor commands from
> running during vCPU thread aio_poll() entirely and addresses the root cause.
>
> This patch series implements this idea. qemu-iotests is sensitive to the exact
> order in which QMP events and responses are emitted. Running QMP handlers in
> the iohandler AioContext causes some QMP events to be ordered differently than
> before. It is therefore necessary to adjust the reference output in many test
> cases. The actual QMP code change is small and everything else is just to make
> qemu-iotests happy.

Hi; we have a suspicion that this change has resulted in a flaky-CI
test: iotest-144 sometimes fails, apparently because a "return"
result from QMP isn't always returned at the same place in relation
to other QMP events. Could you have a look at it?

https://gitlab.com/qemu-project/qemu/-/issues/2126

thanks
-- PMM
YangHang Liu Feb. 3, 2024, 2:19 a.m. UTC | #3
It's easily for me to encounter " ../block/qcow2.c:5263:
ImageInfoSpecific *qcow2_get_specific_info(BlockDriverState *, Error
**): Assertion `false' failed" issue during 1Q vhost-user interface +
RT VM + post-copy migration

After applying this patch, the issue is still not reproduced even if I
repeat the same migration test for 60 times.

Tested-by: Yanghang Liu <yanghliu@redhat.com>


Best Regards,
YangHang Liu

On Mon, Jan 29, 2024 at 7:39 PM Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Tue, 16 Jan 2024 at 19:01, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > Several bugs have been reported related to how QMP commands are rescheduled in
> > qemu_aio_context:
> > - https://gitlab.com/qemu-project/qemu/-/issues/1933
> > - https://issues.redhat.com/browse/RHEL-17369
> > - https://bugzilla.redhat.com/show_bug.cgi?id=2215192
> > - https://bugzilla.redhat.com/show_bug.cgi?id=2214985
> >
> > The first instance of the bug interacted with drain_call_rcu() temporarily
> > dropping the BQL and resulted in vCPU threads entering device emulation code
> > simultaneously (something that should never happen). I set out to make
> > drain_call_rcu() safe to use in this environment, but Paolo and Kevin discussed
> > the possibility of avoiding rescheduling the monitor_qmp_dispatcher_co()
> > coroutine for non-coroutine commands. This would prevent monitor commands from
> > running during vCPU thread aio_poll() entirely and addresses the root cause.
> >
> > This patch series implements this idea. qemu-iotests is sensitive to the exact
> > order in which QMP events and responses are emitted. Running QMP handlers in
> > the iohandler AioContext causes some QMP events to be ordered differently than
> > before. It is therefore necessary to adjust the reference output in many test
> > cases. The actual QMP code change is small and everything else is just to make
> > qemu-iotests happy.
>
> Hi; we have a suspicion that this change has resulted in a flaky-CI
> test: iotest-144 sometimes fails, apparently because a "return"
> result from QMP isn't always returned at the same place in relation
> to other QMP events. Could you have a look at it?
>
> https://gitlab.com/qemu-project/qemu/-/issues/2126
>
> thanks
> -- PMM
>