mbox series

[0/5] migration: Fix the BDRV_O_INACTIVE assertion

Message ID 20241125144612.16194-1-farosas@suse.de (mailing list archive)
Headers show
Series migration: Fix the BDRV_O_INACTIVE assertion | expand

Message

Fabiano Rosas Nov. 25, 2024, 2:46 p.m. UTC
At this point everyone is probably familiar with the assertion:

  bdrv_inactivate_recurse: Assertion `!(bs->open_flags &
  BDRV_O_INACTIVE)' failed.

The issue is that at the end of migration, the block layer is
deactivated to release the locks on the disks so that in a shared
storage scenario the migration destination can take over. Trying to
deactivate twice leads to the above assertion.

The reason deactivation is happening twice is due to the migration
capability "late-block-activate", which delays activation until the
destination VM receives a qmp_cont command. The reasoning for that cap
is:

    Activating the block devices causes the locks to be taken on
    the backing file.  If we're running with -S and the destination libvirt
    hasn't started the destination with 'cont', it's expecting the locks are
    still untaken.

    Don't activate the block devices if we're not going to autostart the VM;
    'cont' already will do that anyway.   This change is tied to the new
    migration capability 'late-block-activate' that defaults to off, keeping
    the old behaviour by default.

    bz: https://bugzilla.redhat.com/show_bug.cgi?id=1560854

However, the expected qmp_cont command never happens, because the user
may never decide to continue the VM and instead attempt another
migration while the VM is paused.

Fix this by assuming that the qmp_migrate command having been issued
is enough indication that it is now ok to take the locks again
(similarly to qmp_cont) and call bdrv_activate_all() at that
moment. This is the least invasive fix that won't require us to add
new qmp commands and go change libvirt, etc.

I'm also copying the block list (for obvious reasons, but also)
because Vladimir pointed out that asserts of this kind might not only
happen during migration. I know of at least
bdrv_co_write_req_prepare() which will assert if reached when a guest
is paused after a migration that used late-block-activate.

Patches 1-3 are just shuffling code;

Patch 4 is the fix;

Patch 5 is the test. To reproduce the issue revert patch 4 and run:

QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -p \
/x86_64/migration/precopy/ping-pong/late-block-activate

References:
https://gitlab.com/qemu-project/qemu/-/issues/686
https://gitlab.com/qemu-project/qemu/-/issues/2395
https://lore.kernel.org/r/20240924125611.664315-1-andrey.drobyshev@virtuozzo.com

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1556902330

Fabiano Rosas (5):
  tests/qtest/migration: Move more code under only_target
  tests/qtest/migration: Don't use hardcoded strings for -serial
  tests/qtest/migration: Support cleaning up only one side of migration
  migration: Activate block devices if VM is paused when migrating
  tests/qtest/migration: Test successive migrations

 migration/migration.c           |  19 +++
 tests/qtest/migration-helpers.c |   8 +
 tests/qtest/migration-helpers.h |   2 +
 tests/qtest/migration-test.c    | 252 +++++++++++++++++++++++++-------
 4 files changed, 226 insertions(+), 55 deletions(-)


base-commit: 134b443512825bed401b6e141447b8cdc22d2efe