mbox series

[0/4] Fix "virtio-gpu: fix scanout migration post-load"

Message ID 20240507111920.1594897-1-marcandre.lureau@redhat.com (mailing list archive)
Headers show
Series Fix "virtio-gpu: fix scanout migration post-load" | expand

Message

Marc-André Lureau May 7, 2024, 11:19 a.m. UTC
From: Marc-André Lureau <marcandre.lureau@redhat.com>

Hi,

The aforementioned patch breaks virtio-gpu device migrations for versions
pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more
complex than it may initially appear, as evidenced in the problematic commit
dfcf74fa68c ("virtio-gpu: fix scanout migration post-load").

To resolve this, we need to propagate the `vmstate` `version_id` through the
nested structures. Additionally, we should tie specific machine version to a
corresponding `version_id` to maintain migration compatibility.

`VMS_VSTRUCT` allows specifying the appropriate version of the nested structure
to use.

Marc-André Lureau (4):
  migration: add "exists" info to load-state-field trace
  include/migration: add VMSTATE_VSTRUCT_TEST_VARRAY_UINT32
  virtio-gpu: use a VMState variant for the scanout field
  virtio-gpu: add x-vmstate-version

 include/hw/virtio/virtio-gpu.h |  1 +
 include/migration/vmstate.h    | 12 ++++++++++++
 hw/core/machine.c              |  1 +
 hw/display/virtio-gpu.c        | 28 +++++++++++++++++++++-------
 migration/vmstate.c            |  5 +++--
 migration/trace-events         |  2 +-
 6 files changed, 39 insertions(+), 10 deletions(-)

Comments

Fabiano Rosas May 7, 2024, 8:46 p.m. UTC | #1
marcandre.lureau@redhat.com writes:

> From: Marc-André Lureau <marcandre.lureau@redhat.com>
>
> Hi,
>
> The aforementioned patch breaks virtio-gpu device migrations for versions
> pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more
> complex than it may initially appear, as evidenced in the problematic commit
> dfcf74fa68c ("virtio-gpu: fix scanout migration post-load").
>
> To resolve this, we need to propagate the `vmstate` `version_id` through the
> nested structures. Additionally, we should tie specific machine version to a
> corresponding `version_id` to maintain migration compatibility.
>
> `VMS_VSTRUCT` allows specifying the appropriate version of the nested structure
> to use.

This would have been caught by the migration-compat-x86_64 CI job had we
added the virtio-gpu device to it.

$ cd build-8.2
$ QTEST_TRACE='vmstate_*' QTEST_DEVICE_OPTS='-device virtio-gpu' \
QTEST_QEMU_BINARY=./qemu-system-x86_64 \
QTEST_QEMU_BINARY_DST=../build-9.0/qemu-system-x86_64 ./tests/qtest/migration-test
...
vmstate_n_elems fb.offset: 1
vmstate_subsection_load virtio-gpu-one-scanout
vmstate_subsection_load_good virtio-gpu-one-scanout
vmstate_load_state_end virtio-gpu-one-scanout end/0
vmstate_subsection_load virtio-gpu-scanouts
vmstate_subsection_load_good virtio-gpu-scanouts
vmstate_load_state_end virtio-gpu-scanouts end/0
vmstate_subsection_load virtio-gpu
vmstate_subsection_load_good virtio-gpu
vmstate_load_state_end virtio-gpu end/0
vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-gpu instance_id=0 downtime=32118
qemu-system-x86_64: Missing section footer for 0000:00:03.0/virtio-gpu
vmstate_downtime_checkpoint dst-precopy-loadvm-completed
qemu-system-x86_64: load of migration failed: Invalid argument

Some considerations:

1) Here QTEST_DEVICE_OPTS is a hack I added on top, it doesn't currently
   exist.

2) This only uncovers relatively simple bugs where we don't need the
   guest to access the device, it just needs to be there.

We could take the steps to enable this kind of testing if we think it's
worthwhile. Some downsides are:

a) the item (2) above - situations that depend on guest behavior are out
   of the picture because migration-test runs only a custom program that
   dirties memory;

b) this test only works in CI or in a pre setup environment because it
   needs the previous QEMU version to be built beforehand;

c) the full set of migration tests already runs a few times in CI via
   make check, plus the compat job. We'll probably need to do some
   simplification to avoid taking too much additional time;

d) there's also the obvious maintenance burden of choosing devices and
   doing the eventual upkeep of the QEMU command line for the
   migration-test.
Peter Xu May 7, 2024, 9:24 p.m. UTC | #2
On Tue, May 07, 2024 at 05:46:36PM -0300, Fabiano Rosas wrote:
> marcandre.lureau@redhat.com writes:
> 
> > From: Marc-André Lureau <marcandre.lureau@redhat.com>
> >
> > Hi,
> >
> > The aforementioned patch breaks virtio-gpu device migrations for versions
> > pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more
> > complex than it may initially appear, as evidenced in the problematic commit
> > dfcf74fa68c ("virtio-gpu: fix scanout migration post-load").
> >
> > To resolve this, we need to propagate the `vmstate` `version_id` through the
> > nested structures. Additionally, we should tie specific machine version to a
> > corresponding `version_id` to maintain migration compatibility.
> >
> > `VMS_VSTRUCT` allows specifying the appropriate version of the nested structure
> > to use.
> 
> This would have been caught by the migration-compat-x86_64 CI job had we
> added the virtio-gpu device to it.

I had exactly the same thoughts, and actually I added a todo after I
noticed this issue but I forgot to discuss with you today.. actually I have
one more on whether we can allow vmsd versioning to work for bi-direction
(then we can get rid of machine type dependencies), we may need the
handshake work as pre-requisite, so that two qemus need to talk first on
verify the vmsds.  But let's leave that for later.

> 
> $ cd build-8.2
> $ QTEST_TRACE='vmstate_*' QTEST_DEVICE_OPTS='-device virtio-gpu' \
> QTEST_QEMU_BINARY=./qemu-system-x86_64 \
> QTEST_QEMU_BINARY_DST=../build-9.0/qemu-system-x86_64 ./tests/qtest/migration-test
> ...
> vmstate_n_elems fb.offset: 1
> vmstate_subsection_load virtio-gpu-one-scanout
> vmstate_subsection_load_good virtio-gpu-one-scanout
> vmstate_load_state_end virtio-gpu-one-scanout end/0
> vmstate_subsection_load virtio-gpu-scanouts
> vmstate_subsection_load_good virtio-gpu-scanouts
> vmstate_load_state_end virtio-gpu-scanouts end/0
> vmstate_subsection_load virtio-gpu
> vmstate_subsection_load_good virtio-gpu
> vmstate_load_state_end virtio-gpu end/0
> vmstate_downtime_load type=non-iterable idstr=0000:00:03.0/virtio-gpu instance_id=0 downtime=32118
> qemu-system-x86_64: Missing section footer for 0000:00:03.0/virtio-gpu
> vmstate_downtime_checkpoint dst-precopy-loadvm-completed
> qemu-system-x86_64: load of migration failed: Invalid argument
> 
> Some considerations:
> 
> 1) Here QTEST_DEVICE_OPTS is a hack I added on top, it doesn't currently
>    exist.
> 
> 2) This only uncovers relatively simple bugs where we don't need the
>    guest to access the device, it just needs to be there.

Right, but having something to cover the basics would still be nice,
because the rarer the bug the less impact to users too (I hope!), so they
can be with lower priority too when tested.

> 
> We could take the steps to enable this kind of testing if we think it's
> worthwhile. Some downsides are:
> 
> a) the item (2) above - situations that depend on guest behavior are out
>    of the picture because migration-test runs only a custom program that
>    dirties memory;
> 
> b) this test only works in CI or in a pre setup environment because it
>    needs the previous QEMU version to be built beforehand;
> 
> c) the full set of migration tests already runs a few times in CI via
>    make check, plus the compat job. We'll probably need to do some
>    simplification to avoid taking too much additional time;
> 
> d) there's also the obvious maintenance burden of choosing devices and
>    doing the eventual upkeep of the QEMU command line for the
>    migration-test.

The last one should be mostly fine, iiuc - we shouldn't add any device that
can easily break ABI, and the list of devices can start with a minumum; an
extreme case is we add one device only if something broke first with a
stable ABI.

Then if the ABI is stable, we need to go through the deprecation procedure,
and that takes two releases.  We can drop the device in migration-test in
the release of when deprecation starts.

Thanks,
Fiona Ebner May 8, 2024, 9:51 a.m. UTC | #3
Am 07.05.24 um 13:19 schrieb marcandre.lureau@redhat.com:
> From: Marc-André Lureau <marcandre.lureau@redhat.com>
> 
> Hi,
> 
> The aforementioned patch breaks virtio-gpu device migrations for versions
> pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more
> complex than it may initially appear, as evidenced in the problematic commit
> dfcf74fa68c ("virtio-gpu: fix scanout migration post-load").
> 
> To resolve this, we need to propagate the `vmstate` `version_id` through the
> nested structures. Additionally, we should tie specific machine version to a
> corresponding `version_id` to maintain migration compatibility.
> 
> `VMS_VSTRUCT` allows specifying the appropriate version of the nested structure
> to use.
> 
> Marc-André Lureau (4):
>   migration: add "exists" info to load-state-field trace
>   include/migration: add VMSTATE_VSTRUCT_TEST_VARRAY_UINT32
>   virtio-gpu: use a VMState variant for the scanout field
>   virtio-gpu: add x-vmstate-version
> 
>  include/hw/virtio/virtio-gpu.h |  1 +
>  include/migration/vmstate.h    | 12 ++++++++++++
>  hw/core/machine.c              |  1 +
>  hw/display/virtio-gpu.c        | 28 +++++++++++++++++++++-------
>  migration/vmstate.c            |  5 +++--
>  migration/trace-events         |  2 +-
>  6 files changed, 39 insertions(+), 10 deletions(-)
> 

Tested-by: Fiona Ebner <f.ebner@proxmox.com>

Thank you for the fix! Tested with an Ubuntu 23.10 VM:

Machine type pc-i440fx-8.2:
1. create snapshot with 8.2, load with patched 9.0
2. create snapshot with patched 9.0, load with patched 9.0 and with 8.2

Machine type pc-i440fx-9.0:
1. create snapshot with patched 9.0, load with patched 9.0

No crashes/failures and didn't notice any other issues :)

Best Regards,
Fiona