mbox

[PULL,00/17] Migration patches for 2024-12-17

Message ID 20241217174855.24971-1-farosas@suse.de (mailing list archive)
State New
Headers show

Pull-request

https://gitlab.com/farosas/qemu.git tags/migration-20241217-pull-request

Message

Fabiano Rosas Dec. 17, 2024, 5:48 p.m. UTC
The following changes since commit 8032c78e556cd0baec111740a6c636863f9bd7c8:

  Merge tag 'firmware-20241216-pull-request' of https://gitlab.com/kraxel/qemu into staging (2024-12-16 14:20:33 -0500)

are available in the Git repository at:

  https://gitlab.com/farosas/qemu.git tags/migration-20241217-pull-request

for you to fetch changes up to 1bed6df0c71d3a74286f53c01fafd21fde8777f4:

  tests/qtest/migration: Fix compile errors when CONFIG_UADK is set (2024-12-17 13:51:19 -0300)

----------------------------------------------------------------
Migration pull request

- Shameer's fixes for CONFIG_UADK code

- Peter's multifd sync cleanups, prereq. for VFIO and postcopy work

- Fabiano's fix for multifd regression in pre-9.0 -> post-9.1
  migrations (#2720)

- Fabiano's fix for s390x migration regression (#2704)

- Peter's fix for assertions during paused migrations; reworks
  late-block-activate logic (#2395, #686)

----------------------------------------------------------------

Fabiano Rosas (2):
  migration/multifd: Fix compat with QEMU < 9.0
  s390x: Fix CSS migration

Peter Xu (13):
  migration/multifd: Further remove the SYNC on complete
  migration/multifd: Allow to sync with sender threads only
  migration/ram: Move RAM_SAVE_FLAG* into ram.h
  migration/multifd: Unify RAM_SAVE_FLAG_MULTIFD_FLUSH messages
  migration/multifd: Remove sync processing on postcopy
  migration/multifd: Cleanup src flushes on condition check
  migration/multifd: Document the reason to sync for save_setup()
  migration: Add helper to get target runstate
  qmp/cont: Only activate disks if migration completed
  migration/block: Make late-block-active the default
  migration/block: Apply late-block-active behavior to postcopy
  migration/block: Fix possible race with block_inactive
  migration/block: Rewrite disk activation

Shameer Kolothum (2):
  migration/multifd: Fix compile error caused by page_size usage
  tests/qtest/migration: Fix compile errors when CONFIG_UADK is set

 hw/s390x/s390-virtio-ccw.c                |   2 +-
 include/migration/misc.h                  |   4 +
 migration/block-active.c                  |  94 +++++++++++++++
 migration/colo.c                          |   2 +-
 migration/meson.build                     |   1 +
 migration/migration.c                     | 136 +++++++++-------------
 migration/migration.h                     |   6 +-
 migration/multifd-nocomp.c                |  74 +++++++++++-
 migration/multifd-uadk.c                  |   2 +-
 migration/multifd.c                       |  32 +++--
 migration/multifd.h                       |  27 ++++-
 migration/ram.c                           |  89 +++++++-------
 migration/ram.h                           |  28 +++++
 migration/rdma.h                          |   7 --
 migration/savevm.c                        |  46 ++++----
 migration/trace-events                    |   3 +
 monitor/qmp-cmds.c                        |  22 ++--
 tests/qtest/migration/compression-tests.c |  54 ---------
 18 files changed, 372 insertions(+), 257 deletions(-)
 create mode 100644 migration/block-active.c

Comments

Stefan Hajnoczi Dec. 19, 2024, 12:32 p.m. UTC | #1
Hi Fabiano,
Please take a look at this CI failure:

>>> MALLOC_PERTURB_=61 QTEST_QEMU_BINARY=./qemu-system-s390x UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 QTEST_QEMU_IMG=./qemu-img MESON_TEST_ITERATION=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 PYTHON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh /home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
stderr:
Traceback (most recent call last):
  File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 688, in <module>
    dump.read(dump_memory = args.memory)
  File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 625, in read
    section.read()
  File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 461, in read
    field['data'] = reader(field, self.file)
  File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 434, in __init__
    for field in self.desc['struct']['fields']:
KeyError: 'fields'
warning: fd: migration to a file is deprecated. Use file: instead.
warning: fd: migration to a file is deprecated. Use file: instead.
**
ERROR:../tests/qtest/migration-test.c:36:main: assertion failed (ret == 0): (1 == 0)
(test program exited with status code -6)

https://gitlab.com/qemu-project/qemu/-/jobs/8681858344#L8190

If you find this pull request caused the failure, please send a new
revision. Otherwise please let me know so we can continue to
investigate.

Thanks,
Stefan
Fabiano Rosas Dec. 19, 2024, 6:53 p.m. UTC | #2
Stefan Hajnoczi <stefanha@redhat.com> writes:

> Hi Fabiano,
> Please take a look at this CI failure:
>
>>>> MALLOC_PERTURB_=61 QTEST_QEMU_BINARY=./qemu-system-s390x UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 QTEST_QEMU_IMG=./qemu-img MESON_TEST_ITERATION=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 PYTHON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh /home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> Traceback (most recent call last):
>   File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 688, in <module>
>     dump.read(dump_memory = args.memory)
>   File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 625, in read
>     section.read()
>   File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 461, in read
>     field['data'] = reader(field, self.file)
>   File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 434, in __init__
>     for field in self.desc['struct']['fields']:
> KeyError: 'fields'

This is the command line that runs only this specific test:

PYTHON=/usr/bin/python3.11 QTEST_QEMU_BINARY=./qemu-system-s390x
./tests/qtest/migration-test -p /s390x/migration/analyze-script

I cannot reproduce in migration-next nor in the detached HEAD that the
pipeline ran in (had to download the tarball from gitlab).

The only s390 patch in this PR is one that I can test just fine with
TCG, so there shouldn't be any difference from KVM (i.e. there should be
no state being migrated with KVM that is not already migrated with TCG).

> warning: fd: migration to a file is deprecated. Use file: instead.
> warning: fd: migration to a file is deprecated. Use file: instead.

This is harmless.

> **
> ERROR:../tests/qtest/migration-test.c:36:main: assertion failed (ret == 0): (1 == 0)
> (test program exited with status code -6)

This is the assert at the end of the tests, irrelevant.

>
> https://gitlab.com/qemu-project/qemu/-/jobs/8681858344#L8190
>
> If you find this pull request caused the failure, please send a new
> revision. Otherwise please let me know so we can continue to
> investigate.

I don't have an s390x host at hand so the only thing I can to is to drop
that patch and hope that resolves the problem. @Peter, @Thomas, any
other ideas? Can you verify this on your end?
Peter Xu Dec. 20, 2024, 4:28 p.m. UTC | #3
On Thu, Dec 19, 2024 at 03:53:22PM -0300, Fabiano Rosas wrote:
> Stefan Hajnoczi <stefanha@redhat.com> writes:
> 
> > Hi Fabiano,
> > Please take a look at this CI failure:
> >
> >>>> MALLOC_PERTURB_=61 QTEST_QEMU_BINARY=./qemu-system-s390x UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 QTEST_QEMU_IMG=./qemu-img MESON_TEST_ITERATION=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 PYTHON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh /home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> > ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> > stderr:
> > Traceback (most recent call last):
> >   File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 688, in <module>
> >     dump.read(dump_memory = args.memory)
> >   File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 625, in read
> >     section.read()
> >   File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 461, in read
> >     field['data'] = reader(field, self.file)
> >   File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 434, in __init__
> >     for field in self.desc['struct']['fields']:
> > KeyError: 'fields'
> 
> This is the command line that runs only this specific test:
> 
> PYTHON=/usr/bin/python3.11 QTEST_QEMU_BINARY=./qemu-system-s390x
> ./tests/qtest/migration-test -p /s390x/migration/analyze-script
> 
> I cannot reproduce in migration-next nor in the detached HEAD that the
> pipeline ran in (had to download the tarball from gitlab).
> 
> The only s390 patch in this PR is one that I can test just fine with
> TCG, so there shouldn't be any difference from KVM (i.e. there should be
> no state being migrated with KVM that is not already migrated with TCG).
> 
> > warning: fd: migration to a file is deprecated. Use file: instead.
> > warning: fd: migration to a file is deprecated. Use file: instead.
> 
> This is harmless.
> 
> > **
> > ERROR:../tests/qtest/migration-test.c:36:main: assertion failed (ret == 0): (1 == 0)
> > (test program exited with status code -6)
> 
> This is the assert at the end of the tests, irrelevant.
> 
> >
> > https://gitlab.com/qemu-project/qemu/-/jobs/8681858344#L8190
> >
> > If you find this pull request caused the failure, please send a new
> > revision. Otherwise please let me know so we can continue to
> > investigate.
> 
> I don't have an s390x host at hand so the only thing I can to is to drop
> that patch and hope that resolves the problem. @Peter, @Thomas, any
> other ideas? Can you verify this on your end?

Cannot reproduce either here, x86_64 host only.  The report was from s390
host, though.  I'm not familiar with the s390 patch, I wonder if any of you
could use plain brain power to figure more things out.

We could wait for 1-2 more days to see whether Thomas can figure it out,
hopefully easily reproduceable on s390.. or we can also leave that for
later.  And if the current issue on such fix is s390-host-only, might be
easier to be picked up by s390 tree, perhaps?

In all cases, the regression is since 9.1, so maybe we don't need to rush
anything in.
Thomas Huth Jan. 2, 2025, 9:32 a.m. UTC | #4
On 20/12/2024 17.28, Peter Xu wrote:
> On Thu, Dec 19, 2024 at 03:53:22PM -0300, Fabiano Rosas wrote:
>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>
>>> Hi Fabiano,
>>> Please take a look at this CI failure:
>>>
>>>>>> MALLOC_PERTURB_=61 QTEST_QEMU_BINARY=./qemu-system-s390x UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 QTEST_QEMU_IMG=./qemu-img MESON_TEST_ITERATION=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 PYTHON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh /home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>>> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>>> stderr:
>>> Traceback (most recent call last):
>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 688, in <module>
>>>      dump.read(dump_memory = args.memory)
>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 625, in read
>>>      section.read()
>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 461, in read
>>>      field['data'] = reader(field, self.file)
>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 434, in __init__
>>>      for field in self.desc['struct']['fields']:
>>> KeyError: 'fields'
>>
>> This is the command line that runs only this specific test:
>>
>> PYTHON=/usr/bin/python3.11 QTEST_QEMU_BINARY=./qemu-system-s390x
>> ./tests/qtest/migration-test -p /s390x/migration/analyze-script
>>
>> I cannot reproduce in migration-next nor in the detached HEAD that the
>> pipeline ran in (had to download the tarball from gitlab).
>>
>> The only s390 patch in this PR is one that I can test just fine with
>> TCG, so there shouldn't be any difference from KVM (i.e. there should be
>> no state being migrated with KVM that is not already migrated with TCG).
>>
>>> warning: fd: migration to a file is deprecated. Use file: instead.
>>> warning: fd: migration to a file is deprecated. Use file: instead.
>>
>> This is harmless.
>>
>>> **
>>> ERROR:../tests/qtest/migration-test.c:36:main: assertion failed (ret == 0): (1 == 0)
>>> (test program exited with status code -6)
>>
>> This is the assert at the end of the tests, irrelevant.
>>
>>>
>>> https://gitlab.com/qemu-project/qemu/-/jobs/8681858344#L8190
>>>
>>> If you find this pull request caused the failure, please send a new
>>> revision. Otherwise please let me know so we can continue to
>>> investigate.
>>
>> I don't have an s390x host at hand so the only thing I can to is to drop
>> that patch and hope that resolves the problem. @Peter, @Thomas, any
>> other ideas? Can you verify this on your end?
> 
> Cannot reproduce either here, x86_64 host only.  The report was from s390
> host, though.  I'm not familiar with the s390 patch, I wonder if any of you
> could use plain brain power to figure more things out.
> 
> We could wait for 1-2 more days to see whether Thomas can figure it out,
> hopefully easily reproduceable on s390.. or we can also leave that for
> later.  And if the current issue on such fix is s390-host-only, might be
> easier to be picked up by s390 tree, perhaps?

I tested migration-20241217-pull-request on a s390x (RHEL) host, but I 
cannot reproduce the issue there - make check-qtest works without any 
problems. Is it maybe related to that specific Ubuntu installation?

  Thomas
Fabiano Rosas Jan. 3, 2025, 6:30 p.m. UTC | #5
Thomas Huth <thuth@redhat.com> writes:

> On 20/12/2024 17.28, Peter Xu wrote:
>> On Thu, Dec 19, 2024 at 03:53:22PM -0300, Fabiano Rosas wrote:
>>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>>
>>>> Hi Fabiano,
>>>> Please take a look at this CI failure:
>>>>
>>>>>>> MALLOC_PERTURB_=61 QTEST_QEMU_BINARY=./qemu-system-s390x UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 QTEST_QEMU_IMG=./qemu-img MESON_TEST_ITERATION=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 PYTHON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh /home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>>>> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>>>> stderr:
>>>> Traceback (most recent call last):
>>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 688, in <module>
>>>>      dump.read(dump_memory = args.memory)
>>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 625, in read
>>>>      section.read()
>>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 461, in read
>>>>      field['data'] = reader(field, self.file)
>>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 434, in __init__
>>>>      for field in self.desc['struct']['fields']:
>>>> KeyError: 'fields'
>>>
>>> This is the command line that runs only this specific test:
>>>
>>> PYTHON=/usr/bin/python3.11 QTEST_QEMU_BINARY=./qemu-system-s390x
>>> ./tests/qtest/migration-test -p /s390x/migration/analyze-script
>>>
>>> I cannot reproduce in migration-next nor in the detached HEAD that the
>>> pipeline ran in (had to download the tarball from gitlab).
>>>
>>> The only s390 patch in this PR is one that I can test just fine with
>>> TCG, so there shouldn't be any difference from KVM (i.e. there should be
>>> no state being migrated with KVM that is not already migrated with TCG).
>>>
>>>> warning: fd: migration to a file is deprecated. Use file: instead.
>>>> warning: fd: migration to a file is deprecated. Use file: instead.
>>>
>>> This is harmless.
>>>
>>>> **
>>>> ERROR:../tests/qtest/migration-test.c:36:main: assertion failed (ret == 0): (1 == 0)
>>>> (test program exited with status code -6)
>>>
>>> This is the assert at the end of the tests, irrelevant.
>>>
>>>>
>>>> https://gitlab.com/qemu-project/qemu/-/jobs/8681858344#L8190
>>>>
>>>> If you find this pull request caused the failure, please send a new
>>>> revision. Otherwise please let me know so we can continue to
>>>> investigate.
>>>
>>> I don't have an s390x host at hand so the only thing I can to is to drop
>>> that patch and hope that resolves the problem. @Peter, @Thomas, any
>>> other ideas? Can you verify this on your end?
>> 
>> Cannot reproduce either here, x86_64 host only.  The report was from s390
>> host, though.  I'm not familiar with the s390 patch, I wonder if any of you
>> could use plain brain power to figure more things out.
>> 
>> We could wait for 1-2 more days to see whether Thomas can figure it out,
>> hopefully easily reproduceable on s390.. or we can also leave that for
>> later.  And if the current issue on such fix is s390-host-only, might be
>> easier to be picked up by s390 tree, perhaps?
>
> I tested migration-20241217-pull-request on a s390x (RHEL) host, but I 
> cannot reproduce the issue there - make check-qtest works without any 
> problems. Is it maybe related to that specific Ubuntu installation?
>

Since we cannot reproduce outside of the staging CI, could we run that
job again with a diagnostic patch? Here's the rebased PR with the patch:

https://gitlab.com/farosas/qemu/-/commits/migration-next

(fork CI run: https://gitlab.com/farosas/qemu/-/pipelines/1610691202)

Or should I just send a v2 of this PR with the debug patch?
Stefan Hajnoczi Jan. 3, 2025, 8:31 p.m. UTC | #6
On Fri, 3 Jan 2025 at 13:32, Fabiano Rosas <farosas@suse.de> wrote:
>
> Thomas Huth <thuth@redhat.com> writes:
>
> > On 20/12/2024 17.28, Peter Xu wrote:
> >> On Thu, Dec 19, 2024 at 03:53:22PM -0300, Fabiano Rosas wrote:
> >>> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >>>
> >>>> Hi Fabiano,
> >>>> Please take a look at this CI failure:
> >>>>
> >>>>>>> MALLOC_PERTURB_=61 QTEST_QEMU_BINARY=./qemu-system-s390x UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 QTEST_QEMU_IMG=./qemu-img MESON_TEST_ITERATION=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 PYTHON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh /home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> >>>> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> >>>> stderr:
> >>>> Traceback (most recent call last):
> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 688, in <module>
> >>>>      dump.read(dump_memory = args.memory)
> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 625, in read
> >>>>      section.read()
> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 461, in read
> >>>>      field['data'] = reader(field, self.file)
> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 434, in __init__
> >>>>      for field in self.desc['struct']['fields']:
> >>>> KeyError: 'fields'
> >>>
> >>> This is the command line that runs only this specific test:
> >>>
> >>> PYTHON=/usr/bin/python3.11 QTEST_QEMU_BINARY=./qemu-system-s390x
> >>> ./tests/qtest/migration-test -p /s390x/migration/analyze-script
> >>>
> >>> I cannot reproduce in migration-next nor in the detached HEAD that the
> >>> pipeline ran in (had to download the tarball from gitlab).
> >>>
> >>> The only s390 patch in this PR is one that I can test just fine with
> >>> TCG, so there shouldn't be any difference from KVM (i.e. there should be
> >>> no state being migrated with KVM that is not already migrated with TCG).
> >>>
> >>>> warning: fd: migration to a file is deprecated. Use file: instead.
> >>>> warning: fd: migration to a file is deprecated. Use file: instead.
> >>>
> >>> This is harmless.
> >>>
> >>>> **
> >>>> ERROR:../tests/qtest/migration-test.c:36:main: assertion failed (ret == 0): (1 == 0)
> >>>> (test program exited with status code -6)
> >>>
> >>> This is the assert at the end of the tests, irrelevant.
> >>>
> >>>>
> >>>> https://gitlab.com/qemu-project/qemu/-/jobs/8681858344#L8190
> >>>>
> >>>> If you find this pull request caused the failure, please send a new
> >>>> revision. Otherwise please let me know so we can continue to
> >>>> investigate.
> >>>
> >>> I don't have an s390x host at hand so the only thing I can to is to drop
> >>> that patch and hope that resolves the problem. @Peter, @Thomas, any
> >>> other ideas? Can you verify this on your end?
> >>
> >> Cannot reproduce either here, x86_64 host only.  The report was from s390
> >> host, though.  I'm not familiar with the s390 patch, I wonder if any of you
> >> could use plain brain power to figure more things out.
> >>
> >> We could wait for 1-2 more days to see whether Thomas can figure it out,
> >> hopefully easily reproduceable on s390.. or we can also leave that for
> >> later.  And if the current issue on such fix is s390-host-only, might be
> >> easier to be picked up by s390 tree, perhaps?
> >
> > I tested migration-20241217-pull-request on a s390x (RHEL) host, but I
> > cannot reproduce the issue there - make check-qtest works without any
> > problems. Is it maybe related to that specific Ubuntu installation?
> >
>
> Since we cannot reproduce outside of the staging CI, could we run that
> job again with a diagnostic patch? Here's the rebased PR with the patch:
>
> https://gitlab.com/farosas/qemu/-/commits/migration-next
>
> (fork CI run: https://gitlab.com/farosas/qemu/-/pipelines/1610691202)
>
> Or should I just send a v2 of this PR with the debug patch?

Here is the staging CI pipeline for your migration-next tree:
https://gitlab.com/qemu-project/qemu/-/pipelines/1610836485

Stefan
Fabiano Rosas Jan. 3, 2025, 9 p.m. UTC | #7
Stefan Hajnoczi <stefanha@gmail.com> writes:

> On Fri, 3 Jan 2025 at 13:32, Fabiano Rosas <farosas@suse.de> wrote:
>>
>> Thomas Huth <thuth@redhat.com> writes:
>>
>> > On 20/12/2024 17.28, Peter Xu wrote:
>> >> On Thu, Dec 19, 2024 at 03:53:22PM -0300, Fabiano Rosas wrote:
>> >>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>> >>>
>> >>>> Hi Fabiano,
>> >>>> Please take a look at this CI failure:
>> >>>>
>> >>>>>>> MALLOC_PERTURB_=61 QTEST_QEMU_BINARY=./qemu-system-s390x UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 QTEST_QEMU_IMG=./qemu-img MESON_TEST_ITERATION=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 PYTHON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh /home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>> >>>> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> >>>> stderr:
>> >>>> Traceback (most recent call last):
>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 688, in <module>
>> >>>>      dump.read(dump_memory = args.memory)
>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 625, in read
>> >>>>      section.read()
>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 461, in read
>> >>>>      field['data'] = reader(field, self.file)
>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 434, in __init__
>> >>>>      for field in self.desc['struct']['fields']:
>> >>>> KeyError: 'fields'
>> >>>
>> >>> This is the command line that runs only this specific test:
>> >>>
>> >>> PYTHON=/usr/bin/python3.11 QTEST_QEMU_BINARY=./qemu-system-s390x
>> >>> ./tests/qtest/migration-test -p /s390x/migration/analyze-script
>> >>>
>> >>> I cannot reproduce in migration-next nor in the detached HEAD that the
>> >>> pipeline ran in (had to download the tarball from gitlab).
>> >>>
>> >>> The only s390 patch in this PR is one that I can test just fine with
>> >>> TCG, so there shouldn't be any difference from KVM (i.e. there should be
>> >>> no state being migrated with KVM that is not already migrated with TCG).
>> >>>
>> >>>> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>>> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>>
>> >>> This is harmless.
>> >>>
>> >>>> **
>> >>>> ERROR:../tests/qtest/migration-test.c:36:main: assertion failed (ret == 0): (1 == 0)
>> >>>> (test program exited with status code -6)
>> >>>
>> >>> This is the assert at the end of the tests, irrelevant.
>> >>>
>> >>>>
>> >>>> https://gitlab.com/qemu-project/qemu/-/jobs/8681858344#L8190
>> >>>>
>> >>>> If you find this pull request caused the failure, please send a new
>> >>>> revision. Otherwise please let me know so we can continue to
>> >>>> investigate.
>> >>>
>> >>> I don't have an s390x host at hand so the only thing I can to is to drop
>> >>> that patch and hope that resolves the problem. @Peter, @Thomas, any
>> >>> other ideas? Can you verify this on your end?
>> >>
>> >> Cannot reproduce either here, x86_64 host only.  The report was from s390
>> >> host, though.  I'm not familiar with the s390 patch, I wonder if any of you
>> >> could use plain brain power to figure more things out.
>> >>
>> >> We could wait for 1-2 more days to see whether Thomas can figure it out,
>> >> hopefully easily reproduceable on s390.. or we can also leave that for
>> >> later.  And if the current issue on such fix is s390-host-only, might be
>> >> easier to be picked up by s390 tree, perhaps?
>> >
>> > I tested migration-20241217-pull-request on a s390x (RHEL) host, but I
>> > cannot reproduce the issue there - make check-qtest works without any
>> > problems. Is it maybe related to that specific Ubuntu installation?
>> >
>>
>> Since we cannot reproduce outside of the staging CI, could we run that
>> job again with a diagnostic patch? Here's the rebased PR with the patch:
>>
>> https://gitlab.com/farosas/qemu/-/commits/migration-next
>>
>> (fork CI run: https://gitlab.com/farosas/qemu/-/pipelines/1610691202)
>>
>> Or should I just send a v2 of this PR with the debug patch?
>
> Here is the staging CI pipeline for your migration-next tree:
> https://gitlab.com/qemu-project/qemu/-/pipelines/1610836485

Great, thanks! Let's find out what is going on...

>
> Stefan
Fabiano Rosas Jan. 3, 2025, 10:34 p.m. UTC | #8
Fabiano Rosas <farosas@suse.de> writes:

> Stefan Hajnoczi <stefanha@gmail.com> writes:
>
>> On Fri, 3 Jan 2025 at 13:32, Fabiano Rosas <farosas@suse.de> wrote:
>>>
>>> Thomas Huth <thuth@redhat.com> writes:
>>>
>>> > On 20/12/2024 17.28, Peter Xu wrote:
>>> >> On Thu, Dec 19, 2024 at 03:53:22PM -0300, Fabiano Rosas wrote:
>>> >>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>>> >>>
>>> >>>> Hi Fabiano,
>>> >>>> Please take a look at this CI failure:
>>> >>>>
>>> >>>>>>> MALLOC_PERTURB_=61 QTEST_QEMU_BINARY=./qemu-system-s390x UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 QTEST_QEMU_IMG=./qemu-img MESON_TEST_ITERATION=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 PYTHON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh /home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>>> >>>> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>>> >>>> stderr:
>>> >>>> Traceback (most recent call last):
>>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 688, in <module>
>>> >>>>      dump.read(dump_memory = args.memory)
>>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 625, in read
>>> >>>>      section.read()
>>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 461, in read
>>> >>>>      field['data'] = reader(field, self.file)
>>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 434, in __init__
>>> >>>>      for field in self.desc['struct']['fields']:
>>> >>>> KeyError: 'fields'
>>> >>>
>>> >>> This is the command line that runs only this specific test:
>>> >>>
>>> >>> PYTHON=/usr/bin/python3.11 QTEST_QEMU_BINARY=./qemu-system-s390x
>>> >>> ./tests/qtest/migration-test -p /s390x/migration/analyze-script
>>> >>>
>>> >>> I cannot reproduce in migration-next nor in the detached HEAD that the
>>> >>> pipeline ran in (had to download the tarball from gitlab).
>>> >>>
>>> >>> The only s390 patch in this PR is one that I can test just fine with
>>> >>> TCG, so there shouldn't be any difference from KVM (i.e. there should be
>>> >>> no state being migrated with KVM that is not already migrated with TCG).
>>> >>>
>>> >>>> warning: fd: migration to a file is deprecated. Use file: instead.
>>> >>>> warning: fd: migration to a file is deprecated. Use file: instead.
>>> >>>
>>> >>> This is harmless.
>>> >>>
>>> >>>> **
>>> >>>> ERROR:../tests/qtest/migration-test.c:36:main: assertion failed (ret == 0): (1 == 0)
>>> >>>> (test program exited with status code -6)
>>> >>>
>>> >>> This is the assert at the end of the tests, irrelevant.
>>> >>>
>>> >>>>
>>> >>>> https://gitlab.com/qemu-project/qemu/-/jobs/8681858344#L8190
>>> >>>>
>>> >>>> If you find this pull request caused the failure, please send a new
>>> >>>> revision. Otherwise please let me know so we can continue to
>>> >>>> investigate.
>>> >>>
>>> >>> I don't have an s390x host at hand so the only thing I can to is to drop
>>> >>> that patch and hope that resolves the problem. @Peter, @Thomas, any
>>> >>> other ideas? Can you verify this on your end?
>>> >>
>>> >> Cannot reproduce either here, x86_64 host only.  The report was from s390
>>> >> host, though.  I'm not familiar with the s390 patch, I wonder if any of you
>>> >> could use plain brain power to figure more things out.
>>> >>
>>> >> We could wait for 1-2 more days to see whether Thomas can figure it out,
>>> >> hopefully easily reproduceable on s390.. or we can also leave that for
>>> >> later.  And if the current issue on such fix is s390-host-only, might be
>>> >> easier to be picked up by s390 tree, perhaps?
>>> >
>>> > I tested migration-20241217-pull-request on a s390x (RHEL) host, but I
>>> > cannot reproduce the issue there - make check-qtest works without any
>>> > problems. Is it maybe related to that specific Ubuntu installation?
>>> >
>>>
>>> Since we cannot reproduce outside of the staging CI, could we run that
>>> job again with a diagnostic patch? Here's the rebased PR with the patch:
>>>
>>> https://gitlab.com/farosas/qemu/-/commits/migration-next
>>>
>>> (fork CI run: https://gitlab.com/farosas/qemu/-/pipelines/1610691202)
>>>
>>> Or should I just send a v2 of this PR with the debug patch?
>>
>> Here is the staging CI pipeline for your migration-next tree:
>> https://gitlab.com/qemu-project/qemu/-/pipelines/1610836485
>
> Great, thanks! Let's find out what is going on...
>

It seems the issue is here:

{"name": "css", "array_len": 256, "type": "struct", "struct": {}, "size": 1}
                                                              ^
And in QEMU:

static const VMStateDescription vmstate_css = {
    .name = "s390_css",
    ...
->      VMSTATE_ARRAY_OF_POINTER_TO_STRUCT(css, ChannelSubSys, MAX_CSSID + 1,
                0, vmstate_css_img, CssImage),

Is it legal to have an empty array? I would assume so. Are we maybe
missing a .needed?

Comparing with another similar vmstate spapr_llan/rx_pools in ppc
(-device spapr-vlan), what I see is:

{"name": "rx_pool", "array_len": 5, "type": "struct", "struct":
{"vmsd_name": "spapr_llan/rx_buffer_pool", ... }, "size": 32776}

So for CSS I'd expect:

-{"name": "css", "array_len": 256, "type": "struct", "struct": {}, "size": 1}
+{"name": "css", "array_len": 256, "type": "struct", "struct": {"vmsd_name": "s390_css_img", ...}, "size": 1}

What is weird is that in my TCG run it also shows the empty struct and
the script doesn't seem to care. For some reason, in the CI job it
parses further into the JSON.

If anyone spots something, let me know. I'll get back to this on Monday
with a fresh mind.
Peter Xu Jan. 6, 2025, 6:45 p.m. UTC | #9
On Fri, Jan 03, 2025 at 07:34:08PM -0300, Fabiano Rosas wrote:
> Fabiano Rosas <farosas@suse.de> writes:
> 
> > Stefan Hajnoczi <stefanha@gmail.com> writes:
> >
> >> On Fri, 3 Jan 2025 at 13:32, Fabiano Rosas <farosas@suse.de> wrote:
> >>>
> >>> Thomas Huth <thuth@redhat.com> writes:
> >>>
> >>> > On 20/12/2024 17.28, Peter Xu wrote:
> >>> >> On Thu, Dec 19, 2024 at 03:53:22PM -0300, Fabiano Rosas wrote:
> >>> >>> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >>> >>>
> >>> >>>> Hi Fabiano,
> >>> >>>> Please take a look at this CI failure:
> >>> >>>>
> >>> >>>>>>> MALLOC_PERTURB_=61 QTEST_QEMU_BINARY=./qemu-system-s390x UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 QTEST_QEMU_IMG=./qemu-img MESON_TEST_ITERATION=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 PYTHON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh /home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> >>> >>>> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> >>> >>>> stderr:
> >>> >>>> Traceback (most recent call last):
> >>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 688, in <module>
> >>> >>>>      dump.read(dump_memory = args.memory)
> >>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 625, in read
> >>> >>>>      section.read()
> >>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 461, in read
> >>> >>>>      field['data'] = reader(field, self.file)
> >>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 434, in __init__
> >>> >>>>      for field in self.desc['struct']['fields']:
> >>> >>>> KeyError: 'fields'
> >>> >>>
> >>> >>> This is the command line that runs only this specific test:
> >>> >>>
> >>> >>> PYTHON=/usr/bin/python3.11 QTEST_QEMU_BINARY=./qemu-system-s390x
> >>> >>> ./tests/qtest/migration-test -p /s390x/migration/analyze-script
> >>> >>>
> >>> >>> I cannot reproduce in migration-next nor in the detached HEAD that the
> >>> >>> pipeline ran in (had to download the tarball from gitlab).
> >>> >>>
> >>> >>> The only s390 patch in this PR is one that I can test just fine with
> >>> >>> TCG, so there shouldn't be any difference from KVM (i.e. there should be
> >>> >>> no state being migrated with KVM that is not already migrated with TCG).
> >>> >>>
> >>> >>>> warning: fd: migration to a file is deprecated. Use file: instead.
> >>> >>>> warning: fd: migration to a file is deprecated. Use file: instead.
> >>> >>>
> >>> >>> This is harmless.
> >>> >>>
> >>> >>>> **
> >>> >>>> ERROR:../tests/qtest/migration-test.c:36:main: assertion failed (ret == 0): (1 == 0)
> >>> >>>> (test program exited with status code -6)
> >>> >>>
> >>> >>> This is the assert at the end of the tests, irrelevant.
> >>> >>>
> >>> >>>>
> >>> >>>> https://gitlab.com/qemu-project/qemu/-/jobs/8681858344#L8190
> >>> >>>>
> >>> >>>> If you find this pull request caused the failure, please send a new
> >>> >>>> revision. Otherwise please let me know so we can continue to
> >>> >>>> investigate.
> >>> >>>
> >>> >>> I don't have an s390x host at hand so the only thing I can to is to drop
> >>> >>> that patch and hope that resolves the problem. @Peter, @Thomas, any
> >>> >>> other ideas? Can you verify this on your end?
> >>> >>
> >>> >> Cannot reproduce either here, x86_64 host only.  The report was from s390
> >>> >> host, though.  I'm not familiar with the s390 patch, I wonder if any of you
> >>> >> could use plain brain power to figure more things out.
> >>> >>
> >>> >> We could wait for 1-2 more days to see whether Thomas can figure it out,
> >>> >> hopefully easily reproduceable on s390.. or we can also leave that for
> >>> >> later.  And if the current issue on such fix is s390-host-only, might be
> >>> >> easier to be picked up by s390 tree, perhaps?
> >>> >
> >>> > I tested migration-20241217-pull-request on a s390x (RHEL) host, but I
> >>> > cannot reproduce the issue there - make check-qtest works without any
> >>> > problems. Is it maybe related to that specific Ubuntu installation?
> >>> >
> >>>
> >>> Since we cannot reproduce outside of the staging CI, could we run that
> >>> job again with a diagnostic patch? Here's the rebased PR with the patch:
> >>>
> >>> https://gitlab.com/farosas/qemu/-/commits/migration-next
> >>>
> >>> (fork CI run: https://gitlab.com/farosas/qemu/-/pipelines/1610691202)
> >>>
> >>> Or should I just send a v2 of this PR with the debug patch?
> >>
> >> Here is the staging CI pipeline for your migration-next tree:
> >> https://gitlab.com/qemu-project/qemu/-/pipelines/1610836485
> >
> > Great, thanks! Let's find out what is going on...
> >
> 
> It seems the issue is here:
> 
> {"name": "css", "array_len": 256, "type": "struct", "struct": {}, "size": 1}
>                                                               ^
> And in QEMU:
> 
> static const VMStateDescription vmstate_css = {
>     .name = "s390_css",
>     ...
> ->      VMSTATE_ARRAY_OF_POINTER_TO_STRUCT(css, ChannelSubSys, MAX_CSSID + 1,
>                 0, vmstate_css_img, CssImage),
> 
> Is it legal to have an empty array? I would assume so. Are we maybe
> missing a .needed?

I guess we can always decide to dump things even if it's empty.

When I was looking at this, I saw a trick we played in vmstate dump, see
07d4e69147 ("migration/vmstate: fix array of ptr with nullptrs").  I am
guessing we hit a nullptr (or a bunch of..) here so the JSON part is
ignored.

> 
> Comparing with another similar vmstate spapr_llan/rx_pools in ppc
> (-device spapr-vlan), what I see is:
> 
> {"name": "rx_pool", "array_len": 5, "type": "struct", "struct":
> {"vmsd_name": "spapr_llan/rx_buffer_pool", ... }, "size": 32776}
> 
> So for CSS I'd expect:
> 
> -{"name": "css", "array_len": 256, "type": "struct", "struct": {}, "size": 1}
> +{"name": "css", "array_len": 256, "type": "struct", "struct": {"vmsd_name": "s390_css_img", ...}, "size": 1}
> 
> What is weird is that in my TCG run it also shows the empty struct and
> the script doesn't seem to care. For some reason, in the CI job it
> parses further into the JSON.
> 
> If anyone spots something, let me know. I'll get back to this on Monday
> with a fresh mind.

So I thought about a solution; it's not easy to do it clean in a small
change.  So here it is, not so small but not huge either.  This is the
cleanest I can come up with.. attached at the end.

If it works, we're 100% lucky.  I hope VMSDFieldGeneric in the script will
already work for the nullptrs.  If not, hopefully this provides some
insight so you can move further..

===8<===
From e5339d55f71df2d96d99dbd7eb845f06da0e68aa Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Mon, 6 Jan 2025 13:18:25 -0500
Subject: [PATCH] migration: Dump correct JSON format for nullptr replacement

QEMU plays a trick with null pointers inside an array of pointers in a VMSD
field.  See 07d4e69147 ("migration/vmstate: fix array of ptr with
nullptrs") for more details on why.  The idea makes sense in general, but
it may overlooked the JSON writer where it could write nothing in a
"struct" in the JSON hints section.

We hit some analyze-migration.py issues on s390 recently, showing that some
of the struct field contains nothing, like:

{"name": "css", "array_len": 256, "type": "struct", "struct": {}, "size": 1}

As described in details by Fabiano:

https://lore.kernel.org/r/87pll37cin.fsf@suse.de

It could be that we hit some null pointers there, and JSON was gone when
they're null pointers.

To fix it, instead of hacking around only at VMStateInfo level, do that
from VMStateField level, so that JSON writer can also be involved.  In this
case, JSON writer will replace the pointer array (which used to be a
"struct") to be the real representation of the nullptr field.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/vmstate.c | 118 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 91 insertions(+), 27 deletions(-)

diff --git a/migration/vmstate.c b/migration/vmstate.c
index fa002b24e8..e9321b7846 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -51,6 +51,36 @@ vmstate_field_exists(const VMStateDescription *vmsd, const VMStateField *field,
     return result;
 }
 
+/*
+ * Create a fake nullptr field when there's a NULL pointer detected in the
+ * array of a VMS_ARRAY_OF_POINTER VMSD field.  It's needed because we
+ * can't dereference the NULL pointer.
+ */
+static const VMStateField *
+vmsd_create_fake_nullptr_field(const VMStateField *field)
+{
+    VMStateField *fake = g_new0(VMStateField, 1);
+
+    /* It can only happen on an array of pointers! */
+    assert(field->flags & VMS_ARRAY_OF_POINTER);
+
+    /* Some of fake's properties should match the original's */
+    fake->name = field->name;
+    fake->version_id = field->version_id;
+
+    /* Do not need "field_exists" check as it always exists (which is null) */
+    fake->field_exists = NULL;
+
+    /* See vmstate_info_nullptr - use 1 byte to represent nullptr */
+    fake->size = 1;
+    fake->info = &vmstate_info_nullptr;
+    fake->flags = VMS_SINGLE;
+
+    /* All the rest fields shouldn't matter.. */
+
+    return (const VMStateField *)fake;
+}
+
 static int vmstate_n_elems(void *opaque, const VMStateField *field)
 {
     int n_elems = 1;
@@ -143,23 +173,39 @@ int vmstate_load_state(QEMUFile *f, const VMStateDescription *vmsd,
             }
             for (i = 0; i < n_elems; i++) {
                 void *curr_elem = first_elem + size * i;
+                const VMStateField *inner_field;
 
                 if (field->flags & VMS_ARRAY_OF_POINTER) {
                     curr_elem = *(void **)curr_elem;
                 }
+
                 if (!curr_elem && size) {
-                    /* if null pointer check placeholder and do not follow */
-                    assert(field->flags & VMS_ARRAY_OF_POINTER);
-                    ret = vmstate_info_nullptr.get(f, curr_elem, size, NULL);
-                } else if (field->flags & VMS_STRUCT) {
-                    ret = vmstate_load_state(f, field->vmsd, curr_elem,
-                                             field->vmsd->version_id);
-                } else if (field->flags & VMS_VSTRUCT) {
-                    ret = vmstate_load_state(f, field->vmsd, curr_elem,
-                                             field->struct_version_id);
+                    /*
+                     * If null pointer found (which should only happen in
+                     * an array of pointers), use null placeholder and do
+                     * not follow.
+                     */
+                    inner_field = vmsd_create_fake_nullptr_field(field);
+                } else {
+                    inner_field = field;
+                }
+
+                if (inner_field->flags & VMS_STRUCT) {
+                    ret = vmstate_load_state(f, inner_field->vmsd, curr_elem,
+                                             inner_field->vmsd->version_id);
+                } else if (inner_field->flags & VMS_VSTRUCT) {
+                    ret = vmstate_load_state(f, inner_field->vmsd, curr_elem,
+                                             inner_field->struct_version_id);
                 } else {
-                    ret = field->info->get(f, curr_elem, size, field);
+                    ret = inner_field->info->get(f, curr_elem, size,
+                                                 inner_field);
                 }
+
+                /* If we used a fake temp field.. free it now */
+                if (inner_field != field) {
+                    g_clear_pointer((gpointer *)&inner_field, g_free);
+                }
+
                 if (ret >= 0) {
                     ret = qemu_file_get_error(f);
                 }
@@ -387,29 +433,50 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
             }
             for (i = 0; i < n_elems; i++) {
                 void *curr_elem = first_elem + size * i;
+                const VMStateField *inner_field;
 
-                vmsd_desc_field_start(vmsd, vmdesc_loop, field, i, n_elems);
                 old_offset = qemu_file_transferred(f);
                 if (field->flags & VMS_ARRAY_OF_POINTER) {
                     assert(curr_elem);
                     curr_elem = *(void **)curr_elem;
                 }
+
                 if (!curr_elem && size) {
-                    /* if null pointer write placeholder and do not follow */
-                    assert(field->flags & VMS_ARRAY_OF_POINTER);
-                    ret = vmstate_info_nullptr.put(f, curr_elem, size, NULL,
-                                                   NULL);
-                } else if (field->flags & VMS_STRUCT) {
-                    ret = vmstate_save_state(f, field->vmsd, curr_elem,
-                                             vmdesc_loop);
-                } else if (field->flags & VMS_VSTRUCT) {
-                    ret = vmstate_save_state_v(f, field->vmsd, curr_elem,
-                                               vmdesc_loop,
-                                               field->struct_version_id, errp);
+                    /*
+                     * If null pointer found (which should only happen in
+                     * an array of pointers), use null placeholder and do
+                     * not follow.
+                     */
+                    inner_field = vmsd_create_fake_nullptr_field(field);
+                } else {
+                    inner_field = field;
+                }
+
+                vmsd_desc_field_start(vmsd, vmdesc_loop, inner_field,
+                                      i, n_elems);
+
+                if (inner_field->flags & VMS_STRUCT) {
+                    ret = vmstate_save_state(f, inner_field->vmsd,
+                                             curr_elem, vmdesc_loop);
+                } else if (inner_field->flags & VMS_VSTRUCT) {
+                    ret = vmstate_save_state_v(f, inner_field->vmsd,
+                                               curr_elem, vmdesc_loop,
+                                               inner_field->struct_version_id,
+                                               errp);
                 } else {
-                    ret = field->info->put(f, curr_elem, size, field,
-                                     vmdesc_loop);
+                    ret = inner_field->info->put(f, curr_elem, size,
+                                                 inner_field, vmdesc_loop);
                 }
+
+                written_bytes = qemu_file_transferred(f) - old_offset;
+                vmsd_desc_field_end(vmsd, vmdesc_loop, inner_field,
+                                    written_bytes, i);
+
+                /* If we used a fake temp field.. free it now */
+                if (inner_field != field) {
+                    g_clear_pointer((gpointer *)&inner_field, g_free);
+                }
+
                 if (ret) {
                     error_setg(errp, "Save of field %s/%s failed",
                                 vmsd->name, field->name);
@@ -419,9 +486,6 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
                     return ret;
                 }
 
-                written_bytes = qemu_file_transferred(f) - old_offset;
-                vmsd_desc_field_end(vmsd, vmdesc_loop, field, written_bytes, i);
-
                 /* Compressed arrays only care about the first element */
                 if (vmdesc_loop && vmsd_can_compress(field)) {
                     vmdesc_loop = NULL;
Fabiano Rosas Jan. 6, 2025, 7:24 p.m. UTC | #10
Peter Xu <peterx@redhat.com> writes:

> On Fri, Jan 03, 2025 at 07:34:08PM -0300, Fabiano Rosas wrote:
>> Fabiano Rosas <farosas@suse.de> writes:
>> 
>> > Stefan Hajnoczi <stefanha@gmail.com> writes:
>> >
>> >> On Fri, 3 Jan 2025 at 13:32, Fabiano Rosas <farosas@suse.de> wrote:
>> >>>
>> >>> Thomas Huth <thuth@redhat.com> writes:
>> >>>
>> >>> > On 20/12/2024 17.28, Peter Xu wrote:
>> >>> >> On Thu, Dec 19, 2024 at 03:53:22PM -0300, Fabiano Rosas wrote:
>> >>> >>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>> >>> >>>
>> >>> >>>> Hi Fabiano,
>> >>> >>>> Please take a look at this CI failure:
>> >>> >>>>
>> >>> >>>>>>> MALLOC_PERTURB_=61 QTEST_QEMU_BINARY=./qemu-system-s390x UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 QTEST_QEMU_IMG=./qemu-img MESON_TEST_ITERATION=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 PYTHON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh /home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>> >>> >>>> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> >>> >>>> stderr:
>> >>> >>>> Traceback (most recent call last):
>> >>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 688, in <module>
>> >>> >>>>      dump.read(dump_memory = args.memory)
>> >>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 625, in read
>> >>> >>>>      section.read()
>> >>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 461, in read
>> >>> >>>>      field['data'] = reader(field, self.file)
>> >>> >>>>    File "/home/gitlab-runner/builds/4S3awx_3/0/qemu-project/qemu/build/scripts/analyze-migration.py", line 434, in __init__
>> >>> >>>>      for field in self.desc['struct']['fields']:
>> >>> >>>> KeyError: 'fields'
>> >>> >>>
>> >>> >>> This is the command line that runs only this specific test:
>> >>> >>>
>> >>> >>> PYTHON=/usr/bin/python3.11 QTEST_QEMU_BINARY=./qemu-system-s390x
>> >>> >>> ./tests/qtest/migration-test -p /s390x/migration/analyze-script
>> >>> >>>
>> >>> >>> I cannot reproduce in migration-next nor in the detached HEAD that the
>> >>> >>> pipeline ran in (had to download the tarball from gitlab).
>> >>> >>>
>> >>> >>> The only s390 patch in this PR is one that I can test just fine with
>> >>> >>> TCG, so there shouldn't be any difference from KVM (i.e. there should be
>> >>> >>> no state being migrated with KVM that is not already migrated with TCG).
>> >>> >>>
>> >>> >>>> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >>>> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >>>
>> >>> >>> This is harmless.
>> >>> >>>
>> >>> >>>> **
>> >>> >>>> ERROR:../tests/qtest/migration-test.c:36:main: assertion failed (ret == 0): (1 == 0)
>> >>> >>>> (test program exited with status code -6)
>> >>> >>>
>> >>> >>> This is the assert at the end of the tests, irrelevant.
>> >>> >>>
>> >>> >>>>
>> >>> >>>> https://gitlab.com/qemu-project/qemu/-/jobs/8681858344#L8190
>> >>> >>>>
>> >>> >>>> If you find this pull request caused the failure, please send a new
>> >>> >>>> revision. Otherwise please let me know so we can continue to
>> >>> >>>> investigate.
>> >>> >>>
>> >>> >>> I don't have an s390x host at hand so the only thing I can to is to drop
>> >>> >>> that patch and hope that resolves the problem. @Peter, @Thomas, any
>> >>> >>> other ideas? Can you verify this on your end?
>> >>> >>
>> >>> >> Cannot reproduce either here, x86_64 host only.  The report was from s390
>> >>> >> host, though.  I'm not familiar with the s390 patch, I wonder if any of you
>> >>> >> could use plain brain power to figure more things out.
>> >>> >>
>> >>> >> We could wait for 1-2 more days to see whether Thomas can figure it out,
>> >>> >> hopefully easily reproduceable on s390.. or we can also leave that for
>> >>> >> later.  And if the current issue on such fix is s390-host-only, might be
>> >>> >> easier to be picked up by s390 tree, perhaps?
>> >>> >
>> >>> > I tested migration-20241217-pull-request on a s390x (RHEL) host, but I
>> >>> > cannot reproduce the issue there - make check-qtest works without any
>> >>> > problems. Is it maybe related to that specific Ubuntu installation?
>> >>> >
>> >>>
>> >>> Since we cannot reproduce outside of the staging CI, could we run that
>> >>> job again with a diagnostic patch? Here's the rebased PR with the patch:
>> >>>
>> >>> https://gitlab.com/farosas/qemu/-/commits/migration-next
>> >>>
>> >>> (fork CI run: https://gitlab.com/farosas/qemu/-/pipelines/1610691202)
>> >>>
>> >>> Or should I just send a v2 of this PR with the debug patch?
>> >>
>> >> Here is the staging CI pipeline for your migration-next tree:
>> >> https://gitlab.com/qemu-project/qemu/-/pipelines/1610836485
>> >
>> > Great, thanks! Let's find out what is going on...
>> >
>> 
>> It seems the issue is here:
>> 
>> {"name": "css", "array_len": 256, "type": "struct", "struct": {}, "size": 1}
>>                                                               ^
>> And in QEMU:
>> 
>> static const VMStateDescription vmstate_css = {
>>     .name = "s390_css",
>>     ...
>> ->      VMSTATE_ARRAY_OF_POINTER_TO_STRUCT(css, ChannelSubSys, MAX_CSSID + 1,
>>                 0, vmstate_css_img, CssImage),
>> 
>> Is it legal to have an empty array? I would assume so. Are we maybe
>> missing a .needed?
>
> I guess we can always decide to dump things even if it's empty.
>
> When I was looking at this, I saw a trick we played in vmstate dump, see
> 07d4e69147 ("migration/vmstate: fix array of ptr with nullptrs").  I am
> guessing we hit a nullptr (or a bunch of..) here so the JSON part is
> ignored.
>
>> 
>> Comparing with another similar vmstate spapr_llan/rx_pools in ppc
>> (-device spapr-vlan), what I see is:
>> 
>> {"name": "rx_pool", "array_len": 5, "type": "struct", "struct":
>> {"vmsd_name": "spapr_llan/rx_buffer_pool", ... }, "size": 32776}
>> 
>> So for CSS I'd expect:
>> 
>> -{"name": "css", "array_len": 256, "type": "struct", "struct": {}, "size": 1}
>> +{"name": "css", "array_len": 256, "type": "struct", "struct": {"vmsd_name": "s390_css_img", ...}, "size": 1}
>> 
>> What is weird is that in my TCG run it also shows the empty struct and
>> the script doesn't seem to care. For some reason, in the CI job it
>> parses further into the JSON.
>> 
>> If anyone spots something, let me know. I'll get back to this on Monday
>> with a fresh mind.
>

Hi, Peter

We already spoke on IRC, but so everyone is in the same page:

The analyze-migration.py script is broken for s390x even in
master. That's why* we cannot reproduce this issue in our local
setups. The s390-storage_attributes section is failing to parse the last
STATTR_FLAG_EOS, which is a u64 0x1 that the generic code then reads a
byte from and sees 0x0 == QEMU_VM_EOF.

*- yes, this doesn't account for the s390 host that Thomas used
   which didn't reproduce the issue, but still...

The patch is here and I'll include it at the end of the email as well:

https://gitlab.com/farosas/qemu/-/commit/5bcad03aad85556a7b72f79d3574e246a99432c3.patch

> So I thought about a solution; it's not easy to do it clean in a small
> change.  So here it is, not so small but not huge either.  This is the
> cleanest I can come up with.. attached at the end.
>
> If it works, we're 100% lucky.  I hope VMSDFieldGeneric in the script will
> already work for the nullptrs.  If not, hopefully this provides some
> insight so you can move further..
>
> ===8<===
> From e5339d55f71df2d96d99dbd7eb845f06da0e68aa Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Mon, 6 Jan 2025 13:18:25 -0500
> Subject: [PATCH] migration: Dump correct JSON format for nullptr replacement
>
> QEMU plays a trick with null pointers inside an array of pointers in a VMSD
> field.  See 07d4e69147 ("migration/vmstate: fix array of ptr with
> nullptrs") for more details on why.  The idea makes sense in general, but
> it may overlooked the JSON writer where it could write nothing in a
> "struct" in the JSON hints section.

Interesting, I didn't know about that. I'm indeed seeing some stray "48"
('0') now in the stream. I'll give your patch a try.

Here's the fix for the pre-existing issue in the script:

-- 8< --
From 5bcad03aad85556a7b72f79d3574e246a99432c3 Mon Sep 17 00:00:00 2001
From: Fabiano Rosas <farosas@suse.de>
Date: Mon, 6 Jan 2025 15:05:31 -0300
Subject: [PATCH 1/2] migration: Fix parsing of s390 stream

The parsing for the S390StorageAttributes section is currently leaving
an unconsumed token that is later interpreted by the generic code as
QEMU_VM_EOF, cutting the parsing short.

The migration will issue a STATTR_FLAG_DONE between iterations, but
there's a final STATTR_FLAG_EOS at .save_complete.

Fixes: 81c2c9dd5d ("tests/qtest/migration-test: Fix analyze-migration.py for s390x")
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 scripts/analyze-migration.py | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/scripts/analyze-migration.py b/scripts/analyze-migration.py
index f2457b1dde..2a2160cbf7 100755
--- a/scripts/analyze-migration.py
+++ b/scripts/analyze-migration.py
@@ -65,6 +65,9 @@ def readvar(self, size = None):
     def tell(self):
         return self.file.tell()
 
+    def seek(self, a, b):
+        return self.file.seek(a, b)
+
     # The VMSD description is at the end of the file, after EOF. Look for
     # the last NULL byte, then for the beginning brace of JSON.
     def read_migration_debug_json(self):
@@ -272,11 +275,24 @@ def __init__(self, file, version_id, device, section_key):
         self.section_key = section_key
 
     def read(self):
+        pos = 0
         while True:
             addr_flags = self.file.read64()
             flags = addr_flags & 0xfff
-            if (flags & (self.STATTR_FLAG_DONE | self.STATTR_FLAG_EOS)):
+
+            if flags & self.STATTR_FLAG_DONE:
+                pos = self.file.tell()
+                continue
+            elif flags & self.STATTR_FLAG_EOS:
                 return
+            else:
+                # No EOS came after DONE, that's OK, but rewind the
+                # stream because this is not our data.
+                if pos:
+                    self.file.seek(pos, 0)
+                    return
+                raise Exception("Unknown flags %x", flags)
+
             if (flags & self.STATTR_FLAG_ERROR):
                 raise Exception("Error in migration stream")
             count = self.file.read64()
Peter Xu Jan. 6, 2025, 8:22 p.m. UTC | #11
On Mon, Jan 06, 2025 at 04:24:53PM -0300, Fabiano Rosas wrote:
> Here's the fix for the pre-existing issue in the script:

For this patch:

> 
> -- 8< --
> From 5bcad03aad85556a7b72f79d3574e246a99432c3 Mon Sep 17 00:00:00 2001
> From: Fabiano Rosas <farosas@suse.de>
> Date: Mon, 6 Jan 2025 15:05:31 -0300
> Subject: [PATCH 1/2] migration: Fix parsing of s390 stream
> 
> The parsing for the S390StorageAttributes section is currently leaving
> an unconsumed token that is later interpreted by the generic code as
> QEMU_VM_EOF, cutting the parsing short.

Better mention why it can be intepreted as QEMU_VM_EOF, especially s390's
tag is 8 bytes while QEMU's EOF is 1 byte.. that confused me. :)

> 
> The migration will issue a STATTR_FLAG_DONE between iterations, but
> there's a final STATTR_FLAG_EOS at .save_complete.
> 
> Fixes: 81c2c9dd5d ("tests/qtest/migration-test: Fix analyze-migration.py for s390x")
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  scripts/analyze-migration.py | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/scripts/analyze-migration.py b/scripts/analyze-migration.py
> index f2457b1dde..2a2160cbf7 100755
> --- a/scripts/analyze-migration.py
> +++ b/scripts/analyze-migration.py
> @@ -65,6 +65,9 @@ def readvar(self, size = None):
>      def tell(self):
>          return self.file.tell()
>  
> +    def seek(self, a, b):
> +        return self.file.seek(a, b)
> +
>      # The VMSD description is at the end of the file, after EOF. Look for
>      # the last NULL byte, then for the beginning brace of JSON.
>      def read_migration_debug_json(self):
> @@ -272,11 +275,24 @@ def __init__(self, file, version_id, device, section_key):
>          self.section_key = section_key
>  
>      def read(self):
> +        pos = 0
>          while True:
>              addr_flags = self.file.read64()
>              flags = addr_flags & 0xfff
> -            if (flags & (self.STATTR_FLAG_DONE | self.STATTR_FLAG_EOS)):
> +
> +            if flags & self.STATTR_FLAG_DONE:
> +                pos = self.file.tell()
> +                continue
> +            elif flags & self.STATTR_FLAG_EOS:
>                  return
> +            else:
> +                # No EOS came after DONE, that's OK, but rewind the
> +                # stream because this is not our data.
> +                if pos:
> +                    self.file.seek(pos, 0)

Nit: use io.SEEK_SET.

> +                    return
> +                raise Exception("Unknown flags %x", flags)
> +
>              if (flags & self.STATTR_FLAG_ERROR):
>                  raise Exception("Error in migration stream")
>              count = self.file.read64()

With above:

Reviewed-by: Peter Xu <peterx@redhat.com>