mbox series

[v9,0/7] Allow to enable multifd and postcopy migration together

Message ID 20250411114534.3370816-1-ppandit@redhat.com (mailing list archive)
Headers show
Series Allow to enable multifd and postcopy migration together | expand

Message

Prasad Pandit April 11, 2025, 11:45 a.m. UTC
From: Prasad Pandit <pjp@fedoraproject.org>

 Hello,


* This series (v9) does minor refactoring and reordering changes as
  suggested in the review of earlier series (v8). Also tried to
  reproduce/debug a qtest hang issue, but it could not be reproduced.
  From the shared stack traces it looked like Postcopy thread was
  preparing to finish before migrating all the pages.
===
67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             170.50s   81 subtests passed
===


v8: https://lore.kernel.org/qemu-devel/20250318123846.1370312-1-ppandit@redhat.com/T/#t
* This series (v8) splits earlier patch-2 which enabled multifd and
  postcopy options together into two separate patches. One modifies
  the channel discovery in migration_ioc_process_incoming() function,
  and second one enables the multifd and postcopy migration together.

  It also adds the 'save_postcopy_prepare' savevm_state handler to
  enable different sections to take an action just before the Postcopy
  phase starts. Thank you Peter for these patches.
===
67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             152.66s   81 subtests passed
===


v7: https://lore.kernel.org/qemu-devel/20250228121749.553184-1-ppandit@redhat.com/T/#t
* This series (v7) adds 'MULTIFD_RECV_SYNC' migration command. It is used
  to notify the destination migration thread to synchronise with the Multifd
  threads. This allows Multifd ('mig/dst/recv_x') threads on the destination
  to receive all their data, before they are shutdown.

  This series also updates the channel discovery function and qtests as
  suggested in the previous review comments.
===
67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             147.84s   81 subtests passed
===


v6: https://lore.kernel.org/qemu-devel/20250215123119.814345-1-ppandit@redhat.com/T/#t
* This series (v6) shuts down Multifd threads before starting Postcopy
  migration. It helps to avoid an issue of multifd pages arriving late
  at the destination during Postcopy phase and corrupting the vCPU
  state. It also reorders the qtest patches and does some refactoring
  changes as suggested in previous review.
===
67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             161.35s   73 subtests passed
===


v5: https://lore.kernel.org/qemu-devel/20250205122712.229151-1-ppandit@redhat.com/T/#t
* This series (v5) consolidates migration capabilities setting in one
  'set_migration_capabilities()' function, thus simplifying test sources.
  It passes all migration tests.
===
66/66 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             143.66s   71 subtests passed
===


v4: https://lore.kernel.org/qemu-devel/20250127120823.144949-1-ppandit@redhat.com/T/#t
* This series (v4) adds more 'multifd+postcopy' qtests which test
  Precopy migration with 'postcopy-ram' attribute set. And run
  Postcopy migrations with 'multifd' channels enabled.
===
$ ../qtest/migration-test --tap -k -r '/x86_64/migration/multifd+postcopy' | grep -i 'slow test'
# slow test /x86_64/migration/multifd+postcopy/plain executed in 1.29 secs
# slow test /x86_64/migration/multifd+postcopy/recovery/tls/psk executed in 2.48 secs
# slow test /x86_64/migration/multifd+postcopy/preempt/plain executed in 1.49 secs
# slow test /x86_64/migration/multifd+postcopy/preempt/recovery/tls/psk executed in 2.52 secs
# slow test /x86_64/migration/multifd+postcopy/tcp/tls/psk/match executed in 3.62 secs
# slow test /x86_64/migration/multifd+postcopy/tcp/plain/zstd executed in 1.34 secs
# slow test /x86_64/migration/multifd+postcopy/tcp/plain/cancel executed in 2.24 secs
...
66/66 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             148.41s   71 subtests passed
===


v3: https://lore.kernel.org/qemu-devel/20250121131032.1611245-1-ppandit@redhat.com/T/#t
* This series (v3) passes all existing 'tests/qtest/migration/*' tests
  and adds a new one to enable multifd channels with postcopy migration.


v2: https://lore.kernel.org/qemu-devel/20241129122256.96778-1-ppandit@redhat.com/T/#u
* This series (v2) further refactors the 'ram_save_target_page'
  function to make it independent of the multifd & postcopy change.


v1: https://lore.kernel.org/qemu-devel/20241126115748.118683-1-ppandit@redhat.com/T/#u
* This series removes magic value (4-bytes) introduced in the
  previous series for the Postcopy channel.


v0: https://lore.kernel.org/qemu-devel/20241029150908.1136894-1-ppandit@redhat.com/T/#u
* Currently Multifd and Postcopy migration can not be used together.
  QEMU shows "Postcopy is not yet compatible with multifd" message.

  When migrating guests with large (100's GB) RAM, Multifd threads
  help to accelerate migration, but inability to use it with the
  Postcopy mode delays guest start up on the destination side.

* This patch series allows to enable both Multifd and Postcopy
  migration together. Precopy and Multifd threads work during
  the initial guest (RAM) transfer. When migration moves to the
  Postcopy phase, Multifd threads are restrained and the Postcopy
  threads start to request pages from the source side.

* This series introduces magic value (4-bytes) to be sent on the
  Postcopy channel. It helps to differentiate channels and properly
  setup incoming connections on the destination side.


Thank you.
---
Peter Xu (2):
  migration: Add save_postcopy_prepare() savevm handler
  migration/ram: Implement save_postcopy_prepare()

Prasad Pandit (5):
  migration/multifd: move macros to multifd header
  migration: refactor channel discovery mechanism
  migration: enable multifd and postcopy together
  tests/qtest/migration: consolidate set capabilities
  tests/qtest/migration: add postcopy tests with multifd

 include/migration/register.h              |  15 +++
 migration/migration.c                     | 136 ++++++++++++----------
 migration/multifd-nocomp.c                |   3 +-
 migration/multifd.c                       |  12 +-
 migration/multifd.h                       |   5 +
 migration/options.c                       |   5 -
 migration/ram.c                           |  42 ++++++-
 migration/savevm.c                        |  33 ++++++
 migration/savevm.h                        |   1 +
 tests/qtest/migration/compression-tests.c |  38 +++++-
 tests/qtest/migration/cpr-tests.c         |   6 +-
 tests/qtest/migration/file-tests.c        |  58 +++++----
 tests/qtest/migration/framework.c         |  76 ++++++++----
 tests/qtest/migration/framework.h         |   9 +-
 tests/qtest/migration/misc-tests.c        |   4 +-
 tests/qtest/migration/postcopy-tests.c    |  35 +++++-
 tests/qtest/migration/precopy-tests.c     |  48 +++++---
 tests/qtest/migration/tls-tests.c         |  70 ++++++++++-
 18 files changed, 437 insertions(+), 159 deletions(-)

Comments

Fabiano Rosas April 16, 2025, 12:31 a.m. UTC | #1
Prasad Pandit <ppandit@redhat.com> writes:

> From: Prasad Pandit <pjp@fedoraproject.org>
>
>  Hello,
>
>
> * This series (v9) does minor refactoring and reordering changes as
>   suggested in the review of earlier series (v8). Also tried to
>   reproduce/debug a qtest hang issue, but it could not be reproduced.
>   From the shared stack traces it looked like Postcopy thread was
>   preparing to finish before migrating all the pages.

The issue is that a zero page is being migrated by multifd but there's
an optimization in place that skips faulting the page in on the
destination. Later during postcopy when the page is found to be missing,
postcopy (@migrate_send_rp_req_pages) believes the page is already
present due to the receivedmap for that pfn being set and thus the code
accessing the guest memory just sits there waiting for the page.

It seems your series has a logical conflict with this work that was done
a while back:

https://lore.kernel.org/all/20240401154110.2028453-1-yuan1.liu@intel.com/

The usage of receivedmap for multifd was supposed to be mutually
exclusive with postcopy. Take a look at the description of that series
and at postcopy_place_page_zero(). We need to figure out what needs to
change and how to do that compatibly. It might just be the case of
memsetting the zero page always for postcopy, but I havent't thought too
much about it.

There's also other issues with the series:

https://gitlab.com/farosas/qemu/-/pipelines/1770488059

The CI workers don't support userfaultfd so the tests need to check for
that properly. We have MigrationTestEnv::has_uffd for that.

Lastly, I have seem some weirdness with TLS channels disconnections
leading to asserts in qio_channel_shutdown() in my testing. I'll get a
better look at those tomorrow.
Fabiano Rosas April 16, 2025, 12:59 p.m. UTC | #2
Fabiano Rosas <farosas@suse.de> writes:

> Prasad Pandit <ppandit@redhat.com> writes:
>
>> From: Prasad Pandit <pjp@fedoraproject.org>
>>
>>  Hello,
>>
>>
>> * This series (v9) does minor refactoring and reordering changes as
>>   suggested in the review of earlier series (v8). Also tried to
>>   reproduce/debug a qtest hang issue, but it could not be reproduced.
>>   From the shared stack traces it looked like Postcopy thread was
>>   preparing to finish before migrating all the pages.
>
> The issue is that a zero page is being migrated by multifd but there's
> an optimization in place that skips faulting the page in on the
> destination. Later during postcopy when the page is found to be missing,
> postcopy (@migrate_send_rp_req_pages) believes the page is already
> present due to the receivedmap for that pfn being set and thus the code
> accessing the guest memory just sits there waiting for the page.
>
> It seems your series has a logical conflict with this work that was done
> a while back:
>
> https://lore.kernel.org/all/20240401154110.2028453-1-yuan1.liu@intel.com/
>
> The usage of receivedmap for multifd was supposed to be mutually
> exclusive with postcopy. Take a look at the description of that series
> and at postcopy_place_page_zero(). We need to figure out what needs to
> change and how to do that compatibly. It might just be the case of
> memsetting the zero page always for postcopy, but I havent't thought too
> much about it.
>
> There's also other issues with the series:
>
> https://gitlab.com/farosas/qemu/-/pipelines/1770488059
>
> The CI workers don't support userfaultfd so the tests need to check for
> that properly. We have MigrationTestEnv::has_uffd for that.
>
> Lastly, I have seem some weirdness with TLS channels disconnections
> leading to asserts in qio_channel_shutdown() in my testing. I'll get a
> better look at those tomorrow.

Ok, you can ignore this last paragraph. I was seeing the postcopy
recovery test disconnect messages, those are benign.