mbox series

[v7,0/5] Allow to enable multifd and postcopy migration together

Message ID 20250228121749.553184-1-ppandit@redhat.com (mailing list archive)
Headers show
Series Allow to enable multifd and postcopy migration together | expand

Message

Prasad Pandit Feb. 28, 2025, 12:17 p.m. UTC
From: Prasad Pandit <pjp@fedoraproject.org>

Hello,

* This series (v7) adds 'MULTIFD_RECV_SYNC' migration command. It is used
  to notify the destination migration thread to synchronise with the Multifd
  threads. This allows Multifd ('mig/dst/recv_x') threads on the destination
  to receive all their data, before they are shutdown.

  This series also updates the channel discovery function and qtests as
  suggested in the previous review comments.
===
67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             147.84s   81 subtests passed
===


v6: https://lore.kernel.org/qemu-devel/20250215123119.814345-1-ppandit@redhat.com/T/#t
* This series (v6) shuts down Multifd threads before starting Postcopy
  migration. It helps to avoid an issue of multifd pages arriving late
  at the destination during Postcopy phase and corrupting the vCPU
  state. It also reorders the qtest patches and does some refactoring
  changes as suggested in previous review.
===
67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             161.35s   73 subtests passed
===


v5: https://lore.kernel.org/qemu-devel/20250205122712.229151-1-ppandit@redhat.com/T/#t
* This series (v5) consolidates migration capabilities setting in one
  'set_migration_capabilities()' function, thus simplifying test sources.
  It passes all migration tests.
===
66/66 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             143.66s   71 subtests passed
===


v4: https://lore.kernel.org/qemu-devel/20250127120823.144949-1-ppandit@redhat.com/T/#t
* This series (v4) adds more 'multifd+postcopy' qtests which test
  Precopy migration with 'postcopy-ram' attribute set. And run
  Postcopy migrations with 'multifd' channels enabled.
===
$ ../qtest/migration-test --tap -k -r '/x86_64/migration/multifd+postcopy' | grep -i 'slow test'
# slow test /x86_64/migration/multifd+postcopy/plain executed in 1.29 secs
# slow test /x86_64/migration/multifd+postcopy/recovery/tls/psk executed in 2.48 secs
# slow test /x86_64/migration/multifd+postcopy/preempt/plain executed in 1.49 secs
# slow test /x86_64/migration/multifd+postcopy/preempt/recovery/tls/psk executed in 2.52 secs
# slow test /x86_64/migration/multifd+postcopy/tcp/tls/psk/match executed in 3.62 secs
# slow test /x86_64/migration/multifd+postcopy/tcp/plain/zstd executed in 1.34 secs
# slow test /x86_64/migration/multifd+postcopy/tcp/plain/cancel executed in 2.24 secs
...
66/66 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             148.41s   71 subtests passed
===


v3: https://lore.kernel.org/qemu-devel/20250121131032.1611245-1-ppandit@redhat.com/T/#t
* This series (v3) passes all existing 'tests/qtest/migration/*' tests
  and adds a new one to enable multifd channels with postcopy migration.


v2: https://lore.kernel.org/qemu-devel/20241129122256.96778-1-ppandit@redhat.com/T/#u
* This series (v2) further refactors the 'ram_save_target_page'
  function to make it independent of the multifd & postcopy change.


v1: https://lore.kernel.org/qemu-devel/20241126115748.118683-1-ppandit@redhat.com/T/#u
* This series removes magic value (4-bytes) introduced in the
  previous series for the Postcopy channel.


v0: https://lore.kernel.org/qemu-devel/20241029150908.1136894-1-ppandit@redhat.com/T/#u
* Currently Multifd and Postcopy migration can not be used together.
  QEMU shows "Postcopy is not yet compatible with multifd" message.

  When migrating guests with large (100's GB) RAM, Multifd threads
  help to accelerate migration, but inability to use it with the
  Postcopy mode delays guest start up on the destination side.

* This patch series allows to enable both Multifd and Postcopy
  migration together. Precopy and Multifd threads work during
  the initial guest (RAM) transfer. When migration moves to the
  Postcopy phase, Multifd threads are restrained and the Postcopy
  threads start to request pages from the source side.

* This series introduces magic value (4-bytes) to be sent on the
  Postcopy channel. It helps to differentiate channels and properly
  setup incoming connections on the destination side.


Thank you.
---
Prasad Pandit (5):
  migration/multifd: move macros to multifd header
  migration: enable multifd and postcopy together
  tests/qtest/migration: consolidate set capabilities
  tests/qtest/migration: add postcopy tests with multifd
  migration: add MULTIFD_RECV_SYNC migration command

 migration/migration.c                     | 136 +++++++++++++---------
 migration/multifd-nocomp.c                |  28 +++--
 migration/multifd.c                       |  17 +--
 migration/multifd.h                       |   6 +
 migration/options.c                       |   5 -
 migration/ram.c                           |   7 +-
 migration/savevm.c                        |  13 +++
 migration/savevm.h                        |   1 +
 tests/qtest/migration/compression-tests.c |  37 +++++-
 tests/qtest/migration/cpr-tests.c         |   6 +-
 tests/qtest/migration/file-tests.c        |  58 +++++----
 tests/qtest/migration/framework.c         |  76 ++++++++----
 tests/qtest/migration/framework.h         |   9 +-
 tests/qtest/migration/misc-tests.c        |   4 +-
 tests/qtest/migration/postcopy-tests.c    |  35 +++++-
 tests/qtest/migration/precopy-tests.c     |  48 +++++---
 tests/qtest/migration/tls-tests.c         |  69 ++++++++++-
 17 files changed, 388 insertions(+), 167 deletions(-)

Comments

Fabiano Rosas Feb. 28, 2025, 2:53 p.m. UTC | #1
Prasad Pandit <ppandit@redhat.com> writes:

> From: Prasad Pandit <pjp@fedoraproject.org>
>
> Hello,
>
> * This series (v7) adds 'MULTIFD_RECV_SYNC' migration command. It is used
>   to notify the destination migration thread to synchronise with the Multifd
>   threads. This allows Multifd ('mig/dst/recv_x') threads on the destination
>   to receive all their data, before they are shutdown.
>
>   This series also updates the channel discovery function and qtests as
>   suggested in the previous review comments.

You forgot to split patch 2. We cannot have a commit that revamps the
channel discovery logic and enables a new feature at the same
time. Changing the channel discovery affects all the migration
use-cases, that change cannot be smuggled along with multifd+postcopy
enablement.

Similarly, the multifd+postcopy enablement is a new feature that needs
to be tested and reasoned upon in isolation, it cannot bring along a
bunch of previously existing code that was shuffled around. We need to
be able to understand clearly what is done _in preparation for_ the
feature and what is done _as part of_ the feature.

Not to mention bisect and backporting. Many people will be looking at
this code in the future without any knowledge of migration, but as part
of a bisect section or when searching for missing backports in the
distros.

I also suggested to move that logic into channel.c. The code is now
well-contained enough that we don't need to be reading it every time
someone is going over the migration flow, it becomes just a helper
function.

> ===
> 67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             147.84s   81 subtests passed

I see postcopy/multifd/plain hanging from time to time. Probably due to
the changes in patch 5. Please take a look.

> ===
>
>
> v6: https://lore.kernel.org/qemu-devel/20250215123119.814345-1-ppandit@redhat.com/T/#t
> * This series (v6) shuts down Multifd threads before starting Postcopy
>   migration. It helps to avoid an issue of multifd pages arriving late
>   at the destination during Postcopy phase and corrupting the vCPU
>   state. It also reorders the qtest patches and does some refactoring
>   changes as suggested in previous review.
> ===
> 67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             161.35s   73 subtests passed
> ===
>
>
> v5: https://lore.kernel.org/qemu-devel/20250205122712.229151-1-ppandit@redhat.com/T/#t
> * This series (v5) consolidates migration capabilities setting in one
>   'set_migration_capabilities()' function, thus simplifying test sources.
>   It passes all migration tests.
> ===
> 66/66 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             143.66s   71 subtests passed
> ===
>
>
> v4: https://lore.kernel.org/qemu-devel/20250127120823.144949-1-ppandit@redhat.com/T/#t
> * This series (v4) adds more 'multifd+postcopy' qtests which test
>   Precopy migration with 'postcopy-ram' attribute set. And run
>   Postcopy migrations with 'multifd' channels enabled.
> ===
> $ ../qtest/migration-test --tap -k -r '/x86_64/migration/multifd+postcopy' | grep -i 'slow test'
> # slow test /x86_64/migration/multifd+postcopy/plain executed in 1.29 secs
> # slow test /x86_64/migration/multifd+postcopy/recovery/tls/psk executed in 2.48 secs
> # slow test /x86_64/migration/multifd+postcopy/preempt/plain executed in 1.49 secs
> # slow test /x86_64/migration/multifd+postcopy/preempt/recovery/tls/psk executed in 2.52 secs
> # slow test /x86_64/migration/multifd+postcopy/tcp/tls/psk/match executed in 3.62 secs
> # slow test /x86_64/migration/multifd+postcopy/tcp/plain/zstd executed in 1.34 secs
> # slow test /x86_64/migration/multifd+postcopy/tcp/plain/cancel executed in 2.24 secs
> ...
> 66/66 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             148.41s   71 subtests passed
> ===
>
>
> v3: https://lore.kernel.org/qemu-devel/20250121131032.1611245-1-ppandit@redhat.com/T/#t
> * This series (v3) passes all existing 'tests/qtest/migration/*' tests
>   and adds a new one to enable multifd channels with postcopy migration.
>
>
> v2: https://lore.kernel.org/qemu-devel/20241129122256.96778-1-ppandit@redhat.com/T/#u
> * This series (v2) further refactors the 'ram_save_target_page'
>   function to make it independent of the multifd & postcopy change.
>
>
> v1: https://lore.kernel.org/qemu-devel/20241126115748.118683-1-ppandit@redhat.com/T/#u
> * This series removes magic value (4-bytes) introduced in the
>   previous series for the Postcopy channel.
>
>
> v0: https://lore.kernel.org/qemu-devel/20241029150908.1136894-1-ppandit@redhat.com/T/#u
> * Currently Multifd and Postcopy migration can not be used together.
>   QEMU shows "Postcopy is not yet compatible with multifd" message.
>
>   When migrating guests with large (100's GB) RAM, Multifd threads
>   help to accelerate migration, but inability to use it with the
>   Postcopy mode delays guest start up on the destination side.
>
> * This patch series allows to enable both Multifd and Postcopy
>   migration together. Precopy and Multifd threads work during
>   the initial guest (RAM) transfer. When migration moves to the
>   Postcopy phase, Multifd threads are restrained and the Postcopy
>   threads start to request pages from the source side.
>
> * This series introduces magic value (4-bytes) to be sent on the
>   Postcopy channel. It helps to differentiate channels and properly
>   setup incoming connections on the destination side.
>
>
> Thank you.
> ---
> Prasad Pandit (5):
>   migration/multifd: move macros to multifd header
>   migration: enable multifd and postcopy together
>   tests/qtest/migration: consolidate set capabilities
>   tests/qtest/migration: add postcopy tests with multifd
>   migration: add MULTIFD_RECV_SYNC migration command
>
>  migration/migration.c                     | 136 +++++++++++++---------
>  migration/multifd-nocomp.c                |  28 +++--
>  migration/multifd.c                       |  17 +--
>  migration/multifd.h                       |   6 +
>  migration/options.c                       |   5 -
>  migration/ram.c                           |   7 +-
>  migration/savevm.c                        |  13 +++
>  migration/savevm.h                        |   1 +
>  tests/qtest/migration/compression-tests.c |  37 +++++-
>  tests/qtest/migration/cpr-tests.c         |   6 +-
>  tests/qtest/migration/file-tests.c        |  58 +++++----
>  tests/qtest/migration/framework.c         |  76 ++++++++----
>  tests/qtest/migration/framework.h         |   9 +-
>  tests/qtest/migration/misc-tests.c        |   4 +-
>  tests/qtest/migration/postcopy-tests.c    |  35 +++++-
>  tests/qtest/migration/precopy-tests.c     |  48 +++++---
>  tests/qtest/migration/tls-tests.c         |  69 ++++++++++-
>  17 files changed, 388 insertions(+), 167 deletions(-)
Prasad Pandit March 3, 2025, 10:47 a.m. UTC | #2
Hello Fabiano,

On Fri, 28 Feb 2025 at 20:23, Fabiano Rosas <farosas@suse.de> wrote:
> You forgot to split patch 2. We cannot have a commit that revamps the
> channel discovery logic and enables a new feature at the same
> time. Changing the channel discovery affects all the migration
> use-cases, that change cannot be smuggled along with multifd+postcopy
> enablement.

===
>> Continue with this patch and fix the stuff I mentioned. You can ignore
>> the first two paragraphs of that reply.
>>
>> https://lore.kernel.org/r/87y0y4tf5q.fsf@suse.de
>>
>> I still think we need to test that preempt + multifd scenario, but it
>> should be easy to write a test for that once the series is in more of a
>> final shape.
===

* I thought that major overhaul of the channel discovery part by
moving it to channel.c was to be done subsequently, no? As we
discussed, we need to touch the channel discovery parts in
'migration_ioc_process_incoming' because, we need to identify the
Postcopy channel when its connection comes in. So the Patch-2 does
minimal changes that are _required_ to enable the two (multifd &
postcopy) features together.

* Whereas, moving channel discovery parts out to connections.c could
be done separately with its own reasoning that - we are moving it
outside to connection.c because it fits better there.

> Similarly, the multifd+postcopy enablement is a new feature that needs
> to be tested and reasoned upon in isolation, it cannot bring along a
> bunch of previously existing code that was shuffled around. We need to
> be able to understand clearly what is done _in preparation for_ the
> feature and what is done _as part of_ the feature.
...
> Not to mention bisect and backporting. Many people will be looking at
> this code in the future without any knowledge of migration, but as part
> of a bisect section or when searching for missing backports in the
> distros.

* I think we (you, me, Peter) are all looking at things differently.
    - In my view Patch-2 is the minimal change _required_  to enable
multifd & postcopy. In your view we are _revamping_ channel discovery
parts while _sneaking_ in a feature of enabling multifd & postcopy
together.
    - In my view Patch-5 in this series is an isolated change because
it adds a new migration command to allow multifd threads sync from
source side. But Peter thinks without that 'flush and sync' Patch-2 is
incomplete, so we should merge it back there.

* I've also shared my view about not trying to do everything in this
one series. I don't know how do we move forward from here. I'm also
not sure what the final _acceptable_ patch-set should look like. I
thought a tested working patch-set is a good start. At this point I
think I'll just follow your lead.

> I also suggested to move that logic into channel.c. The code is now
> well-contained enough that we don't need to be reading it every time
> someone is going over the migration flow, it becomes just a helper
> function.

* How exactly do we want to divide patch-2 to move channel discovery
parts to channel.c?
===
     if (!migration_has_main_and_multifd_channels()) {
     }
     ...
     } else {
        channel = CH_MAIN;
    }
===
Do we move this entire if - else - else block to channel.c from
migration_ioc_process_incoming() ?

> > ===
> > 67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                 OK             147.84s   81 subtests passed
>
> I see postcopy/multifd/plain hanging from time to time. Probably due to
> the changes in patch 5. Please take a look.

* Okay. Does it hang indefinitely or for brief moments?

Thank you.
---
  - Prasad
Peter Xu March 3, 2025, 2:12 p.m. UTC | #3
On Mon, Mar 03, 2025 at 04:17:53PM +0530, Prasad Pandit wrote:
> * I think we (you, me, Peter) are all looking at things differently.
>     - In my view Patch-2 is the minimal change _required_  to enable
> multifd & postcopy. In your view we are _revamping_ channel discovery
> parts while _sneaking_ in a feature of enabling multifd & postcopy
> together.
>     - In my view Patch-5 in this series is an isolated change because
> it adds a new migration command to allow multifd threads sync from
> source side. But Peter thinks without that 'flush and sync' Patch-2 is
> incomplete, so we should merge it back there.

Just to mention, my suggestion does not conflict with splitting patch 2, as
long as you keep every patch complete on its own.

Patch 5 needs to be squashed to either patch 2 or a split patch out of
patch 2, because current patch 2 (or any possible way to split it into
smaller ones, then one of them which enables the feature) is buggy.

Thanks,
Prasad Pandit March 4, 2025, 9:47 a.m. UTC | #4
Hi,

On Mon, 3 Mar 2025 at 19:42, Peter Xu <peterx@redhat.com> wrote:
> On Mon, Mar 03, 2025 at 04:17:53PM +0530, Prasad Pandit wrote:
> > * I think we (you, me, Peter) are all looking at things differently.
> >     - In my view Patch-2 is the minimal change _required_  to enable
> > multifd & postcopy. In your view we are _revamping_ channel discovery
> > parts while _sneaking_ in a feature of enabling multifd & postcopy
> > together.
> >     - In my view Patch-5 in this series is an isolated change because
> > it adds a new migration command to allow multifd threads sync from
> > source side. But Peter thinks without that 'flush and sync' Patch-2 is
> > incomplete, so we should merge it back there.
>
> Just to mention, my suggestion does not conflict with splitting patch 2, as
> long as you keep every patch complete on its own.
>
> Patch 5 needs to be squashed to either patch 2 or a split patch out of
> patch 2, because current patch 2 (or any possible way to split it into
> smaller ones, then one of them which enables the feature) is buggy.

* I'll try to segregate different parts, then we can discuss how to
split them across patches:

Terminology:
    _requires_  => is without which migration shall not work at all.
    _essential_ => is without which there may be issues.

1. Enable Multifd and Postcopy together
    - It _requires_ removal of the Multifd capability check in
migration/options.c
    - It _requires_ identification of the CH_POSTCOPY connection when it arrives
        - so enum { CH_MAIN, CH_MULTIFD, CH_POSTCOPY } is defined
        - To identify channels, related changes are made to the
channel discovery part (if .. else conditionals) in
migration_ioc_process_incoming() function.
        - These changes help to establish all channel connections.
After that, the migration proceeds as usual until it's time to start
the Postcopy phase.
    - When time comes to start Postcopy phase, we shutdown multifd channels.
        - it _requires_ calling multifd_send_shutdown()
        - It _requires_ moving
file_cleanup_outgoing/socket_cleanup_outgoing calls to
migration_cleanup() function.
    - When Postcopy phase starts, we don't want ram_save_target
_page() function to call ram_save_multifd_page() function, because
migrate_multifd() is still true.
        - It _requires_ adding the !migration_in_postcopy() checks.

* Above changes are minimal _required_ to enable multifd and postcopy
together, while also ensuring that migration continues to work when
those options are not enabled together. With these changes, guest
migration across two machines works without any observed failures.

2. The multifd_ram_flush_and_sync() call/command and the
assert(!migration_in_postcopy()) call in multifd_recv_thread()
    - These calls help to ensure that no multifd data is left behind
when the Postcopy phase starts.
    - And that multifd_recv threads shall not be active when the
Postcopy is running.
    - They protect the guests from potential state corruption.

* It is up to us to decide whether (2) is _required_ OR _essential_
for the feature. Individual opinions can vary here.

3. Revamp of the channel discovery parts by moving those bits to
connection.c or other places.
    - This entails moving the channel discovery parts from
migration_ioc_process_incoming() function to somewhere else because it
makes more sense to move it there and maybe it reduces complexity and
makes the sources easier to understand.*

* It is up to us to decide whether (3) is _required_  OR  _essential_
for the feature. Individual opinions can vary here.

* IMHO (1) is _required_,  (2) is _essential_,  and (3) is neither
_required_ nor _essential_ for the - Enable multifd and postcopy
together - feature. (3) is a completely unrelated change to this
feature.

Since it is an individual opinion, we all can think differently here
and that is perfectly fine. Once we have some consensus, we can decide
how to split or merge patches and move forward.

Hope it helps. Thank you.
---
  - Prasad
Peter Xu March 4, 2025, 2:42 p.m. UTC | #5
On Tue, Mar 04, 2025 at 03:17:14PM +0530, Prasad Pandit wrote:
> Hi,
> 
> On Mon, 3 Mar 2025 at 19:42, Peter Xu <peterx@redhat.com> wrote:
> > On Mon, Mar 03, 2025 at 04:17:53PM +0530, Prasad Pandit wrote:
> > > * I think we (you, me, Peter) are all looking at things differently.
> > >     - In my view Patch-2 is the minimal change _required_  to enable
> > > multifd & postcopy. In your view we are _revamping_ channel discovery
> > > parts while _sneaking_ in a feature of enabling multifd & postcopy
> > > together.
> > >     - In my view Patch-5 in this series is an isolated change because
> > > it adds a new migration command to allow multifd threads sync from
> > > source side. But Peter thinks without that 'flush and sync' Patch-2 is
> > > incomplete, so we should merge it back there.
> >
> > Just to mention, my suggestion does not conflict with splitting patch 2, as
> > long as you keep every patch complete on its own.
> >
> > Patch 5 needs to be squashed to either patch 2 or a split patch out of
> > patch 2, because current patch 2 (or any possible way to split it into
> > smaller ones, then one of them which enables the feature) is buggy.
> 
> * I'll try to segregate different parts, then we can discuss how to
> split them across patches:
> 
> Terminology:
>     _requires_  => is without which migration shall not work at all.
>     _essential_ => is without which there may be issues.
> 
> 1. Enable Multifd and Postcopy together
>     - It _requires_ removal of the Multifd capability check in
> migration/options.c
>     - It _requires_ identification of the CH_POSTCOPY connection when it arrives
>         - so enum { CH_MAIN, CH_MULTIFD, CH_POSTCOPY } is defined
>         - To identify channels, related changes are made to the
> channel discovery part (if .. else conditionals) in
> migration_ioc_process_incoming() function.
>         - These changes help to establish all channel connections.
> After that, the migration proceeds as usual until it's time to start
> the Postcopy phase.
>     - When time comes to start Postcopy phase, we shutdown multifd channels.
>         - it _requires_ calling multifd_send_shutdown()
>         - It _requires_ moving
> file_cleanup_outgoing/socket_cleanup_outgoing calls to
> migration_cleanup() function.
>     - When Postcopy phase starts, we don't want ram_save_target
> _page() function to call ram_save_multifd_page() function, because
> migrate_multifd() is still true.
>         - It _requires_ adding the !migration_in_postcopy() checks.

IIUC Fabiano is not asking you to drop them, but split them.  Split still
"requires" them to be present, as long as before the enablement patch.

For example, if you want you can put the channel changes into a separate
patch, but without enabling the feature.  That single patch (after applied)
should keep migration working as before.

> 
> * Above changes are minimal _required_ to enable multifd and postcopy
> together, while also ensuring that migration continues to work when
> those options are not enabled together. With these changes, guest
> migration across two machines works without any observed failures.
> 
> 2. The multifd_ram_flush_and_sync() call/command and the
> assert(!migration_in_postcopy()) call in multifd_recv_thread()
>     - These calls help to ensure that no multifd data is left behind
> when the Postcopy phase starts.
>     - And that multifd_recv threads shall not be active when the
> Postcopy is running.
>     - They protect the guests from potential state corruption.
> 
> * It is up to us to decide whether (2) is _required_ OR _essential_
> for the feature. Individual opinions can vary here.
> 
> 3. Revamp of the channel discovery parts by moving those bits to
> connection.c or other places.
>     - This entails moving the channel discovery parts from
> migration_ioc_process_incoming() function to somewhere else because it
> makes more sense to move it there and maybe it reduces complexity and
> makes the sources easier to understand.*
> 
> * It is up to us to decide whether (3) is _required_  OR  _essential_
> for the feature. Individual opinions can vary here.
> 
> * IMHO (1) is _required_,  (2) is _essential_,  and (3) is neither
> _required_ nor _essential_ for the - Enable multifd and postcopy
> together - feature. (3) is a completely unrelated change to this
> feature.
> 
> Since it is an individual opinion, we all can think differently here
> and that is perfectly fine. Once we have some consensus, we can decide
> how to split or merge patches and move forward.
> 
> Hope it helps. Thank you.
> ---
>   - Prasad
>
Prasad Pandit March 5, 2025, 7:41 a.m. UTC | #6
Hi,

On Tue, 4 Mar 2025 at 20:12, Peter Xu <peterx@redhat.com> wrote:
> IIUC Fabiano is not asking you to drop them, but split them.  Split still
> "requires" them to be present, as long as before the enablement patch.

* Yes, same here; Even I am not suggesting to drop anything. Fabiano
mentioned the following about the changes being done:

    1.  _in preparation for_ the feature  (multifd+postcopy enablement)
    2.  _as part of_ the feature (multifd+postcopy enablement)
    3. Concerns about git bisect and backporting of patches, not
missing patches when backporting.

* I am thinking:
    1. The _required_ changes are the _as part of_  the feature changes.
    2. The refactoring of ram_save_target_page and moving of
MULTIFD_MAGIC/VERSION macros to multifd.h patch can be termed as _in
preparation for_ the feature. Of these
        - Refactoring of 'ram_save_target_page' function patch is
already pulled upstream.
            -> https://gitlab.com/qemu-project/qemu/-/commit/bc38dc2f5f350310724fd7d4f0a09f8c3a4811fa
        - Similarly moving of MULTIFD_ macros patch can be pulled. It
has not changed for many revisions.
    3. The _essential_ changes are further refinement of the feature
(multifd+postcopy enablement).

* IMO, we can split patches as:
     - One patch for -> _required_ OR _as part of_ the feature changes
     - One patch for -> MULTIFD_ macros as _in preparation for_ the
feature change
     - One patch for -> _essential_  changes : flush and sync related
     - Two patches for qtest cases related changes.
This division also seems helpful for git bisect and backporting
related concerns.

* Currently we are divided over what constitutes:  _required_  OR  _as
part of_ the feature changes.
    - Hence the request for individual opinions towards that.

* If we want to merge _required_/_as part of_ the feature  and
_essential_  changes together in one patch - I'm okay.
* If we want to split the _required_ / _as part of_ the feature patch
further into two, we need to define exactly how we do that division. -
[???]

* I have shared my thoughts above, I won't hold on to it unduly.  We
need to find a way to move forward. I'm willing to go with either way
we decide.

Thank you.
---
  - Prasad
Fabiano Rosas March 5, 2025, 1:56 p.m. UTC | #7
Prasad Pandit <ppandit@redhat.com> writes:

> Hi,
>
> On Tue, 4 Mar 2025 at 20:12, Peter Xu <peterx@redhat.com> wrote:
>> IIUC Fabiano is not asking you to drop them, but split them.  Split still
>> "requires" them to be present, as long as before the enablement patch.
>
> * Yes, same here; Even I am not suggesting to drop anything. Fabiano
> mentioned the following about the changes being done:
>
>     1.  _in preparation for_ the feature  (multifd+postcopy enablement)
>     2.  _as part of_ the feature (multifd+postcopy enablement)

The "channel discovery" part can be considered as a separate patch
because in order to integrate it with the new feature you had to
introduce a new concept of enum CH_* which is not a trivially verifiable
change. That change has the potential to break other use-cases of
migration and although the tests are passing we must be cautions about
it. Simply having that part as a patch is enough separation that an
issue there won't be conflated with the multifd+postcopy feature in
general.

I also called it "in preparation" for the feature because the channel
discovery part is clearly a preexisting issue, in the sense that had we
had proper magic numbers in all channels or some other mechanism to
diffrentiate channels, we wouldn't need this patch. That's a cleanup
you're doing beforehand in order to make the subsequent patches simpler.

Changing that code makes sense regardless of whether there is a new
feature being introduced.

Note that none of this is out of the ordinary, you'll find such
discussions in any thread on this community. It may feel arbitrary to
you because that's tacit knowledge we gathered along the years.

>     3. Concerns about git bisect and backporting of patches, not
> missing patches when backporting.

Yes, whenever you can frame a code change as a pure refactoring, you
should do that. If, say, you had to change 100 lines in order to
implement your feature, but 90 of those lines are just preserving the
existing behavior, that is a good candidate to be a separate patch.

Reviewers will be more inclined to approve the patch, since reading code
and making sure it doesn't change behavior is easier than also
accounting for the new behavior simultaneously.

Maintainers moving patches around, merging, testing, etc will have an
easier time when solving git conflicts because the separate code is more
easily compared with the old one.

Backporters have similar issues with git, but also face questions about
the safety of bringing a patch as a prerequisite into older QEMU
versions. Imagine someone wants to enable multifd+postcopy in QEMU 9.2,
it'll be easier for them to have code more clearly defined as
patches. Another scenario could be if the changes to the channel
discovery actually fix a bug that we're currently not aware of. That
would show up in a bisect session and we wouldn't want to apply a patch
that fixes a bug if the same patch also enables a feature.

>
> * I am thinking:
>     1. The _required_ changes are the _as part of_  the feature changes.
>     2. The refactoring of ram_save_target_page and moving of
> MULTIFD_MAGIC/VERSION macros to multifd.h patch can be termed as _in
> preparation for_ the feature. Of these
>         - Refactoring of 'ram_save_target_page' function patch is
> already pulled upstream.
>             -> https://gitlab.com/qemu-project/qemu/-/commit/bc38dc2f5f350310724fd7d4f0a09f8c3a4811fa
>         - Similarly moving of MULTIFD_ macros patch can be pulled. It
> has not changed for many revisions.
>     3. The _essential_ changes are further refinement of the feature
> (multifd+postcopy enablement).
>
> * IMO, we can split patches as:
>      - One patch for -> _required_ OR _as part of_ the feature changes
>      - One patch for -> MULTIFD_ macros as _in preparation for_ the
> feature change
>      - One patch for -> _essential_  changes : flush and sync related
>      - Two patches for qtest cases related changes.
> This division also seems helpful for git bisect and backporting
> related concerns.
>
> * Currently we are divided over what constitutes:  _required_  OR  _as
> part of_ the feature changes.
>     - Hence the request for individual opinions towards that.
>
> * If we want to merge _required_/_as part of_ the feature  and
> _essential_  changes together in one patch - I'm okay.
> * If we want to split the _required_ / _as part of_ the feature patch
> further into two, we need to define exactly how we do that division. -
> [???]
>

We need an extra patch that reads:

 migration: Refactor channel discovery mechanism
 
 The various logical migration channels don't have a standardized way of
 advertising themselves and their connections may be seen out of order
 by the migration destination. When a new connection arrives, the
 incoming migration currently make use of heuristics to determine which
 channel it belongs to.
 
 The next few patches will need to change how the multifd and postcopy
 capabilities interact and that affects the channel discovery heuristic.
 
 Refactor the channel discovery heuristic to make it less opaque and
 simplify the subsequent patches.
 
 <some description of the new code which might be pertinent>
 ---

You'd move all of the channel discovery code into this patch. Some of it
will be unreacheable because multifd is not yet allowed with postcopy,
but that's fine. You can mention it on the commit message.

About moving the code out of migration.c, it was a suggestion that
you're free to push back. Ideally, doing the work would be faster than
arguing against it on the mailing list. But that's fine.

About the hang in the test. It doesn't reproduce often, but once it
does, it hangs forever (although I haven't waited that long).

> * I have shared my thoughts above, I won't hold on to it unduly.  We
> need to find a way to move forward. I'm willing to go with either way
> we decide.
>
> Thank you.
> ---
>   - Prasad
Prasad Pandit March 6, 2025, 7:51 a.m. UTC | #8
Hello Fabiano,

On Wed, 5 Mar 2025 at 19:26, Fabiano Rosas <farosas@suse.de> wrote:
> Note that none of this is out of the ordinary, you'll find such
> discussions in any thread on this community. It may feel arbitrary to
> you because that's tacit knowledge we gathered along the years.

* I understand. I don't find it arbitrary.

> We need an extra patch that reads:
>
>  migration: Refactor channel discovery mechanism
>
>  The various logical migration channels don't have a standardized way of
>  advertising themselves and their connections may be seen out of order
>  by the migration destination. When a new connection arrives, the
>  incoming migration currently make use of heuristics to determine which
>  channel it belongs to.
>
>  The next few patches will need to change how the multifd and postcopy
>  capabilities interact and that affects the channel discovery heuristic.
>
>  Refactor the channel discovery heuristic to make it less opaque and
>  simplify the subsequent patches.
>
>  <some description of the new code which might be pertinent>
>  ---
>
> You'd move all of the channel discovery code into this patch. Some of it
> will be unreacheable because multifd is not yet allowed with postcopy,
> but that's fine. You can mention it on the commit message.

Please see:
    -> https://privatebin.net/?dad6f052dd986f9f#FULnfrCV29NkQpvsQyvWuU4HdYjDwFbUPbDtvLro7mwi

* Does this division look okay?

> About moving the code out of migration.c, it was a suggestion that
> you're free to push back. Ideally, doing the work would be faster than
> arguing against it on the mailing list. But that's fine.

* Same here, I'm not against moving that code part to connection.c OR
doing the work. My suggestion has been to do that movement in another
series and not try to do everything in this one series.

> About the hang in the test. It doesn't reproduce often, but once it
> does, it hangs forever (although I haven't waited that long).

* Okay, I'm not seeing it or able to reproduce it across 3 different
machines. One is my laptop and the other 2 are servers wherein I'm
testing migrations of guests with 64G/128G of RAM and guest dirtying
memory to the tune of 68M/128M/256M bytes. I'll keep an eye on it if I
find something.

Thank you.
---
  - Prasad
Fabiano Rosas March 6, 2025, 1:48 p.m. UTC | #9
Prasad Pandit <ppandit@redhat.com> writes:

> Hello Fabiano,
>
> On Wed, 5 Mar 2025 at 19:26, Fabiano Rosas <farosas@suse.de> wrote:
>> Note that none of this is out of the ordinary, you'll find such
>> discussions in any thread on this community. It may feel arbitrary to
>> you because that's tacit knowledge we gathered along the years.
>
> * I understand. I don't find it arbitrary.
>
>> We need an extra patch that reads:
>>
>>  migration: Refactor channel discovery mechanism
>>
>>  The various logical migration channels don't have a standardized way of
>>  advertising themselves and their connections may be seen out of order
>>  by the migration destination. When a new connection arrives, the
>>  incoming migration currently make use of heuristics to determine which
>>  channel it belongs to.
>>
>>  The next few patches will need to change how the multifd and postcopy
>>  capabilities interact and that affects the channel discovery heuristic.
>>
>>  Refactor the channel discovery heuristic to make it less opaque and
>>  simplify the subsequent patches.
>>
>>  <some description of the new code which might be pertinent>
>>  ---
>>
>> You'd move all of the channel discovery code into this patch. Some of it
>> will be unreacheable because multifd is not yet allowed with postcopy,
>> but that's fine. You can mention it on the commit message.
>
> Please see:
>     -> https://privatebin.net/?dad6f052dd986f9f#FULnfrCV29NkQpvsQyvWuU4HdYjDwFbUPbDtvLro7mwi
>
> * Does this division look okay?
>

Yes.

>> About moving the code out of migration.c, it was a suggestion that
>> you're free to push back. Ideally, doing the work would be faster than
>> arguing against it on the mailing list. But that's fine.
>
> * Same here, I'm not against moving that code part to connection.c OR
> doing the work. My suggestion has been to do that movement in another
> series and not try to do everything in this one series.
>
>> About the hang in the test. It doesn't reproduce often, but once it
>> does, it hangs forever (although I haven't waited that long).
>
> * Okay, I'm not seeing it or able to reproduce it across 3 different
> machines. One is my laptop and the other 2 are servers wherein I'm
> testing migrations of guests with 64G/128G of RAM and guest dirtying
> memory to the tune of 68M/128M/256M bytes. I'll keep an eye on it if I
> find something.

Usually a loaded (or slow) machine is needed to reproduce multifd
synchronization issues. Sometimes running the test in a loop in parallel
with some other workload helps to uncover them. The CI also tends to
have slower machines that hit these problems.