Message ID | 20250228121749.553184-1-ppandit@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Allow to enable multifd and postcopy migration together | expand |
Prasad Pandit <ppandit@redhat.com> writes: > From: Prasad Pandit <pjp@fedoraproject.org> > > Hello, > > * This series (v7) adds 'MULTIFD_RECV_SYNC' migration command. It is used > to notify the destination migration thread to synchronise with the Multifd > threads. This allows Multifd ('mig/dst/recv_x') threads on the destination > to receive all their data, before they are shutdown. > > This series also updates the channel discovery function and qtests as > suggested in the previous review comments. You forgot to split patch 2. We cannot have a commit that revamps the channel discovery logic and enables a new feature at the same time. Changing the channel discovery affects all the migration use-cases, that change cannot be smuggled along with multifd+postcopy enablement. Similarly, the multifd+postcopy enablement is a new feature that needs to be tested and reasoned upon in isolation, it cannot bring along a bunch of previously existing code that was shuffled around. We need to be able to understand clearly what is done _in preparation for_ the feature and what is done _as part of_ the feature. Not to mention bisect and backporting. Many people will be looking at this code in the future without any knowledge of migration, but as part of a bisect section or when searching for missing backports in the distros. I also suggested to move that logic into channel.c. The code is now well-contained enough that we don't need to be reading it every time someone is going over the migration flow, it becomes just a helper function. > === > 67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test OK 147.84s 81 subtests passed I see postcopy/multifd/plain hanging from time to time. Probably due to the changes in patch 5. Please take a look. > === > > > v6: https://lore.kernel.org/qemu-devel/20250215123119.814345-1-ppandit@redhat.com/T/#t > * This series (v6) shuts down Multifd threads before starting Postcopy > migration. It helps to avoid an issue of multifd pages arriving late > at the destination during Postcopy phase and corrupting the vCPU > state. It also reorders the qtest patches and does some refactoring > changes as suggested in previous review. > === > 67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test OK 161.35s 73 subtests passed > === > > > v5: https://lore.kernel.org/qemu-devel/20250205122712.229151-1-ppandit@redhat.com/T/#t > * This series (v5) consolidates migration capabilities setting in one > 'set_migration_capabilities()' function, thus simplifying test sources. > It passes all migration tests. > === > 66/66 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test OK 143.66s 71 subtests passed > === > > > v4: https://lore.kernel.org/qemu-devel/20250127120823.144949-1-ppandit@redhat.com/T/#t > * This series (v4) adds more 'multifd+postcopy' qtests which test > Precopy migration with 'postcopy-ram' attribute set. And run > Postcopy migrations with 'multifd' channels enabled. > === > $ ../qtest/migration-test --tap -k -r '/x86_64/migration/multifd+postcopy' | grep -i 'slow test' > # slow test /x86_64/migration/multifd+postcopy/plain executed in 1.29 secs > # slow test /x86_64/migration/multifd+postcopy/recovery/tls/psk executed in 2.48 secs > # slow test /x86_64/migration/multifd+postcopy/preempt/plain executed in 1.49 secs > # slow test /x86_64/migration/multifd+postcopy/preempt/recovery/tls/psk executed in 2.52 secs > # slow test /x86_64/migration/multifd+postcopy/tcp/tls/psk/match executed in 3.62 secs > # slow test /x86_64/migration/multifd+postcopy/tcp/plain/zstd executed in 1.34 secs > # slow test /x86_64/migration/multifd+postcopy/tcp/plain/cancel executed in 2.24 secs > ... > 66/66 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test OK 148.41s 71 subtests passed > === > > > v3: https://lore.kernel.org/qemu-devel/20250121131032.1611245-1-ppandit@redhat.com/T/#t > * This series (v3) passes all existing 'tests/qtest/migration/*' tests > and adds a new one to enable multifd channels with postcopy migration. > > > v2: https://lore.kernel.org/qemu-devel/20241129122256.96778-1-ppandit@redhat.com/T/#u > * This series (v2) further refactors the 'ram_save_target_page' > function to make it independent of the multifd & postcopy change. > > > v1: https://lore.kernel.org/qemu-devel/20241126115748.118683-1-ppandit@redhat.com/T/#u > * This series removes magic value (4-bytes) introduced in the > previous series for the Postcopy channel. > > > v0: https://lore.kernel.org/qemu-devel/20241029150908.1136894-1-ppandit@redhat.com/T/#u > * Currently Multifd and Postcopy migration can not be used together. > QEMU shows "Postcopy is not yet compatible with multifd" message. > > When migrating guests with large (100's GB) RAM, Multifd threads > help to accelerate migration, but inability to use it with the > Postcopy mode delays guest start up on the destination side. > > * This patch series allows to enable both Multifd and Postcopy > migration together. Precopy and Multifd threads work during > the initial guest (RAM) transfer. When migration moves to the > Postcopy phase, Multifd threads are restrained and the Postcopy > threads start to request pages from the source side. > > * This series introduces magic value (4-bytes) to be sent on the > Postcopy channel. It helps to differentiate channels and properly > setup incoming connections on the destination side. > > > Thank you. > --- > Prasad Pandit (5): > migration/multifd: move macros to multifd header > migration: enable multifd and postcopy together > tests/qtest/migration: consolidate set capabilities > tests/qtest/migration: add postcopy tests with multifd > migration: add MULTIFD_RECV_SYNC migration command > > migration/migration.c | 136 +++++++++++++--------- > migration/multifd-nocomp.c | 28 +++-- > migration/multifd.c | 17 +-- > migration/multifd.h | 6 + > migration/options.c | 5 - > migration/ram.c | 7 +- > migration/savevm.c | 13 +++ > migration/savevm.h | 1 + > tests/qtest/migration/compression-tests.c | 37 +++++- > tests/qtest/migration/cpr-tests.c | 6 +- > tests/qtest/migration/file-tests.c | 58 +++++---- > tests/qtest/migration/framework.c | 76 ++++++++---- > tests/qtest/migration/framework.h | 9 +- > tests/qtest/migration/misc-tests.c | 4 +- > tests/qtest/migration/postcopy-tests.c | 35 +++++- > tests/qtest/migration/precopy-tests.c | 48 +++++--- > tests/qtest/migration/tls-tests.c | 69 ++++++++++- > 17 files changed, 388 insertions(+), 167 deletions(-)
Hello Fabiano, On Fri, 28 Feb 2025 at 20:23, Fabiano Rosas <farosas@suse.de> wrote: > You forgot to split patch 2. We cannot have a commit that revamps the > channel discovery logic and enables a new feature at the same > time. Changing the channel discovery affects all the migration > use-cases, that change cannot be smuggled along with multifd+postcopy > enablement. === >> Continue with this patch and fix the stuff I mentioned. You can ignore >> the first two paragraphs of that reply. >> >> https://lore.kernel.org/r/87y0y4tf5q.fsf@suse.de >> >> I still think we need to test that preempt + multifd scenario, but it >> should be easy to write a test for that once the series is in more of a >> final shape. === * I thought that major overhaul of the channel discovery part by moving it to channel.c was to be done subsequently, no? As we discussed, we need to touch the channel discovery parts in 'migration_ioc_process_incoming' because, we need to identify the Postcopy channel when its connection comes in. So the Patch-2 does minimal changes that are _required_ to enable the two (multifd & postcopy) features together. * Whereas, moving channel discovery parts out to connections.c could be done separately with its own reasoning that - we are moving it outside to connection.c because it fits better there. > Similarly, the multifd+postcopy enablement is a new feature that needs > to be tested and reasoned upon in isolation, it cannot bring along a > bunch of previously existing code that was shuffled around. We need to > be able to understand clearly what is done _in preparation for_ the > feature and what is done _as part of_ the feature. ... > Not to mention bisect and backporting. Many people will be looking at > this code in the future without any knowledge of migration, but as part > of a bisect section or when searching for missing backports in the > distros. * I think we (you, me, Peter) are all looking at things differently. - In my view Patch-2 is the minimal change _required_ to enable multifd & postcopy. In your view we are _revamping_ channel discovery parts while _sneaking_ in a feature of enabling multifd & postcopy together. - In my view Patch-5 in this series is an isolated change because it adds a new migration command to allow multifd threads sync from source side. But Peter thinks without that 'flush and sync' Patch-2 is incomplete, so we should merge it back there. * I've also shared my view about not trying to do everything in this one series. I don't know how do we move forward from here. I'm also not sure what the final _acceptable_ patch-set should look like. I thought a tested working patch-set is a good start. At this point I think I'll just follow your lead. > I also suggested to move that logic into channel.c. The code is now > well-contained enough that we don't need to be reading it every time > someone is going over the migration flow, it becomes just a helper > function. * How exactly do we want to divide patch-2 to move channel discovery parts to channel.c? === if (!migration_has_main_and_multifd_channels()) { } ... } else { channel = CH_MAIN; } === Do we move this entire if - else - else block to channel.c from migration_ioc_process_incoming() ? > > === > > 67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test OK 147.84s 81 subtests passed > > I see postcopy/multifd/plain hanging from time to time. Probably due to > the changes in patch 5. Please take a look. * Okay. Does it hang indefinitely or for brief moments? Thank you. --- - Prasad
On Mon, Mar 03, 2025 at 04:17:53PM +0530, Prasad Pandit wrote: > * I think we (you, me, Peter) are all looking at things differently. > - In my view Patch-2 is the minimal change _required_ to enable > multifd & postcopy. In your view we are _revamping_ channel discovery > parts while _sneaking_ in a feature of enabling multifd & postcopy > together. > - In my view Patch-5 in this series is an isolated change because > it adds a new migration command to allow multifd threads sync from > source side. But Peter thinks without that 'flush and sync' Patch-2 is > incomplete, so we should merge it back there. Just to mention, my suggestion does not conflict with splitting patch 2, as long as you keep every patch complete on its own. Patch 5 needs to be squashed to either patch 2 or a split patch out of patch 2, because current patch 2 (or any possible way to split it into smaller ones, then one of them which enables the feature) is buggy. Thanks,
Hi, On Mon, 3 Mar 2025 at 19:42, Peter Xu <peterx@redhat.com> wrote: > On Mon, Mar 03, 2025 at 04:17:53PM +0530, Prasad Pandit wrote: > > * I think we (you, me, Peter) are all looking at things differently. > > - In my view Patch-2 is the minimal change _required_ to enable > > multifd & postcopy. In your view we are _revamping_ channel discovery > > parts while _sneaking_ in a feature of enabling multifd & postcopy > > together. > > - In my view Patch-5 in this series is an isolated change because > > it adds a new migration command to allow multifd threads sync from > > source side. But Peter thinks without that 'flush and sync' Patch-2 is > > incomplete, so we should merge it back there. > > Just to mention, my suggestion does not conflict with splitting patch 2, as > long as you keep every patch complete on its own. > > Patch 5 needs to be squashed to either patch 2 or a split patch out of > patch 2, because current patch 2 (or any possible way to split it into > smaller ones, then one of them which enables the feature) is buggy. * I'll try to segregate different parts, then we can discuss how to split them across patches: Terminology: _requires_ => is without which migration shall not work at all. _essential_ => is without which there may be issues. 1. Enable Multifd and Postcopy together - It _requires_ removal of the Multifd capability check in migration/options.c - It _requires_ identification of the CH_POSTCOPY connection when it arrives - so enum { CH_MAIN, CH_MULTIFD, CH_POSTCOPY } is defined - To identify channels, related changes are made to the channel discovery part (if .. else conditionals) in migration_ioc_process_incoming() function. - These changes help to establish all channel connections. After that, the migration proceeds as usual until it's time to start the Postcopy phase. - When time comes to start Postcopy phase, we shutdown multifd channels. - it _requires_ calling multifd_send_shutdown() - It _requires_ moving file_cleanup_outgoing/socket_cleanup_outgoing calls to migration_cleanup() function. - When Postcopy phase starts, we don't want ram_save_target _page() function to call ram_save_multifd_page() function, because migrate_multifd() is still true. - It _requires_ adding the !migration_in_postcopy() checks. * Above changes are minimal _required_ to enable multifd and postcopy together, while also ensuring that migration continues to work when those options are not enabled together. With these changes, guest migration across two machines works without any observed failures. 2. The multifd_ram_flush_and_sync() call/command and the assert(!migration_in_postcopy()) call in multifd_recv_thread() - These calls help to ensure that no multifd data is left behind when the Postcopy phase starts. - And that multifd_recv threads shall not be active when the Postcopy is running. - They protect the guests from potential state corruption. * It is up to us to decide whether (2) is _required_ OR _essential_ for the feature. Individual opinions can vary here. 3. Revamp of the channel discovery parts by moving those bits to connection.c or other places. - This entails moving the channel discovery parts from migration_ioc_process_incoming() function to somewhere else because it makes more sense to move it there and maybe it reduces complexity and makes the sources easier to understand.* * It is up to us to decide whether (3) is _required_ OR _essential_ for the feature. Individual opinions can vary here. * IMHO (1) is _required_, (2) is _essential_, and (3) is neither _required_ nor _essential_ for the - Enable multifd and postcopy together - feature. (3) is a completely unrelated change to this feature. Since it is an individual opinion, we all can think differently here and that is perfectly fine. Once we have some consensus, we can decide how to split or merge patches and move forward. Hope it helps. Thank you. --- - Prasad
On Tue, Mar 04, 2025 at 03:17:14PM +0530, Prasad Pandit wrote: > Hi, > > On Mon, 3 Mar 2025 at 19:42, Peter Xu <peterx@redhat.com> wrote: > > On Mon, Mar 03, 2025 at 04:17:53PM +0530, Prasad Pandit wrote: > > > * I think we (you, me, Peter) are all looking at things differently. > > > - In my view Patch-2 is the minimal change _required_ to enable > > > multifd & postcopy. In your view we are _revamping_ channel discovery > > > parts while _sneaking_ in a feature of enabling multifd & postcopy > > > together. > > > - In my view Patch-5 in this series is an isolated change because > > > it adds a new migration command to allow multifd threads sync from > > > source side. But Peter thinks without that 'flush and sync' Patch-2 is > > > incomplete, so we should merge it back there. > > > > Just to mention, my suggestion does not conflict with splitting patch 2, as > > long as you keep every patch complete on its own. > > > > Patch 5 needs to be squashed to either patch 2 or a split patch out of > > patch 2, because current patch 2 (or any possible way to split it into > > smaller ones, then one of them which enables the feature) is buggy. > > * I'll try to segregate different parts, then we can discuss how to > split them across patches: > > Terminology: > _requires_ => is without which migration shall not work at all. > _essential_ => is without which there may be issues. > > 1. Enable Multifd and Postcopy together > - It _requires_ removal of the Multifd capability check in > migration/options.c > - It _requires_ identification of the CH_POSTCOPY connection when it arrives > - so enum { CH_MAIN, CH_MULTIFD, CH_POSTCOPY } is defined > - To identify channels, related changes are made to the > channel discovery part (if .. else conditionals) in > migration_ioc_process_incoming() function. > - These changes help to establish all channel connections. > After that, the migration proceeds as usual until it's time to start > the Postcopy phase. > - When time comes to start Postcopy phase, we shutdown multifd channels. > - it _requires_ calling multifd_send_shutdown() > - It _requires_ moving > file_cleanup_outgoing/socket_cleanup_outgoing calls to > migration_cleanup() function. > - When Postcopy phase starts, we don't want ram_save_target > _page() function to call ram_save_multifd_page() function, because > migrate_multifd() is still true. > - It _requires_ adding the !migration_in_postcopy() checks. IIUC Fabiano is not asking you to drop them, but split them. Split still "requires" them to be present, as long as before the enablement patch. For example, if you want you can put the channel changes into a separate patch, but without enabling the feature. That single patch (after applied) should keep migration working as before. > > * Above changes are minimal _required_ to enable multifd and postcopy > together, while also ensuring that migration continues to work when > those options are not enabled together. With these changes, guest > migration across two machines works without any observed failures. > > 2. The multifd_ram_flush_and_sync() call/command and the > assert(!migration_in_postcopy()) call in multifd_recv_thread() > - These calls help to ensure that no multifd data is left behind > when the Postcopy phase starts. > - And that multifd_recv threads shall not be active when the > Postcopy is running. > - They protect the guests from potential state corruption. > > * It is up to us to decide whether (2) is _required_ OR _essential_ > for the feature. Individual opinions can vary here. > > 3. Revamp of the channel discovery parts by moving those bits to > connection.c or other places. > - This entails moving the channel discovery parts from > migration_ioc_process_incoming() function to somewhere else because it > makes more sense to move it there and maybe it reduces complexity and > makes the sources easier to understand.* > > * It is up to us to decide whether (3) is _required_ OR _essential_ > for the feature. Individual opinions can vary here. > > * IMHO (1) is _required_, (2) is _essential_, and (3) is neither > _required_ nor _essential_ for the - Enable multifd and postcopy > together - feature. (3) is a completely unrelated change to this > feature. > > Since it is an individual opinion, we all can think differently here > and that is perfectly fine. Once we have some consensus, we can decide > how to split or merge patches and move forward. > > Hope it helps. Thank you. > --- > - Prasad >
Hi, On Tue, 4 Mar 2025 at 20:12, Peter Xu <peterx@redhat.com> wrote: > IIUC Fabiano is not asking you to drop them, but split them. Split still > "requires" them to be present, as long as before the enablement patch. * Yes, same here; Even I am not suggesting to drop anything. Fabiano mentioned the following about the changes being done: 1. _in preparation for_ the feature (multifd+postcopy enablement) 2. _as part of_ the feature (multifd+postcopy enablement) 3. Concerns about git bisect and backporting of patches, not missing patches when backporting. * I am thinking: 1. The _required_ changes are the _as part of_ the feature changes. 2. The refactoring of ram_save_target_page and moving of MULTIFD_MAGIC/VERSION macros to multifd.h patch can be termed as _in preparation for_ the feature. Of these - Refactoring of 'ram_save_target_page' function patch is already pulled upstream. -> https://gitlab.com/qemu-project/qemu/-/commit/bc38dc2f5f350310724fd7d4f0a09f8c3a4811fa - Similarly moving of MULTIFD_ macros patch can be pulled. It has not changed for many revisions. 3. The _essential_ changes are further refinement of the feature (multifd+postcopy enablement). * IMO, we can split patches as: - One patch for -> _required_ OR _as part of_ the feature changes - One patch for -> MULTIFD_ macros as _in preparation for_ the feature change - One patch for -> _essential_ changes : flush and sync related - Two patches for qtest cases related changes. This division also seems helpful for git bisect and backporting related concerns. * Currently we are divided over what constitutes: _required_ OR _as part of_ the feature changes. - Hence the request for individual opinions towards that. * If we want to merge _required_/_as part of_ the feature and _essential_ changes together in one patch - I'm okay. * If we want to split the _required_ / _as part of_ the feature patch further into two, we need to define exactly how we do that division. - [???] * I have shared my thoughts above, I won't hold on to it unduly. We need to find a way to move forward. I'm willing to go with either way we decide. Thank you. --- - Prasad
Prasad Pandit <ppandit@redhat.com> writes: > Hi, > > On Tue, 4 Mar 2025 at 20:12, Peter Xu <peterx@redhat.com> wrote: >> IIUC Fabiano is not asking you to drop them, but split them. Split still >> "requires" them to be present, as long as before the enablement patch. > > * Yes, same here; Even I am not suggesting to drop anything. Fabiano > mentioned the following about the changes being done: > > 1. _in preparation for_ the feature (multifd+postcopy enablement) > 2. _as part of_ the feature (multifd+postcopy enablement) The "channel discovery" part can be considered as a separate patch because in order to integrate it with the new feature you had to introduce a new concept of enum CH_* which is not a trivially verifiable change. That change has the potential to break other use-cases of migration and although the tests are passing we must be cautions about it. Simply having that part as a patch is enough separation that an issue there won't be conflated with the multifd+postcopy feature in general. I also called it "in preparation" for the feature because the channel discovery part is clearly a preexisting issue, in the sense that had we had proper magic numbers in all channels or some other mechanism to diffrentiate channels, we wouldn't need this patch. That's a cleanup you're doing beforehand in order to make the subsequent patches simpler. Changing that code makes sense regardless of whether there is a new feature being introduced. Note that none of this is out of the ordinary, you'll find such discussions in any thread on this community. It may feel arbitrary to you because that's tacit knowledge we gathered along the years. > 3. Concerns about git bisect and backporting of patches, not > missing patches when backporting. Yes, whenever you can frame a code change as a pure refactoring, you should do that. If, say, you had to change 100 lines in order to implement your feature, but 90 of those lines are just preserving the existing behavior, that is a good candidate to be a separate patch. Reviewers will be more inclined to approve the patch, since reading code and making sure it doesn't change behavior is easier than also accounting for the new behavior simultaneously. Maintainers moving patches around, merging, testing, etc will have an easier time when solving git conflicts because the separate code is more easily compared with the old one. Backporters have similar issues with git, but also face questions about the safety of bringing a patch as a prerequisite into older QEMU versions. Imagine someone wants to enable multifd+postcopy in QEMU 9.2, it'll be easier for them to have code more clearly defined as patches. Another scenario could be if the changes to the channel discovery actually fix a bug that we're currently not aware of. That would show up in a bisect session and we wouldn't want to apply a patch that fixes a bug if the same patch also enables a feature. > > * I am thinking: > 1. The _required_ changes are the _as part of_ the feature changes. > 2. The refactoring of ram_save_target_page and moving of > MULTIFD_MAGIC/VERSION macros to multifd.h patch can be termed as _in > preparation for_ the feature. Of these > - Refactoring of 'ram_save_target_page' function patch is > already pulled upstream. > -> https://gitlab.com/qemu-project/qemu/-/commit/bc38dc2f5f350310724fd7d4f0a09f8c3a4811fa > - Similarly moving of MULTIFD_ macros patch can be pulled. It > has not changed for many revisions. > 3. The _essential_ changes are further refinement of the feature > (multifd+postcopy enablement). > > * IMO, we can split patches as: > - One patch for -> _required_ OR _as part of_ the feature changes > - One patch for -> MULTIFD_ macros as _in preparation for_ the > feature change > - One patch for -> _essential_ changes : flush and sync related > - Two patches for qtest cases related changes. > This division also seems helpful for git bisect and backporting > related concerns. > > * Currently we are divided over what constitutes: _required_ OR _as > part of_ the feature changes. > - Hence the request for individual opinions towards that. > > * If we want to merge _required_/_as part of_ the feature and > _essential_ changes together in one patch - I'm okay. > * If we want to split the _required_ / _as part of_ the feature patch > further into two, we need to define exactly how we do that division. - > [???] > We need an extra patch that reads: migration: Refactor channel discovery mechanism The various logical migration channels don't have a standardized way of advertising themselves and their connections may be seen out of order by the migration destination. When a new connection arrives, the incoming migration currently make use of heuristics to determine which channel it belongs to. The next few patches will need to change how the multifd and postcopy capabilities interact and that affects the channel discovery heuristic. Refactor the channel discovery heuristic to make it less opaque and simplify the subsequent patches. <some description of the new code which might be pertinent> --- You'd move all of the channel discovery code into this patch. Some of it will be unreacheable because multifd is not yet allowed with postcopy, but that's fine. You can mention it on the commit message. About moving the code out of migration.c, it was a suggestion that you're free to push back. Ideally, doing the work would be faster than arguing against it on the mailing list. But that's fine. About the hang in the test. It doesn't reproduce often, but once it does, it hangs forever (although I haven't waited that long). > * I have shared my thoughts above, I won't hold on to it unduly. We > need to find a way to move forward. I'm willing to go with either way > we decide. > > Thank you. > --- > - Prasad
Hello Fabiano, On Wed, 5 Mar 2025 at 19:26, Fabiano Rosas <farosas@suse.de> wrote: > Note that none of this is out of the ordinary, you'll find such > discussions in any thread on this community. It may feel arbitrary to > you because that's tacit knowledge we gathered along the years. * I understand. I don't find it arbitrary. > We need an extra patch that reads: > > migration: Refactor channel discovery mechanism > > The various logical migration channels don't have a standardized way of > advertising themselves and their connections may be seen out of order > by the migration destination. When a new connection arrives, the > incoming migration currently make use of heuristics to determine which > channel it belongs to. > > The next few patches will need to change how the multifd and postcopy > capabilities interact and that affects the channel discovery heuristic. > > Refactor the channel discovery heuristic to make it less opaque and > simplify the subsequent patches. > > <some description of the new code which might be pertinent> > --- > > You'd move all of the channel discovery code into this patch. Some of it > will be unreacheable because multifd is not yet allowed with postcopy, > but that's fine. You can mention it on the commit message. Please see: -> https://privatebin.net/?dad6f052dd986f9f#FULnfrCV29NkQpvsQyvWuU4HdYjDwFbUPbDtvLro7mwi * Does this division look okay? > About moving the code out of migration.c, it was a suggestion that > you're free to push back. Ideally, doing the work would be faster than > arguing against it on the mailing list. But that's fine. * Same here, I'm not against moving that code part to connection.c OR doing the work. My suggestion has been to do that movement in another series and not try to do everything in this one series. > About the hang in the test. It doesn't reproduce often, but once it > does, it hangs forever (although I haven't waited that long). * Okay, I'm not seeing it or able to reproduce it across 3 different machines. One is my laptop and the other 2 are servers wherein I'm testing migrations of guests with 64G/128G of RAM and guest dirtying memory to the tune of 68M/128M/256M bytes. I'll keep an eye on it if I find something. Thank you. --- - Prasad
Prasad Pandit <ppandit@redhat.com> writes: > Hello Fabiano, > > On Wed, 5 Mar 2025 at 19:26, Fabiano Rosas <farosas@suse.de> wrote: >> Note that none of this is out of the ordinary, you'll find such >> discussions in any thread on this community. It may feel arbitrary to >> you because that's tacit knowledge we gathered along the years. > > * I understand. I don't find it arbitrary. > >> We need an extra patch that reads: >> >> migration: Refactor channel discovery mechanism >> >> The various logical migration channels don't have a standardized way of >> advertising themselves and their connections may be seen out of order >> by the migration destination. When a new connection arrives, the >> incoming migration currently make use of heuristics to determine which >> channel it belongs to. >> >> The next few patches will need to change how the multifd and postcopy >> capabilities interact and that affects the channel discovery heuristic. >> >> Refactor the channel discovery heuristic to make it less opaque and >> simplify the subsequent patches. >> >> <some description of the new code which might be pertinent> >> --- >> >> You'd move all of the channel discovery code into this patch. Some of it >> will be unreacheable because multifd is not yet allowed with postcopy, >> but that's fine. You can mention it on the commit message. > > Please see: > -> https://privatebin.net/?dad6f052dd986f9f#FULnfrCV29NkQpvsQyvWuU4HdYjDwFbUPbDtvLro7mwi > > * Does this division look okay? > Yes. >> About moving the code out of migration.c, it was a suggestion that >> you're free to push back. Ideally, doing the work would be faster than >> arguing against it on the mailing list. But that's fine. > > * Same here, I'm not against moving that code part to connection.c OR > doing the work. My suggestion has been to do that movement in another > series and not try to do everything in this one series. > >> About the hang in the test. It doesn't reproduce often, but once it >> does, it hangs forever (although I haven't waited that long). > > * Okay, I'm not seeing it or able to reproduce it across 3 different > machines. One is my laptop and the other 2 are servers wherein I'm > testing migrations of guests with 64G/128G of RAM and guest dirtying > memory to the tune of 68M/128M/256M bytes. I'll keep an eye on it if I > find something. Usually a loaded (or slow) machine is needed to reproduce multifd synchronization issues. Sometimes running the test in a loop in parallel with some other workload helps to uncover them. The CI also tends to have slower machines that hit these problems.
From: Prasad Pandit <pjp@fedoraproject.org> Hello, * This series (v7) adds 'MULTIFD_RECV_SYNC' migration command. It is used to notify the destination migration thread to synchronise with the Multifd threads. This allows Multifd ('mig/dst/recv_x') threads on the destination to receive all their data, before they are shutdown. This series also updates the channel discovery function and qtests as suggested in the previous review comments. === 67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test OK 147.84s 81 subtests passed === v6: https://lore.kernel.org/qemu-devel/20250215123119.814345-1-ppandit@redhat.com/T/#t * This series (v6) shuts down Multifd threads before starting Postcopy migration. It helps to avoid an issue of multifd pages arriving late at the destination during Postcopy phase and corrupting the vCPU state. It also reorders the qtest patches and does some refactoring changes as suggested in previous review. === 67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test OK 161.35s 73 subtests passed === v5: https://lore.kernel.org/qemu-devel/20250205122712.229151-1-ppandit@redhat.com/T/#t * This series (v5) consolidates migration capabilities setting in one 'set_migration_capabilities()' function, thus simplifying test sources. It passes all migration tests. === 66/66 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test OK 143.66s 71 subtests passed === v4: https://lore.kernel.org/qemu-devel/20250127120823.144949-1-ppandit@redhat.com/T/#t * This series (v4) adds more 'multifd+postcopy' qtests which test Precopy migration with 'postcopy-ram' attribute set. And run Postcopy migrations with 'multifd' channels enabled. === $ ../qtest/migration-test --tap -k -r '/x86_64/migration/multifd+postcopy' | grep -i 'slow test' # slow test /x86_64/migration/multifd+postcopy/plain executed in 1.29 secs # slow test /x86_64/migration/multifd+postcopy/recovery/tls/psk executed in 2.48 secs # slow test /x86_64/migration/multifd+postcopy/preempt/plain executed in 1.49 secs # slow test /x86_64/migration/multifd+postcopy/preempt/recovery/tls/psk executed in 2.52 secs # slow test /x86_64/migration/multifd+postcopy/tcp/tls/psk/match executed in 3.62 secs # slow test /x86_64/migration/multifd+postcopy/tcp/plain/zstd executed in 1.34 secs # slow test /x86_64/migration/multifd+postcopy/tcp/plain/cancel executed in 2.24 secs ... 66/66 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test OK 148.41s 71 subtests passed === v3: https://lore.kernel.org/qemu-devel/20250121131032.1611245-1-ppandit@redhat.com/T/#t * This series (v3) passes all existing 'tests/qtest/migration/*' tests and adds a new one to enable multifd channels with postcopy migration. v2: https://lore.kernel.org/qemu-devel/20241129122256.96778-1-ppandit@redhat.com/T/#u * This series (v2) further refactors the 'ram_save_target_page' function to make it independent of the multifd & postcopy change. v1: https://lore.kernel.org/qemu-devel/20241126115748.118683-1-ppandit@redhat.com/T/#u * This series removes magic value (4-bytes) introduced in the previous series for the Postcopy channel. v0: https://lore.kernel.org/qemu-devel/20241029150908.1136894-1-ppandit@redhat.com/T/#u * Currently Multifd and Postcopy migration can not be used together. QEMU shows "Postcopy is not yet compatible with multifd" message. When migrating guests with large (100's GB) RAM, Multifd threads help to accelerate migration, but inability to use it with the Postcopy mode delays guest start up on the destination side. * This patch series allows to enable both Multifd and Postcopy migration together. Precopy and Multifd threads work during the initial guest (RAM) transfer. When migration moves to the Postcopy phase, Multifd threads are restrained and the Postcopy threads start to request pages from the source side. * This series introduces magic value (4-bytes) to be sent on the Postcopy channel. It helps to differentiate channels and properly setup incoming connections on the destination side. Thank you. --- Prasad Pandit (5): migration/multifd: move macros to multifd header migration: enable multifd and postcopy together tests/qtest/migration: consolidate set capabilities tests/qtest/migration: add postcopy tests with multifd migration: add MULTIFD_RECV_SYNC migration command migration/migration.c | 136 +++++++++++++--------- migration/multifd-nocomp.c | 28 +++-- migration/multifd.c | 17 +-- migration/multifd.h | 6 + migration/options.c | 5 - migration/ram.c | 7 +- migration/savevm.c | 13 +++ migration/savevm.h | 1 + tests/qtest/migration/compression-tests.c | 37 +++++- tests/qtest/migration/cpr-tests.c | 6 +- tests/qtest/migration/file-tests.c | 58 +++++---- tests/qtest/migration/framework.c | 76 ++++++++---- tests/qtest/migration/framework.h | 9 +- tests/qtest/migration/misc-tests.c | 4 +- tests/qtest/migration/postcopy-tests.c | 35 +++++- tests/qtest/migration/precopy-tests.c | 48 +++++--- tests/qtest/migration/tls-tests.c | 69 ++++++++++- 17 files changed, 388 insertions(+), 167 deletions(-)