Message ID | 20230530144821.1557-1-avihaih@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | migration: Add switchover ack capability and VFIO precopy support | expand |
Tested-by: YangHang Liu <yanghliu@redhat.com> On Wed, May 31, 2023 at 1:01 AM Avihai Horon <avihaih@nvidia.com> wrote: > > Hello everyone, > > This is v5 of the switchover ack series. > > Changes from v4 [6]: > * Removed superfluous '"' in vfio_save_iterate() trace. (Cedric) > * Removed VFIOMigration->switchover_ack_needed and computed it locally > when needed. (Cedric) > * Added R-bs. > > Changes from v3 [5]: > * Rebased on latest master branch. > * Simplified switchover ack logic (call switchover_ack_needed only in > destination). (Peter) > * Moved caching of VFIO migration flags to a separate patch. (Cedric) > * Moved adding of x-allow-pre-copy property to a separate patch. (Cedric) > * Reset VFIOMigration->precopy_{init,dirty}_size in vfio_query_precopy_size() > and in vfio_save_cleanup(). (Cedric) > * Added a reference to VFIO uAPI in vfio_save_block() ENOMSG comment. (Cedric) > * Added VFIOMigration->precopy_{init,dirty}_size to trace_vfio_save_iterate(). (Cedric) > * Adapted VFIO migration to switchover ack logic simplification: > - Checked migrate_switchover_ack() in vfio_{save,load}_setup() and set > VFIOMigration->switchover_ack_needed accordingly. > - vfio_switchover_ack_needed() doesn't set VFIOMigration->switchover_ack_needed > and only returns its value. > * Move VFIOMigration->switchover_ack_needed = false to vfio_migration_cleanup() > so it will be set to false both in src and dest. > * Fixed a few typos/coding style. (Peter/Cedric) > * Added R-b/A-b (didn't add Cedric's R-b on patch #7 as switchover ack > changes in patch #2 introduced some changes to patch #7 as well). > > Changes from v2 [4]: > * Rebased on latest master branch. > * Changed the capability name to "switchover-ack" and the related > code/docs accordingly. (Peter) > * Added a counter for the number of switchover ack users in the source > and used it to skip switchover ack if there are no users (instead of > setting the switchover acked flag to true). (Peter) > * Added R-bs. > > Changes from v1 [3]: > * Rebased on latest master branch. > * Updated to latest QAPI doc comment conventions and refined > QAPI docs and capability error message. (Markus) > * Followed Peter/Juan suggestion and removed the handshake between > source and destination. > Now the capability must be set on both source and destination. > Compatibility of this feature between different QEMU versions or > different host capabilities (i.e., kernel) is achieved in the regular > way of device properties and hw_comapt_x_y. > * Replaced is_initial_data_active() and initial_data_loaded() > SaveVMHandlers handlers with a notification mechanism. (Peter) > * Set the capability also in destination in the migration test. > * Added VFIO device property x-allow-pre-copy to be able to preserve > compatibility between different QEMU versions or different host > capabilities (i.e., kernel). > * Changed VFIO precopy initial data implementation according to the > above changes. > * Documented VFIO precopy initial data support in VFIO migration > documentation. > * Added R-bs. > > === > > This series adds a new migration capability called "switchover ack". The > purpose of this capability is to reduce migration downtime in cases > where loading of migration data in the destination can take a lot of > time, such as with VFIO migration data. > > The series then moves to add precopy support and switchover ack support > for VFIO migration. > > Switchover ack is used by VFIO migration, but other migrated devices can > add support for it and use it as well. > > === Background === > > Migration downtime estimation is calculated based on bandwidth and > remaining migration data. This assumes that loading of migration data in > the destination takes a negligible amount of time and that downtime > depends only on network speed. > > While this may be true for RAM, it's not necessarily true for other > migrated devices. For example, loading the data of a VFIO device in the > destination might require from the device to allocate resources and > prepare internal data structures which can take a significant amount of > time to do. > > This poses a problem, as the source may think that the remaining > migration data is small enough to meet the downtime limit, so it will > stop the VM and complete the migration, but in fact sending and loading > the data in the destination may take longer than the downtime limit. > > To solve this, VFIO migration uAPI defines "initial bytes" as part of > its precopy stream [1]. Initial bytes can be used in various ways to > improve VFIO migration performance. For example, it can be used to > transfer device metadata to pre-allocate resources in the destination. > However, for this to work we need to make sure that all initial bytes > are sent and loaded in the destination before the source VM is stopped. > > The new switchover ack migration capability helps us achieve this. > It prevents the source from stopping the VM and completing the migration > until an ACK is received from the destination that it's OK to do so. > Thus, a VFIO device can make sure that its initial bytes were sent > and loaded in the destination before the source VM is stopped. > > Note that this relies on the return path capability to communicate from > the destination back to the source. > > === Flow of operation === > > To use switchover ack, the capability must be enabled in both the source > and the destination. > > During migration setup, migration code in the destination calls the > switchover_ack_needed() SaveVMHandlers handler of the migrated devices > to check if switchover ack is used by them. > A "switchover_ack_pending_num" counter is increased for each migrated > device that supports this feature. It will be used later to mark when an > ACK should be sent to the source. > > Migration is active and the source starts to send precopy data as usual. > In the destination, when a migrated device thinks it's OK to do > switchover, it notifies the migration code about it and the > "switchover_ack_pending_num" counter is decreased. For example, for a > VFIO device it's when the device receives and loads its initial bytes. > > When the "switchover_ack_pending_num" counter reaches zero, it means > that all devices agree to do switchover and an ACK is sent to the > source, which will now be able to complete the migration when > appropriate. > > === Test results === > > The below table shows the downtime of two identical migrations. In the > first migration swithcover ack is disabled and in the second it is > enabled. The migrated VM is assigned with a mlx5 VFIO device which has > 300MB of device data to be migrated. > > +----------------------+-----------------------+----------+ > | Switchover ack | VFIO device data size | Downtime | > +----------------------+-----------------------+----------+ > | Disabled | 300MB | 1900ms | > | Enabled | 300MB | 420ms | > +----------------------+-----------------------+----------+ > > Switchover ack gives a roughly 4.5 times improvement in downtime. > The 1480ms difference is time that is used for resource allocation for > the VFIO device in the destination. Without switchover ack, this time is > spent when the source VM is stopped and thus the downtime is much > higher. With switchover ack, the time is spent when the source VM is > still running. > > === Patch breakdown === > > - Patches 1-4 add the switchover ack capability. > - Patches 5-8 add VFIO migration precopy support. Similar version of > them was previously sent here [2]. > - Patch 9 adds switchover ack support for VFIO migration. > > Thanks for reviewing! > > [1] > https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/vfio.h#L1048 > > [2] > https://lore.kernel.org/qemu-devel/20230222174915.5647-3-avihaih@nvidia.com/ > > [3] > https://lore.kernel.org/qemu-devel/20230501140141.11743-1-avihaih@nvidia.com/ > > [4] > https://lore.kernel.org/qemu-devel/20230517155219.10691-1-avihaih@nvidia.com/ > > [5] > https://lore.kernel.org/qemu-devel/20230521151808.24804-1-avihaih@nvidia.com/ > > [6] > https://lore.kernel.org/qemu-devel/20230528140652.8693-1-avihaih@nvidia.com/ > > Avihai Horon (9): > migration: Add switchover ack capability > migration: Implement switchover ack logic > migration: Enable switchover ack capability > tests: Add migration switchover ack capability test > vfio/migration: Refactor vfio_save_block() to return saved data size > vfio/migration: Store VFIO migration flags in VFIOMigration > vfio/migration: Add VFIO migration pre-copy support > vfio/migration: Add x-allow-pre-copy VFIO device property > vfio/migration: Add support for switchover ack capability > > docs/devel/vfio-migration.rst | 45 +++++-- > qapi/migration.json | 12 +- > include/hw/vfio/vfio-common.h | 5 + > include/migration/register.h | 2 + > migration/migration.h | 14 +++ > migration/options.h | 1 + > migration/savevm.h | 1 + > hw/core/machine.c | 1 + > hw/vfio/common.c | 6 +- > hw/vfio/migration.c | 221 +++++++++++++++++++++++++++++++--- > hw/vfio/pci.c | 2 + > migration/migration.c | 32 ++++- > migration/options.c | 17 +++ > migration/savevm.c | 54 +++++++++ > tests/qtest/migration-test.c | 26 ++++ > hw/vfio/trace-events | 4 +- > migration/trace-events | 3 + > 17 files changed, 413 insertions(+), 33 deletions(-) > > -- > 2.26.3 > >