mbox series

[v3,0/7] migration: Add switchover ack capability and VFIO precopy support

Message ID 20230521151808.24804-1-avihaih@nvidia.com (mailing list archive)
Headers show
Series migration: Add switchover ack capability and VFIO precopy support | expand

Message

Avihai Horon May 21, 2023, 3:18 p.m. UTC
Hello everyone,

This is v3 of the switchover ack series (previously called precopy
initial data).

Changes from v2 [4]:
* Rebased on latest master branch.
* Changed the capability name to "switchover-ack" and the related
  code/docs accordingly. (Peter)
* Added a counter for the number of switchover ack users in the source
  and used it to skip switchover ack if there are no users (instead of
  setting the switchover acked flag to true). (Peter)
* Added R-bs.

Changes from v1 [3]:
* Rebased on latest master branch.
* Updated to latest QAPI doc comment conventions and refined
  QAPI docs and capability error message. (Markus)
* Followed Peter/Juan suggestion and removed the handshake between
  source and destination.
  Now the capability must be set on both source and destination.
  Compatibility of this feature between different QEMU versions or
  different host capabilities (i.e., kernel) is achieved in the regular
  way of device properties and hw_comapt_x_y.
* Replaced is_initial_data_active() and initial_data_loaded()
  SaveVMHandlers handlers with a notification mechanism. (Peter)
* Set the capability also in destination in the migration test.
* Added VFIO device property x-allow-pre-copy to be able to preserve
  compatibility between different QEMU versions or different host
  capabilities (i.e., kernel).
* Changed VFIO precopy initial data implementation according to the
  above changes.
* Documented VFIO precopy initial data support in VFIO migration
  documentation.
* Added R-bs.

===

This series adds a new migration capability called "switchover ack". The
purpose of this capability is to reduce migration downtime in cases
where loading of migration data in the destination can take a lot of
time, such as with VFIO migration data.

The series then moves to add precopy support and switchover ack support
for VFIO migration.

Switchover ack is used by VFIO migration, but other migrated devices can
add support for it and use it as well.

=== Background ===

Migration downtime estimation is calculated based on bandwidth and
remaining migration data. This assumes that loading of migration data in
the destination takes a negligible amount of time and that downtime
depends only on network speed.

While this may be true for RAM, it's not necessarily true for other
migrated devices. For example, loading the data of a VFIO device in the
destination might require from the device to allocate resources and
prepare internal data structures which can take a significant amount of
time to do.

This poses a problem, as the source may think that the remaining
migration data is small enough to meet the downtime limit, so it will
stop the VM and complete the migration, but in fact sending and loading
the data in the destination may take longer than the downtime limit.

To solve this, VFIO migration uAPI defines "initial bytes" as part of
its precopy stream [1]. Initial bytes can be used in various ways to
improve VFIO migration performance. For example, it can be used to
transfer device metadata to pre-allocate resources in the destination.
However, for this to work we need to make sure that all initial bytes
are sent and loaded in the destination before the source VM is stopped.

The new switchover ack migration capability helps us achieve this.
It prevents the source from stopping the VM and completing the migration
until an ACK is received from the destination that it's OK to do so.
Thus, a VFIO device can make sure that its initial bytes were sent
and loaded in the destination before the source VM is stopped.

Note that this relies on the return path capability to communicate from
the destination back to the source.

=== Flow of operation ===

To use switchover ack, the capability must be enabled in both the source
and the destination.

During migration setup, migration code calls the switchover_ack_needed()
SaveVMHandlers handler of the migrated devices, both in the source and
the destination, to check if switchover ack is used by them. In the
destination, a "switchover_ack_pending_num" counter is increased for
each migrated device that supports this feature. It will be used later
to mark when an ACK should be sent to the source.

Migration is active and the source starts to send precopy data as usual.
In the destination, when a migrated device thinks it's OK to do
switchover, it notifies the migration code about it and the
"switchover_ack_pending_num" counter is decreased. For example, for a
VFIO device it's when the device receives and loads its initial bytes.

When the "switchover_ack_pending_num" counter reaches zero, it means
that all devices agree to do switchover and an ACK is sent to the
source, which will now be able to complete the migration when
appropriate.

=== Test results ===

The below table shows the downtime of two identical migrations. In the
first migration swithcover ack is disabled and in the second it is
enabled. The migrated VM is assigned with a mlx5 VFIO device which has
300MB of device data to be migrated.

+----------------------+-----------------------+----------+
|    Switchover ack    | VFIO device data size | Downtime |
+----------------------+-----------------------+----------+
|       Disabled       |         300MB         |  1900ms  |
|       Enabled        |         300MB         |  420ms   |
+----------------------+-----------------------+----------+

Switchover ack gives a roughly 4.5 times improvement in downtime.
The 1480ms difference is time that is used for resource allocation for
the VFIO device in the destination. Without switchover ack, this time is
spent when the source VM is stopped and thus the downtime is much
higher. With switchover ack, the time is spent when the source VM is
still running.

=== Patch breakdown ===

- Patches 1-4 add the switchover ack capability.
- Patches 5-6 add VFIO migration precopy support. Similar version of
  them was previously sent here [2].
- Patch 7 adds switchover ack support for VFIO migration.

Thanks for reviewing!

[1]
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/vfio.h#L1048

[2]
https://lore.kernel.org/qemu-devel/20230222174915.5647-3-avihaih@nvidia.com/

[3]
https://lore.kernel.org/qemu-devel/20230501140141.11743-1-avihaih@nvidia.com/

[4]
https://lore.kernel.org/qemu-devel/20230517155219.10691-1-avihaih@nvidia.com/

Avihai Horon (7):
  migration: Add switchover ack capability
  migration: Implement switchover ack logic
  migration: Enable switchover ack capability
  tests: Add migration switchover ack capability test
  vfio/migration: Refactor vfio_save_block() to return saved data size
  vfio/migration: Add VFIO migration pre-copy support
  vfio/migration: Add support for switchover ack capability

 docs/devel/vfio-migration.rst |  45 +++++--
 qapi/migration.json           |  12 +-
 include/hw/vfio/vfio-common.h |   6 +
 include/migration/register.h  |   3 +
 migration/migration.h         |  16 +++
 migration/options.h           |   1 +
 migration/savevm.h            |   2 +
 hw/core/machine.c             |   1 +
 hw/vfio/common.c              |   6 +-
 hw/vfio/migration.c           | 220 +++++++++++++++++++++++++++++++---
 hw/vfio/pci.c                 |   2 +
 migration/migration.c         |  42 ++++++-
 migration/options.c           |  17 +++
 migration/savevm.c            |  56 +++++++++
 tests/qtest/migration-test.c  |  26 ++++
 hw/vfio/trace-events          |   4 +-
 migration/trace-events        |   4 +
 17 files changed, 430 insertions(+), 33 deletions(-)