mbox series

[v2,0/7] migration: Add precopy initial data capability and VFIO precopy support

Message ID 20230517155219.10691-1-avihaih@nvidia.com (mailing list archive)
Headers show
Series migration: Add precopy initial data capability and VFIO precopy support | expand

Message

Avihai Horon May 17, 2023, 3:52 p.m. UTC
Hello everyone,

This is v2 of the precopy initial data series.
I am still wondering about the name -- maybe "explicit switchover", as
suggested by Peter, is better as it's more general?

Anyway,

Changes from v1 [3]:
* Rebased on latest master branch.
* Updated to latest QAPI doc comment conventions and refined
  QAPI docs and capability error message. (Markus)
* Followed Peter/Juan suggestion and removed the handshake between
  source and destination.
  Now the capability must be set on both source and destination.
  Compatibility of this feature between different QEMU versions or
  different host capabilities (i.e., kernel) is achieved in the regular
  way of device properties and hw_comapt_x_y.
* Replaced is_initial_data_active() and initial_data_loaded()
  SaveVMHandlers handlers with a notification mechanism. (Peter)
* Set the capability also in destination in the migration test.
* Added VFIO device property x-allow-pre-copy to be able to preserve
  compatibility between different QEMU versions or different host
  capabilities (i.e., kernel).
* Changed VFIO precopy initial data implementation according to the
  above changes.
* Documented VFIO precopy initial data support in VFIO migration
  documentation.
* Added R-bs.

===

This series adds a new migration capability called "precopy initial
data". The purpose of this capability is to reduce migration downtime in
cases where loading of migration data in the destination can take a lot
of time, such as with VFIO migration data.

The series then moves to add precopy support and precopy initial data
support for VFIO migration.

Precopy initial data is used by VFIO migration, but other migration
users can add support for it and use it as well.

=== Background ===

Migration downtime estimation is calculated based on bandwidth and
remaining migration data. This assumes that loading of migration data in
the destination takes a negligible amount of time and that downtime
depends only on network speed.

While this may be true for RAM, it's not necessarily true for other
migration users. For example, loading the data of a VFIO device in the
destination might require from the device to allocate resources and
prepare internal data structures which can take a significant amount of
time to do.

This poses a problem, as the source may think that the remaining
migration data is small enough to meet the downtime limit, so it will
stop the VM and complete the migration, but in fact sending and loading
the data in the destination may take longer than the downtime limit.

To solve this, VFIO migration uAPI defines "initial bytes" as part of
its precopy stream [1]. Initial bytes can be used in various ways to
improve VFIO migration performance. For example, it can be used to
transfer device metadata to pre-allocate resources in the destination.
However, for this to work we need to make sure that all initial bytes
are sent and loaded in the destination before the source VM is stopped.

The new precopy initial data migration capability helps us achieve this.
It allows the source to send initial precopy data and the destination to
ACK that this data has been loaded. Migration will not attempt to stop
the source VM and complete the migration until this ACK is received.

Note that this relies on the return path capability to communicate from
the destination back to the source.

=== Flow of operation ===

To use precopy initial data, the capability must be enabled in both the
source and the destination.

During migration setup, migration code calls the initial_data_advise()
SaveVMHandlers handler of the migration users, both in the source and
the destination, to notify them that precopy initial data is used.
In the destination, an "initial_data_pending_num" counter is increased
for each migration user that supports this feature. It will be used
later to mark when an ACK should be sent to the source.

Migration starts to send precopy data and as part of it also the initial
precopy data. Initial precopy data is just like any other precopy data
and as such, migration code is not aware of it. Therefore, it's the
responsibility of the migration users (such as VFIO devices) to notify
their counterparts in the destination that their initial precopy data
has been sent (for example, VFIO migration does it when its initial
bytes reach zero).

In the destination, when a migration user finishes to receive and load
its initial data, it notifies the migration code about it and the
"initial_data_pending_num" counter is decreased. When this counter
reaches zero, it means that all initial data has been loaded in the
destination and an ACK is sent to the source, which will now be able to
complete migration when appropriate.

=== Test results ===

The below table shows the downtime of two identical migrations. In the
first migration precopy initial data is disabled and in the second it is
enabled. The migrated VM is assigned with a mlx5 VFIO device which has
300MB of device data to be migrated.

+----------------------+-----------------------+----------+
| Precopy initial data | VFIO device data size | Downtime |
+----------------------+-----------------------+----------+
|       Disabled       |         300MB         |  1900ms  |
|       Enabled        |         300MB         |  420ms   |
+----------------------+-----------------------+----------+

Precopy initial data gives a roughly 4.5 times improvement in downtime.
The 1480ms difference is time that is used for resource allocation for
the VFIO device in the destination. Without precopy initial data, this
time is spent when the source VM is stopped and thus the downtime is
much higher. With precopy initial data, the time is spent when the
source VM is still running.

=== Patch breakdown ===

- Patches 1-4 add the precopy initial data capability.
- Patches 5-6 add VFIO migration precopy support. Similar version of
  them was previously sent here [2].
- Patch 7 adds precopy initial data support for VFIO migration.

Thanks for reviewing!

[1]
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/vfio.h#L1048

[2]
https://lore.kernel.org/qemu-devel/20230222174915.5647-3-avihaih@nvidia.com/

[3]
https://lore.kernel.org/qemu-devel/20230501140141.11743-1-avihaih@nvidia.com/

Avihai Horon (7):
  migration: Add precopy initial data capability
  migration: Implement precopy initial data logic
  migration: Enable precopy initial data capability
  tests: Add migration precopy initial data capability test
  vfio/migration: Refactor vfio_save_block() to return saved data size
  vfio/migration: Add VFIO migration pre-copy support
  vfio/migration: Add support for precopy initial data capability

 docs/devel/vfio-migration.rst |  45 +++++--
 qapi/migration.json           |   9 +-
 include/hw/vfio/vfio-common.h |   6 +
 include/migration/register.h  |   6 +
 migration/migration.h         |  14 +++
 migration/options.h           |   1 +
 migration/savevm.h            |   2 +
 hw/core/machine.c             |   1 +
 hw/vfio/common.c              |   6 +-
 hw/vfio/migration.c           | 220 +++++++++++++++++++++++++++++++---
 hw/vfio/pci.c                 |   2 +
 migration/migration.c         |  40 ++++++-
 migration/options.c           |  17 +++
 migration/savevm.c            |  65 ++++++++++
 tests/qtest/migration-test.c  |  26 ++++
 hw/vfio/trace-events          |   4 +-
 migration/trace-events        |   4 +
 17 files changed, 435 insertions(+), 33 deletions(-)