diff mbox series

[v3,6/7] vfio/migration: Add VFIO migration pre-copy support

Message ID 20230521151808.24804-7-avihaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series migration: Add switchover ack capability and VFIO precopy support | expand

Commit Message

Avihai Horon May 21, 2023, 3:18 p.m. UTC
Pre-copy support allows the VFIO device data to be transferred while the
VM is running. This helps to accommodate VFIO devices that have a large
amount of data that needs to be transferred, and it can reduce migration
downtime.

Pre-copy support is optional in VFIO migration protocol v2.
Implement pre-copy of VFIO migration protocol v2 and use it for devices
that support it. Full description of it can be found here [1].

In addition, add a new VFIO device property x-allow-pre-copy to keep
migration compatibility to/from older QEMU versions that don't have VFIO
pre-copy support.

[1]
https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst |  35 +++++---
 include/hw/vfio/vfio-common.h |   4 +
 hw/core/machine.c             |   1 +
 hw/vfio/common.c              |   6 +-
 hw/vfio/migration.c           | 163 ++++++++++++++++++++++++++++++++--
 hw/vfio/pci.c                 |   2 +
 hw/vfio/trace-events          |   4 +-
 7 files changed, 193 insertions(+), 22 deletions(-)

Comments

Cédric Le Goater May 23, 2023, 2:56 p.m. UTC | #1
Hello Avihai,

On 5/21/23 17:18, Avihai Horon wrote:
> Pre-copy support allows the VFIO device data to be transferred while the
> VM is running. This helps to accommodate VFIO devices that have a large
> amount of data that needs to be transferred, and it can reduce migration
> downtime.
> 
> Pre-copy support is optional in VFIO migration protocol v2.
> Implement pre-copy of VFIO migration protocol v2 and use it for devices
> that support it. Full description of it can be found here [1].
> 
> In addition, add a new VFIO device property x-allow-pre-copy to keep
> migration compatibility to/from older QEMU versions that don't have VFIO
> pre-copy support.
> 
> [1]
> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/


May be simply reference Linux commit 4db52602a607 ("vfio: Extend the device
migration protocol with PRE_COPY") instead.

some comments below,


> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   docs/devel/vfio-migration.rst |  35 +++++---
>   include/hw/vfio/vfio-common.h |   4 +
>   hw/core/machine.c             |   1 +
>   hw/vfio/common.c              |   6 +-
>   hw/vfio/migration.c           | 163 ++++++++++++++++++++++++++++++++--
>   hw/vfio/pci.c                 |   2 +
>   hw/vfio/trace-events          |   4 +-
>   7 files changed, 193 insertions(+), 22 deletions(-)
> 
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index 1b68ccf115..e896b2a673 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
>   destination host. This document details how saving and restoring of VFIO
>   devices is done in QEMU.
>   
> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
> -data is transferred to the destination.
> -
> -The pre-copy phase of migration is currently not supported for VFIO devices.
> -Support for VFIO pre-copy will be added later on.
> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
> +accommodate VFIO devices that have a large amount of data that needs to be
> +transferred. The iterative pre-copy phase of migration allows for the guest to
> +continue whilst the VFIO device state is transferred to the destination, this
> +helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
> +support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
> +VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>   
>   Note that currently VFIO migration is supported only for a single device. This
>   is due to VFIO migration's lack of P2P support. However, P2P support is planned
> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows:
>   * A ``load_setup`` function that sets the VFIO device on the destination in
>     _RESUMING state.
>   
> +* A ``state_pending_estimate`` function that reports an estimate of the
> +  remaining pre-copy data that the vendor driver has yet to save for the VFIO
> +  device.
> +
>   * A ``state_pending_exact`` function that reads pending_bytes from the vendor
>     driver, which indicates the amount of data that the vendor driver has yet to
>     save for the VFIO device.
>   
> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
> +  active only when the VFIO device is in pre-copy states.
> +
> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
> +  vendor driver during iterative pre-copy phase.
> +
>   * A ``save_state`` function to save the device config space if it is present.
>   
>   * A ``save_live_complete_precopy`` function that sets the VFIO device in
> @@ -111,8 +123,10 @@ Flow of state changes during Live migration
>   ===========================================
>   
>   Below is the flow of state change during live migration.
> -The values in the brackets represent the VM state, the migration state, and
> +The values in the parentheses represent the VM state, the migration state, and
>   the VFIO device state, respectively.
> +The text in the square brackets represents the flow if the VFIO device supports
> +pre-copy.
>   
>   Live migration save path
>   ------------------------
> @@ -124,11 +138,12 @@ Live migration save path
>                                     |
>                        migrate_init spawns migration_thread
>                   Migration thread then calls each device's .save_setup()
> -                       (RUNNING, _SETUP, _RUNNING)
> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>                                     |
> -                      (RUNNING, _ACTIVE, _RUNNING)
> -             If device is active, get pending_bytes by .state_pending_exact()
> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
> +      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
>             If total pending_bytes >= threshold_size, call .save_live_iterate()
> +                  [Data of VFIO device for pre-copy phase is copied]
>           Iterate till total pending bytes converge and are less than threshold
>                                     |
>     On migration completion, vCPU stops and calls .save_live_complete_precopy for
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index eed244f25f..5ce7a01d56 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -66,6 +66,9 @@ typedef struct VFIOMigration {
>       int data_fd;
>       void *data_buffer;
>       size_t data_buffer_size;
> +    uint64_t precopy_init_size;
> +    uint64_t precopy_dirty_size;
> +    uint64_t mig_flags;

It would have been cleaner to introduce VFIOMigration::mig_flags and its
update in another patch. This is minor.

>   } VFIOMigration;
>   
>   typedef struct VFIOAddressSpace {
> @@ -143,6 +146,7 @@ typedef struct VFIODevice {
>       VFIOMigration *migration;
>       Error *migration_blocker;
>       OnOffAuto pre_copy_dirty_page_tracking;
> +    bool allow_pre_copy;

same comment for this bool and the associated property, because it would
ease backports.

>       bool dirty_pages_supported;
>       bool dirty_tracking;
>   } VFIODevice;
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 07f763eb2e..50439e5cbb 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -41,6 +41,7 @@
>   
>   GlobalProperty hw_compat_8_0[] = {
>       { "migration", "multifd-flush-after-each-section", "on"},
> +    { "vfio-pci", "x-allow-pre-copy", "false" },
>   };
>   const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>   
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 78358ede27..b73086e17a 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -492,7 +492,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>               }
>   
>               if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
> +                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
>                   return false;
>               }
>           }
> @@ -537,7 +538,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>                   return false;
>               }
>   
> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
> +            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>                   continue;
>               } else {
>                   return false;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 235978fd68..418efed019 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
>           return "STOP_COPY";
>       case VFIO_DEVICE_STATE_RESUMING:
>           return "RESUMING";
> +    case VFIO_DEVICE_STATE_PRE_COPY:
> +        return "PRE_COPY";
>       default:
>           return "UNKNOWN STATE";
>       }
> @@ -241,6 +243,22 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>       return 0;
>   }
>   
> +static int vfio_query_precopy_size(VFIOMigration *migration)
> +{
> +    struct vfio_precopy_info precopy = {
> +        .argsz = sizeof(precopy),
> +    };

May be move here  :

         migration->precopy_init_size = 0;
         migration->precopy_dirty_size = 0;

since the values are reset always before calling vfio_query_precopy_size()

> +
> +    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
> +        return -errno;
> +    }
> +
> +    migration->precopy_init_size = precopy.initial_bytes;
> +    migration->precopy_dirty_size = precopy.dirty_bytes;
> +
> +    return 0;
> +}
> +
>   /* Returns the size of saved data on success and -errno on error */
>   static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>   {
> @@ -249,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>       data_size = read(migration->data_fd, migration->data_buffer,
>                        migration->data_buffer_size);
>       if (data_size < 0) {
> +        /* Pre-copy emptied all the device state for now */
> +        if (errno == ENOMSG) {

Could you explain a little more this errno please ? It looks like an API with
the VFIO PCI variant kernel driver.

> +            return 0;
> +        }
> +
>           return -errno;
>       }
>       if (data_size == 0) {
> @@ -265,6 +288,39 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>       return qemu_file_get_error(f) ?: data_size;
>   }
>   
> +static void vfio_update_estimated_pending_data(VFIOMigration *migration,
> +                                               uint64_t data_size)
> +{
> +    if (!data_size) {
> +        /*
> +         * Pre-copy emptied all the device state for now, update estimated sizes
> +         * accordingly.
> +         */
> +        migration->precopy_init_size = 0;
> +        migration->precopy_dirty_size = 0;
> +
> +        return;
> +    }
> +
> +    if (migration->precopy_init_size) {
> +        uint64_t init_size = MIN(migration->precopy_init_size, data_size);
> +
> +        migration->precopy_init_size -= init_size;
> +        data_size -= init_size;
> +    }
> +
> +    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
> +                                         data_size);

Do we have a trace event for all this data values ?

> +}
> +
> +static bool vfio_precopy_supported(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    return vbasedev->allow_pre_copy &&
> +           migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
> +}
> +
>   /* ---------------------------------------------------------------------- */
>   
>   static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -285,6 +341,31 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>           return -ENOMEM;
>       }
>   
> +    if (vfio_precopy_supported(vbasedev)) {
> +        int ret;
> +
> +        migration->precopy_init_size = 0;
> +        migration->precopy_dirty_size = 0;
> +
> +        switch (migration->device_state) {
> +        case VFIO_DEVICE_STATE_RUNNING:
> +            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
> +                                           VFIO_DEVICE_STATE_RUNNING);
> +            if (ret) {
> +                return ret;
> +            }
> +
> +            vfio_query_precopy_size(migration);
> +
> +            break;
> +        case VFIO_DEVICE_STATE_STOP:
> +            /* vfio_save_complete_precopy() will go to STOP_COPY */
> +            break;
> +        default:
> +            return -EINVAL;
> +        }
> +    }
> +
>       trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
>   
>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> @@ -303,22 +384,36 @@ static void vfio_save_cleanup(void *opaque)
>       trace_vfio_save_cleanup(vbasedev->name);
>   }
>   
> +static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> +                                        uint64_t *can_postcopy)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
> +        return;
> +    }
> +
> +    *must_precopy +=
> +        migration->precopy_init_size + migration->precopy_dirty_size;
> +
> +    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> +                                      *can_postcopy,
> +                                      migration->precopy_init_size,
> +                                      migration->precopy_dirty_size);


ok we have one :) I wonder if we should not update trace_vfio_save_iterate()
also with some values.

> +}
> +
>   /*
>    * Migration size of VFIO devices can be as little as a few KBs or as big as
>    * many GBs. This value should be big enough to cover the worst case.
>    */
>   #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>   
> -/*
> - * Only exact function is implemented and not estimate function. The reason is
> - * that during pre-copy phase of migration the estimate function is called
> - * repeatedly while pending RAM size is over the threshold, thus migration
> - * can't converge and querying the VFIO device pending data size is useless.
> - */
>   static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>                                        uint64_t *can_postcopy)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>   
>       /*
> @@ -328,8 +423,49 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>       *must_precopy += stop_copy_size;
>   
> +    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
> +        migration->precopy_init_size = 0;
> +        migration->precopy_dirty_size = 0;
> +        vfio_query_precopy_size(migration);
> +
> +        *must_precopy +=
> +            migration->precopy_init_size + migration->precopy_dirty_size;
> +    }
> +
>       trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> -                                   stop_copy_size);
> +                                   stop_copy_size, migration->precopy_init_size,
> +                                   migration->precopy_dirty_size);
> +}
> +
> +static bool vfio_is_active_iterate(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    ssize_t data_size;
> +
> +    data_size = vfio_save_block(f, migration);
> +    if (data_size < 0) {
> +        return data_size;
> +    }
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    vfio_update_estimated_pending_data(migration, data_size);
> +
> +    trace_vfio_save_iterate(vbasedev->name);
> +
> +    /*
> +     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
> +     * Return 1 so following handlers will not be potentially blocked.

Can this condition be detected to warn the user ?

> +     */
> +    return 1;
>   }
>   
>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> @@ -338,7 +474,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>       ssize_t data_size;
>       int ret;
>   
> -    /* We reach here with device state STOP only */
> +    /* We reach here with device state STOP or STOP_COPY only */
>       ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>                                      VFIO_DEVICE_STATE_STOP);
>       if (ret) {
> @@ -457,7 +593,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>   static const SaveVMHandlers savevm_vfio_handlers = {
>       .save_setup = vfio_save_setup,
>       .save_cleanup = vfio_save_cleanup,
> +    .state_pending_estimate = vfio_state_pending_estimate,
>       .state_pending_exact = vfio_state_pending_exact,
> +    .is_active_iterate = vfio_is_active_iterate,
> +    .save_live_iterate = vfio_save_iterate,
>       .save_live_complete_precopy = vfio_save_complete_precopy,
>       .save_state = vfio_save_state,
>       .load_setup = vfio_load_setup,
> @@ -470,13 +609,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>   static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       enum vfio_device_mig_state new_state;
>       int ret;
>   
>       if (running) {
>           new_state = VFIO_DEVICE_STATE_RUNNING;
>       } else {
> -        new_state = VFIO_DEVICE_STATE_STOP;
> +        new_state =
> +            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
> +             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
> +                VFIO_DEVICE_STATE_STOP_COPY :
> +                VFIO_DEVICE_STATE_STOP;
>       }
>   
>       /*
> @@ -603,6 +747,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>       migration->vbasedev = vbasedev;
>       migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>       migration->data_fd = -1;
> +    migration->mig_flags = mig_flags;
>   
>       vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);
>   
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index bf27a39905..72f30ce09f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
>       DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
>                               vbasedev.pre_copy_dirty_page_tracking,
>                               ON_OFF_AUTO_ON),
> +    DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
> +                     vbasedev.allow_pre_copy, true),
>       DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
>                               display, ON_OFF_AUTO_OFF),
>       DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 646e42fd27..fd6893cb43 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>   vfio_save_cleanup(const char *name) " (%s)"
>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>   vfio_save_device_config_state(const char *name) " (%s)"
> +vfio_save_iterate(const char *name) " (%s)"
>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
> +vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
> +vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>   vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
Avihai Horon May 24, 2023, 12:49 p.m. UTC | #2
On 23/05/2023 17:56, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> Hello Avihai,
>
> On 5/21/23 17:18, Avihai Horon wrote:
>> Pre-copy support allows the VFIO device data to be transferred while the
>> VM is running. This helps to accommodate VFIO devices that have a large
>> amount of data that needs to be transferred, and it can reduce migration
>> downtime.
>>
>> Pre-copy support is optional in VFIO migration protocol v2.
>> Implement pre-copy of VFIO migration protocol v2 and use it for devices
>> that support it. Full description of it can be found here [1].
>>
>> In addition, add a new VFIO device property x-allow-pre-copy to keep
>> migration compatibility to/from older QEMU versions that don't have VFIO
>> pre-copy support.
>>
>> [1]
>> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
>
>
> May be simply reference Linux commit 4db52602a607 ("vfio: Extend the 
> device
> migration protocol with PRE_COPY") instead.

Sure, I will change it.

>
> some comments below,
>
>
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   docs/devel/vfio-migration.rst |  35 +++++---
>>   include/hw/vfio/vfio-common.h |   4 +
>>   hw/core/machine.c             |   1 +
>>   hw/vfio/common.c              |   6 +-
>>   hw/vfio/migration.c           | 163 ++++++++++++++++++++++++++++++++--
>>   hw/vfio/pci.c                 |   2 +
>>   hw/vfio/trace-events          |   4 +-
>>   7 files changed, 193 insertions(+), 22 deletions(-)
>>
>> diff --git a/docs/devel/vfio-migration.rst 
>> b/docs/devel/vfio-migration.rst
>> index 1b68ccf115..e896b2a673 100644
>> --- a/docs/devel/vfio-migration.rst
>> +++ b/docs/devel/vfio-migration.rst
>> @@ -7,12 +7,14 @@ the guest is running on source host and restoring 
>> this saved state on the
>>   destination host. This document details how saving and restoring of 
>> VFIO
>>   devices is done in QEMU.
>>
>> -Migration of VFIO devices currently consists of a single 
>> stop-and-copy phase.
>> -During the stop-and-copy phase the guest is stopped and the entire 
>> VFIO device
>> -data is transferred to the destination.
>> -
>> -The pre-copy phase of migration is currently not supported for VFIO 
>> devices.
>> -Support for VFIO pre-copy will be added later on.
>> +Migration of VFIO devices consists of two phases: the optional 
>> pre-copy phase,
>> +and the stop-and-copy phase. The pre-copy phase is iterative and 
>> allows to
>> +accommodate VFIO devices that have a large amount of data that needs 
>> to be
>> +transferred. The iterative pre-copy phase of migration allows for 
>> the guest to
>> +continue whilst the VFIO device state is transferred to the 
>> destination, this
>> +helps to reduce the total downtime of the VM. VFIO devices opt-in to 
>> pre-copy
>> +support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>> +VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>
>>   Note that currently VFIO migration is supported only for a single 
>> device. This
>>   is due to VFIO migration's lack of P2P support. However, P2P 
>> support is planned
>> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the 
>> iterative approach as follows:
>>   * A ``load_setup`` function that sets the VFIO device on the 
>> destination in
>>     _RESUMING state.
>>
>> +* A ``state_pending_estimate`` function that reports an estimate of the
>> +  remaining pre-copy data that the vendor driver has yet to save for 
>> the VFIO
>> +  device.
>> +
>>   * A ``state_pending_exact`` function that reads pending_bytes from 
>> the vendor
>>     driver, which indicates the amount of data that the vendor driver 
>> has yet to
>>     save for the VFIO device.
>>
>> +* An ``is_active_iterate`` function that indicates 
>> ``save_live_iterate`` is
>> +  active only when the VFIO device is in pre-copy states.
>> +
>> +* A ``save_live_iterate`` function that reads the VFIO device's data 
>> from the
>> +  vendor driver during iterative pre-copy phase.
>> +
>>   * A ``save_state`` function to save the device config space if it 
>> is present.
>>
>>   * A ``save_live_complete_precopy`` function that sets the VFIO 
>> device in
>> @@ -111,8 +123,10 @@ Flow of state changes during Live migration
>>   ===========================================
>>
>>   Below is the flow of state change during live migration.
>> -The values in the brackets represent the VM state, the migration 
>> state, and
>> +The values in the parentheses represent the VM state, the migration 
>> state, and
>>   the VFIO device state, respectively.
>> +The text in the square brackets represents the flow if the VFIO 
>> device supports
>> +pre-copy.
>>
>>   Live migration save path
>>   ------------------------
>> @@ -124,11 +138,12 @@ Live migration save path
>>                                     |
>>                        migrate_init spawns migration_thread
>>                   Migration thread then calls each device's 
>> .save_setup()
>> -                       (RUNNING, _SETUP, _RUNNING)
>> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>>                                     |
>> -                      (RUNNING, _ACTIVE, _RUNNING)
>> -             If device is active, get pending_bytes by 
>> .state_pending_exact()
>> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
>> +      If device is active, get pending_bytes by 
>> .state_pending_{estimate,exact}()
>>             If total pending_bytes >= threshold_size, call 
>> .save_live_iterate()
>> +                  [Data of VFIO device for pre-copy phase is copied]
>>           Iterate till total pending bytes converge and are less than 
>> threshold
>>                                     |
>>     On migration completion, vCPU stops and calls 
>> .save_live_complete_precopy for
>> diff --git a/include/hw/vfio/vfio-common.h 
>> b/include/hw/vfio/vfio-common.h
>> index eed244f25f..5ce7a01d56 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -66,6 +66,9 @@ typedef struct VFIOMigration {
>>       int data_fd;
>>       void *data_buffer;
>>       size_t data_buffer_size;
>> +    uint64_t precopy_init_size;
>> +    uint64_t precopy_dirty_size;
>> +    uint64_t mig_flags;
>
> It would have been cleaner to introduce VFIOMigration::mig_flags and its
> update in another patch. This is minor.

Sure, I will split it.

>
>
>>   } VFIOMigration;
>>
>>   typedef struct VFIOAddressSpace {
>> @@ -143,6 +146,7 @@ typedef struct VFIODevice {
>>       VFIOMigration *migration;
>>       Error *migration_blocker;
>>       OnOffAuto pre_copy_dirty_page_tracking;
>> +    bool allow_pre_copy;
>
> same comment for this bool and the associated property, because it would
> ease backports.

Sure.
Just for general knowledge, can you explain how this could ease backports?

>
>
>>       bool dirty_pages_supported;
>>       bool dirty_tracking;
>>   } VFIODevice;
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index 07f763eb2e..50439e5cbb 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -41,6 +41,7 @@
>>
>>   GlobalProperty hw_compat_8_0[] = {
>>       { "migration", "multifd-flush-after-each-section", "on"},
>> +    { "vfio-pci", "x-allow-pre-copy", "false" },
>>   };
>>   const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 78358ede27..b73086e17a 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -492,7 +492,8 @@ static bool 
>> vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>               }
>>
>>               if (vbasedev->pre_copy_dirty_page_tracking == 
>> ON_OFF_AUTO_OFF &&
>> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>> +                (migration->device_state == 
>> VFIO_DEVICE_STATE_RUNNING ||
>> +                 migration->device_state == 
>> VFIO_DEVICE_STATE_PRE_COPY)) {
>>                   return false;
>>               }
>>           }
>> @@ -537,7 +538,8 @@ static bool 
>> vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>                   return false;
>>               }
>>
>> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>> +            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> +                migration->device_state == 
>> VFIO_DEVICE_STATE_PRE_COPY) {
>>                   continue;
>>               } else {
>>                   return false;
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 235978fd68..418efed019 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum 
>> vfio_device_mig_state state)
>>           return "STOP_COPY";
>>       case VFIO_DEVICE_STATE_RESUMING:
>>           return "RESUMING";
>> +    case VFIO_DEVICE_STATE_PRE_COPY:
>> +        return "PRE_COPY";
>>       default:
>>           return "UNKNOWN STATE";
>>       }
>> @@ -241,6 +243,22 @@ static int vfio_query_stop_copy_size(VFIODevice 
>> *vbasedev,
>>       return 0;
>>   }
>>
>> +static int vfio_query_precopy_size(VFIOMigration *migration)
>> +{
>> +    struct vfio_precopy_info precopy = {
>> +        .argsz = sizeof(precopy),
>> +    };
>
> May be move here  :
>
>         migration->precopy_init_size = 0;
>         migration->precopy_dirty_size = 0;
>
> since the values are reset always before calling vfio_query_precopy_size()

OK.
I will also reset these values in vfio_save_cleanup() so there won't be 
stale values in case migration is cancelled or fails.

>
>
>> +
>> +    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, 
>> &precopy)) {
>> +        return -errno;
>> +    }
>> +
>> +    migration->precopy_init_size = precopy.initial_bytes;
>> +    migration->precopy_dirty_size = precopy.dirty_bytes;
>> +
>> +    return 0;
>> +}
>> +
>>   /* Returns the size of saved data on success and -errno on error */
>>   static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>   {
>> @@ -249,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, 
>> VFIOMigration *migration)
>>       data_size = read(migration->data_fd, migration->data_buffer,
>>                        migration->data_buffer_size);
>>       if (data_size < 0) {
>> +        /* Pre-copy emptied all the device state for now */
>> +        if (errno == ENOMSG) {
>
> Could you explain a little more this errno please ? It looks like an 
> API with
> the VFIO PCI variant kernel driver.

Yes, it's explained in the precopy uAPI [1].
Do you want to change the comment to something like the following?
/*
  * ENOMSG indicates that the migration data_fd has reached a temporary
  * "end of stream", i.e. both initial_bytes and dirty_bytes are zero.
  * More data may be available later in future reads.
  */

[1] 
https://elixir.bootlin.com/linux/v6.4-rc1/source/include/uapi/linux/vfio.h#L1084

>
>> +            return 0;
>> +        }
>> +
>>           return -errno;
>>       }
>>       if (data_size == 0) {
>> @@ -265,6 +288,39 @@ static ssize_t vfio_save_block(QEMUFile *f, 
>> VFIOMigration *migration)
>>       return qemu_file_get_error(f) ?: data_size;
>>   }
>>
>> +static void vfio_update_estimated_pending_data(VFIOMigration 
>> *migration,
>> +                                               uint64_t data_size)
>> +{
>> +    if (!data_size) {
>> +        /*
>> +         * Pre-copy emptied all the device state for now, update 
>> estimated sizes
>> +         * accordingly.
>> +         */
>> +        migration->precopy_init_size = 0;
>> +        migration->precopy_dirty_size = 0;
>> +
>> +        return;
>> +    }
>> +
>> +    if (migration->precopy_init_size) {
>> +        uint64_t init_size = MIN(migration->precopy_init_size, 
>> data_size);
>> +
>> +        migration->precopy_init_size -= init_size;
>> +        data_size -= init_size;
>> +    }
>> +
>> +    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
>> +                                         data_size);
>
> Do we have a trace event for all this data values ?
>
>> +}
>> +
>> +static bool vfio_precopy_supported(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    return vbasedev->allow_pre_copy &&
>> +           migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>> +}
>> +
>>   /* 
>> ---------------------------------------------------------------------- 
>> */
>>
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -285,6 +341,31 @@ static int vfio_save_setup(QEMUFile *f, void 
>> *opaque)
>>           return -ENOMEM;
>>       }
>>
>> +    if (vfio_precopy_supported(vbasedev)) {
>> +        int ret;
>> +
>> +        migration->precopy_init_size = 0;
>> +        migration->precopy_dirty_size = 0;
>> +
>> +        switch (migration->device_state) {
>> +        case VFIO_DEVICE_STATE_RUNNING:
>> +            ret = vfio_migration_set_state(vbasedev, 
>> VFIO_DEVICE_STATE_PRE_COPY,
>> + VFIO_DEVICE_STATE_RUNNING);
>> +            if (ret) {
>> +                return ret;
>> +            }
>> +
>> +            vfio_query_precopy_size(migration);
>> +
>> +            break;
>> +        case VFIO_DEVICE_STATE_STOP:
>> +            /* vfio_save_complete_precopy() will go to STOP_COPY */
>> +            break;
>> +        default:
>> +            return -EINVAL;
>> +        }
>> +    }
>> +
>>       trace_vfio_save_setup(vbasedev->name, 
>> migration->data_buffer_size);
>>
>>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> @@ -303,22 +384,36 @@ static void vfio_save_cleanup(void *opaque)
>>       trace_vfio_save_cleanup(vbasedev->name);
>>   }
>>
>> +static void vfio_state_pending_estimate(void *opaque, uint64_t 
>> *must_precopy,
>> +                                        uint64_t *can_postcopy)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
>> +        return;
>> +    }
>> +
>> +    *must_precopy +=
>> +        migration->precopy_init_size + migration->precopy_dirty_size;
>> +
>> +    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
>> +                                      *can_postcopy,
>> + migration->precopy_init_size,
>> + migration->precopy_dirty_size);
>
>
> ok we have one :) I wonder if we should not update 
> trace_vfio_save_iterate()
> also with some values.

Hmm, yes, I guess it wouldn't hurt.

>
>> +}
>> +
>>   /*
>>    * Migration size of VFIO devices can be as little as a few KBs or 
>> as big as
>>    * many GBs. This value should be big enough to cover the worst case.
>>    */
>>   #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>>
>> -/*
>> - * Only exact function is implemented and not estimate function. The 
>> reason is
>> - * that during pre-copy phase of migration the estimate function is 
>> called
>> - * repeatedly while pending RAM size is over the threshold, thus 
>> migration
>> - * can't converge and querying the VFIO device pending data size is 
>> useless.
>> - */
>>   static void vfio_state_pending_exact(void *opaque, uint64_t 
>> *must_precopy,
>>                                        uint64_t *can_postcopy)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>>
>>       /*
>> @@ -328,8 +423,49 @@ static void vfio_state_pending_exact(void 
>> *opaque, uint64_t *must_precopy,
>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>       *must_precopy += stop_copy_size;
>>
>> +    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>> +        migration->precopy_init_size = 0;
>> +        migration->precopy_dirty_size = 0;
>> +        vfio_query_precopy_size(migration);
>> +
>> +        *must_precopy +=
>> +            migration->precopy_init_size + 
>> migration->precopy_dirty_size;
>> +    }
>> +
>>       trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, 
>> *can_postcopy,
>> -                                   stop_copy_size);
>> +                                   stop_copy_size, 
>> migration->precopy_init_size,
>> + migration->precopy_dirty_size);
>> +}
>> +
>> +static bool vfio_is_active_iterate(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
>> +}
>> +
>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    ssize_t data_size;
>> +
>> +    data_size = vfio_save_block(f, migration);
>> +    if (data_size < 0) {
>> +        return data_size;
>> +    }
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    vfio_update_estimated_pending_data(migration, data_size);
>> +
>> +    trace_vfio_save_iterate(vbasedev->name);
>> +
>> +    /*
>> +     * A VFIO device's pre-copy dirty_bytes is not guaranteed to 
>> reach zero.
>> +     * Return 1 so following handlers will not be potentially blocked.
>
> Can this condition be detected to warn the user ?

I don't think so, it depends on the kernel driver implementation.

Thanks!

>
>
>> +     */
>> +    return 1;
>>   }
>>
>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>> @@ -338,7 +474,7 @@ static int vfio_save_complete_precopy(QEMUFile 
>> *f, void *opaque)
>>       ssize_t data_size;
>>       int ret;
>>
>> -    /* We reach here with device state STOP only */
>> +    /* We reach here with device state STOP or STOP_COPY only */
>>       ret = vfio_migration_set_state(vbasedev, 
>> VFIO_DEVICE_STATE_STOP_COPY,
>>                                      VFIO_DEVICE_STATE_STOP);
>>       if (ret) {
>> @@ -457,7 +593,10 @@ static int vfio_load_state(QEMUFile *f, void 
>> *opaque, int version_id)
>>   static const SaveVMHandlers savevm_vfio_handlers = {
>>       .save_setup = vfio_save_setup,
>>       .save_cleanup = vfio_save_cleanup,
>> +    .state_pending_estimate = vfio_state_pending_estimate,
>>       .state_pending_exact = vfio_state_pending_exact,
>> +    .is_active_iterate = vfio_is_active_iterate,
>> +    .save_live_iterate = vfio_save_iterate,
>>       .save_live_complete_precopy = vfio_save_complete_precopy,
>>       .save_state = vfio_save_state,
>>       .load_setup = vfio_load_setup,
>> @@ -470,13 +609,18 @@ static const SaveVMHandlers 
>> savevm_vfio_handlers = {
>>   static void vfio_vmstate_change(void *opaque, bool running, 
>> RunState state)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       enum vfio_device_mig_state new_state;
>>       int ret;
>>
>>       if (running) {
>>           new_state = VFIO_DEVICE_STATE_RUNNING;
>>       } else {
>> -        new_state = VFIO_DEVICE_STATE_STOP;
>> +        new_state =
>> +            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
>> +             (state == RUN_STATE_FINISH_MIGRATE || state == 
>> RUN_STATE_PAUSED)) ?
>> +                VFIO_DEVICE_STATE_STOP_COPY :
>> +                VFIO_DEVICE_STATE_STOP;
>>       }
>>
>>       /*
>> @@ -603,6 +747,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>       migration->vbasedev = vbasedev;
>>       migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>>       migration->data_fd = -1;
>> +    migration->mig_flags = mig_flags;
>>
>>       vbasedev->dirty_pages_supported = 
>> vfio_dma_logging_supported(vbasedev);
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index bf27a39905..72f30ce09f 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
>>       DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", 
>> VFIOPCIDevice,
>> vbasedev.pre_copy_dirty_page_tracking,
>>                               ON_OFF_AUTO_ON),
>> +    DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
>> +                     vbasedev.allow_pre_copy, true),
>>       DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
>>                               display, ON_OFF_AUTO_OFF),
>>       DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 646e42fd27..fd6893cb43 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) 
>> " (%s) data_size %d"
>>   vfio_save_cleanup(const char *name) " (%s)"
>>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>>   vfio_save_device_config_state(const char *name) " (%s)"
>> +vfio_save_iterate(const char *name) " (%s)"
>>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) 
>> data buffer size 0x%"PRIx64
>> -vfio_state_pending_exact(const char *name, uint64_t precopy, 
>> uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" 
>> postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
>> +vfio_state_pending_estimate(const char *name, uint64_t precopy, 
>> uint64_t postcopy, uint64_t precopy_init_size, uint64_t 
>> precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" 
>> precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>> +vfio_state_pending_exact(const char *name, uint64_t precopy, 
>> uint64_t postcopy, uint64_t stopcopy_size, uint64_t 
>> precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 
>> 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy 
>> initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>>   vfio_vmstate_change(const char *name, int running, const char 
>> *reason, const char *dev_state) " (%s) running %d reason %s device 
>> state %s"
>
Cédric Le Goater May 24, 2023, 3:38 p.m. UTC | #3
Hello Avihai,

On 5/24/23 14:49, Avihai Horon wrote:
> 
> On 23/05/2023 17:56, Cédric Le Goater wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Hello Avihai,
>>
>> On 5/21/23 17:18, Avihai Horon wrote:
>>> Pre-copy support allows the VFIO device data to be transferred while the
>>> VM is running. This helps to accommodate VFIO devices that have a large
>>> amount of data that needs to be transferred, and it can reduce migration
>>> downtime.
>>>
>>> Pre-copy support is optional in VFIO migration protocol v2.
>>> Implement pre-copy of VFIO migration protocol v2 and use it for devices
>>> that support it. Full description of it can be found here [1].
>>>
>>> In addition, add a new VFIO device property x-allow-pre-copy to keep
>>> migration compatibility to/from older QEMU versions that don't have VFIO
>>> pre-copy support.
>>>
>>> [1]
>>> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
>>
>>
>> May be simply reference Linux commit 4db52602a607 ("vfio: Extend the device
>> migration protocol with PRE_COPY") instead.
> 
> Sure, I will change it.
> 
>>
>> some comments below,
>>
>>
>>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> ---
>>>   docs/devel/vfio-migration.rst |  35 +++++---
>>>   include/hw/vfio/vfio-common.h |   4 +
>>>   hw/core/machine.c             |   1 +
>>>   hw/vfio/common.c              |   6 +-
>>>   hw/vfio/migration.c           | 163 ++++++++++++++++++++++++++++++++--
>>>   hw/vfio/pci.c                 |   2 +
>>>   hw/vfio/trace-events          |   4 +-
>>>   7 files changed, 193 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
>>> index 1b68ccf115..e896b2a673 100644
>>> --- a/docs/devel/vfio-migration.rst
>>> +++ b/docs/devel/vfio-migration.rst
>>> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
>>>   destination host. This document details how saving and restoring of VFIO
>>>   devices is done in QEMU.
>>>
>>> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
>>> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
>>> -data is transferred to the destination.
>>> -
>>> -The pre-copy phase of migration is currently not supported for VFIO devices.
>>> -Support for VFIO pre-copy will be added later on.
>>> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
>>> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
>>> +accommodate VFIO devices that have a large amount of data that needs to be
>>> +transferred. The iterative pre-copy phase of migration allows for the guest to
>>> +continue whilst the VFIO device state is transferred to the destination, this
>>> +helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
>>> +support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>>> +VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>>
>>>   Note that currently VFIO migration is supported only for a single device. This
>>>   is due to VFIO migration's lack of P2P support. However, P2P support is planned
>>> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows:
>>>   * A ``load_setup`` function that sets the VFIO device on the destination in
>>>     _RESUMING state.
>>>
>>> +* A ``state_pending_estimate`` function that reports an estimate of the
>>> +  remaining pre-copy data that the vendor driver has yet to save for the VFIO
>>> +  device.
>>> +
>>>   * A ``state_pending_exact`` function that reads pending_bytes from the vendor
>>>     driver, which indicates the amount of data that the vendor driver has yet to
>>>     save for the VFIO device.
>>>
>>> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>>> +  active only when the VFIO device is in pre-copy states.
>>> +
>>> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
>>> +  vendor driver during iterative pre-copy phase.
>>> +
>>>   * A ``save_state`` function to save the device config space if it is present.
>>>
>>>   * A ``save_live_complete_precopy`` function that sets the VFIO device in
>>> @@ -111,8 +123,10 @@ Flow of state changes during Live migration
>>>   ===========================================
>>>
>>>   Below is the flow of state change during live migration.
>>> -The values in the brackets represent the VM state, the migration state, and
>>> +The values in the parentheses represent the VM state, the migration state, and
>>>   the VFIO device state, respectively.
>>> +The text in the square brackets represents the flow if the VFIO device supports
>>> +pre-copy.
>>>
>>>   Live migration save path
>>>   ------------------------
>>> @@ -124,11 +138,12 @@ Live migration save path
>>>                                     |
>>>                        migrate_init spawns migration_thread
>>>                   Migration thread then calls each device's .save_setup()
>>> -                       (RUNNING, _SETUP, _RUNNING)
>>> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>>>                                     |
>>> -                      (RUNNING, _ACTIVE, _RUNNING)
>>> -             If device is active, get pending_bytes by .state_pending_exact()
>>> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
>>> +      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
>>>             If total pending_bytes >= threshold_size, call .save_live_iterate()
>>> +                  [Data of VFIO device for pre-copy phase is copied]
>>>           Iterate till total pending bytes converge and are less than threshold
>>>                                     |
>>>     On migration completion, vCPU stops and calls .save_live_complete_precopy for
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>> index eed244f25f..5ce7a01d56 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -66,6 +66,9 @@ typedef struct VFIOMigration {
>>>       int data_fd;
>>>       void *data_buffer;
>>>       size_t data_buffer_size;
>>> +    uint64_t precopy_init_size;
>>> +    uint64_t precopy_dirty_size;
>>> +    uint64_t mig_flags;
>>
>> It would have been cleaner to introduce VFIOMigration::mig_flags and its
>> update in another patch. This is minor.
> 
> Sure, I will split it.
> 
>>
>>
>>>   } VFIOMigration;
>>>
>>>   typedef struct VFIOAddressSpace {
>>> @@ -143,6 +146,7 @@ typedef struct VFIODevice {
>>>       VFIOMigration *migration;
>>>       Error *migration_blocker;
>>>       OnOffAuto pre_copy_dirty_page_tracking;
>>> +    bool allow_pre_copy;
>>
>> same comment for this bool and the associated property, because it would
>> ease backports.
> 
> Sure.
> Just for general knowledge, can you explain how this could ease backports?

The downstream machine names are different. Each distro has its own
flavor. So adding a machine option always require some massaging.

That might change in the future though.

Anyhow, it is good pratice to isolate a change adding a restriction
or a hw compatibility in its own patch.

> 
>>
>>
>>>       bool dirty_pages_supported;
>>>       bool dirty_tracking;
>>>   } VFIODevice;
>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>> index 07f763eb2e..50439e5cbb 100644
>>> --- a/hw/core/machine.c
>>> +++ b/hw/core/machine.c
>>> @@ -41,6 +41,7 @@
>>>
>>>   GlobalProperty hw_compat_8_0[] = {
>>>       { "migration", "multifd-flush-after-each-section", "on"},
>>> +    { "vfio-pci", "x-allow-pre-copy", "false" },
>>>   };
>>>   const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>> index 78358ede27..b73086e17a 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -492,7 +492,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>>               }
>>>
>>>               if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
>>> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>>> +                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>>> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
>>>                   return false;
>>>               }
>>>           }
>>> @@ -537,7 +538,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>>                   return false;
>>>               }
>>>
>>> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>>> +            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>>> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>>>                   continue;
>>>               } else {
>>>                   return false;
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 235978fd68..418efed019 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
>>>           return "STOP_COPY";
>>>       case VFIO_DEVICE_STATE_RESUMING:
>>>           return "RESUMING";
>>> +    case VFIO_DEVICE_STATE_PRE_COPY:
>>> +        return "PRE_COPY";
>>>       default:
>>>           return "UNKNOWN STATE";
>>>       }
>>> @@ -241,6 +243,22 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>>>       return 0;
>>>   }
>>>
>>> +static int vfio_query_precopy_size(VFIOMigration *migration)
>>> +{
>>> +    struct vfio_precopy_info precopy = {
>>> +        .argsz = sizeof(precopy),
>>> +    };
>>
>> May be move here  :
>>
>>         migration->precopy_init_size = 0;
>>         migration->precopy_dirty_size = 0;
>>
>> since the values are reset always before calling vfio_query_precopy_size()
> 
> OK.
> I will also reset these values in vfio_save_cleanup() so there won't be stale values in case migration is cancelled or fails.
> 
>>
>>
>>> +
>>> +    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
>>> +        return -errno;
>>> +    }
>>> +
>>> +    migration->precopy_init_size = precopy.initial_bytes;
>>> +    migration->precopy_dirty_size = precopy.dirty_bytes;
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>   /* Returns the size of saved data on success and -errno on error */
>>>   static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>>   {
>>> @@ -249,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>>       data_size = read(migration->data_fd, migration->data_buffer,
>>>                        migration->data_buffer_size);
>>>       if (data_size < 0) {
>>> +        /* Pre-copy emptied all the device state for now */
>>> +        if (errno == ENOMSG) {
>>
>> Could you explain a little more this errno please ? It looks like an API with
>> the VFIO PCI variant kernel driver.
> 
> Yes, it's explained in the precopy uAPI [1].

Oh. I forgot to look at the uAPI file and only checked the driver.
All good then.

> Do you want to change the comment to something like the following?

Yes, please. It would be good for the reader to have a reference on
the kernel uAPI.

> /*
>   * ENOMSG indicates that the migration data_fd has reached a temporary
>   * "end of stream", i.e. both initial_bytes and dirty_bytes are zero.
>   * More data may be available later in future reads.
>   */

Please add something like :

  "For more information, please refer to the Linux kernel VFIO uAPI"

No need for links or file names.


> [1] https://elixir.bootlin.com/linux/v6.4-rc1/source/include/uapi/linux/vfio.h#L1084
> 
>>
>>> +            return 0;
>>> +        }
>>> +
>>>           return -errno;
>>>       }
>>>       if (data_size == 0) {
>>> @@ -265,6 +288,39 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>>       return qemu_file_get_error(f) ?: data_size;
>>>   }
>>>
>>> +static void vfio_update_estimated_pending_data(VFIOMigration *migration,
>>> +                                               uint64_t data_size)
>>> +{
>>> +    if (!data_size) {
>>> +        /*
>>> +         * Pre-copy emptied all the device state for now, update estimated sizes
>>> +         * accordingly.
>>> +         */
>>> +        migration->precopy_init_size = 0;
>>> +        migration->precopy_dirty_size = 0;
>>> +
>>> +        return;
>>> +    }
>>> +
>>> +    if (migration->precopy_init_size) {
>>> +        uint64_t init_size = MIN(migration->precopy_init_size, data_size);
>>> +
>>> +        migration->precopy_init_size -= init_size;
>>> +        data_size -= init_size;
>>> +    }
>>> +
>>> +    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
>>> +                                         data_size);
>>
>> Do we have a trace event for all this data values ?
>>
>>> +}
>>> +
>>> +static bool vfio_precopy_supported(VFIODevice *vbasedev)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    return vbasedev->allow_pre_copy &&
>>> +           migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>>> +}
>>> +
>>>   /* ---------------------------------------------------------------------- */
>>>
>>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>>> @@ -285,6 +341,31 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>>>           return -ENOMEM;
>>>       }
>>>
>>> +    if (vfio_precopy_supported(vbasedev)) {
>>> +        int ret;
>>> +
>>> +        migration->precopy_init_size = 0;
>>> +        migration->precopy_dirty_size = 0;
>>> +
>>> +        switch (migration->device_state) {
>>> +        case VFIO_DEVICE_STATE_RUNNING:
>>> +            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
>>> + VFIO_DEVICE_STATE_RUNNING);
>>> +            if (ret) {
>>> +                return ret;
>>> +            }
>>> +
>>> +            vfio_query_precopy_size(migration);
>>> +
>>> +            break;
>>> +        case VFIO_DEVICE_STATE_STOP:
>>> +            /* vfio_save_complete_precopy() will go to STOP_COPY */
>>> +            break;
>>> +        default:
>>> +            return -EINVAL;
>>> +        }
>>> +    }
>>> +
>>>       trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
>>>
>>>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>> @@ -303,22 +384,36 @@ static void vfio_save_cleanup(void *opaque)
>>>       trace_vfio_save_cleanup(vbasedev->name);
>>>   }
>>>
>>> +static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
>>> +                                        uint64_t *can_postcopy)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
>>> +        return;
>>> +    }
>>> +
>>> +    *must_precopy +=
>>> +        migration->precopy_init_size + migration->precopy_dirty_size;
>>> +
>>> +    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
>>> +                                      *can_postcopy,
>>> + migration->precopy_init_size,
>>> + migration->precopy_dirty_size);
>>
>>
>> ok we have one :) I wonder if we should not update trace_vfio_save_iterate()
>> also with some values.
> 
> Hmm, yes, I guess it wouldn't hurt.
> 
>>
>>> +}
>>> +
>>>   /*
>>>    * Migration size of VFIO devices can be as little as a few KBs or as big as
>>>    * many GBs. This value should be big enough to cover the worst case.
>>>    */
>>>   #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>>>
>>> -/*
>>> - * Only exact function is implemented and not estimate function. The reason is
>>> - * that during pre-copy phase of migration the estimate function is called
>>> - * repeatedly while pending RAM size is over the threshold, thus migration
>>> - * can't converge and querying the VFIO device pending data size is useless.
>>> - */
>>>   static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>>>                                        uint64_t *can_postcopy)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>>       uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>>>
>>>       /*
>>> @@ -328,8 +423,49 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>>       *must_precopy += stop_copy_size;
>>>
>>> +    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>>> +        migration->precopy_init_size = 0;
>>> +        migration->precopy_dirty_size = 0;
>>> +        vfio_query_precopy_size(migration);
>>> +
>>> +        *must_precopy +=
>>> +            migration->precopy_init_size + migration->precopy_dirty_size;
>>> +    }
>>> +
>>>       trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
>>> -                                   stop_copy_size);
>>> +                                   stop_copy_size, migration->precopy_init_size,
>>> + migration->precopy_dirty_size);
>>> +}
>>> +
>>> +static bool vfio_is_active_iterate(void *opaque)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
>>> +}
>>> +
>>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    ssize_t data_size;
>>> +
>>> +    data_size = vfio_save_block(f, migration);
>>> +    if (data_size < 0) {
>>> +        return data_size;
>>> +    }
>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>> +
>>> +    vfio_update_estimated_pending_data(migration, data_size);
>>> +
>>> +    trace_vfio_save_iterate(vbasedev->name);
>>> +
>>> +    /*
>>> +     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
>>> +     * Return 1 so following handlers will not be potentially blocked.
>>
>> Can this condition be detected to warn the user ?
> 
> I don't think so, it depends on the kernel driver implementation.

OK. We will see how it evolves with time. We might need some message
saying pre copy is not converging.

Thanks,

C.


> 
> Thanks!
> 
>>
>>
>>> +     */
>>> +    return 1;
>>>   }
>>>
>>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>> @@ -338,7 +474,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>       ssize_t data_size;
>>>       int ret;
>>>
>>> -    /* We reach here with device state STOP only */
>>> +    /* We reach here with device state STOP or STOP_COPY only */
>>>       ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>>>                                      VFIO_DEVICE_STATE_STOP);
>>>       if (ret) {
>>> @@ -457,7 +593,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>>   static const SaveVMHandlers savevm_vfio_handlers = {
>>>       .save_setup = vfio_save_setup,
>>>       .save_cleanup = vfio_save_cleanup,
>>> +    .state_pending_estimate = vfio_state_pending_estimate,
>>>       .state_pending_exact = vfio_state_pending_exact,
>>> +    .is_active_iterate = vfio_is_active_iterate,
>>> +    .save_live_iterate = vfio_save_iterate,
>>>       .save_live_complete_precopy = vfio_save_complete_precopy,
>>>       .save_state = vfio_save_state,
>>>       .load_setup = vfio_load_setup,
>>> @@ -470,13 +609,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>>>   static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>>       enum vfio_device_mig_state new_state;
>>>       int ret;
>>>
>>>       if (running) {
>>>           new_state = VFIO_DEVICE_STATE_RUNNING;
>>>       } else {
>>> -        new_state = VFIO_DEVICE_STATE_STOP;
>>> +        new_state =
>>> +            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
>>> +             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
>>> +                VFIO_DEVICE_STATE_STOP_COPY :
>>> +                VFIO_DEVICE_STATE_STOP;
>>>       }
>>>
>>>       /*
>>> @@ -603,6 +747,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>>       migration->vbasedev = vbasedev;
>>>       migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>>>       migration->data_fd = -1;
>>> +    migration->mig_flags = mig_flags;
>>>
>>>       vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);
>>>
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index bf27a39905..72f30ce09f 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
>>>       DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
>>> vbasedev.pre_copy_dirty_page_tracking,
>>>                               ON_OFF_AUTO_ON),
>>> +    DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
>>> +                     vbasedev.allow_pre_copy, true),
>>>       DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
>>>                               display, ON_OFF_AUTO_OFF),
>>>       DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>> index 646e42fd27..fd6893cb43 100644
>>> --- a/hw/vfio/trace-events
>>> +++ b/hw/vfio/trace-events
>>> @@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>>>   vfio_save_cleanup(const char *name) " (%s)"
>>>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>>>   vfio_save_device_config_state(const char *name) " (%s)"
>>> +vfio_save_iterate(const char *name) " (%s)"
>>>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
>>> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
>>> +vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>>> +vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>>>   vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>>
>
Avihai Horon May 25, 2023, 9:24 a.m. UTC | #4
On 24/05/2023 18:38, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> Hello Avihai,
>
> On 5/24/23 14:49, Avihai Horon wrote:
>>
>> On 23/05/2023 17:56, Cédric Le Goater wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Hello Avihai,
>>>
>>> On 5/21/23 17:18, Avihai Horon wrote:
>>>> Pre-copy support allows the VFIO device data to be transferred 
>>>> while the
>>>> VM is running. This helps to accommodate VFIO devices that have a 
>>>> large
>>>> amount of data that needs to be transferred, and it can reduce 
>>>> migration
>>>> downtime.
>>>>
>>>> Pre-copy support is optional in VFIO migration protocol v2.
>>>> Implement pre-copy of VFIO migration protocol v2 and use it for 
>>>> devices
>>>> that support it. Full description of it can be found here [1].
>>>>
>>>> In addition, add a new VFIO device property x-allow-pre-copy to keep
>>>> migration compatibility to/from older QEMU versions that don't have 
>>>> VFIO
>>>> pre-copy support.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
>>>
>>>
>>> May be simply reference Linux commit 4db52602a607 ("vfio: Extend the 
>>> device
>>> migration protocol with PRE_COPY") instead.
>>
>> Sure, I will change it.
>>
>>>
>>> some comments below,
>>>
>>>
>>>>
>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>> ---
>>>>   docs/devel/vfio-migration.rst |  35 +++++---
>>>>   include/hw/vfio/vfio-common.h |   4 +
>>>>   hw/core/machine.c             |   1 +
>>>>   hw/vfio/common.c              |   6 +-
>>>>   hw/vfio/migration.c           | 163 
>>>> ++++++++++++++++++++++++++++++++--
>>>>   hw/vfio/pci.c                 |   2 +
>>>>   hw/vfio/trace-events          |   4 +-
>>>>   7 files changed, 193 insertions(+), 22 deletions(-)
>>>>
>>>> diff --git a/docs/devel/vfio-migration.rst 
>>>> b/docs/devel/vfio-migration.rst
>>>> index 1b68ccf115..e896b2a673 100644
>>>> --- a/docs/devel/vfio-migration.rst
>>>> +++ b/docs/devel/vfio-migration.rst
>>>> @@ -7,12 +7,14 @@ the guest is running on source host and restoring 
>>>> this saved state on the
>>>>   destination host. This document details how saving and restoring 
>>>> of VFIO
>>>>   devices is done in QEMU.
>>>>
>>>> -Migration of VFIO devices currently consists of a single 
>>>> stop-and-copy phase.
>>>> -During the stop-and-copy phase the guest is stopped and the entire 
>>>> VFIO device
>>>> -data is transferred to the destination.
>>>> -
>>>> -The pre-copy phase of migration is currently not supported for 
>>>> VFIO devices.
>>>> -Support for VFIO pre-copy will be added later on.
>>>> +Migration of VFIO devices consists of two phases: the optional 
>>>> pre-copy phase,
>>>> +and the stop-and-copy phase. The pre-copy phase is iterative and 
>>>> allows to
>>>> +accommodate VFIO devices that have a large amount of data that 
>>>> needs to be
>>>> +transferred. The iterative pre-copy phase of migration allows for 
>>>> the guest to
>>>> +continue whilst the VFIO device state is transferred to the 
>>>> destination, this
>>>> +helps to reduce the total downtime of the VM. VFIO devices opt-in 
>>>> to pre-copy
>>>> +support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>>>> +VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>>>
>>>>   Note that currently VFIO migration is supported only for a single 
>>>> device. This
>>>>   is due to VFIO migration's lack of P2P support. However, P2P 
>>>> support is planned
>>>> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the 
>>>> iterative approach as follows:
>>>>   * A ``load_setup`` function that sets the VFIO device on the 
>>>> destination in
>>>>     _RESUMING state.
>>>>
>>>> +* A ``state_pending_estimate`` function that reports an estimate 
>>>> of the
>>>> +  remaining pre-copy data that the vendor driver has yet to save 
>>>> for the VFIO
>>>> +  device.
>>>> +
>>>>   * A ``state_pending_exact`` function that reads pending_bytes 
>>>> from the vendor
>>>>     driver, which indicates the amount of data that the vendor 
>>>> driver has yet to
>>>>     save for the VFIO device.
>>>>
>>>> +* An ``is_active_iterate`` function that indicates 
>>>> ``save_live_iterate`` is
>>>> +  active only when the VFIO device is in pre-copy states.
>>>> +
>>>> +* A ``save_live_iterate`` function that reads the VFIO device's 
>>>> data from the
>>>> +  vendor driver during iterative pre-copy phase.
>>>> +
>>>>   * A ``save_state`` function to save the device config space if it 
>>>> is present.
>>>>
>>>>   * A ``save_live_complete_precopy`` function that sets the VFIO 
>>>> device in
>>>> @@ -111,8 +123,10 @@ Flow of state changes during Live migration
>>>>   ===========================================
>>>>
>>>>   Below is the flow of state change during live migration.
>>>> -The values in the brackets represent the VM state, the migration 
>>>> state, and
>>>> +The values in the parentheses represent the VM state, the 
>>>> migration state, and
>>>>   the VFIO device state, respectively.
>>>> +The text in the square brackets represents the flow if the VFIO 
>>>> device supports
>>>> +pre-copy.
>>>>
>>>>   Live migration save path
>>>>   ------------------------
>>>> @@ -124,11 +138,12 @@ Live migration save path
>>>>                                     |
>>>>                        migrate_init spawns migration_thread
>>>>                   Migration thread then calls each device's 
>>>> .save_setup()
>>>> -                       (RUNNING, _SETUP, _RUNNING)
>>>> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>>>>                                     |
>>>> -                      (RUNNING, _ACTIVE, _RUNNING)
>>>> -             If device is active, get pending_bytes by 
>>>> .state_pending_exact()
>>>> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
>>>> +      If device is active, get pending_bytes by 
>>>> .state_pending_{estimate,exact}()
>>>>             If total pending_bytes >= threshold_size, call 
>>>> .save_live_iterate()
>>>> +                  [Data of VFIO device for pre-copy phase is copied]
>>>>           Iterate till total pending bytes converge and are less 
>>>> than threshold
>>>>                                     |
>>>>     On migration completion, vCPU stops and calls 
>>>> .save_live_complete_precopy for
>>>> diff --git a/include/hw/vfio/vfio-common.h 
>>>> b/include/hw/vfio/vfio-common.h
>>>> index eed244f25f..5ce7a01d56 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -66,6 +66,9 @@ typedef struct VFIOMigration {
>>>>       int data_fd;
>>>>       void *data_buffer;
>>>>       size_t data_buffer_size;
>>>> +    uint64_t precopy_init_size;
>>>> +    uint64_t precopy_dirty_size;
>>>> +    uint64_t mig_flags;
>>>
>>> It would have been cleaner to introduce VFIOMigration::mig_flags and 
>>> its
>>> update in another patch. This is minor.
>>
>> Sure, I will split it.
>>
>>>
>>>
>>>>   } VFIOMigration;
>>>>
>>>>   typedef struct VFIOAddressSpace {
>>>> @@ -143,6 +146,7 @@ typedef struct VFIODevice {
>>>>       VFIOMigration *migration;
>>>>       Error *migration_blocker;
>>>>       OnOffAuto pre_copy_dirty_page_tracking;
>>>> +    bool allow_pre_copy;
>>>
>>> same comment for this bool and the associated property, because it 
>>> would
>>> ease backports.
>>
>> Sure.
>> Just for general knowledge, can you explain how this could ease 
>> backports?
>
> The downstream machine names are different. Each distro has its own
> flavor. So adding a machine option always require some massaging.
>
> That might change in the future though.
>
> Anyhow, it is good pratice to isolate a change adding a restriction
> or a hw compatibility in its own patch.

Ah, I see.
Thanks for the explanation.

>
>
>>
>>>
>>>
>>>>       bool dirty_pages_supported;
>>>>       bool dirty_tracking;
>>>>   } VFIODevice;
>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>>> index 07f763eb2e..50439e5cbb 100644
>>>> --- a/hw/core/machine.c
>>>> +++ b/hw/core/machine.c
>>>> @@ -41,6 +41,7 @@
>>>>
>>>>   GlobalProperty hw_compat_8_0[] = {
>>>>       { "migration", "multifd-flush-after-each-section", "on"},
>>>> +    { "vfio-pci", "x-allow-pre-copy", "false" },
>>>>   };
>>>>   const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 78358ede27..b73086e17a 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -492,7 +492,8 @@ static bool 
>>>> vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>>>               }
>>>>
>>>>               if (vbasedev->pre_copy_dirty_page_tracking == 
>>>> ON_OFF_AUTO_OFF &&
>>>> -                migration->device_state == 
>>>> VFIO_DEVICE_STATE_RUNNING) {
>>>> +                (migration->device_state == 
>>>> VFIO_DEVICE_STATE_RUNNING ||
>>>> +                 migration->device_state == 
>>>> VFIO_DEVICE_STATE_PRE_COPY)) {
>>>>                   return false;
>>>>               }
>>>>           }
>>>> @@ -537,7 +538,8 @@ static bool 
>>>> vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>>>                   return false;
>>>>               }
>>>>
>>>> -            if (migration->device_state == 
>>>> VFIO_DEVICE_STATE_RUNNING) {
>>>> +            if (migration->device_state == 
>>>> VFIO_DEVICE_STATE_RUNNING ||
>>>> +                migration->device_state == 
>>>> VFIO_DEVICE_STATE_PRE_COPY) {
>>>>                   continue;
>>>>               } else {
>>>>                   return false;
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index 235978fd68..418efed019 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum 
>>>> vfio_device_mig_state state)
>>>>           return "STOP_COPY";
>>>>       case VFIO_DEVICE_STATE_RESUMING:
>>>>           return "RESUMING";
>>>> +    case VFIO_DEVICE_STATE_PRE_COPY:
>>>> +        return "PRE_COPY";
>>>>       default:
>>>>           return "UNKNOWN STATE";
>>>>       }
>>>> @@ -241,6 +243,22 @@ static int 
>>>> vfio_query_stop_copy_size(VFIODevice *vbasedev,
>>>>       return 0;
>>>>   }
>>>>
>>>> +static int vfio_query_precopy_size(VFIOMigration *migration)
>>>> +{
>>>> +    struct vfio_precopy_info precopy = {
>>>> +        .argsz = sizeof(precopy),
>>>> +    };
>>>
>>> May be move here  :
>>>
>>>         migration->precopy_init_size = 0;
>>>         migration->precopy_dirty_size = 0;
>>>
>>> since the values are reset always before calling 
>>> vfio_query_precopy_size()
>>
>> OK.
>> I will also reset these values in vfio_save_cleanup() so there won't 
>> be stale values in case migration is cancelled or fails.
>>
>>>
>>>
>>>> +
>>>> +    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, 
>>>> &precopy)) {
>>>> +        return -errno;
>>>> +    }
>>>> +
>>>> +    migration->precopy_init_size = precopy.initial_bytes;
>>>> +    migration->precopy_dirty_size = precopy.dirty_bytes;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>>   /* Returns the size of saved data on success and -errno on error */
>>>>   static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration 
>>>> *migration)
>>>>   {
>>>> @@ -249,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, 
>>>> VFIOMigration *migration)
>>>>       data_size = read(migration->data_fd, migration->data_buffer,
>>>>                        migration->data_buffer_size);
>>>>       if (data_size < 0) {
>>>> +        /* Pre-copy emptied all the device state for now */
>>>> +        if (errno == ENOMSG) {
>>>
>>> Could you explain a little more this errno please ? It looks like an 
>>> API with
>>> the VFIO PCI variant kernel driver.
>>
>> Yes, it's explained in the precopy uAPI [1].
>
> Oh. I forgot to look at the uAPI file and only checked the driver.
> All good then.
>
>> Do you want to change the comment to something like the following?
>
> Yes, please. It would be good for the reader to have a reference on
> the kernel uAPI.
>
>> /*
>>   * ENOMSG indicates that the migration data_fd has reached a temporary
>>   * "end of stream", i.e. both initial_bytes and dirty_bytes are zero.
>>   * More data may be available later in future reads.
>>   */
>
> Please add something like :
>
>  "For more information, please refer to the Linux kernel VFIO uAPI"
>
> No need for links or file names.
>
OK, I will add that.

Thanks!

>
>> [1] 
>> https://elixir.bootlin.com/linux/v6.4-rc1/source/include/uapi/linux/vfio.h#L1084
>>
>>>
>>>> +            return 0;
>>>> +        }
>>>> +
>>>>           return -errno;
>>>>       }
>>>>       if (data_size == 0) {
>>>> @@ -265,6 +288,39 @@ static ssize_t vfio_save_block(QEMUFile *f, 
>>>> VFIOMigration *migration)
>>>>       return qemu_file_get_error(f) ?: data_size;
>>>>   }
>>>>
>>>> +static void vfio_update_estimated_pending_data(VFIOMigration 
>>>> *migration,
>>>> +                                               uint64_t data_size)
>>>> +{
>>>> +    if (!data_size) {
>>>> +        /*
>>>> +         * Pre-copy emptied all the device state for now, update 
>>>> estimated sizes
>>>> +         * accordingly.
>>>> +         */
>>>> +        migration->precopy_init_size = 0;
>>>> +        migration->precopy_dirty_size = 0;
>>>> +
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (migration->precopy_init_size) {
>>>> +        uint64_t init_size = MIN(migration->precopy_init_size, 
>>>> data_size);
>>>> +
>>>> +        migration->precopy_init_size -= init_size;
>>>> +        data_size -= init_size;
>>>> +    }
>>>> +
>>>> +    migration->precopy_dirty_size -= 
>>>> MIN(migration->precopy_dirty_size,
>>>> +                                         data_size);
>>>
>>> Do we have a trace event for all this data values ?
>>>
>>>> +}
>>>> +
>>>> +static bool vfio_precopy_supported(VFIODevice *vbasedev)
>>>> +{
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +
>>>> +    return vbasedev->allow_pre_copy &&
>>>> +           migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>>>> +}
>>>> +
>>>>   /* 
>>>> ---------------------------------------------------------------------- 
>>>> */
>>>>
>>>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>>>> @@ -285,6 +341,31 @@ static int vfio_save_setup(QEMUFile *f, void 
>>>> *opaque)
>>>>           return -ENOMEM;
>>>>       }
>>>>
>>>> +    if (vfio_precopy_supported(vbasedev)) {
>>>> +        int ret;
>>>> +
>>>> +        migration->precopy_init_size = 0;
>>>> +        migration->precopy_dirty_size = 0;
>>>> +
>>>> +        switch (migration->device_state) {
>>>> +        case VFIO_DEVICE_STATE_RUNNING:
>>>> +            ret = vfio_migration_set_state(vbasedev, 
>>>> VFIO_DEVICE_STATE_PRE_COPY,
>>>> + VFIO_DEVICE_STATE_RUNNING);
>>>> +            if (ret) {
>>>> +                return ret;
>>>> +            }
>>>> +
>>>> +            vfio_query_precopy_size(migration);
>>>> +
>>>> +            break;
>>>> +        case VFIO_DEVICE_STATE_STOP:
>>>> +            /* vfio_save_complete_precopy() will go to STOP_COPY */
>>>> +            break;
>>>> +        default:
>>>> +            return -EINVAL;
>>>> +        }
>>>> +    }
>>>> +
>>>>       trace_vfio_save_setup(vbasedev->name, 
>>>> migration->data_buffer_size);
>>>>
>>>>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>> @@ -303,22 +384,36 @@ static void vfio_save_cleanup(void *opaque)
>>>>       trace_vfio_save_cleanup(vbasedev->name);
>>>>   }
>>>>
>>>> +static void vfio_state_pending_estimate(void *opaque, uint64_t 
>>>> *must_precopy,
>>>> +                                        uint64_t *can_postcopy)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +
>>>> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    *must_precopy +=
>>>> +        migration->precopy_init_size + migration->precopy_dirty_size;
>>>> +
>>>> +    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
>>>> +                                      *can_postcopy,
>>>> + migration->precopy_init_size,
>>>> + migration->precopy_dirty_size);
>>>
>>>
>>> ok we have one :) I wonder if we should not update 
>>> trace_vfio_save_iterate()
>>> also with some values.
>>
>> Hmm, yes, I guess it wouldn't hurt.
>>
>>>
>>>> +}
>>>> +
>>>>   /*
>>>>    * Migration size of VFIO devices can be as little as a few KBs 
>>>> or as big as
>>>>    * many GBs. This value should be big enough to cover the worst 
>>>> case.
>>>>    */
>>>>   #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>>>>
>>>> -/*
>>>> - * Only exact function is implemented and not estimate function. 
>>>> The reason is
>>>> - * that during pre-copy phase of migration the estimate function 
>>>> is called
>>>> - * repeatedly while pending RAM size is over the threshold, thus 
>>>> migration
>>>> - * can't converge and querying the VFIO device pending data size 
>>>> is useless.
>>>> - */
>>>>   static void vfio_state_pending_exact(void *opaque, uint64_t 
>>>> *must_precopy,
>>>>                                        uint64_t *can_postcopy)
>>>>   {
>>>>       VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>       uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>>>>
>>>>       /*
>>>> @@ -328,8 +423,49 @@ static void vfio_state_pending_exact(void 
>>>> *opaque, uint64_t *must_precopy,
>>>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>>>       *must_precopy += stop_copy_size;
>>>>
>>>> +    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>>>> +        migration->precopy_init_size = 0;
>>>> +        migration->precopy_dirty_size = 0;
>>>> +        vfio_query_precopy_size(migration);
>>>> +
>>>> +        *must_precopy +=
>>>> +            migration->precopy_init_size + 
>>>> migration->precopy_dirty_size;
>>>> +    }
>>>> +
>>>>       trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, 
>>>> *can_postcopy,
>>>> -                                   stop_copy_size);
>>>> +                                   stop_copy_size, 
>>>> migration->precopy_init_size,
>>>> + migration->precopy_dirty_size);
>>>> +}
>>>> +
>>>> +static bool vfio_is_active_iterate(void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +
>>>> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
>>>> +}
>>>> +
>>>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    ssize_t data_size;
>>>> +
>>>> +    data_size = vfio_save_block(f, migration);
>>>> +    if (data_size < 0) {
>>>> +        return data_size;
>>>> +    }
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>> +
>>>> +    vfio_update_estimated_pending_data(migration, data_size);
>>>> +
>>>> +    trace_vfio_save_iterate(vbasedev->name);
>>>> +
>>>> +    /*
>>>> +     * A VFIO device's pre-copy dirty_bytes is not guaranteed to 
>>>> reach zero.
>>>> +     * Return 1 so following handlers will not be potentially 
>>>> blocked.
>>>
>>> Can this condition be detected to warn the user ?
>>
>> I don't think so, it depends on the kernel driver implementation.
>
> OK. We will see how it evolves with time. We might need some message
> saying pre copy is not converging.
>
> Thanks,
>
> C.
>
>
>>
>> Thanks!
>>
>>>
>>>
>>>> +     */
>>>> +    return 1;
>>>>   }
>>>>
>>>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>> @@ -338,7 +474,7 @@ static int vfio_save_complete_precopy(QEMUFile 
>>>> *f, void *opaque)
>>>>       ssize_t data_size;
>>>>       int ret;
>>>>
>>>> -    /* We reach here with device state STOP only */
>>>> +    /* We reach here with device state STOP or STOP_COPY only */
>>>>       ret = vfio_migration_set_state(vbasedev, 
>>>> VFIO_DEVICE_STATE_STOP_COPY,
>>>> VFIO_DEVICE_STATE_STOP);
>>>>       if (ret) {
>>>> @@ -457,7 +593,10 @@ static int vfio_load_state(QEMUFile *f, void 
>>>> *opaque, int version_id)
>>>>   static const SaveVMHandlers savevm_vfio_handlers = {
>>>>       .save_setup = vfio_save_setup,
>>>>       .save_cleanup = vfio_save_cleanup,
>>>> +    .state_pending_estimate = vfio_state_pending_estimate,
>>>>       .state_pending_exact = vfio_state_pending_exact,
>>>> +    .is_active_iterate = vfio_is_active_iterate,
>>>> +    .save_live_iterate = vfio_save_iterate,
>>>>       .save_live_complete_precopy = vfio_save_complete_precopy,
>>>>       .save_state = vfio_save_state,
>>>>       .load_setup = vfio_load_setup,
>>>> @@ -470,13 +609,18 @@ static const SaveVMHandlers 
>>>> savevm_vfio_handlers = {
>>>>   static void vfio_vmstate_change(void *opaque, bool running, 
>>>> RunState state)
>>>>   {
>>>>       VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>       enum vfio_device_mig_state new_state;
>>>>       int ret;
>>>>
>>>>       if (running) {
>>>>           new_state = VFIO_DEVICE_STATE_RUNNING;
>>>>       } else {
>>>> -        new_state = VFIO_DEVICE_STATE_STOP;
>>>> +        new_state =
>>>> +            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
>>>> +             (state == RUN_STATE_FINISH_MIGRATE || state == 
>>>> RUN_STATE_PAUSED)) ?
>>>> +                VFIO_DEVICE_STATE_STOP_COPY :
>>>> +                VFIO_DEVICE_STATE_STOP;
>>>>       }
>>>>
>>>>       /*
>>>> @@ -603,6 +747,7 @@ static int vfio_migration_init(VFIODevice 
>>>> *vbasedev)
>>>>       migration->vbasedev = vbasedev;
>>>>       migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>>>>       migration->data_fd = -1;
>>>> +    migration->mig_flags = mig_flags;
>>>>
>>>>       vbasedev->dirty_pages_supported = 
>>>> vfio_dma_logging_supported(vbasedev);
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index bf27a39905..72f30ce09f 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
>>>> DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", 
>>>> VFIOPCIDevice,
>>>> vbasedev.pre_copy_dirty_page_tracking,
>>>>                               ON_OFF_AUTO_ON),
>>>> +    DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
>>>> +                     vbasedev.allow_pre_copy, true),
>>>>       DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
>>>>                               display, ON_OFF_AUTO_OFF),
>>>>       DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>> index 646e42fd27..fd6893cb43 100644
>>>> --- a/hw/vfio/trace-events
>>>> +++ b/hw/vfio/trace-events
>>>> @@ -162,6 +162,8 @@ vfio_save_block(const char *name, int 
>>>> data_size) " (%s) data_size %d"
>>>>   vfio_save_cleanup(const char *name) " (%s)"
>>>>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>>>>   vfio_save_device_config_state(const char *name) " (%s)"
>>>> +vfio_save_iterate(const char *name) " (%s)"
>>>>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " 
>>>> (%s) data buffer size 0x%"PRIx64
>>>> -vfio_state_pending_exact(const char *name, uint64_t precopy, 
>>>> uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 
>>>> 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
>>>> +vfio_state_pending_estimate(const char *name, uint64_t precopy, 
>>>> uint64_t postcopy, uint64_t precopy_init_size, uint64_t 
>>>> precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" 
>>>> precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>>>> +vfio_state_pending_exact(const char *name, uint64_t precopy, 
>>>> uint64_t postcopy, uint64_t stopcopy_size, uint64_t 
>>>> precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 
>>>> 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy 
>>>> initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>>>>   vfio_vmstate_change(const char *name, int running, const char 
>>>> *reason, const char *dev_state) " (%s) running %d reason %s device 
>>>> state %s"
>>>
>>
>
diff mbox series

Patch

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index 1b68ccf115..e896b2a673 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,12 +7,14 @@  the guest is running on source host and restoring this saved state on the
 destination host. This document details how saving and restoring of VFIO
 devices is done in QEMU.
 
-Migration of VFIO devices currently consists of a single stop-and-copy phase.
-During the stop-and-copy phase the guest is stopped and the entire VFIO device
-data is transferred to the destination.
-
-The pre-copy phase of migration is currently not supported for VFIO devices.
-Support for VFIO pre-copy will be added later on.
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
+support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
+VFIO_DEVICE_FEATURE_MIGRATION ioctl.
 
 Note that currently VFIO migration is supported only for a single device. This
 is due to VFIO migration's lack of P2P support. However, P2P support is planned
@@ -29,10 +31,20 @@  VFIO implements the device hooks for the iterative approach as follows:
 * A ``load_setup`` function that sets the VFIO device on the destination in
   _RESUMING state.
 
+* A ``state_pending_estimate`` function that reports an estimate of the
+  remaining pre-copy data that the vendor driver has yet to save for the VFIO
+  device.
+
 * A ``state_pending_exact`` function that reads pending_bytes from the vendor
   driver, which indicates the amount of data that the vendor driver has yet to
   save for the VFIO device.
 
+* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
+  active only when the VFIO device is in pre-copy states.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+  vendor driver during iterative pre-copy phase.
+
 * A ``save_state`` function to save the device config space if it is present.
 
 * A ``save_live_complete_precopy`` function that sets the VFIO device in
@@ -111,8 +123,10 @@  Flow of state changes during Live migration
 ===========================================
 
 Below is the flow of state change during live migration.
-The values in the brackets represent the VM state, the migration state, and
+The values in the parentheses represent the VM state, the migration state, and
 the VFIO device state, respectively.
+The text in the square brackets represents the flow if the VFIO device supports
+pre-copy.
 
 Live migration save path
 ------------------------
@@ -124,11 +138,12 @@  Live migration save path
                                   |
                      migrate_init spawns migration_thread
                 Migration thread then calls each device's .save_setup()
-                       (RUNNING, _SETUP, _RUNNING)
+                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
                                   |
-                      (RUNNING, _ACTIVE, _RUNNING)
-             If device is active, get pending_bytes by .state_pending_exact()
+                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
+      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
           If total pending_bytes >= threshold_size, call .save_live_iterate()
+                  [Data of VFIO device for pre-copy phase is copied]
         Iterate till total pending bytes converge and are less than threshold
                                   |
   On migration completion, vCPU stops and calls .save_live_complete_precopy for
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index eed244f25f..5ce7a01d56 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -66,6 +66,9 @@  typedef struct VFIOMigration {
     int data_fd;
     void *data_buffer;
     size_t data_buffer_size;
+    uint64_t precopy_init_size;
+    uint64_t precopy_dirty_size;
+    uint64_t mig_flags;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
@@ -143,6 +146,7 @@  typedef struct VFIODevice {
     VFIOMigration *migration;
     Error *migration_blocker;
     OnOffAuto pre_copy_dirty_page_tracking;
+    bool allow_pre_copy;
     bool dirty_pages_supported;
     bool dirty_tracking;
 } VFIODevice;
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 07f763eb2e..50439e5cbb 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -41,6 +41,7 @@ 
 
 GlobalProperty hw_compat_8_0[] = {
     { "migration", "multifd-flush-after-each-section", "on"},
+    { "vfio-pci", "x-allow-pre-copy", "false" },
 };
 const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 78358ede27..b73086e17a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -492,7 +492,8 @@  static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
             }
 
             if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
-                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
                 return false;
             }
         }
@@ -537,7 +538,8 @@  static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
                 return false;
             }
 
-            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 235978fd68..418efed019 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -68,6 +68,8 @@  static const char *mig_state_to_str(enum vfio_device_mig_state state)
         return "STOP_COPY";
     case VFIO_DEVICE_STATE_RESUMING:
         return "RESUMING";
+    case VFIO_DEVICE_STATE_PRE_COPY:
+        return "PRE_COPY";
     default:
         return "UNKNOWN STATE";
     }
@@ -241,6 +243,22 @@  static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
     return 0;
 }
 
+static int vfio_query_precopy_size(VFIOMigration *migration)
+{
+    struct vfio_precopy_info precopy = {
+        .argsz = sizeof(precopy),
+    };
+
+    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
+        return -errno;
+    }
+
+    migration->precopy_init_size = precopy.initial_bytes;
+    migration->precopy_dirty_size = precopy.dirty_bytes;
+
+    return 0;
+}
+
 /* Returns the size of saved data on success and -errno on error */
 static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
 {
@@ -249,6 +267,11 @@  static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     data_size = read(migration->data_fd, migration->data_buffer,
                      migration->data_buffer_size);
     if (data_size < 0) {
+        /* Pre-copy emptied all the device state for now */
+        if (errno == ENOMSG) {
+            return 0;
+        }
+
         return -errno;
     }
     if (data_size == 0) {
@@ -265,6 +288,39 @@  static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     return qemu_file_get_error(f) ?: data_size;
 }
 
+static void vfio_update_estimated_pending_data(VFIOMigration *migration,
+                                               uint64_t data_size)
+{
+    if (!data_size) {
+        /*
+         * Pre-copy emptied all the device state for now, update estimated sizes
+         * accordingly.
+         */
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+
+        return;
+    }
+
+    if (migration->precopy_init_size) {
+        uint64_t init_size = MIN(migration->precopy_init_size, data_size);
+
+        migration->precopy_init_size -= init_size;
+        data_size -= init_size;
+    }
+
+    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
+                                         data_size);
+}
+
+static bool vfio_precopy_supported(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    return vbasedev->allow_pre_copy &&
+           migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -285,6 +341,31 @@  static int vfio_save_setup(QEMUFile *f, void *opaque)
         return -ENOMEM;
     }
 
+    if (vfio_precopy_supported(vbasedev)) {
+        int ret;
+
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+
+        switch (migration->device_state) {
+        case VFIO_DEVICE_STATE_RUNNING:
+            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
+                                           VFIO_DEVICE_STATE_RUNNING);
+            if (ret) {
+                return ret;
+            }
+
+            vfio_query_precopy_size(migration);
+
+            break;
+        case VFIO_DEVICE_STATE_STOP:
+            /* vfio_save_complete_precopy() will go to STOP_COPY */
+            break;
+        default:
+            return -EINVAL;
+        }
+    }
+
     trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
@@ -303,22 +384,36 @@  static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
+                                        uint64_t *can_postcopy)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
+        return;
+    }
+
+    *must_precopy +=
+        migration->precopy_init_size + migration->precopy_dirty_size;
+
+    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
+                                      *can_postcopy,
+                                      migration->precopy_init_size,
+                                      migration->precopy_dirty_size);
+}
+
 /*
  * Migration size of VFIO devices can be as little as a few KBs or as big as
  * many GBs. This value should be big enough to cover the worst case.
  */
 #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
 
-/*
- * Only exact function is implemented and not estimate function. The reason is
- * that during pre-copy phase of migration the estimate function is called
- * repeatedly while pending RAM size is over the threshold, thus migration
- * can't converge and querying the VFIO device pending data size is useless.
- */
 static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
                                      uint64_t *can_postcopy)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
 
     /*
@@ -328,8 +423,49 @@  static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
     vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
     *must_precopy += stop_copy_size;
 
+    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+        vfio_query_precopy_size(migration);
+
+        *must_precopy +=
+            migration->precopy_init_size + migration->precopy_dirty_size;
+    }
+
     trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
-                                   stop_copy_size);
+                                   stop_copy_size, migration->precopy_init_size,
+                                   migration->precopy_dirty_size);
+}
+
+static bool vfio_is_active_iterate(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    ssize_t data_size;
+
+    data_size = vfio_save_block(f, migration);
+    if (data_size < 0) {
+        return data_size;
+    }
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    vfio_update_estimated_pending_data(migration, data_size);
+
+    trace_vfio_save_iterate(vbasedev->name);
+
+    /*
+     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
+     * Return 1 so following handlers will not be potentially blocked.
+     */
+    return 1;
 }
 
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
@@ -338,7 +474,7 @@  static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     ssize_t data_size;
     int ret;
 
-    /* We reach here with device state STOP only */
+    /* We reach here with device state STOP or STOP_COPY only */
     ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
                                    VFIO_DEVICE_STATE_STOP);
     if (ret) {
@@ -457,7 +593,10 @@  static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 static const SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
+    .state_pending_estimate = vfio_state_pending_estimate,
     .state_pending_exact = vfio_state_pending_exact,
+    .is_active_iterate = vfio_is_active_iterate,
+    .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
@@ -470,13 +609,18 @@  static const SaveVMHandlers savevm_vfio_handlers = {
 static void vfio_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     enum vfio_device_mig_state new_state;
     int ret;
 
     if (running) {
         new_state = VFIO_DEVICE_STATE_RUNNING;
     } else {
-        new_state = VFIO_DEVICE_STATE_STOP;
+        new_state =
+            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
+             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
+                VFIO_DEVICE_STATE_STOP_COPY :
+                VFIO_DEVICE_STATE_STOP;
     }
 
     /*
@@ -603,6 +747,7 @@  static int vfio_migration_init(VFIODevice *vbasedev)
     migration->vbasedev = vbasedev;
     migration->device_state = VFIO_DEVICE_STATE_RUNNING;
     migration->data_fd = -1;
+    migration->mig_flags = mig_flags;
 
     vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bf27a39905..72f30ce09f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3335,6 +3335,8 @@  static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
                             vbasedev.pre_copy_dirty_page_tracking,
                             ON_OFF_AUTO_ON),
+    DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
+                     vbasedev.allow_pre_copy, true),
     DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
                             display, ON_OFF_AUTO_OFF),
     DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 646e42fd27..fd6893cb43 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -162,6 +162,8 @@  vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
+vfio_save_iterate(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
-vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
+vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
+vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"