diff mbox series

[01/18] vfio/migration: Add VFIO migration pre-copy support

Message ID 20230126184948.10478-2-avihaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series vfio: Add migration pre-copy support and device dirty tracking | expand

Commit Message

Avihai Horon Jan. 26, 2023, 6:49 p.m. UTC
Pre-copy support allows the VFIO device data to be transferred while the
VM is running. This helps to accommodate VFIO devices that have a large
amount of data that needs to be transferred, and it can reduce migration
downtime.

Pre-copy support is optional in VFIO migration protocol v2.
Implement pre-copy of VFIO migration protocol v2 and use it for devices
that support it. Full description of it can be found here [1].

[1]
https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst |  29 ++++++---
 include/hw/vfio/vfio-common.h |   3 +
 hw/vfio/common.c              |   8 ++-
 hw/vfio/migration.c           | 112 ++++++++++++++++++++++++++++++++--
 hw/vfio/trace-events          |   5 +-
 5 files changed, 140 insertions(+), 17 deletions(-)

Comments

Alex Williamson Jan. 26, 2023, 11:52 p.m. UTC | #1
On Thu, 26 Jan 2023 20:49:31 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> Pre-copy support allows the VFIO device data to be transferred while the
> VM is running. This helps to accommodate VFIO devices that have a large
> amount of data that needs to be transferred, and it can reduce migration
> downtime.
> 
> Pre-copy support is optional in VFIO migration protocol v2.
> Implement pre-copy of VFIO migration protocol v2 and use it for devices
> that support it. Full description of it can be found here [1].
> 
> [1]
> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  docs/devel/vfio-migration.rst |  29 ++++++---
>  include/hw/vfio/vfio-common.h |   3 +
>  hw/vfio/common.c              |   8 ++-
>  hw/vfio/migration.c           | 112 ++++++++++++++++++++++++++++++++--
>  hw/vfio/trace-events          |   5 +-
>  5 files changed, 140 insertions(+), 17 deletions(-)
> 
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index 1d50c2fe5f..51f5e1a537 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
>  destination host. This document details how saving and restoring of VFIO
>  devices is done in QEMU.
>  
> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
> -data is transferred to the destination.
> -
> -The pre-copy phase of migration is currently not supported for VFIO devices.
> -Support for VFIO pre-copy will be added later on.
> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
> +accommodate VFIO devices that have a large amount of data that needs to be
> +transferred. The iterative pre-copy phase of migration allows for the guest to
> +continue whilst the VFIO device state is transferred to the destination, this
> +helps to reduce the total downtime of the VM. VFIO devices can choose to skip
> +the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
> +flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>  
>  A detailed description of the UAPI for VFIO device migration can be found in
>  the comment for the ``vfio_device_mig_state`` structure in the header file
> @@ -29,6 +31,12 @@ VFIO implements the device hooks for the iterative approach as follows:
>    driver, which indicates the amount of data that the vendor driver has yet to
>    save for the VFIO device.
>  
> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
> +  active only if the VFIO device is in pre-copy states.
> +
> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
> +  vendor driver during iterative phase.
> +
>  * A ``save_state`` function to save the device config space if it is present.
>  
>  * A ``save_live_complete_precopy`` function that sets the VFIO device in
> @@ -91,8 +99,10 @@ Flow of state changes during Live migration
>  ===========================================
>  
>  Below is the flow of state change during live migration.
> -The values in the brackets represent the VM state, the migration state, and
> +The values in the parentheses represent the VM state, the migration state, and
>  the VFIO device state, respectively.
> +The text in the square brackets represents the flow if the VFIO device supports
> +pre-copy.
>  
>  Live migration save path
>  ------------------------
> @@ -104,11 +114,12 @@ Live migration save path
>                                    |
>                       migrate_init spawns migration_thread
>                  Migration thread then calls each device's .save_setup()
> -                       (RUNNING, _SETUP, _RUNNING)
> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>                                    |
> -                      (RUNNING, _ACTIVE, _RUNNING)
> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
>               If device is active, get pending_bytes by .save_live_pending()
>            If total pending_bytes >= threshold_size, call .save_live_iterate()
> +                  [Data of VFIO device for pre-copy phase is copied]
>          Iterate till total pending bytes converge and are less than threshold
>                                    |
>    On migration completion, vCPU stops and calls .save_live_complete_precopy for
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 5f8e7a02fe..88c2194fb9 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -67,7 +67,10 @@ typedef struct VFIOMigration {
>      int data_fd;
>      void *data_buffer;
>      size_t data_buffer_size;
> +    uint64_t mig_flags;
>      uint64_t stop_copy_size;
> +    uint64_t precopy_init_size;
> +    uint64_t precopy_dirty_size;
>  } VFIOMigration;
>  
>  typedef struct VFIOAddressSpace {
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 9a0dbee6b4..93b18c5e3d 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -357,7 +357,9 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>  
>              if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
>                  (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> -                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
> +                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {

Should this just turn into a test that we're not in STOP_COPY?

>                  return false;
>              }
>          }
> @@ -387,7 +389,9 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>              }
>  
>              if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P) {
> +                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) {
>                  continue;
>              } else {
>                  return false;

Hmm, this only seems to highlight that between this series and the
previous, we're adding tests for states that we never actually use, ie.
these _P2P states.

IIRC, the reason we have these _P2P states is so that we can transition
a set of devices, which may have active P2P DMA between them, to STOP,
STOP_COPY, and even RUNNING states safely without lost data given that
we cannot simultaneously transition all devices.  That suggest that
missing from both these series is support for bringing all devices to
these _P2P states before we move any device to one of STOP, STOP_COPY,
or RUNNING states (in the case of RESUMING).

Also, I recall discussions that we need to enforce configuration
restrictions when not all devices support the _P2P states?  For example
adding a migration blocker when there are multiple vfio devices and at
least one of them does not support _P2P migration states.  Or perhaps
initially, requiring support for _P2P states.

I think what's implemented here, where we don't make use of the _P2P
states would require adding a migration blocker whenever there are
multiple vfio devices, regardless of the device support for _P2P.
Thanks,

Alex
Avihai Horon Jan. 31, 2023, 12:44 p.m. UTC | #2
On 27/01/2023 1:52, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 26 Jan 2023 20:49:31 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> Pre-copy support allows the VFIO device data to be transferred while the
>> VM is running. This helps to accommodate VFIO devices that have a large
>> amount of data that needs to be transferred, and it can reduce migration
>> downtime.
>>
>> Pre-copy support is optional in VFIO migration protocol v2.
>> Implement pre-copy of VFIO migration protocol v2 and use it for devices
>> that support it. Full description of it can be found here [1].
>>
>> [1]
>> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   docs/devel/vfio-migration.rst |  29 ++++++---
>>   include/hw/vfio/vfio-common.h |   3 +
>>   hw/vfio/common.c              |   8 ++-
>>   hw/vfio/migration.c           | 112 ++++++++++++++++++++++++++++++++--
>>   hw/vfio/trace-events          |   5 +-
>>   5 files changed, 140 insertions(+), 17 deletions(-)
>>
>> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
>> index 1d50c2fe5f..51f5e1a537 100644
>> --- a/docs/devel/vfio-migration.rst
>> +++ b/docs/devel/vfio-migration.rst
>> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
>>   destination host. This document details how saving and restoring of VFIO
>>   devices is done in QEMU.
>>
>> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
>> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
>> -data is transferred to the destination.
>> -
>> -The pre-copy phase of migration is currently not supported for VFIO devices.
>> -Support for VFIO pre-copy will be added later on.
>> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
>> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
>> +accommodate VFIO devices that have a large amount of data that needs to be
>> +transferred. The iterative pre-copy phase of migration allows for the guest to
>> +continue whilst the VFIO device state is transferred to the destination, this
>> +helps to reduce the total downtime of the VM. VFIO devices can choose to skip
>> +the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
>> +flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>
>>   A detailed description of the UAPI for VFIO device migration can be found in
>>   the comment for the ``vfio_device_mig_state`` structure in the header file
>> @@ -29,6 +31,12 @@ VFIO implements the device hooks for the iterative approach as follows:
>>     driver, which indicates the amount of data that the vendor driver has yet to
>>     save for the VFIO device.
>>
>> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>> +  active only if the VFIO device is in pre-copy states.
>> +
>> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
>> +  vendor driver during iterative phase.
>> +
>>   * A ``save_state`` function to save the device config space if it is present.
>>
>>   * A ``save_live_complete_precopy`` function that sets the VFIO device in
>> @@ -91,8 +99,10 @@ Flow of state changes during Live migration
>>   ===========================================
>>
>>   Below is the flow of state change during live migration.
>> -The values in the brackets represent the VM state, the migration state, and
>> +The values in the parentheses represent the VM state, the migration state, and
>>   the VFIO device state, respectively.
>> +The text in the square brackets represents the flow if the VFIO device supports
>> +pre-copy.
>>
>>   Live migration save path
>>   ------------------------
>> @@ -104,11 +114,12 @@ Live migration save path
>>                                     |
>>                        migrate_init spawns migration_thread
>>                   Migration thread then calls each device's .save_setup()
>> -                       (RUNNING, _SETUP, _RUNNING)
>> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>>                                     |
>> -                      (RUNNING, _ACTIVE, _RUNNING)
>> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
>>                If device is active, get pending_bytes by .save_live_pending()
>>             If total pending_bytes >= threshold_size, call .save_live_iterate()
>> +                  [Data of VFIO device for pre-copy phase is copied]
>>           Iterate till total pending bytes converge and are less than threshold
>>                                     |
>>     On migration completion, vCPU stops and calls .save_live_complete_precopy for
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 5f8e7a02fe..88c2194fb9 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -67,7 +67,10 @@ typedef struct VFIOMigration {
>>       int data_fd;
>>       void *data_buffer;
>>       size_t data_buffer_size;
>> +    uint64_t mig_flags;
>>       uint64_t stop_copy_size;
>> +    uint64_t precopy_init_size;
>> +    uint64_t precopy_dirty_size;
>>   } VFIOMigration;
>>
>>   typedef struct VFIOAddressSpace {
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 9a0dbee6b4..93b18c5e3d 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -357,7 +357,9 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>
>>               if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
>>                   (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> -                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
>> +                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
>> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
>> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {
> Should this just turn into a test that we're not in STOP_COPY?

Then we would need to check we are not in STOP_COPY and not in STOP.
The STOP check is for the case where PRE_COPY is not supported, since 
RAM will ask for dirty page sync when the device is in STOP state.
Without the STOP check, the device will skip the final dirty page sync.

>
>>                   return false;
>>               }
>>           }
>> @@ -387,7 +389,9 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>               }
>>
>>               if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P) {
>> +                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
>> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
>> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) {
>>                   continue;
>>               } else {
>>                   return false;
> Hmm, this only seems to highlight that between this series and the
> previous, we're adding tests for states that we never actually use, ie.
> these _P2P states.
>
> IIRC, the reason we have these _P2P states is so that we can transition
> a set of devices, which may have active P2P DMA between them, to STOP,
> STOP_COPY, and even RUNNING states safely without lost data given that
> we cannot simultaneously transition all devices.  That suggest that
> missing from both these series is support for bringing all devices to
> these _P2P states before we move any device to one of STOP, STOP_COPY,
> or RUNNING states (in the case of RESUMING).
>
> Also, I recall discussions that we need to enforce configuration
> restrictions when not all devices support the _P2P states?  For example
> adding a migration blocker when there are multiple vfio devices and at
> least one of them does not support _P2P migration states.  Or perhaps
> initially, requiring support for _P2P states.
>
> I think what's implemented here, where we don't make use of the _P2P
> states would require adding a migration blocker whenever there are
> multiple vfio devices, regardless of the device support for _P2P.

Yes, I think you are right.

I will add a migration blocker if there are multiple devices as part of 
v9 of the basic series.
When P2P support is added, I will block migration of multiple devices if 
one or more of them doesn't support P2P.

Thanks.
Alex Williamson Jan. 31, 2023, 10:43 p.m. UTC | #3
On Tue, 31 Jan 2023 14:44:54 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> On 27/01/2023 1:52, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, 26 Jan 2023 20:49:31 +0200
> > Avihai Horon <avihaih@nvidia.com> wrote:
> >  
> >> Pre-copy support allows the VFIO device data to be transferred while the
> >> VM is running. This helps to accommodate VFIO devices that have a large
> >> amount of data that needs to be transferred, and it can reduce migration
> >> downtime.
> >>
> >> Pre-copy support is optional in VFIO migration protocol v2.
> >> Implement pre-copy of VFIO migration protocol v2 and use it for devices
> >> that support it. Full description of it can be found here [1].
> >>
> >> [1]
> >> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
> >>
> >> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> >> ---
> >>   docs/devel/vfio-migration.rst |  29 ++++++---
> >>   include/hw/vfio/vfio-common.h |   3 +
> >>   hw/vfio/common.c              |   8 ++-
> >>   hw/vfio/migration.c           | 112 ++++++++++++++++++++++++++++++++--
> >>   hw/vfio/trace-events          |   5 +-
> >>   5 files changed, 140 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> >> index 1d50c2fe5f..51f5e1a537 100644
> >> --- a/docs/devel/vfio-migration.rst
> >> +++ b/docs/devel/vfio-migration.rst
> >> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
> >>   destination host. This document details how saving and restoring of VFIO
> >>   devices is done in QEMU.
> >>
> >> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
> >> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
> >> -data is transferred to the destination.
> >> -
> >> -The pre-copy phase of migration is currently not supported for VFIO devices.
> >> -Support for VFIO pre-copy will be added later on.
> >> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
> >> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
> >> +accommodate VFIO devices that have a large amount of data that needs to be
> >> +transferred. The iterative pre-copy phase of migration allows for the guest to
> >> +continue whilst the VFIO device state is transferred to the destination, this
> >> +helps to reduce the total downtime of the VM. VFIO devices can choose to skip
> >> +the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
> >> +flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
> >>
> >>   A detailed description of the UAPI for VFIO device migration can be found in
> >>   the comment for the ``vfio_device_mig_state`` structure in the header file
> >> @@ -29,6 +31,12 @@ VFIO implements the device hooks for the iterative approach as follows:
> >>     driver, which indicates the amount of data that the vendor driver has yet to
> >>     save for the VFIO device.
> >>
> >> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
> >> +  active only if the VFIO device is in pre-copy states.
> >> +
> >> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
> >> +  vendor driver during iterative phase.
> >> +
> >>   * A ``save_state`` function to save the device config space if it is present.
> >>
> >>   * A ``save_live_complete_precopy`` function that sets the VFIO device in
> >> @@ -91,8 +99,10 @@ Flow of state changes during Live migration
> >>   ===========================================
> >>
> >>   Below is the flow of state change during live migration.
> >> -The values in the brackets represent the VM state, the migration state, and
> >> +The values in the parentheses represent the VM state, the migration state, and
> >>   the VFIO device state, respectively.
> >> +The text in the square brackets represents the flow if the VFIO device supports
> >> +pre-copy.
> >>
> >>   Live migration save path
> >>   ------------------------
> >> @@ -104,11 +114,12 @@ Live migration save path
> >>                                     |
> >>                        migrate_init spawns migration_thread
> >>                   Migration thread then calls each device's .save_setup()
> >> -                       (RUNNING, _SETUP, _RUNNING)
> >> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
> >>                                     |
> >> -                      (RUNNING, _ACTIVE, _RUNNING)
> >> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
> >>                If device is active, get pending_bytes by .save_live_pending()
> >>             If total pending_bytes >= threshold_size, call .save_live_iterate()
> >> +                  [Data of VFIO device for pre-copy phase is copied]
> >>           Iterate till total pending bytes converge and are less than threshold
> >>                                     |
> >>     On migration completion, vCPU stops and calls .save_live_complete_precopy for
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 5f8e7a02fe..88c2194fb9 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -67,7 +67,10 @@ typedef struct VFIOMigration {
> >>       int data_fd;
> >>       void *data_buffer;
> >>       size_t data_buffer_size;
> >> +    uint64_t mig_flags;
> >>       uint64_t stop_copy_size;
> >> +    uint64_t precopy_init_size;
> >> +    uint64_t precopy_dirty_size;
> >>   } VFIOMigration;
> >>
> >>   typedef struct VFIOAddressSpace {
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 9a0dbee6b4..93b18c5e3d 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -357,7 +357,9 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
> >>
> >>               if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
> >>                   (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> >> -                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
> >> +                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
> >> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
> >> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {  
> > Should this just turn into a test that we're not in STOP_COPY?  
> 
> Then we would need to check we are not in STOP_COPY and not in STOP.
> The STOP check is for the case where PRE_COPY is not supported, since 
> RAM will ask for dirty page sync when the device is in STOP state.
> Without the STOP check, the device will skip the final dirty page sync.
> 
> >  
> >>                   return false;
> >>               }
> >>           }
> >> @@ -387,7 +389,9 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
> >>               }
> >>
> >>               if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> >> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P) {
> >> +                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
> >> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
> >> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) {
> >>                   continue;
> >>               } else {
> >>                   return false;  
> > Hmm, this only seems to highlight that between this series and the
> > previous, we're adding tests for states that we never actually use, ie.
> > these _P2P states.
> >
> > IIRC, the reason we have these _P2P states is so that we can transition
> > a set of devices, which may have active P2P DMA between them, to STOP,
> > STOP_COPY, and even RUNNING states safely without lost data given that
> > we cannot simultaneously transition all devices.  That suggest that
> > missing from both these series is support for bringing all devices to
> > these _P2P states before we move any device to one of STOP, STOP_COPY,
> > or RUNNING states (in the case of RESUMING).
> >
> > Also, I recall discussions that we need to enforce configuration
> > restrictions when not all devices support the _P2P states?  For example
> > adding a migration blocker when there are multiple vfio devices and at
> > least one of them does not support _P2P migration states.  Or perhaps
> > initially, requiring support for _P2P states.
> >
> > I think what's implemented here, where we don't make use of the _P2P
> > states would require adding a migration blocker whenever there are
> > multiple vfio devices, regardless of the device support for _P2P.  
> 
> Yes, I think you are right.
> 
> I will add a migration blocker if there are multiple devices as part of 
> v9 of the basic series.
> When P2P support is added, I will block migration of multiple devices if 
> one or more of them doesn't support P2P.

How does this affect our path towards supported migration?  I'm
thinking about a user experience where QEMU supports migration if
device A OR device B are attached, but not devices A and B attached to
the same VM.  We might have a device C where QEMU supports migration
with B AND C, but not A AND C, nor A AND B AND C.  This would be the
case if device B and device C both supported P2P states, but device A
did not. The user has no observability of this feature, so all of this
looks effectively random to the user.

Even in the single device case, we need to make an assumption that a
device that does not support P2P migration states (or when QEMU doesn't
make use of P2P states) cannot be a DMA target, or otherwise have its
MMIO space accessed while in a STOP state.  Can we guarantee that when
other devices have not yet transitioned to STOP?

We could disable the direct map MemoryRegions when we move to a STOP
state, which would give QEMU visibility to those accesses, but besides
pulling an abort should such an access occur, could we queue them in
software, add them to the migration stream, and replay them after the
device moves to the RUNNING state?  We'd need to account for the lack of
RESUMING_P2P states as well, trapping and queue accesses from devices
already RUNNING to those still in RESUMING (not _P2P).

This all looks complicated.  Is it better to start with requiring P2P
state support?  Thanks,

Alex
Jason Gunthorpe Jan. 31, 2023, 11:29 p.m. UTC | #4
On Tue, Jan 31, 2023 at 03:43:01PM -0700, Alex Williamson wrote:

> How does this affect our path towards supported migration?  I'm
> thinking about a user experience where QEMU supports migration if
> device A OR device B are attached, but not devices A and B attached to
> the same VM.  We might have a device C where QEMU supports migration
> with B AND C, but not A AND C, nor A AND B AND C.  This would be the
> case if device B and device C both supported P2P states, but device A
> did not. The user has no observability of this feature, so all of this
> looks effectively random to the user.

I think qemu should just log if it encounters a device without P2P
support.

> Even in the single device case, we need to make an assumption that a
> device that does not support P2P migration states (or when QEMU doesn't
> make use of P2P states) cannot be a DMA target, or otherwise have its
> MMIO space accessed while in a STOP state.  Can we guarantee that when
> other devices have not yet transitioned to STOP?

You mean the software devices created by qemu?

> We could disable the direct map MemoryRegions when we move to a STOP
> state, which would give QEMU visibility to those accesses, but besides
> pulling an abort should such an access occur, could we queue them in
> software, add them to the migration stream, and replay them after the
> device moves to the RUNNING state?  We'd need to account for the lack of
> RESUMING_P2P states as well, trapping and queue accesses from devices
> already RUNNING to those still in RESUMING (not _P2P).

I think any internal SW devices should just fail all accesses to the
P2P space, all the time.

qemu simply acts like a real system that doesn't support P2P.

IMHO this is generally the way forward to do multi-device as well,
remove the MMIO from all the address maps: VFIO, SW access, all of
them. Nothing can touch MMIO except for the vCPU.

> This all looks complicated.  Is it better to start with requiring P2P
> state support?  Thanks,

People have built HW without it, so I don't see this as so good..

Jason
Alex Williamson Feb. 1, 2023, 4:15 a.m. UTC | #5
On Tue, 31 Jan 2023 19:29:48 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jan 31, 2023 at 03:43:01PM -0700, Alex Williamson wrote:
> 
> > How does this affect our path towards supported migration?  I'm
> > thinking about a user experience where QEMU supports migration if
> > device A OR device B are attached, but not devices A and B attached to
> > the same VM.  We might have a device C where QEMU supports migration
> > with B AND C, but not A AND C, nor A AND B AND C.  This would be the
> > case if device B and device C both supported P2P states, but device A
> > did not. The user has no observability of this feature, so all of this
> > looks effectively random to the user.  
> 
> I think qemu should just log if it encounters a device without P2P
> support.

Better for debugging, but still poor from a VM management perspective.

> > Even in the single device case, we need to make an assumption that a
> > device that does not support P2P migration states (or when QEMU doesn't
> > make use of P2P states) cannot be a DMA target, or otherwise have its
> > MMIO space accessed while in a STOP state.  Can we guarantee that when
> > other devices have not yet transitioned to STOP?  
> 
> You mean the software devices created by qemu?

Other devices, software or otherwise, yes.

> > We could disable the direct map MemoryRegions when we move to a STOP
> > state, which would give QEMU visibility to those accesses, but besides
> > pulling an abort should such an access occur, could we queue them in
> > software, add them to the migration stream, and replay them after the
> > device moves to the RUNNING state?  We'd need to account for the lack of
> > RESUMING_P2P states as well, trapping and queue accesses from devices
> > already RUNNING to those still in RESUMING (not _P2P).  
> 
> I think any internal SW devices should just fail all accesses to the
> P2P space, all the time.
> 
> qemu simply acts like a real system that doesn't support P2P.
> 
> IMHO this is generally the way forward to do multi-device as well,
> remove the MMIO from all the address maps: VFIO, SW access, all of
> them. Nothing can touch MMIO except for the vCPU.

Are you suggesting this relative to migration or in general?  P2P is a
feature with tangible benefits and real use cases.  Real systems seem
to be moving towards making P2P work better, so it would seem short
sighted to move to and enforce only configurations w/o P2P in QEMU
generally.  Besides, this would require some sort of deprecation, so are
we intending to make users choose between migration and P2P?
 
> > This all looks complicated.  Is it better to start with requiring P2P
> > state support?  Thanks,  
> 
> People have built HW without it, so I don't see this as so good..

Are we obliged to start with that hardware?  I'm just trying to think
about whether a single device restriction is sufficient to prevent any
possible P2P or whether there might be an easier starting point for
more capable hardware.  There's no shortage of hardware that could
support migration given sufficient effort.  Thanks,

Alex
Jason Gunthorpe Feb. 1, 2023, 5:28 p.m. UTC | #6
On Tue, Jan 31, 2023 at 09:15:03PM -0700, Alex Williamson wrote:

> > IMHO this is generally the way forward to do multi-device as well,
> > remove the MMIO from all the address maps: VFIO, SW access, all of
> > them. Nothing can touch MMIO except for the vCPU.
> 
> Are you suggesting this relative to migration or in general?  

I would suggest a general qemu p2p on/off option.

> P2P is a feature with tangible benefits and real use cases.  Real
> systems seem to be moving towards making P2P work better, so it
> would seem short sighted to move to and enforce only configurations
> w/o P2P in QEMU generally.  

qemu needs to support it, but it should be a user option option.

Every system I've been involved with for enabling P2P into a VM has
been a total nightmare. This is not something you just turn on and it
works great :\ The whole thing was carefully engineered right down to
the BIOS to be able to work safely.

P2P in baremetal is much easier compared to P2P inside a VM.

> Besides, this would require some sort of deprecation, so are we
> intending to make users choose between migration and P2P?

Give qemu an option 'p2p on/p2p off' and default it to on for
backwards compatability.

If p2p on and migration devices don't support P2P states then
migration is disabled. The user made this choice when they bought
un-capable HW.

Log warnings to make it more discoverable. I think with the cdev
patches we can make it so libvirt can query the device FD for
capabilities to be even cleaner.

If user sets 'p2p off' then migration works with all HW.

p2p on/off is a global switch. With p2p off nothing, no HW or SW or
hybrid device, can touch the MMIO memory.

'p2p off' is a valuable option in its own right because this stuff
doesn't work reliably and is actively dangerous. Did you know you can
hard crash the bare metal from a guest on some platforms with P2P
operations? Yikes. If you don't need to use it turn it off and don't
take the risk.

Arguably for this reason 'p2p off' should trend toward the default as
it is much safer.

> Are we obliged to start with that hardware?  I'm just trying to think
> about whether a single device restriction is sufficient to prevent any
> possible P2P or whether there might be an easier starting point for
> more capable hardware.  There's no shortage of hardware that could
> support migration given sufficient effort.  Thanks,

I think multi-device will likely have some use cases, so I'd like to
see a path to have support for them. For this series I think it is
probably fine since it is already 18 patches.

Jason
Alex Williamson Feb. 1, 2023, 6:42 p.m. UTC | #7
On Wed, 1 Feb 2023 13:28:40 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jan 31, 2023 at 09:15:03PM -0700, Alex Williamson wrote:
> 
> > > IMHO this is generally the way forward to do multi-device as well,
> > > remove the MMIO from all the address maps: VFIO, SW access, all of
> > > them. Nothing can touch MMIO except for the vCPU.  
> > 
> > Are you suggesting this relative to migration or in general?    
> 
> I would suggest a general qemu p2p on/off option.
> 
> > P2P is a feature with tangible benefits and real use cases.  Real
> > systems seem to be moving towards making P2P work better, so it
> > would seem short sighted to move to and enforce only configurations
> > w/o P2P in QEMU generally.    
> 
> qemu needs to support it, but it should be a user option option.
> 
> Every system I've been involved with for enabling P2P into a VM has
> been a total nightmare. This is not something you just turn on and it
> works great :\ The whole thing was carefully engineered right down to
> the BIOS to be able to work safely.
> 
> P2P in baremetal is much easier compared to P2P inside a VM.
> 
> > Besides, this would require some sort of deprecation, so are we
> > intending to make users choose between migration and P2P?  
> 
> Give qemu an option 'p2p on/p2p off' and default it to on for
> backwards compatability.
> 
> If p2p on and migration devices don't support P2P states then
> migration is disabled. The user made this choice when they bought
> un-capable HW.
> 
> Log warnings to make it more discoverable. I think with the cdev
> patches we can make it so libvirt can query the device FD for
> capabilities to be even cleaner.
> 
> If user sets 'p2p off' then migration works with all HW.
> 
> p2p on/off is a global switch. With p2p off nothing, no HW or SW or
> hybrid device, can touch the MMIO memory.
> 
> 'p2p off' is a valuable option in its own right because this stuff
> doesn't work reliably and is actively dangerous. Did you know you can
> hard crash the bare metal from a guest on some platforms with P2P
> operations? Yikes. If you don't need to use it turn it off and don't
> take the risk.

If we're honest, there are a number of cases of non-exceptional faults
that an assigned device can generate that the platform might escalate
to fatal errors.

> Arguably for this reason 'p2p off' should trend toward the default as
> it is much safer.

Safety in the hands of the userspace to protect the host though?
Shouldn't the opt-in be at the kernel level whether to allow p2p
mappings?  I don't have an issue if QEMU were to mirror this by
creating a RAM-only AddressSpace for devices which would be used when
p2p is disable (it'd save us some headaches for various unaligned
devices as well), but we shouldn't pretend that actually protects the
host.  OTOH, QEMU could feel confident supporting migration of devices
w/o support of the migration P2P states with that restriction.

> > Are we obliged to start with that hardware?  I'm just trying to think
> > about whether a single device restriction is sufficient to prevent any
> > possible P2P or whether there might be an easier starting point for
> > more capable hardware.  There's no shortage of hardware that could
> > support migration given sufficient effort.  Thanks,  
> 
> I think multi-device will likely have some use cases, so I'd like to
> see a path to have support for them. For this series I think it is
> probably fine since it is already 18 patches.

It might be fine for this series because it hasn't yet proposed to make
migration non-experimental, but it's unclear where the goal post is
that we can actually make that transition.

If we restrict migration to a single vfio device, is that enough?
Theoretically it's not, but perhaps in practice... maybe?

Therefore, do we depend on QEMU implementing a new RAM-only AddressSpace
for devices?  What's the path to making it the default?  Maybe there
are other aspects of the VM from which we can infer a preference
towards migration support, ex. 'host' CPU type.

Another option, as previously mentioned, is to start with requiring P2P
migration support both at the device and QEMU, where we only restrict
the set of devices that could initially support migration without
modifying existing behavior of the VM.

In any case, it seems we're a bit further from being able to declare
this as supported than some had hoped.  Thanks,

Alex
Jason Gunthorpe Feb. 1, 2023, 8:10 p.m. UTC | #8
On Wed, Feb 01, 2023 at 11:42:46AM -0700, Alex Williamson wrote:

> > 'p2p off' is a valuable option in its own right because this stuff
> > doesn't work reliably and is actively dangerous. Did you know you can
> > hard crash the bare metal from a guest on some platforms with P2P
> > operations? Yikes. If you don't need to use it turn it off and don't
> > take the risk.
> 
> If we're honest, there are a number of cases of non-exceptional faults
> that an assigned device can generate that the platform might escalate
> to fatal errors.

What I understand is that is true on some commodity hardware, but
engineered systems to run as cloud hypervisors have these problems
solved and VFIO is made safe.

Unfortunately there is no way to know if you have a safe or unsafe
system from the OS.

> > Arguably for this reason 'p2p off' should trend toward the default as
> > it is much safer.
> 
> Safety in the hands of the userspace to protect the host though?
> Shouldn't the opt-in be at the kernel level whether to allow p2p
> mappings?  

I haven't seen anyone interested in doing this kind of work. The
expectation seems to be that places seriously concerned about security
either don't include VFIO at all in their environments or have
engineered their platforms to make it safe.

Where this leaves the enterprise space, I don't know. I think they end
up with systems that functionally work but possibly have DOS problems.

So, given this landscape I think a user option in qemu is the best we
can do at the moment.

> I don't have an issue if QEMU were to mirror this by
> creating a RAM-only AddressSpace for devices which would be used when
> p2p is disable (it'd save us some headaches for various unaligned
> devices as well), but we shouldn't pretend that actually protects the
> host.  OTOH, QEMU could feel confident supporting migration of devices
> w/o support of the migration P2P states with that restriction.

It protects the host from a hostile VM, it does not fully protect the
host from a compromised qemu. That is still an improvement.

> > I think multi-device will likely have some use cases, so I'd like to
> > see a path to have support for them. For this series I think it is
> > probably fine since it is already 18 patches.
> 
> It might be fine for this series because it hasn't yet proposed to make
> migration non-experimental, but it's unclear where the goal post is
> that we can actually make that transition.

IMHO non-experimental just means the solution works with whatever
configuration limitations it comes with. It shouldn't mean every
device or every configuration combination works.

So if you want to do single device, or just hard require P2P for now,
those are both reasonable temporary choices IMHO.

But they are temporary and we should come with a remedy to allow
non-P2P migration devices to work as well.

Given we merged a non-P2P kernel driver I prefer the single device
option as it is more logically consistent with the kernel situation.

Jason
diff mbox series

Patch

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index 1d50c2fe5f..51f5e1a537 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,12 +7,14 @@  the guest is running on source host and restoring this saved state on the
 destination host. This document details how saving and restoring of VFIO
 devices is done in QEMU.
 
-Migration of VFIO devices currently consists of a single stop-and-copy phase.
-During the stop-and-copy phase the guest is stopped and the entire VFIO device
-data is transferred to the destination.
-
-The pre-copy phase of migration is currently not supported for VFIO devices.
-Support for VFIO pre-copy will be added later on.
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices can choose to skip
+the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
+flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
 
 A detailed description of the UAPI for VFIO device migration can be found in
 the comment for the ``vfio_device_mig_state`` structure in the header file
@@ -29,6 +31,12 @@  VFIO implements the device hooks for the iterative approach as follows:
   driver, which indicates the amount of data that the vendor driver has yet to
   save for the VFIO device.
 
+* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
+  active only if the VFIO device is in pre-copy states.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+  vendor driver during iterative phase.
+
 * A ``save_state`` function to save the device config space if it is present.
 
 * A ``save_live_complete_precopy`` function that sets the VFIO device in
@@ -91,8 +99,10 @@  Flow of state changes during Live migration
 ===========================================
 
 Below is the flow of state change during live migration.
-The values in the brackets represent the VM state, the migration state, and
+The values in the parentheses represent the VM state, the migration state, and
 the VFIO device state, respectively.
+The text in the square brackets represents the flow if the VFIO device supports
+pre-copy.
 
 Live migration save path
 ------------------------
@@ -104,11 +114,12 @@  Live migration save path
                                   |
                      migrate_init spawns migration_thread
                 Migration thread then calls each device's .save_setup()
-                       (RUNNING, _SETUP, _RUNNING)
+                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
                                   |
-                      (RUNNING, _ACTIVE, _RUNNING)
+                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
              If device is active, get pending_bytes by .save_live_pending()
           If total pending_bytes >= threshold_size, call .save_live_iterate()
+                  [Data of VFIO device for pre-copy phase is copied]
         Iterate till total pending bytes converge and are less than threshold
                                   |
   On migration completion, vCPU stops and calls .save_live_complete_precopy for
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 5f8e7a02fe..88c2194fb9 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -67,7 +67,10 @@  typedef struct VFIOMigration {
     int data_fd;
     void *data_buffer;
     size_t data_buffer_size;
+    uint64_t mig_flags;
     uint64_t stop_copy_size;
+    uint64_t precopy_init_size;
+    uint64_t precopy_dirty_size;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9a0dbee6b4..93b18c5e3d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -357,7 +357,9 @@  static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
 
             if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
                 (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
-                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
+                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
+                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {
                 return false;
             }
         }
@@ -387,7 +389,9 @@  static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
             }
 
             if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
-                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P) {
+                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
+                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 760f667e04..2a0a663023 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -69,6 +69,10 @@  static const char *mig_state_to_str(enum vfio_device_mig_state state)
         return "RESUMING";
     case VFIO_DEVICE_STATE_RUNNING_P2P:
         return "RUNNING_P2P";
+    case VFIO_DEVICE_STATE_PRE_COPY:
+        return "PRE_COPY";
+    case VFIO_DEVICE_STATE_PRE_COPY_P2P:
+        return "PRE_COPY_P2P";
     default:
         return "UNKNOWN STATE";
     }
@@ -237,6 +241,11 @@  static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     data_size = read(migration->data_fd, migration->data_buffer,
                      migration->data_buffer_size);
     if (data_size < 0) {
+        /* Pre-copy emptied all the device state for now */
+        if (errno == ENOMSG) {
+            return 1;
+        }
+
         return -errno;
     }
     if (data_size == 0) {
@@ -260,6 +269,7 @@  static int vfio_save_setup(QEMUFile *f, void *opaque)
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
     uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
+    int ret;
 
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
@@ -273,6 +283,23 @@  static int vfio_save_setup(QEMUFile *f, void *opaque)
         return -ENOMEM;
     }
 
+    if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
+        switch (migration->device_state) {
+        case VFIO_DEVICE_STATE_RUNNING:
+            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
+                                           VFIO_DEVICE_STATE_RUNNING);
+            if (ret) {
+                return ret;
+            }
+            break;
+        case VFIO_DEVICE_STATE_STOP:
+            /* vfio_save_complete_precopy() will go to STOP_COPY */
+            break;
+        default:
+            return -EINVAL;
+        }
+    }
+
     trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
@@ -287,6 +314,12 @@  static void vfio_save_cleanup(void *opaque)
 
     g_free(migration->data_buffer);
     migration->data_buffer = NULL;
+
+    if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+    }
+
     vfio_migration_cleanup(vbasedev);
     trace_vfio_save_cleanup(vbasedev->name);
 }
@@ -301,9 +334,55 @@  static void vfio_save_pending(void *opaque, uint64_t threshold_size,
 
     *res_precopy_only += migration->stop_copy_size;
 
+    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+        migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) {
+        if (migration->precopy_init_size) {
+            /*
+             * Initial size should be transferred during pre-copy phase so
+             * stop-copy phase will not be slowed down. Report threshold_size
+             * to force another pre-copy iteration.
+             */
+            *res_precopy_only += threshold_size;
+        } else {
+            *res_precopy_only += migration->precopy_dirty_size;
+        }
+    }
+
     trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
                             *res_postcopy_only, *res_compatible,
-                            migration->stop_copy_size);
+                            migration->stop_copy_size,
+                            migration->precopy_init_size,
+                            migration->precopy_dirty_size);
+}
+
+static bool vfio_is_active_iterate(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+           migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_save_block(f, migration);
+    if (ret < 0) {
+        return ret;
+    }
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    trace_vfio_save_iterate(vbasedev->name);
+
+    /*
+     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
+     * Return 1 so following handlers will not be potentially blocked.
+     */
+    return 1;
 }
 
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
@@ -312,7 +391,7 @@  static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     enum vfio_device_mig_state recover_state;
     int ret;
 
-    /* We reach here with device state STOP only */
+    /* We reach here with device state STOP or STOP_COPY only */
     recover_state = VFIO_DEVICE_STATE_STOP;
     ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
                                    recover_state);
@@ -430,6 +509,8 @@  static const SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
     .save_live_pending = vfio_save_pending,
+    .is_active_iterate = vfio_is_active_iterate,
+    .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
@@ -442,13 +523,19 @@  static const SaveVMHandlers savevm_vfio_handlers = {
 static void vfio_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     enum vfio_device_mig_state new_state;
     int ret;
 
     if (running) {
         new_state = VFIO_DEVICE_STATE_RUNNING;
     } else {
-        new_state = VFIO_DEVICE_STATE_STOP;
+        new_state =
+            ((migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+              migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) &&
+             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
+                VFIO_DEVICE_STATE_STOP_COPY :
+                VFIO_DEVICE_STATE_STOP;
     }
 
     ret = vfio_migration_set_state(vbasedev, new_state,
@@ -496,6 +583,9 @@  static int vfio_migration_data_notifier(NotifierWithReturn *n, void *data)
 {
     VFIOMigration *migration = container_of(n, VFIOMigration, migration_data);
     VFIODevice *vbasedev = migration->vbasedev;
+    struct vfio_precopy_info precopy = {
+        .argsz = sizeof(precopy),
+    };
     PrecopyNotifyData *pnd = data;
 
     if (pnd->reason != PRECOPY_NOTIFY_AFTER_BITMAP_SYNC) {
@@ -515,8 +605,21 @@  static int vfio_migration_data_notifier(NotifierWithReturn *n, void *data)
         migration->stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
     }
 
+    if ((migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+         migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {
+        if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
+            migration->precopy_init_size = 0;
+            migration->precopy_dirty_size = 0;
+        } else {
+            migration->precopy_init_size = precopy.initial_bytes;
+            migration->precopy_dirty_size = precopy.dirty_bytes;
+        }
+    }
+
     trace_vfio_migration_data_notifier(vbasedev->name,
-                                       migration->stop_copy_size);
+                                       migration->stop_copy_size,
+                                       migration->precopy_init_size,
+                                       migration->precopy_dirty_size);
 
     return 0;
 }
@@ -588,6 +691,7 @@  static int vfio_migration_init(VFIODevice *vbasedev)
     migration->vbasedev = vbasedev;
     migration->device_state = VFIO_DEVICE_STATE_RUNNING;
     migration->data_fd = -1;
+    migration->mig_flags = mig_flags;
 
     oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
     if (oid) {
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index db9cb94952..37724579e3 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -154,7 +154,7 @@  vfio_load_cleanup(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size 0x%"PRIx64" ret %d"
-vfio_migration_data_notifier(const char *name, uint64_t stopcopy_size) " (%s) stopcopy size 0x%"PRIx64
+vfio_migration_data_notifier(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_migration_probe(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
@@ -162,6 +162,7 @@  vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
-vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64" stopcopy size 0x%"PRIx64
+vfio_save_iterate(const char *name) " (%s)"
+vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"