diff mbox series

[v7,07/10] vfio: Extend the device migration protocol with PRE_COPY

Message ID 20220302172903.1995-8-shameerali.kolothum.thodi@huawei.com (mailing list archive)
State Superseded
Headers show
Series vfio/hisilicon: add ACC live migration driver | expand

Commit Message

Shameerali Kolothum Thodi March 2, 2022, 5:29 p.m. UTC
From: Jason Gunthorpe <jgg@nvidia.com>

The optional PRE_COPY states open the saving data transfer FD before
reaching STOP_COPY and allows the device to dirty track internal state
changes with the general idea to reduce the volume of data transferred
in the STOP_COPY stage.

While in PRE_COPY the device remains RUNNING, but the saving FD is open.

Only if the device also supports RUNNING_P2P can it support PRE_COPY_P2P,
which halts P2P transfers while continuing the saving FD.

PRE_COPY, with P2P support, requires the driver to implement 7 new arcs
and exists as an optional FSM branch between RUNNING and STOP_COPY:
    RUNNING -> PRE_COPY -> PRE_COPY_P2P -> STOP_COPY

A new ioctl VFIO_MIG_GET_PRECOPY_INFO is provided to allow userspace to
query the progress of the precopy operation in the driver with the idea it
will judge to move to STOP_COPY at least once the initial data set is
transferred, and possibly after the dirty size has shrunk appropriately.

This ioctl is valid only in VFIO_DEVICE_STATE_PRE_COPY state and kernel
driver should return -EINVAL from any other migration state.

Compared to the v1 clarification, STOP_COPY -> PRE_COPY is made optional
and to be defined in future. While making the whole PRE_COPY feature
optional eliminates the concern from mlx5, this is still a complicated arc
to implement and seems prudent to leave it closed until a proper use case
is developed. We also split the pending_bytes report into the initial and
sustaining values, and define the protocol to get an event via poll() for
new dirty data during PRE_COPY.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
[Shameer: Renamed ioctl and updated ioctl usage description]
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 drivers/vfio/vfio.c       |  71 +++++++++++++++++++++++-
 include/uapi/linux/vfio.h | 113 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 179 insertions(+), 5 deletions(-)

Comments

Alex Williamson March 2, 2022, 8:31 p.m. UTC | #1
On Wed, 2 Mar 2022 17:29:00 +0000
Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote:

> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> The optional PRE_COPY states open the saving data transfer FD before
> reaching STOP_COPY and allows the device to dirty track internal state
> changes with the general idea to reduce the volume of data transferred
> in the STOP_COPY stage.
> 
> While in PRE_COPY the device remains RUNNING, but the saving FD is open.
> 
> Only if the device also supports RUNNING_P2P can it support PRE_COPY_P2P,
> which halts P2P transfers while continuing the saving FD.
> 
> PRE_COPY, with P2P support, requires the driver to implement 7 new arcs
> and exists as an optional FSM branch between RUNNING and STOP_COPY:
>     RUNNING -> PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
> 
> A new ioctl VFIO_MIG_GET_PRECOPY_INFO is provided to allow userspace to
> query the progress of the precopy operation in the driver with the idea it
> will judge to move to STOP_COPY at least once the initial data set is
> transferred, and possibly after the dirty size has shrunk appropriately.
> 
> This ioctl is valid only in VFIO_DEVICE_STATE_PRE_COPY state and kernel
> driver should return -EINVAL from any other migration state.

Nit, PRE_COPY and PRE_COPY_P2P
 
> Compared to the v1 clarification, STOP_COPY -> PRE_COPY is made optional
> and to be defined in future.

It's not really optional, it's explicitly unsupported in this extension.

> While making the whole PRE_COPY feature
> optional eliminates the concern from mlx5, this is still a complicated arc
> to implement and seems prudent to leave it closed until a proper use case
> is developed.

This seems like a leftover from the RFC that could be dropped.

> We also split the pending_bytes report into the initial and
> sustaining values, and define the protocol to get an event via poll() for
> new dirty data during PRE_COPY.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> [Shameer: Renamed ioctl and updated ioctl usage description]
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  drivers/vfio/vfio.c       |  71 +++++++++++++++++++++++-
>  include/uapi/linux/vfio.h | 113 ++++++++++++++++++++++++++++++++++++--
>  2 files changed, 179 insertions(+), 5 deletions(-)
> 
...
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index fea86061b44e..440dbfaabdb3 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -819,12 +819,20 @@ struct vfio_device_feature {
>   * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P
>   * is supported in addition to the STOP_COPY states.
>   *
> + * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY means that
> + * PRE_COPY is supported in addition to the STOP_COPY states.
> + *
> + * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY
> + * means that RUNNING_P2P, PRE_COPY and PRE_COPY_P2P are supported
> + * in addition to the STOP_COPY states.
> + *
>   * Other combinations of flags have behavior to be defined in the future.
>   */
>  struct vfio_device_feature_migration {
>  	__aligned_u64 flags;
>  #define VFIO_MIGRATION_STOP_COPY	(1 << 0)
>  #define VFIO_MIGRATION_P2P		(1 << 1)
> +#define VFIO_MIGRATION_PRE_COPY		(1 << 2)
>  };
>  #define VFIO_DEVICE_FEATURE_MIGRATION 1
>  
> @@ -875,8 +883,13 @@ struct vfio_device_feature_mig_state {
>   *  RESUMING - The device is stopped and is loading a new internal state
>   *  ERROR - The device has failed and must be reset
>   *
> - * And 1 optional state to support VFIO_MIGRATION_P2P:
> + * And optional states to support VFIO_MIGRATION_P2P:
>   *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
> + * And VFIO_MIGRATION_PRE_COPY:
> + *  PRE_COPY - The device is running normally but tracking internal state
> + *             changes
> + * And VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY:
> + *  PRE_COPY_P2P - PRE_COPY, except the device cannot do peer to peer DMA
>   *
>   * The FSM takes actions on the arcs between FSM states. The driver implements
>   * the following behavior for the FSM arcs:
> @@ -908,20 +921,48 @@ struct vfio_device_feature_mig_state {
>   *
>   *   To abort a RESUMING session the device must be reset.
>   *
> + * PRE_COPY -> RUNNING
>   * RUNNING_P2P -> RUNNING
>   *   While in RUNNING the device is fully operational, the device may generate
>   *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
>   *   and the device may advance its internal state.
>   *
> + *   The PRE_COPY arc will terminate a data transfer session.
> + *
> + * PRE_COPY_P2P -> RUNNING_P2P
>   * RUNNING -> RUNNING_P2P
>   * STOP -> RUNNING_P2P
>   *   While in RUNNING_P2P the device is partially running in the P2P quiescent
>   *   state defined below.
>   *
> + *   The PRE_COPY arc will terminate a data transfer session.
> + *
> + * RUNNING -> PRE_COPY
> + * RUNNING_P2P -> PRE_COPY_P2P
>   * STOP -> STOP_COPY
> - *   This arc begin the process of saving the device state and will return a
> - *   new data_fd.
> + *   PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of states
> + *   which share a data transfer session. Moving between these states alters
> + *   what is streamed in session, but does not terminate or otherwise effect
> + *   the associated fd.
> + *
> + *   These arcs begin the process of saving the device state and will return a
> + *   new data_fd. The migration driver may perform actions such as enabling
> + *   dirty logging of device state when entering PRE_COPY or PER_COPY_P2P.
> + *
> + *   Each arc does not change the device operation, the device remains
> + *   RUNNING, P2P quiesced or in STOP. The STOP_COPY state is described below
> + *   in PRE_COPY_P2P -> STOP_COPY.
> + *
> + * PRE_COPY -> PRE_COPY_P2P
> + *   Entering PRE_COPY_P2P continues all the behaviors of PRE_COPY above.
> + *   However, while in the PRE_COPY_P2P state, the device is partially running
> + *   in the P2P quiescent state defined below, like RUNNING_P2P.
>   *
> + * PRE_COPY_P2P -> PRE_COPY
> + *   This arc allows returning the device to a full RUNNING behavior while
> + *   continuing all the behaviors of PRE_COPY.
> + *
> + * PRE_COPY_P2P -> STOP_COPY
>   *   While in the STOP_COPY state the device has the same behavior as STOP
>   *   with the addition that the data transfers session continues to stream the
>   *   migration state. End of stream on the FD indicates the entire device
> @@ -939,6 +980,13 @@ struct vfio_device_feature_mig_state {
>   *   device state for this arc if required to prepare the device to receive the
>   *   migration data.
>   *
> + * STOP_COPY -> PRE_COPY
> + * STOP_COPY -> PRE_COPY_P2P
> + *   These arcs are not permitted and return error if requested. Future
> + *   revisions of this API may define behaviors for these arcs, in this case
> + *   support will be discoverable by a new flag in
> + *   VFIO_DEVICE_FEATURE_MIGRATION.
> + *
>   * any -> ERROR
>   *   ERROR cannot be specified as a device state, however any transition request
>   *   can be failed with an errno return and may then move the device_state into
> @@ -950,7 +998,7 @@ struct vfio_device_feature_mig_state {
>   * The optional peer to peer (P2P) quiescent state is intended to be a quiescent
>   * state for the device for the purposes of managing multiple devices within a
>   * user context where peer-to-peer DMA between devices may be active. The
> - * RUNNING_P2P states must prevent the device from initiating
> + * RUNNING_P2P and PRE_COPY_P2P states must prevent the device from initiating
>   * any new P2P DMA transactions. If the device can identify P2P transactions
>   * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration
>   * driver must complete any such outstanding operations prior to completing the
> @@ -963,6 +1011,8 @@ struct vfio_device_feature_mig_state {
>   * above FSM arcs. As there are multiple paths through the FSM arcs the path
>   * should be selected based on the following rules:
>   *   - Select the shortest path.
> + *   - The path cannot have saving group states as interior arcs, only
> + *     starting/end states.
>   * Refer to vfio_mig_get_next_state() for the result of the algorithm.
>   *
>   * The automatic transit through the FSM arcs that make up the combination
> @@ -976,6 +1026,9 @@ struct vfio_device_feature_mig_state {
>   * support them. The user can discover if these states are supported by using
>   * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
>   * avoid knowing about these optional states if the kernel driver supports them.
> + *
> + * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support for PRE_COPY
> + * is not present.
>   */
>  enum vfio_device_mig_state {
>  	VFIO_DEVICE_STATE_ERROR = 0,
> @@ -984,8 +1037,60 @@ enum vfio_device_mig_state {
>  	VFIO_DEVICE_STATE_STOP_COPY = 3,
>  	VFIO_DEVICE_STATE_RESUMING = 4,
>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
> +	VFIO_DEVICE_STATE_PRE_COPY = 6,
> +	VFIO_DEVICE_STATE_PRE_COPY_P2P = 7,
> +};
> +
> +/**
> + * VFIO_MIG_GET_PRECOPY_INFO - _IO(VFIO_TYPE, VFIO_BASE + 21)
> + *
> + * This ioctl is used on the migration data FD in the precopy phase of the
> + * migration data transfer. It returns an estimate of the current data sizes
> + * remaining to be transferred. It allows the user to judge when it is
> + * appropriate to leave PRE_COPY for STOP_COPY.
> + *
> + * This ioctl is valid only in VFIO_DEVICE_STATE_PRE_COPY state and kernel

PRE_COPY and PRE_COPY_P2P states, maybe we can generally refer to these
as "PRE_COPY states".

> + * driver should return -EINVAL from any other migration state.
> + *
> + * initial_bytes reflects the estimated remaining size of any initial mandatory
> + * precopy data transfer. When initial_bytes returns as zero then the initial
> + * phase of the precopy data is completed. Generally initial_bytes should start
> + * out as approximately the entire device state.

What is "mandatory" intended to mean here?  The user isn't required to
collect any data from the device in the PRE_COPY states.

> + *
> + * dirty_bytes reflects an estimate for how much more data needs to be
> + * transferred to complete the migration. Generally it should start as zero
> + * and increase as internal state is dirtied.

Maybe let's try to combine the descriptions, I'll give it a shot:

"The vfio_precopy_info data structure returned by this ioctl provides
 estimates of data available from the device during the PRE_COPY states.
 This estimate is split into two categories, initial_bytes and
 dirty_bytes.

 The initial_bytes field indicates the amount of static data available
 from the device.  This field should have a non-zero initial value and
 decrease as migration data is read from the device.

 The dirty_bytes field tracks device state changes relative to data
 previously retrieved.  This field starts at zero and may increase as
 the internal device state is modified or decrease as that modified
 state is read from the device.

 Userspace may use the combination of these fields to estimate the
 potential data size available during the PRE_COPY phases, as well as
 trends relative to the rate the device is dirtying it's internal
 state, but these fields are not required to have any bearing relative
 to the data size available during the STOP_COPY phase."

> + *
> + * Drivers should attempt to return estimates so that initial_bytes +
> + * dirty_bytes matches the amount of data an immediate transition to STOP_COPY
> + * will require to be streamed.

I think previous discussions have proven this false, we expect trailing
data that is only available in STOP_COPY, we cannot bound the size of
that data, and dirty_bytes is not intended to expose data that cannot
be retrieved during the PRE_COPY phases.  Thanks,

Alex

> + *
> + * Drivers have a lot of flexibility in when and what they transfer during the
> + * PRE_COPY phase, and how they report this from VFIO_MIG_GET_PRECOPY_INFO.
> + *
> + * During pre-copy the migration data FD has a temporary "end of stream" that is
> + * reached when both initial_bytes and dirty_byte are zero. For instance, this
> + * may indicate that the device is idle and not currently dirtying any internal
> + * state. When read() is done on this temporary end of stream the kernel driver
> + * should return ENOMSG from read(). Userspace can wait for more data (which may
> + * never come) by using poll.
> + *
> + * Once in STOP_COPY the migration data FD has a permanent end of stream
> + * signaled in the usual way by read() always returning 0 and poll always
> + * returning readable. ENOMSG may not be returned in STOP_COPY. Support
> + * for this ioctl is optional.
> + *
> + * Return: 0 on success, -1 and errno set on failure.
> + */
> +struct vfio_precopy_info {
> +	__u32 argsz;
> +	__u32 flags;
> +	__aligned_u64 initial_bytes;
> +	__aligned_u64 dirty_bytes;
>  };
>  
> +#define VFIO_MIG_GET_PRECOPY_INFO _IO(VFIO_TYPE, VFIO_BASE + 21)
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**
Jason Gunthorpe March 3, 2022, 12:05 a.m. UTC | #2
On Wed, Mar 02, 2022 at 01:31:59PM -0700, Alex Williamson wrote:
> > + * initial_bytes reflects the estimated remaining size of any initial mandatory
> > + * precopy data transfer. When initial_bytes returns as zero then the initial
> > + * phase of the precopy data is completed. Generally initial_bytes should start
> > + * out as approximately the entire device state.
> 
> What is "mandatory" intended to mean here?  The user isn't required to
> collect any data from the device in the PRE_COPY states.

If the data is split into initial,dirty,trailer then mandatory means
that first chunk.

> "The vfio_precopy_info data structure returned by this ioctl provides
>  estimates of data available from the device during the PRE_COPY states.
>  This estimate is split into two categories, initial_bytes and
>  dirty_bytes.
> 
>  The initial_bytes field indicates the amount of static data available
>  from the device.  This field should have a non-zero initial value and
>  decrease as migration data is read from the device.

static isn't great either, how about just say 'minimum data available'

>  Userspace may use the combination of these fields to estimate the
>  potential data size available during the PRE_COPY phases, as well as
>  trends relative to the rate the device is dirtying it's internal
>  state, but these fields are not required to have any bearing relative
>  to the data size available during the STOP_COPY phase."

That last is too strong. I would just drop starting at but.

The message to communicate is the device should allow dirty_bytes to
reach 0 during the PRE_COPY phases if everything is is idle. Which
tells alot about how to calculate it.

It is all better otherwise

> > + * Drivers should attempt to return estimates so that initial_bytes +
> > + * dirty_bytes matches the amount of data an immediate transition to STOP_COPY
> > + * will require to be streamed.
>
> I think previous discussions have proven this false, we expect trailing
> data that is only available in STOP_COPY, we cannot bound the size of
> that data, and dirty_bytes is not intended to expose data that cannot
> be retrieved during the PRE_COPY phases.  Thanks,

It was written assuming the stop_copy trailer is small.

Thanks,
Jason
Alex Williamson March 3, 2022, 3:47 a.m. UTC | #3
On Wed, 2 Mar 2022 20:05:28 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 02, 2022 at 01:31:59PM -0700, Alex Williamson wrote:
> > > + * initial_bytes reflects the estimated remaining size of any initial mandatory
> > > + * precopy data transfer. When initial_bytes returns as zero then the initial
> > > + * phase of the precopy data is completed. Generally initial_bytes should start
> > > + * out as approximately the entire device state.  
> > 
> > What is "mandatory" intended to mean here?  The user isn't required to
> > collect any data from the device in the PRE_COPY states.  
> 
> If the data is split into initial,dirty,trailer then mandatory means
> that first chunk.

But there's no requirement to read anything in PRE_COPY, so initial
becomes indistinguishable from trailer and dirty doesn't exist.

> > "The vfio_precopy_info data structure returned by this ioctl provides
> >  estimates of data available from the device during the PRE_COPY states.
> >  This estimate is split into two categories, initial_bytes and
> >  dirty_bytes.
> > 
> >  The initial_bytes field indicates the amount of static data available
> >  from the device.  This field should have a non-zero initial value and
> >  decrease as migration data is read from the device.  
> 
> static isn't great either, how about just say 'minimum data available'

'initial precopy data-set'?

> >  Userspace may use the combination of these fields to estimate the
> >  potential data size available during the PRE_COPY phases, as well as
> >  trends relative to the rate the device is dirtying it's internal
> >  state, but these fields are not required to have any bearing relative
> >  to the data size available during the STOP_COPY phase."  
> 
> That last is too strong. I would just drop starting at but.
> 
> The message to communicate is the device should allow dirty_bytes to
> reach 0 during the PRE_COPY phases if everything is is idle. Which
> tells alot about how to calculate it.
> 
> It is all better otherwise
> 
> > > + * Drivers should attempt to return estimates so that initial_bytes +
> > > + * dirty_bytes matches the amount of data an immediate transition to STOP_COPY
> > > + * will require to be streamed.  
> >
> > I think previous discussions have proven this false, we expect trailing
> > data that is only available in STOP_COPY, we cannot bound the size of
> > that data, and dirty_bytes is not intended to expose data that cannot
> > be retrieved during the PRE_COPY phases.  Thanks,  
> 
> It was written assuming the stop_copy trailer is small.

We have no basis to make that assertion.  We've agreed that precopy can
be used for nothing more than a compatibility test, so we could have a
vGPU with a massive framebuffer and no ability to provide dirty
tracking implement precopy only to include the entire framebuffer in
the trailing STOP_COPY data set.  Per my understanding and the fact
that we cannot enforce any heuristics regarding the size of the tailer
relative to the pre-copy data set, I think the above strongly phrased
sentence is necessary to understand the limitations of what this ioctl
is meant to convey.  Thanks,

Alex
Jason Gunthorpe March 3, 2022, 1:01 p.m. UTC | #4
On Wed, Mar 02, 2022 at 08:47:52PM -0700, Alex Williamson wrote:
> On Wed, 2 Mar 2022 20:05:28 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 02, 2022 at 01:31:59PM -0700, Alex Williamson wrote:
> > > > + * initial_bytes reflects the estimated remaining size of any initial mandatory
> > > > + * precopy data transfer. When initial_bytes returns as zero then the initial
> > > > + * phase of the precopy data is completed. Generally initial_bytes should start
> > > > + * out as approximately the entire device state.  
> > > 
> > > What is "mandatory" intended to mean here?  The user isn't required to
> > > collect any data from the device in the PRE_COPY states.  
> > 
> > If the data is split into initial,dirty,trailer then mandatory means
> > that first chunk.
> 
> But there's no requirement to read anything in PRE_COPY, so initial
> becomes indistinguishable from trailer and dirty doesn't exist.

It is still mandatory to read that data out, it doesn't matter if it
is read during PRE_COPY or STOP_COPY.

> > > "The vfio_precopy_info data structure returned by this ioctl provides
> > >  estimates of data available from the device during the PRE_COPY states.
> > >  This estimate is split into two categories, initial_bytes and
> > >  dirty_bytes.
> > > 
> > >  The initial_bytes field indicates the amount of static data available
> > >  from the device.  This field should have a non-zero initial value and
> > >  decrease as migration data is read from the device.  
> > 
> > static isn't great either, how about just say 'minimum data available'
> 
> 'initial precopy data-set'?

Sure

> We have no basis to make that assertion.  We've agreed that precopy can
> be used for nothing more than a compatibility test, so we could have a
> vGPU with a massive framebuffer and no ability to provide dirty
> tracking implement precopy only to include the entire framebuffer in
> the trailing STOP_COPY data set.  Per my understanding and the fact
> that we cannot enforce any heuristics regarding the size of the tailer
> relative to the pre-copy data set, I think the above strongly phrased
> sentence is necessary to understand the limitations of what this ioctl
> is meant to convey.  Thanks,

This is why abusing precopy for compatability is not a great idea. It
is OK for acc because its total state is tiny, but I would not agree
to a vGPU driver being merged working like you describe. It distorts
the entire purpose of PRE_COPY and this whole estimation mechanism.

The ioctl is intended to convey when to switch to STOP_COPY, and the
driver should provide a semantic where the closer the reported length
is to 0 then the faster the STOP_COPY will go.

Jason
Alex Williamson March 3, 2022, 3:20 p.m. UTC | #5
On Thu, 3 Mar 2022 09:01:24 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 02, 2022 at 08:47:52PM -0700, Alex Williamson wrote:
> > On Wed, 2 Mar 2022 20:05:28 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Mar 02, 2022 at 01:31:59PM -0700, Alex Williamson wrote:  
> > > > > + * initial_bytes reflects the estimated remaining size of any initial mandatory
> > > > > + * precopy data transfer. When initial_bytes returns as zero then the initial
> > > > > + * phase of the precopy data is completed. Generally initial_bytes should start
> > > > > + * out as approximately the entire device state.    
> > > > 
> > > > What is "mandatory" intended to mean here?  The user isn't required to
> > > > collect any data from the device in the PRE_COPY states.    
> > > 
> > > If the data is split into initial,dirty,trailer then mandatory means
> > > that first chunk.  
> > 
> > But there's no requirement to read anything in PRE_COPY, so initial
> > becomes indistinguishable from trailer and dirty doesn't exist.  
> 
> It is still mandatory to read that data out, it doesn't matter if it
> is read during PRE_COPY or STOP_COPY.

Not really, PRE_COPY -> RUNNING is a valid arc.
 
> > > > "The vfio_precopy_info data structure returned by this ioctl provides
> > > >  estimates of data available from the device during the PRE_COPY states.
> > > >  This estimate is split into two categories, initial_bytes and
> > > >  dirty_bytes.
> > > > 
> > > >  The initial_bytes field indicates the amount of static data available
> > > >  from the device.  This field should have a non-zero initial value and
> > > >  decrease as migration data is read from the device.    
> > > 
> > > static isn't great either, how about just say 'minimum data available'  
> > 
> > 'initial precopy data-set'?  
> 
> Sure
> 
> > We have no basis to make that assertion.  We've agreed that precopy can
> > be used for nothing more than a compatibility test, so we could have a
> > vGPU with a massive framebuffer and no ability to provide dirty
> > tracking implement precopy only to include the entire framebuffer in
> > the trailing STOP_COPY data set.  Per my understanding and the fact
> > that we cannot enforce any heuristics regarding the size of the tailer
> > relative to the pre-copy data set, I think the above strongly phrased
> > sentence is necessary to understand the limitations of what this ioctl
> > is meant to convey.  Thanks,  
> 
> This is why abusing precopy for compatability is not a great idea. It
> is OK for acc because its total state is tiny, but I would not agree
> to a vGPU driver being merged working like you describe. It distorts
> the entire purpose of PRE_COPY and this whole estimation mechanism.
> 
> The ioctl is intended to convey when to switch to STOP_COPY, and the
> driver should provide a semantic where the closer the reported length
> is to 0 then the faster the STOP_COPY will go.

If it's an abuse, then let's not do it.  It was never my impression or
intention that this was ok for acc only due to the minimal trailing
data size.  My statement was that use of PRE_COPY for compatibility
testing only had been a previously agreed valid use case of the
original migration interface.

Furthermore the acc driver was explicitly directed not to indicate any
degree of trailing data size in dirty_bytes, so while trailing data may
be small for acc, this interface is explicitly not intended to provide
any indication of trailing data size.  Thanks,

Alex
Shameerali Kolothum Thodi March 3, 2022, 6:05 p.m. UTC | #6
> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: 03 March 2022 15:21
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-crypto@vger.kernel.org; linux-pci@vger.kernel.org; cohuck@redhat.com;
> mgurtovoy@nvidia.com; yishaih@nvidia.com; Linuxarm
> <linuxarm@huawei.com>; liulongfang <liulongfang@huawei.com>; Zengtao (B)
> <prime.zeng@hisilicon.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> Subject: Re: [PATCH v7 07/10] vfio: Extend the device migration protocol with
> PRE_COPY
> 
> On Thu, 3 Mar 2022 09:01:24 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 02, 2022 at 08:47:52PM -0700, Alex Williamson wrote:
> > > On Wed, 2 Mar 2022 20:05:28 -0400
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > > On Wed, Mar 02, 2022 at 01:31:59PM -0700, Alex Williamson wrote:
> > > > > > + * initial_bytes reflects the estimated remaining size of any
> > > > > > + initial mandatory
> > > > > > + * precopy data transfer. When initial_bytes returns as zero
> > > > > > + then the initial
> > > > > > + * phase of the precopy data is completed. Generally initial_bytes
> should start
> > > > > > + * out as approximately the entire device state.
> > > > >
> > > > > What is "mandatory" intended to mean here?  The user isn't required
> to
> > > > > collect any data from the device in the PRE_COPY states.
> > > >
> > > > If the data is split into initial,dirty,trailer then mandatory
> > > > means that first chunk.
> > >
> > > But there's no requirement to read anything in PRE_COPY, so initial
> > > becomes indistinguishable from trailer and dirty doesn't exist.
> >
> > It is still mandatory to read that data out, it doesn't matter if it
> > is read during PRE_COPY or STOP_COPY.
> 
> Not really, PRE_COPY -> RUNNING is a valid arc.
> 
> > > > > "The vfio_precopy_info data structure returned by this ioctl
> > > > > provides  estimates of data available from the device during the
> PRE_COPY states.
> > > > >  This estimate is split into two categories, initial_bytes and
> > > > > dirty_bytes.
> > > > >
> > > > >  The initial_bytes field indicates the amount of static data
> > > > > available  from the device.  This field should have a non-zero initial
> value and
> > > > >  decrease as migration data is read from the device.
> > > >
> > > > static isn't great either, how about just say 'minimum data available'
> > >
> > > 'initial precopy data-set'?
> >
> > Sure
> >
> > > We have no basis to make that assertion.  We've agreed that precopy
> > > can be used for nothing more than a compatibility test, so we could
> > > have a vGPU with a massive framebuffer and no ability to provide
> > > dirty tracking implement precopy only to include the entire
> > > framebuffer in the trailing STOP_COPY data set.  Per my
> > > understanding and the fact that we cannot enforce any heuristics
> > > regarding the size of the tailer relative to the pre-copy data set,
> > > I think the above strongly phrased sentence is necessary to
> > > understand the limitations of what this ioctl is meant to convey.
> > > Thanks,
> >
> > This is why abusing precopy for compatability is not a great idea. It
> > is OK for acc because its total state is tiny, but I would not agree
> > to a vGPU driver being merged working like you describe. It distorts
> > the entire purpose of PRE_COPY and this whole estimation mechanism.
> >
> > The ioctl is intended to convey when to switch to STOP_COPY, and the
> > driver should provide a semantic where the closer the reported length
> > is to 0 then the faster the STOP_COPY will go.
> 
> If it's an abuse, then let's not do it.  It was never my impression or intention
> that this was ok for acc only due to the minimal trailing data size.  My
> statement was that use of PRE_COPY for compatibility testing only had been a
> previously agreed valid use case of the original migration interface.
> 
> Furthermore the acc driver was explicitly directed not to indicate any degree
> of trailing data size in dirty_bytes, so while trailing data may be small for acc,
> this interface is explicitly not intended to provide any indication of trailing
> data size.  Thanks,

Just to clarify, so the suggestion here is not to use PRE_COPY for compatibility
check at all and have a different proper infrastructure for that later as Jason
suggested?

If so, I will remove this patch from this series and go back to the old revision
where we only have STOP_COPY and do the compatibility check during the final
load data operation.

Please let me know.

Thanks,
Shameer
Alex Williamson March 3, 2022, 7:59 p.m. UTC | #7
On Thu, 3 Mar 2022 18:05:53 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: 03 March 2022 15:21
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> > kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > linux-crypto@vger.kernel.org; linux-pci@vger.kernel.org; cohuck@redhat.com;
> > mgurtovoy@nvidia.com; yishaih@nvidia.com; Linuxarm
> > <linuxarm@huawei.com>; liulongfang <liulongfang@huawei.com>; Zengtao (B)
> > <prime.zeng@hisilicon.com>; Jonathan Cameron
> > <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> > Subject: Re: [PATCH v7 07/10] vfio: Extend the device migration protocol with
> > PRE_COPY
> > 
> > On Thu, 3 Mar 2022 09:01:24 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Mar 02, 2022 at 08:47:52PM -0700, Alex Williamson wrote:  
> > > > On Wed, 2 Mar 2022 20:05:28 -0400
> > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >  
> > > > > On Wed, Mar 02, 2022 at 01:31:59PM -0700, Alex Williamson wrote:  
> > > > > > > + * initial_bytes reflects the estimated remaining size of any
> > > > > > > + initial mandatory
> > > > > > > + * precopy data transfer. When initial_bytes returns as zero
> > > > > > > + then the initial
> > > > > > > + * phase of the precopy data is completed. Generally initial_bytes  
> > should start  
> > > > > > > + * out as approximately the entire device state.  
> > > > > >
> > > > > > What is "mandatory" intended to mean here?  The user isn't required  
> > to  
> > > > > > collect any data from the device in the PRE_COPY states.  
> > > > >
> > > > > If the data is split into initial,dirty,trailer then mandatory
> > > > > means that first chunk.  
> > > >
> > > > But there's no requirement to read anything in PRE_COPY, so initial
> > > > becomes indistinguishable from trailer and dirty doesn't exist.  
> > >
> > > It is still mandatory to read that data out, it doesn't matter if it
> > > is read during PRE_COPY or STOP_COPY.  
> > 
> > Not really, PRE_COPY -> RUNNING is a valid arc.
> >   
> > > > > > "The vfio_precopy_info data structure returned by this ioctl
> > > > > > provides  estimates of data available from the device during the  
> > PRE_COPY states.  
> > > > > >  This estimate is split into two categories, initial_bytes and
> > > > > > dirty_bytes.
> > > > > >
> > > > > >  The initial_bytes field indicates the amount of static data
> > > > > > available  from the device.  This field should have a non-zero initial  
> > value and  
> > > > > >  decrease as migration data is read from the device.  
> > > > >
> > > > > static isn't great either, how about just say 'minimum data available'  
> > > >
> > > > 'initial precopy data-set'?  
> > >
> > > Sure
> > >  
> > > > We have no basis to make that assertion.  We've agreed that precopy
> > > > can be used for nothing more than a compatibility test, so we could
> > > > have a vGPU with a massive framebuffer and no ability to provide
> > > > dirty tracking implement precopy only to include the entire
> > > > framebuffer in the trailing STOP_COPY data set.  Per my
> > > > understanding and the fact that we cannot enforce any heuristics
> > > > regarding the size of the tailer relative to the pre-copy data set,
> > > > I think the above strongly phrased sentence is necessary to
> > > > understand the limitations of what this ioctl is meant to convey.
> > > > Thanks,  
> > >
> > > This is why abusing precopy for compatability is not a great idea. It
> > > is OK for acc because its total state is tiny, but I would not agree
> > > to a vGPU driver being merged working like you describe. It distorts
> > > the entire purpose of PRE_COPY and this whole estimation mechanism.
> > >
> > > The ioctl is intended to convey when to switch to STOP_COPY, and the
> > > driver should provide a semantic where the closer the reported length
> > > is to 0 then the faster the STOP_COPY will go.  
> > 
> > If it's an abuse, then let's not do it.  It was never my impression or intention
> > that this was ok for acc only due to the minimal trailing data size.  My
> > statement was that use of PRE_COPY for compatibility testing only had been a
> > previously agreed valid use case of the original migration interface.
> > 
> > Furthermore the acc driver was explicitly directed not to indicate any degree
> > of trailing data size in dirty_bytes, so while trailing data may be small for acc,
> > this interface is explicitly not intended to provide any indication of trailing
> > data size.  Thanks,  
> 
> Just to clarify, so the suggestion here is not to use PRE_COPY for compatibility
> check at all and have a different proper infrastructure for that later as Jason
> suggested?
> 
> If so, I will remove this patch from this series and go back to the old revision
> where we only have STOP_COPY and do the compatibility check during the final
> load data operation.

Hi Shameer,

I think NVIDIA has a company long weekend, so I'm not sure how quickly
we'll hear a rebuttal from Jason, but at this point I'd rather not move
forward with using PRE_COPY exclusively for compatibility testing if
that is seen as an abuse of the interface, regardless of the size of
the remaining STOP_COPY data.  It might be most expedient to respin
without PRE_COPY and we'll revisit methods to perform early
compatibility testing in the future.  Thanks,

Alex
Jason Gunthorpe March 3, 2022, 11:49 p.m. UTC | #8
On Thu, Mar 03, 2022 at 12:59:30PM -0700, Alex Williamson wrote:

> > > If it's an abuse, then let's not do it.  It was never my
> > > impression or intention

So maybe abuse is the wrong word, but I don't want to mess up this
interface, which is intended to support real pre-copy devices, just
because devices that don't actually implement true precopy might do
silly things.

The vGPU case you imagine will still work and qemu will switch to
STOP_COPY with a huge trailer and be slow. That is unavoidable and I
think it is fine.

> > > Furthermore the acc driver was explicitly directed not to indicate any degree
> > > of trailing data size in dirty_bytes, so while trailing data may be small for acc,
> > > this interface is explicitly not intended to provide any indication of trailing
> > > data size.  Thanks, 

Yes, trailing data is not what this is for. This is only to help
decide when to switch from PRE_COPY to STOP_COPY. If the device can
execute STOP_COPY in the right time is a completely different
discussion/interface.

> > Just to clarify, so the suggestion here is not to use PRE_COPY for compatibility
> > check at all and have a different proper infrastructure for that later as Jason
> > suggested?
> > 
> > If so, I will remove this patch from this series and go back to the old revision
> > where we only have STOP_COPY and do the compatibility check during the final
> > load data operation.
> 
> Hi Shameer,
> 
> I think NVIDIA has a company long weekend, so I'm not sure how quickly
> we'll hear a rebuttal from Jason, but at this point I'd rather not
> move

Yes, company long weekend.

> forward with using PRE_COPY exclusively for compatibility testing if
> that is seen as an abuse of the interface, regardless of the size of
> the remaining STOP_COPY data.  It might be most expedient to respin
> without PRE_COPY and we'll revisit methods to perform early
> compatibility testing in the future.  Thanks,

Shameerali has talked about wanting this compat check early from the
start, and done all the work to implement it. I think it is pretty
extreme to blow up his series over trailing_data.

To me acc is fine to use it this way until we get a better solution
for compatability. We all need this, but I expect it to be complicated
to define.

Jason
Alex Williamson March 4, 2022, 7:56 p.m. UTC | #9
On Thu, 3 Mar 2022 19:49:51 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Mar 03, 2022 at 12:59:30PM -0700, Alex Williamson wrote:
> 
> > > > If it's an abuse, then let's not do it.  It was never my
> > > > impression or intention  
> 
> So maybe abuse is the wrong word, but I don't want to mess up this
> interface, which is intended to support real pre-copy devices, just
> because devices that don't actually implement true precopy might do
> silly things.

Abuse... silly... either way, you're clearly not comfortable misusing
PRE_COPY for this purpose.  

> The vGPU case you imagine will still work and qemu will switch to
> STOP_COPY with a huge trailer and be slow. That is unavoidable and I
> think it is fine.

It's not really fine, but I think it will require some better defined
interfaces and userspace support to give a clear picture of how data is
partitioned.

> > > > Furthermore the acc driver was explicitly directed not to indicate any degree
> > > > of trailing data size in dirty_bytes, so while trailing data may be small for acc,
> > > > this interface is explicitly not intended to provide any indication of trailing
> > > > data size.  Thanks,   
> 
> Yes, trailing data is not what this is for. This is only to help
> decide when to switch from PRE_COPY to STOP_COPY. If the device can
> execute STOP_COPY in the right time is a completely different
> discussion/interface.
> 
> > > Just to clarify, so the suggestion here is not to use PRE_COPY for compatibility
> > > check at all and have a different proper infrastructure for that later as Jason
> > > suggested?
> > > 
> > > If so, I will remove this patch from this series and go back to the old revision
> > > where we only have STOP_COPY and do the compatibility check during the final
> > > load data operation.  
> > 
> > Hi Shameer,
> > 
> > I think NVIDIA has a company long weekend, so I'm not sure how quickly
> > we'll hear a rebuttal from Jason, but at this point I'd rather not
> > move  
> 
> Yes, company long weekend.
> 
> > forward with using PRE_COPY exclusively for compatibility testing if
> > that is seen as an abuse of the interface, regardless of the size of
> > the remaining STOP_COPY data.  It might be most expedient to respin
> > without PRE_COPY and we'll revisit methods to perform early
> > compatibility testing in the future.  Thanks,  
> 
> Shameerali has talked about wanting this compat check early from the
> start, and done all the work to implement it. I think it is pretty
> extreme to blow up his series over trailing_data.
> 
> To me acc is fine to use it this way until we get a better solution
> for compatability. We all need this, but I expect it to be complicated
> to define.

It was only in v7 that we made this switch to use PRE_COPY for this
purpose, I wouldn't call it blowing up his series to step back and
decide that was a poor choice and clearly v8 exists without this.  This
isn't the end of the discussion regarding early compatibility testing,
but I'm not going to rush a PRE_COPY interface to support that early
compatibility testing if we're not agreed that it's a valid use case,
and not just a marginally acceptable one due to the trailing data being
inconsequential.  Let's focus on v8 and we can talk about further
extensions later.  Thanks,

Alex
diff mbox series

Patch

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index bdb5205bb358..a14b86913593 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1577,7 +1577,7 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 			    enum vfio_device_mig_state new_fsm,
 			    enum vfio_device_mig_state *next_fsm)
 {
-	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RUNNING_P2P + 1 };
+	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_PRE_COPY_P2P + 1 };
 	/*
 	 * The coding in this table requires the driver to implement
 	 * FSM arcs:
@@ -1596,25 +1596,59 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 	 *         RUNNING -> STOP
 	 *         STOP -> RUNNING
 	 *
+	 * If precopy is supported then the driver must support these additional
+	 * FSM arcs:
+	 *         RUNNING -> PRE_COPY
+	 *         PRE_COPY -> RUNNING
+	 *         PRE_COPY -> STOP_COPY
+	 * However, if precopy and P2P are supported together then the driver
+	 * must support these additional arcs beyond the P2P arcs above:
+	 *         PRE_COPY -> RUNNING
+	 *         PRE_COPY -> PRE_COPY_P2P
+	 *         PRE_COPY_P2P -> PRE_COPY
+	 *         PRE_COPY_P2P -> RUNNING_P2P
+	 *         PRE_COPY_P2P -> STOP_COPY
+	 *         RUNNING -> PRE_COPY
+	 *         RUNNING_P2P -> PRE_COPY_P2P
+	 *
 	 * If all optional features are supported then the coding will step
 	 * through multiple states for these combination transitions:
+	 *         PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP -> RESUMING
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> RUNNING
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> STOP
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> STOP -> RESUMING
 	 *         RESUMING -> STOP -> RUNNING_P2P
+	 *         RESUMING -> STOP -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING
+	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         RESUMING -> STOP -> STOP_COPY
+	 *         RUNNING -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         RUNNING -> RUNNING_P2P -> STOP
 	 *         RUNNING -> RUNNING_P2P -> STOP -> RESUMING
 	 *         RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
+	 *         RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         RUNNING_P2P -> STOP -> RESUMING
 	 *         RUNNING_P2P -> STOP -> STOP_COPY
+	 *         STOP -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         STOP -> RUNNING_P2P -> RUNNING
+	 *         STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         STOP_COPY -> STOP -> RESUMING
 	 *         STOP_COPY -> STOP -> RUNNING_P2P
 	 *         STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
+	 *
+	 *  The following transitions are blocked:
+	 *         STOP_COPY -> PRE_COPY
+	 *         STOP_COPY -> PRE_COPY_P2P
 	 */
 	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
 		[VFIO_DEVICE_STATE_STOP] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
@@ -1623,14 +1657,38 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RUNNING] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
+		[VFIO_DEVICE_STATE_PRE_COPY] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_PRE_COPY_P2P] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
 		[VFIO_DEVICE_STATE_STOP_COPY] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
@@ -1639,6 +1697,8 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RESUMING] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
@@ -1647,6 +1707,8 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
@@ -1655,6 +1717,8 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_ERROR] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
@@ -1665,6 +1729,11 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 	static const unsigned int state_flags_table[VFIO_DEVICE_NUM_STATES] = {
 		[VFIO_DEVICE_STATE_STOP] = VFIO_MIGRATION_STOP_COPY,
 		[VFIO_DEVICE_STATE_RUNNING] = VFIO_MIGRATION_STOP_COPY,
+		[VFIO_DEVICE_STATE_PRE_COPY] =
+			VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY,
+		[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_MIGRATION_STOP_COPY |
+						   VFIO_MIGRATION_P2P |
+						   VFIO_MIGRATION_PRE_COPY,
 		[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_MIGRATION_STOP_COPY,
 		[VFIO_DEVICE_STATE_RESUMING] = VFIO_MIGRATION_STOP_COPY,
 		[VFIO_DEVICE_STATE_RUNNING_P2P] =
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index fea86061b44e..440dbfaabdb3 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -819,12 +819,20 @@  struct vfio_device_feature {
  * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P
  * is supported in addition to the STOP_COPY states.
  *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY means that
+ * PRE_COPY is supported in addition to the STOP_COPY states.
+ *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY
+ * means that RUNNING_P2P, PRE_COPY and PRE_COPY_P2P are supported
+ * in addition to the STOP_COPY states.
+ *
  * Other combinations of flags have behavior to be defined in the future.
  */
 struct vfio_device_feature_migration {
 	__aligned_u64 flags;
 #define VFIO_MIGRATION_STOP_COPY	(1 << 0)
 #define VFIO_MIGRATION_P2P		(1 << 1)
+#define VFIO_MIGRATION_PRE_COPY		(1 << 2)
 };
 #define VFIO_DEVICE_FEATURE_MIGRATION 1
 
@@ -875,8 +883,13 @@  struct vfio_device_feature_mig_state {
  *  RESUMING - The device is stopped and is loading a new internal state
  *  ERROR - The device has failed and must be reset
  *
- * And 1 optional state to support VFIO_MIGRATION_P2P:
+ * And optional states to support VFIO_MIGRATION_P2P:
  *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
+ * And VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY - The device is running normally but tracking internal state
+ *             changes
+ * And VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY_P2P - PRE_COPY, except the device cannot do peer to peer DMA
  *
  * The FSM takes actions on the arcs between FSM states. The driver implements
  * the following behavior for the FSM arcs:
@@ -908,20 +921,48 @@  struct vfio_device_feature_mig_state {
  *
  *   To abort a RESUMING session the device must be reset.
  *
+ * PRE_COPY -> RUNNING
  * RUNNING_P2P -> RUNNING
  *   While in RUNNING the device is fully operational, the device may generate
  *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
  *   and the device may advance its internal state.
  *
+ *   The PRE_COPY arc will terminate a data transfer session.
+ *
+ * PRE_COPY_P2P -> RUNNING_P2P
  * RUNNING -> RUNNING_P2P
  * STOP -> RUNNING_P2P
  *   While in RUNNING_P2P the device is partially running in the P2P quiescent
  *   state defined below.
  *
+ *   The PRE_COPY arc will terminate a data transfer session.
+ *
+ * RUNNING -> PRE_COPY
+ * RUNNING_P2P -> PRE_COPY_P2P
  * STOP -> STOP_COPY
- *   This arc begin the process of saving the device state and will return a
- *   new data_fd.
+ *   PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of states
+ *   which share a data transfer session. Moving between these states alters
+ *   what is streamed in session, but does not terminate or otherwise effect
+ *   the associated fd.
+ *
+ *   These arcs begin the process of saving the device state and will return a
+ *   new data_fd. The migration driver may perform actions such as enabling
+ *   dirty logging of device state when entering PRE_COPY or PER_COPY_P2P.
+ *
+ *   Each arc does not change the device operation, the device remains
+ *   RUNNING, P2P quiesced or in STOP. The STOP_COPY state is described below
+ *   in PRE_COPY_P2P -> STOP_COPY.
+ *
+ * PRE_COPY -> PRE_COPY_P2P
+ *   Entering PRE_COPY_P2P continues all the behaviors of PRE_COPY above.
+ *   However, while in the PRE_COPY_P2P state, the device is partially running
+ *   in the P2P quiescent state defined below, like RUNNING_P2P.
  *
+ * PRE_COPY_P2P -> PRE_COPY
+ *   This arc allows returning the device to a full RUNNING behavior while
+ *   continuing all the behaviors of PRE_COPY.
+ *
+ * PRE_COPY_P2P -> STOP_COPY
  *   While in the STOP_COPY state the device has the same behavior as STOP
  *   with the addition that the data transfers session continues to stream the
  *   migration state. End of stream on the FD indicates the entire device
@@ -939,6 +980,13 @@  struct vfio_device_feature_mig_state {
  *   device state for this arc if required to prepare the device to receive the
  *   migration data.
  *
+ * STOP_COPY -> PRE_COPY
+ * STOP_COPY -> PRE_COPY_P2P
+ *   These arcs are not permitted and return error if requested. Future
+ *   revisions of this API may define behaviors for these arcs, in this case
+ *   support will be discoverable by a new flag in
+ *   VFIO_DEVICE_FEATURE_MIGRATION.
+ *
  * any -> ERROR
  *   ERROR cannot be specified as a device state, however any transition request
  *   can be failed with an errno return and may then move the device_state into
@@ -950,7 +998,7 @@  struct vfio_device_feature_mig_state {
  * The optional peer to peer (P2P) quiescent state is intended to be a quiescent
  * state for the device for the purposes of managing multiple devices within a
  * user context where peer-to-peer DMA between devices may be active. The
- * RUNNING_P2P states must prevent the device from initiating
+ * RUNNING_P2P and PRE_COPY_P2P states must prevent the device from initiating
  * any new P2P DMA transactions. If the device can identify P2P transactions
  * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration
  * driver must complete any such outstanding operations prior to completing the
@@ -963,6 +1011,8 @@  struct vfio_device_feature_mig_state {
  * above FSM arcs. As there are multiple paths through the FSM arcs the path
  * should be selected based on the following rules:
  *   - Select the shortest path.
+ *   - The path cannot have saving group states as interior arcs, only
+ *     starting/end states.
  * Refer to vfio_mig_get_next_state() for the result of the algorithm.
  *
  * The automatic transit through the FSM arcs that make up the combination
@@ -976,6 +1026,9 @@  struct vfio_device_feature_mig_state {
  * support them. The user can discover if these states are supported by using
  * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
  * avoid knowing about these optional states if the kernel driver supports them.
+ *
+ * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support for PRE_COPY
+ * is not present.
  */
 enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_ERROR = 0,
@@ -984,8 +1037,60 @@  enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_STOP_COPY = 3,
 	VFIO_DEVICE_STATE_RESUMING = 4,
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
+	VFIO_DEVICE_STATE_PRE_COPY = 6,
+	VFIO_DEVICE_STATE_PRE_COPY_P2P = 7,
+};
+
+/**
+ * VFIO_MIG_GET_PRECOPY_INFO - _IO(VFIO_TYPE, VFIO_BASE + 21)
+ *
+ * This ioctl is used on the migration data FD in the precopy phase of the
+ * migration data transfer. It returns an estimate of the current data sizes
+ * remaining to be transferred. It allows the user to judge when it is
+ * appropriate to leave PRE_COPY for STOP_COPY.
+ *
+ * This ioctl is valid only in VFIO_DEVICE_STATE_PRE_COPY state and kernel
+ * driver should return -EINVAL from any other migration state.
+ *
+ * initial_bytes reflects the estimated remaining size of any initial mandatory
+ * precopy data transfer. When initial_bytes returns as zero then the initial
+ * phase of the precopy data is completed. Generally initial_bytes should start
+ * out as approximately the entire device state.
+ *
+ * dirty_bytes reflects an estimate for how much more data needs to be
+ * transferred to complete the migration. Generally it should start as zero
+ * and increase as internal state is dirtied.
+ *
+ * Drivers should attempt to return estimates so that initial_bytes +
+ * dirty_bytes matches the amount of data an immediate transition to STOP_COPY
+ * will require to be streamed.
+ *
+ * Drivers have a lot of flexibility in when and what they transfer during the
+ * PRE_COPY phase, and how they report this from VFIO_MIG_GET_PRECOPY_INFO.
+ *
+ * During pre-copy the migration data FD has a temporary "end of stream" that is
+ * reached when both initial_bytes and dirty_byte are zero. For instance, this
+ * may indicate that the device is idle and not currently dirtying any internal
+ * state. When read() is done on this temporary end of stream the kernel driver
+ * should return ENOMSG from read(). Userspace can wait for more data (which may
+ * never come) by using poll.
+ *
+ * Once in STOP_COPY the migration data FD has a permanent end of stream
+ * signaled in the usual way by read() always returning 0 and poll always
+ * returning readable. ENOMSG may not be returned in STOP_COPY. Support
+ * for this ioctl is optional.
+ *
+ * Return: 0 on success, -1 and errno set on failure.
+ */
+struct vfio_precopy_info {
+	__u32 argsz;
+	__u32 flags;
+	__aligned_u64 initial_bytes;
+	__aligned_u64 dirty_bytes;
 };
 
+#define VFIO_MIG_GET_PRECOPY_INFO _IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**