diff mbox series

[V7,mlx5-next,08/15] vfio: Define device migration protocol v2

Message ID 20220207172216.206415-9-yishaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series Add mlx5 live migration driver and v2 migration protocol | expand

Commit Message

Yishai Hadas Feb. 7, 2022, 5:22 p.m. UTC
From: Jason Gunthorpe <jgg@nvidia.com>

Replace the existing region based migration protocol with an ioctl based
protocol. The two protocols have the same general semantic behaviors, but
the way the data is transported is changed.

This is the STOP_COPY portion of the new protocol, it defines the 5 states
for basic stop and copy migration and the protocol to move the migration
data in/out of the kernel.

Compared to the clarification of the v1 protocol Alex proposed:

https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen

This has a few deliberate functional differences:

 - ERROR arcs allow the device function to remain unchanged.

 - The protocol is not required to return to the original state on
   transition failure. Instead userspace can execute an unwind back to
   the original state, reset, or do something else without needing kernel
   support. This simplifies the kernel design and should userspace choose
   a policy like always reset, avoids doing useless work in the kernel
   on error handling paths.

 - PRE_COPY is made optional, userspace must discover it before using it.
   This reflects the fact that the majority of drivers we are aware of
   right now will not implement PRE_COPY.

 - segmentation is not part of the data stream protocol, the receiver
   does not have to reproduce the framing boundaries.

The hybrid FSM for the device_state is described as a Mealy machine by
documenting each of the arcs the driver is required to implement. Defining
the remaining set of old/new device_state transitions as 'combination
transitions' which are naturally defined as taking multiple FSM arcs along
the shortest path within the FSM's digraph allows a complete matrix of
transitions.

A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is
defined to replace writing to the device_state field in the region. This
allows returning a brand new FD whenever the requested transition opens
a data transfer session.

The VFIO core code implements the new feature and provides a helper
function to the driver. Using the helper the driver only has to
implement 6 of the FSM arcs and the other combination transitions are
elaborated consistently from those arcs.

A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIGRATION is defined to
report the capability for migration and indicate which set of states and
arcs are supported by the device. The FSM provides a lot of flexibility to
make backwards compatible extensions but the VFIO_DEVICE_FEATURE also
allows for future breaking extensions for scenarios that cannot support
even the basic STOP_COPY requirements.

The VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE with the GET option (i.e.
VFIO_DEVICE_FEATURE_GET) can be used to read the current migration state
of the VFIO device.

Data transfer sessions are now carried over a file descriptor, instead of
the region. The FD functions for the lifetime of the data transfer
session. read() and write() transfer the data with normal Linux stream FD
semantics. This design allows future expansion to support poll(),
io_uring, and other performance optimizations.

The complicated mmap mode for data transfer is discarded as current qemu
doesn't take meaningful advantage of it, and the new qemu implementation
avoids substantially all the performance penalty of using a read() on the
region.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/vfio.c       | 198 ++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h      |  17 ++++
 include/uapi/linux/vfio.h | 172 ++++++++++++++++++++++++++++++---
 3 files changed, 374 insertions(+), 13 deletions(-)

Comments

Alex Williamson Feb. 9, 2022, 12:07 a.m. UTC | #1
On Mon, 7 Feb 2022 19:22:09 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:
> +static int
> +vfio_ioctl_device_feature_mig_device_state(struct vfio_device *device,
> +					   u32 flags, void __user *arg,
> +					   size_t argsz)
> +{
> +	size_t minsz =
> +		offsetofend(struct vfio_device_feature_mig_state, data_fd);
> +	struct vfio_device_feature_mig_state mig;

Perhaps set default data_fd here?  ie.

  struct vfio_device_feature_mig_state mig = { .data_fd = -1 };

> +	struct file *filp = NULL;
> +	int ret;
> +
> +	if (!device->ops->migration_set_state ||
> +	    !device->ops->migration_get_state)
> +		return -ENOTTY;
> +
> +	ret = vfio_check_feature(flags, argsz,
> +				 VFIO_DEVICE_FEATURE_SET |
> +				 VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(mig));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&mig, arg, minsz))
> +		return -EFAULT;
> +
> +	if (flags & VFIO_DEVICE_FEATURE_GET) {
> +		enum vfio_device_mig_state curr_state;
> +
> +		ret = device->ops->migration_get_state(device, &curr_state);
> +		if (ret)
> +			return ret;
> +		mig.device_state = curr_state;
> +		goto out_copy;
> +	}
> +
> +	/* Handle the VFIO_DEVICE_FEATURE_SET */
> +	filp = device->ops->migration_set_state(device, mig.device_state);
> +	if (IS_ERR(filp) || !filp)
> +		goto out_copy;
> +
> +	return vfio_ioct_mig_return_fd(filp, arg, &mig);
> +out_copy:
> +	mig.data_fd = -1;
> +	if (copy_to_user(arg, &mig, sizeof(mig)))
> +		return -EFAULT;
> +	if (IS_ERR(filp))
> +		return PTR_ERR(filp);
> +	return 0;
> +}
...
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index ca69516f869d..3f4a1a7c2277 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -56,6 +56,13 @@ struct vfio_device {
>   *         match, -errno for abort (ex. match with insufficient or incorrect
>   *         additional args)
>   * @device_feature: Fill in the VFIO_DEVICE_FEATURE ioctl
> + * @migration_set_state: Optional callback to change the migration state for
> + *         devices that support migration. The returned FD is used for data
> + *         transfer according to the FSM definition. The driver is responsible
> + *         to ensure that FD is isolated whenever the migration FSM leaves a
> + *         data transfer state or before close_device() returns.
> + @migration_get_state: Optional callback to get the migration state for

Fix formatting, " * @mig..."

> + *         devices that support migration.
>   */
>  struct vfio_device_ops {
>  	char	*name;
...
> +/*
> + * Indicates the device can support the migration API. See enum
> + * vfio_device_mig_state for details. If present flags must be non-zero and
> + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported.
> + *
> + * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY and
> + * RESUMING are supported.
> + */
> +struct vfio_device_feature_migration {
> +	__aligned_u64 flags;
> +#define VFIO_MIGRATION_STOP_COPY	(1 << 0)
> +};
> +#define VFIO_DEVICE_FEATURE_MIGRATION 1
> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET,
> + * Execute a migration state change on the VFIO device.
> + * The new state is supplied in device_state.
> + *
> + * The kernel migration driver must fully transition the device to the new state
> + * value before the write(2) operation returns to the user.

Stale comment, there's no write(2) anymore.

> + *
> + * The kernel migration driver must not generate asynchronous device state
> + * transitions outside of manipulation by the user or the VFIO_DEVICE_RESET
> + * ioctl as described above.
> + *
> + * If this function fails then current device_state may be the original
> + * operating state or some other state along the combination transition path.
> + * The user can then decide if it should execute a VFIO_DEVICE_RESET, attempt
> + * to return to the original state, or attempt to return to some other state
> + * such as RUNNING or STOP.
> + *
> + * If the new_state starts a new data transfer session then the FD associated
> + * with that session is returned in data_fd. The user is responsible to close
> + * this FD when it is finished. The user must consider the migration data
> + * segments carried over the FD to be opaque and non-fungible. During RESUMING,
> + * the data segments must be written in the same order they came out of the
> + * saving side FD.
> + *
> + * Upon VFIO_DEVICE_FEATURE_GET,
> + * Get the current migration state of the VFIO device, data_fd will be -1.
> + */
> +struct vfio_device_feature_mig_state {
> +	__u32 device_state; /* From enum vfio_device_mig_state */
> +	__s32 data_fd;
> +};
> +#define VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE 2
> +
> +/*
> + * The device migration Finite State Machine is described by the enum
> + * vfio_device_mig_state. Some of the FSM arcs will create a migration data
> + * transfer session by returning a FD, in this case the migration data will
> + * flow over the FD using read() and write() as discussed below.
> + *
> + * There are 5 states to support VFIO_MIGRATION_STOP_COPY:
> + *  RUNNING - The device is running normally
> + *  STOP - The device does not change the internal or external state
> + *  STOP_COPY - The device internal state can be read out
> + *  RESUMING - The device is stopped and is loading a new internal state
> + *  ERROR - The device has failed and must be reset
> + *
> + * The FSM takes actions on the arcs between FSM states. The driver implements
> + * the following behavior for the FSM arcs:
> + *
> + * RUNNING -> STOP
> + * STOP_COPY -> STOP
> + *   While in STOP the device must stop the operation of the device. The
> + *   device must not generate interrupts, DMA, or advance its internal
> + *   state. When stopped the device and kernel migration driver must accept
> + *   and respond to interaction to support external subsystems in the STOP
> + *   state, for example PCI MSI-X and PCI config pace. Failure by the user to
> + *   restrict device access while in STOP must not result in error conditions
> + *   outside the user context (ex. host system faults).
> + *
> + *   The STOP_COPY arc will terminate a data transfer session.
> + *
> + * RESUMING -> STOP
> + *   Leaving RESUMING terminates a data transfer session and indicates the
> + *   device should complete processing of the data delivered by write(). The
> + *   kernel migration driver should complete the incorporation of data written
> + *   to the data transfer FD into the device internal state and perform
> + *   final validity and consistency checking of the new device state. If the
> + *   user provided data is found to be incomplete, inconsistent, or otherwise
> + *   invalid, the migration driver must fail the SET_STATE ioctl and
> + *   optionally go to the ERROR state as described below.
> + *
> + *   While in STOP the device has the same behavior as other STOP states
> + *   described above.
> + *
> + *   To abort a RESUMING session the device must be reset.
> + *
> + * STOP -> RUNNING
> + *   While in RUNNING the device is fully operational, the device may generate
> + *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
> + *   and the device may advance its internal state.
> + *
> + * STOP -> STOP_COPY
> + *   This arc begin the process of saving the device state and will return a
> + *   new data_fd.
> + *
> + *   While in the STOP_COPY state the device has the same behavior as STOP
> + *   with the addition that the data transfers session continues to stream the
> + *   migration state. End of stream on the FD indicates the entire device
> + *   state has been transferred.
> + *
> + *   The user should take steps to restrict access to vfio device regions while
> + *   the device is in STOP_COPY or risk corruption of the device migration data
> + *   stream.
> + *
> + * STOP -> RESUMING
> + *   Entering the RESUMING state starts a process of restoring the device
> + *   state and will return a new data_fd. The data stream fed into the data_fd
> + *   should be taken from the data transfer output of the saving group states
> + *   from a compatible device. The migration driver may alter/reset the
> + *   internal device state for this arc if required to prepare the device to
> + *   receive the migration data.
> + *
> + * any -> ERROR
> + *   ERROR cannot be specified as a device state, however any transition request
> + *   can be failed with an errno return and may then move the device_state into
> + *   ERROR. In this case the device was unable to execute the requested arc and
> + *   was also unable to restore the device to any valid device_state.
> + *   To recover from ERROR VFIO_DEVICE_RESET must be used to return the
> + *   device_state back to RUNNING.
> + *
> + * The remaining possible transitions are interpreted as combinations of the
> + * above FSM arcs. As there are multiple paths through the FSM arcs the path
> + * should be selected based on the following rules:
> + *   - Select the shortest path.
> + * Refer to vfio_mig_get_next_state() for the result of the algorithm.
> + *
> + * The automatic transit through the FSM arcs that make up the combination
> + * transition is invisible to the user. When working with combination arcs the
> + * user may see any step along the path in the device_state if SET_STATE
> + * fails. When handling these types of errors users should anticipate future
> + * revisions of this protocol using new states and those states becoming
> + * visible in this case.
> + */
> +enum vfio_device_mig_state {
> +	VFIO_DEVICE_STATE_ERROR = 0,
> +	VFIO_DEVICE_STATE_STOP = 1,
> +	VFIO_DEVICE_STATE_RUNNING = 2,

I'm a little surprised we're not using RUNNING = 0 given all the
objection in the v1 protocol that the default state was non-zero.

> +	VFIO_DEVICE_STATE_STOP_COPY = 3,
> +	VFIO_DEVICE_STATE_RESUMING = 4,
> +};
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**

Otherwise, I'm still not sure how userspace handles the fact that it
can't know how much data will be read from the device and how important
that is.  There's no replacement of that feature from the v1 protocol
here.

As you noted, it's not only the size of the migration data, but also
the rate the device can generate it.  However, I also expect that it's
generally the external rate that's the limiting factor.  I've not
previously seen any evidence that the device rate is taken into account.

I also think we're still waiting for confirmation from owners of
devices with extremely large device states (vGPUs) whether they consider
the stream FD sufficient versus their ability to directly mmap regions
of the device in the previous protocol.  Thanks,

Alex
Jason Gunthorpe Feb. 9, 2022, 2:36 a.m. UTC | #2
On Tue, Feb 08, 2022 at 05:07:54PM -0700, Alex Williamson wrote:
> On Mon, 7 Feb 2022 19:22:09 +0200
> Yishai Hadas <yishaih@nvidia.com> wrote:
> > +static int
> > +vfio_ioctl_device_feature_mig_device_state(struct vfio_device *device,
> > +					   u32 flags, void __user *arg,
> > +					   size_t argsz)
> > +{
> > +	size_t minsz =
> > +		offsetofend(struct vfio_device_feature_mig_state, data_fd);
> > +	struct vfio_device_feature_mig_state mig;
> 
> Perhaps set default data_fd here?  ie.
> 
>   struct vfio_device_feature_mig_state mig = { .data_fd = -1 };

Why? there is no path where this variable is read before set.

> > +	struct file *filp = NULL;
> > +	int ret;
> > +
> > +	if (!device->ops->migration_set_state ||
> > +	    !device->ops->migration_get_state)
> > +		return -ENOTTY;
> > +
> > +	ret = vfio_check_feature(flags, argsz,
> > +				 VFIO_DEVICE_FEATURE_SET |
> > +				 VFIO_DEVICE_FEATURE_GET,
> > +				 sizeof(mig));
> > +	if (ret != 1)
> > +		return ret;
> > +
> > +	if (copy_from_user(&mig, arg, minsz))
> > +		return -EFAULT;

                   ^^^^^^^^^^^^^^

Is before all gotos.

> > +enum vfio_device_mig_state {
> > +	VFIO_DEVICE_STATE_ERROR = 0,
> > +	VFIO_DEVICE_STATE_STOP = 1,
> > +	VFIO_DEVICE_STATE_RUNNING = 2,
> 
> I'm a little surprised we're not using RUNNING = 0 given all the
> objection in the v1 protocol that the default state was non-zero.

Making ERROR 0 ensures that errors, eg in the FSM table due to a
backport or something still work properly.

I think we corrected that confusion by explicitly calling out RUNNING
as the default and removing the register-like region API.

> >  /* -------- API for Type1 VFIO IOMMU -------- */
> >  
> >  /**
> 
> Otherwise, I'm still not sure how userspace handles the fact that it
> can't know how much data will be read from the device and how important
> that is.  There's no replacement of that feature from the v1 protocol
> here.

I'm not sure this was part of the v1 protocol either. Yes it had a
pending_bytes, but I don't think it was actually expected to be 100%
accurate. Computing this value accurately is potentially quite
expensive, I would prefer we not enforce this on an implementation
without a reason, and qemu currently doesn't make use of it.

The ioctl from the precopy patch is probably the best approach, I
think it would be fine to allow that for stop copy as well, but also
don't see a usage right now.

It is not something that needs decision now, it is very easy to detect
if an ioctl is supported on the data_fd at runtime to add new things
here when needed.

> I also think we're still waiting for confirmation from owners of
> devices with extremely large device states (vGPUs) whether they consider
> the stream FD sufficient versus their ability to directly mmap regions
> of the device in the previous protocol.  Thanks,

As is this.

I think the mlx5 and huwaei patches show that without a doubt the
stream fd is the correct choice for these drivers.

Jason
Tian, Kevin Feb. 15, 2022, 8:04 a.m. UTC | #3
> From: Yishai Hadas <yishaih@nvidia.com>
> Sent: Tuesday, February 8, 2022 1:22 AM
> 
> +static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
> +					       u32 flags, void __user *arg,
> +					       size_t argsz)
> +{
> +	struct vfio_device_feature_migration mig = {
> +		.flags = VFIO_MIGRATION_STOP_COPY,
> +	};
> +	int ret;
> +
> +	if (!device->ops->migration_set_state)
> +		return -ENOTTY;

Miss a check on migration_get_state, as done in last function.

> + * @migration_set_state: Optional callback to change the migration state for
> + *         devices that support migration. The returned FD is used for data
> + *         transfer according to the FSM definition. The driver is responsible
> + *         to ensure that FD is isolated whenever the migration FSM leaves a
> + *         data transfer state or before close_device() returns.

didn't understand the meaning of 'isolated' here.

> +#define VFIO_DEVICE_STATE_V1_STOP      (0)
> +#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_V1_RUNNING
> | \
> +				     VFIO_DEVICE_STATE_V1_SAVING |  \
> +				     VFIO_DEVICE_STATE_V1_RESUMING)

Does it make sense to also add 'V1' to MASK and also following macros
given their names are general?

  #define VFIO_DEVICE_STATE_VALID(state) \
  #define VFIO_DEVICE_STATE_IS_ERROR(state) \
  #define VFIO_DEVICE_STATE_SET_ERROR(state) \

It certainly implies more changes to v1 code but readability can be
slightly improved.

> +/*
> + * Indicates the device can support the migration API. See enum

call it V2? Not necessary to add V2 in code but worthy of a clarification
in comment.

> + * vfio_device_mig_state for details. If present flags must be non-zero and
> + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported.
> + *
> + * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY
> and
> + * RESUMING are supported.
> + */

Not aligned with other places where 5 states are mentioned. Better add
ERROR here.

> + *
> + * RUNNING -> STOP
> + * STOP_COPY -> STOP
> + *   While in STOP the device must stop the operation of the device. The
> + *   device must not generate interrupts, DMA, or advance its internal
> + *   state. When stopped the device and kernel migration driver must accept
> + *   and respond to interaction to support external subsystems in the STOP
> + *   state, for example PCI MSI-X and PCI config pace. Failure by the user to
> + *   restrict device access while in STOP must not result in error conditions
> + *   outside the user context (ex. host system faults).

Right above the STOP state is defined as:

       *  STOP - The device does not change the internal or external state

'external state' I assume means P2P activities. For consistency it is clearer
to also say something about external state in above paragraph.

> + *
> + *   The STOP_COPY arc will terminate a data transfer session.

remove 'will'

> + *
> + * STOP -> STOP_COPY
> + *   This arc begin the process of saving the device state and will return a
> + *   new data_fd.
> + *
> + *   While in the STOP_COPY state the device has the same behavior as STOP
> + *   with the addition that the data transfers session continues to stream the
> + *   migration state. End of stream on the FD indicates the entire device
> + *   state has been transferred.
> + *
> + *   The user should take steps to restrict access to vfio device regions while
> + *   the device is in STOP_COPY or risk corruption of the device migration
> data
> + *   stream.

Restricting access has been explained in the to-STOP arcs and it is stated 
that while in STOP_COPY the device has the same behavior as STOP. So 
I think the last paragraph is possibly not required.

> + *
> + * STOP -> RESUMING
> + *   Entering the RESUMING state starts a process of restoring the device
> + *   state and will return a new data_fd. The data stream fed into the
> data_fd
> + *   should be taken from the data transfer output of the saving group states

No definition of 'group state' (maybe introduced in a later patch?)

> + *   from a compatible device. The migration driver may alter/reset the
> + *   internal device state for this arc if required to prepare the device to
> + *   receive the migration data.
> + *

Thanks
Kevin
Tian, Kevin Feb. 15, 2022, 10:41 a.m. UTC | #4
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, February 9, 2022 10:37 AM
> 
> > >  /* -------- API for Type1 VFIO IOMMU -------- */
> > >
> > >  /**
> >
> > Otherwise, I'm still not sure how userspace handles the fact that it
> > can't know how much data will be read from the device and how important
> > that is.  There's no replacement of that feature from the v1 protocol
> > here.
> 
> I'm not sure this was part of the v1 protocol either. Yes it had a
> pending_bytes, but I don't think it was actually expected to be 100%
> accurate. Computing this value accurately is potentially quite
> expensive, I would prefer we not enforce this on an implementation
> without a reason, and qemu currently doesn't make use of it.
> 
> The ioctl from the precopy patch is probably the best approach, I
> think it would be fine to allow that for stop copy as well, but also
> don't see a usage right now.
> 
> It is not something that needs decision now, it is very easy to detect
> if an ioctl is supported on the data_fd at runtime to add new things
> here when needed.
> 

Another interesting thing (not an immediate concern on this series)
is how to handle devices which may have long time (e.g. due to 
draining outstanding requests, even w/o vPRI) to enter the STOP 
state. that time is not as deterministic as pending bytes thus cannot
be reported back to the user before the operation is actually done.

Similarly to what we discussed for vPRI an eventfd will be beneficial 
so the user can timeout-wait on it, but it also needs an arc to create 
the eventfd between RUNNING->STOP...

Thanks
Kevin
Tian, Kevin Feb. 15, 2022, 10:58 a.m. UTC | #5
> From: Tian, Kevin
> Sent: Tuesday, February 15, 2022 6:42 PM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, February 9, 2022 10:37 AM
> >
> > > >  /* -------- API for Type1 VFIO IOMMU -------- */
> > > >
> > > >  /**
> > >
> > > Otherwise, I'm still not sure how userspace handles the fact that it
> > > can't know how much data will be read from the device and how
> important
> > > that is.  There's no replacement of that feature from the v1 protocol
> > > here.
> >
> > I'm not sure this was part of the v1 protocol either. Yes it had a
> > pending_bytes, but I don't think it was actually expected to be 100%
> > accurate. Computing this value accurately is potentially quite
> > expensive, I would prefer we not enforce this on an implementation
> > without a reason, and qemu currently doesn't make use of it.
> >
> > The ioctl from the precopy patch is probably the best approach, I
> > think it would be fine to allow that for stop copy as well, but also
> > don't see a usage right now.
> >
> > It is not something that needs decision now, it is very easy to detect
> > if an ioctl is supported on the data_fd at runtime to add new things
> > here when needed.
> >
> 
> Another interesting thing (not an immediate concern on this series)
> is how to handle devices which may have long time (e.g. due to
> draining outstanding requests, even w/o vPRI) to enter the STOP
> state. that time is not as deterministic as pending bytes thus cannot
> be reported back to the user before the operation is actually done.
> 
> Similarly to what we discussed for vPRI an eventfd will be beneficial
> so the user can timeout-wait on it, but it also needs an arc to create
> the eventfd between RUNNING->STOP...
> 

type too fast. it doesn’t need a new arc. Just a new capability to say
that STOP returns an event fd for the user to wait for completion,
when supporting such devices is required. 
Jason Gunthorpe Feb. 15, 2022, 1:13 p.m. UTC | #6
On Tue, Feb 15, 2022 at 10:58:58AM +0000, Tian, Kevin wrote:

> > Another interesting thing (not an immediate concern on this series)
> > is how to handle devices which may have long time (e.g. due to
> > draining outstanding requests, even w/o vPRI) to enter the STOP
> > state. that time is not as deterministic as pending bytes thus cannot
> > be reported back to the user before the operation is actually done.
> > 
> > Similarly to what we discussed for vPRI an eventfd will be beneficial
> > so the user can timeout-wait on it, but it also needs an arc to create
> > the eventfd between RUNNING->STOP...
> > 
> 
> type too fast. it doesn’t need a new arc. Just a new capability to say
> that STOP returns an event fd for the user to wait for completion,
> when supporting such devices is required. 
Jason Gunthorpe Feb. 15, 2022, 3:33 p.m. UTC | #7
On Tue, Feb 15, 2022 at 08:04:27AM +0000, Tian, Kevin wrote:
> > From: Yishai Hadas <yishaih@nvidia.com>
> > Sent: Tuesday, February 8, 2022 1:22 AM
> > 
> > +static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
> > +					       u32 flags, void __user *arg,
> > +					       size_t argsz)
> > +{
> > +	struct vfio_device_feature_migration mig = {
> > +		.flags = VFIO_MIGRATION_STOP_COPY,
> > +	};
> > +	int ret;
> > +
> > +	if (!device->ops->migration_set_state)
> > +		return -ENOTTY;
> 
> Miss a check on migration_get_state, as done in last function.

Yep

> > + * @migration_set_state: Optional callback to change the migration state for
> > + *         devices that support migration. The returned FD is used for data
> > + *         transfer according to the FSM definition. The driver is responsible
> > + *         to ensure that FD is isolated whenever the migration FSM leaves a
> > + *         data transfer state or before close_device() returns.
> 
> didn't understand the meaning of 'isolated' here.

It is not a good word. Lets say 'that FD reaches end of stream or
error whenever'
 
> > +#define VFIO_DEVICE_STATE_V1_STOP      (0)
> > +#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
> > +#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
> > +#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)
> > +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_V1_RUNNING
> > | \
> > +				     VFIO_DEVICE_STATE_V1_SAVING |  \
> > +				     VFIO_DEVICE_STATE_V1_RESUMING)
> 
> Does it make sense to also add 'V1' to MASK and also following macros
> given their names are general?

No, the point of this exercise is to avoid trouble for qemu - the
fewest changes we can get away with the better.

Once qemu is updated we'll delete this old stuff from the kernel.

> > +/*
> > + * Indicates the device can support the migration API. See enum
> 
> call it V2? Not necessary to add V2 in code but worthy of a clarification
> in comment.

We've only called it 'v2' for discussions.

If you think it is unclear lets say 'support the migration API through
VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE'

> 
> > + * vfio_device_mig_state for details. If present flags must be non-zero and
> > + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported.
> > + *
> > + * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY
> > and
> > + * RESUMING are supported.
> > + */
> 
> Not aligned with other places where 5 states are mentioned. Better add
> ERROR here.

ERROR is not a state that is 'supported'. It could be clarified that
ERROR and RUNNING are always supported.

> 
> > + *
> > + * RUNNING -> STOP
> > + * STOP_COPY -> STOP
> > + *   While in STOP the device must stop the operation of the device. The
> > + *   device must not generate interrupts, DMA, or advance its internal
> > + *   state. When stopped the device and kernel migration driver must accept
> > + *   and respond to interaction to support external subsystems in the STOP
> > + *   state, for example PCI MSI-X and PCI config pace. Failure by the user to
> > + *   restrict device access while in STOP must not result in error conditions
> > + *   outside the user context (ex. host system faults).
> 
> Right above the STOP state is defined as:
> 
>        *  STOP - The device does not change the internal or external state
> 
> 'external state' I assume means P2P activities. For consistency it is clearer
> to also say something about external state in above paragraph.

No, STOP is defined to halt all DMA. I tidied it a bit like this:

 *   While in STOP the device must stop the operation of the device. The device
 *   must not generate interrupts, DMA, or any other change to external state.
 *   It must not change its internal state.


> > + *
> > + *   The STOP_COPY arc will terminate a data transfer session.
>
> remove 'will'

will is correct grammar. It could be 'arc terminates'

> > + *
> > + * STOP -> STOP_COPY
> > + *   This arc begin the process of saving the device state and will return a
> > + *   new data_fd.
> > + *
> > + *   While in the STOP_COPY state the device has the same behavior as STOP
> > + *   with the addition that the data transfers session continues to stream the
> > + *   migration state. End of stream on the FD indicates the entire device
> > + *   state has been transferred.
> > + *
> > + *   The user should take steps to restrict access to vfio device regions while
> > + *   the device is in STOP_COPY or risk corruption of the device migration
> > data
> > + *   stream.
> 
> Restricting access has been explained in the to-STOP arcs and it is stated 
> that while in STOP_COPY the device has the same behavior as STOP. So 
> I think the last paragraph is possibly not required.

It is not the same, the language in STOP is saying that the device
must tolerate external touches without breaking the kernel

This language is saying if external touches happen then the device is
free to corrupt the migration stream.

In both cases we expect good userspace to not have device
touches, the guidance here is for driver authors about what kind of
steps they need to take to protect against hostile userspace.

> > + * STOP -> RESUMING
> > + *   Entering the RESUMING state starts a process of restoring the device
> > + *   state and will return a new data_fd. The data stream fed into the
> > data_fd
> > + *   should be taken from the data transfer output of the saving group states
> 
> No definition of 'group state' (maybe introduced in a later patch?)

Yes, it was added in the P2P patch

We can avoid talking about saving group here entirely, it really just
means a single FD.

 *    The data stream fed into the data_fd should
 *   be taken from the data transfer output of a single FD during saving on a
 *   from a compatible device.

Thanks,
Jason
Jason Gunthorpe Feb. 15, 2022, 4:04 p.m. UTC | #8
On Tue, Feb 15, 2022 at 10:41:56AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, February 9, 2022 10:37 AM
> > 
> > > >  /* -------- API for Type1 VFIO IOMMU -------- */
> > > >
> > > >  /**
> > >
> > > Otherwise, I'm still not sure how userspace handles the fact that it
> > > can't know how much data will be read from the device and how important
> > > that is.  There's no replacement of that feature from the v1 protocol
> > > here.
> > 
> > I'm not sure this was part of the v1 protocol either. Yes it had a
> > pending_bytes, but I don't think it was actually expected to be 100%
> > accurate. Computing this value accurately is potentially quite
> > expensive, I would prefer we not enforce this on an implementation
> > without a reason, and qemu currently doesn't make use of it.
> > 
> > The ioctl from the precopy patch is probably the best approach, I
> > think it would be fine to allow that for stop copy as well, but also
> > don't see a usage right now.
> > 
> > It is not something that needs decision now, it is very easy to detect
> > if an ioctl is supported on the data_fd at runtime to add new things
> > here when needed.
> > 
> 
> Another interesting thing (not an immediate concern on this series)
> is how to handle devices which may have long time (e.g. due to 
> draining outstanding requests, even w/o vPRI) to enter the STOP 
> state. that time is not as deterministic as pending bytes thus cannot
> be reported back to the user before the operation is actually done.

Well, it is not deterministic at all..

I suppose you have to do as Alex says and try to estimate how much
time the stop phase of migration will take and grant only the
remaining time from the SLA to the guest to finish its PRI flushing,
otherwise go back to PRE_COPY and try again later if the timer hits.

This suggests to me the right interface from the driver is some
estimate of time to enter STOP_COPY and resulting required transfer
size.

Still, I just don't see how SLAs can really be feasible with this kind
of HW that requires guest co-operation..

Jason
Alex Williamson Feb. 15, 2022, 11:32 p.m. UTC | #9
On Tue, 15 Feb 2022 12:04:19 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 15, 2022 at 10:41:56AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, February 9, 2022 10:37 AM
> > >   
> > > > >  /* -------- API for Type1 VFIO IOMMU -------- */
> > > > >
> > > > >  /**  
> > > >
> > > > Otherwise, I'm still not sure how userspace handles the fact that it
> > > > can't know how much data will be read from the device and how important
> > > > that is.  There's no replacement of that feature from the v1 protocol
> > > > here.  
> > > 
> > > I'm not sure this was part of the v1 protocol either. Yes it had a
> > > pending_bytes, but I don't think it was actually expected to be 100%
> > > accurate. Computing this value accurately is potentially quite
> > > expensive, I would prefer we not enforce this on an implementation
> > > without a reason, and qemu currently doesn't make use of it.
> > > 
> > > The ioctl from the precopy patch is probably the best approach, I
> > > think it would be fine to allow that for stop copy as well, but also
> > > don't see a usage right now.
> > > 
> > > It is not something that needs decision now, it is very easy to detect
> > > if an ioctl is supported on the data_fd at runtime to add new things
> > > here when needed.
> > >   
> > 
> > Another interesting thing (not an immediate concern on this series)
> > is how to handle devices which may have long time (e.g. due to 
> > draining outstanding requests, even w/o vPRI) to enter the STOP 
> > state. that time is not as deterministic as pending bytes thus cannot
> > be reported back to the user before the operation is actually done.  
> 
> Well, it is not deterministic at all..
> 
> I suppose you have to do as Alex says and try to estimate how much
> time the stop phase of migration will take and grant only the
> remaining time from the SLA to the guest to finish its PRI flushing,
> otherwise go back to PRE_COPY and try again later if the timer hits.
> 
> This suggests to me the right interface from the driver is some
> estimate of time to enter STOP_COPY and resulting required transfer
> size.
> 
> Still, I just don't see how SLAs can really be feasible with this kind
> of HW that requires guest co-operation..

Devil's advocate, does this discussion raise any concerns whether a
synchronous vs asynchronous arc transition ioctl is still the right
solution here?  I can imagine for instance that posting a state change
and being able to poll for pending transactions or completion of the
saved state generation and ultimate size could be very useful for
managing migration SLAs, not to mention trivial userspace support to
parallel'ize state changes.

Reporting a maximum device state size hint also seems relatively
trivial since this should just be the sum of on-device memory, asics,
and processors.  The mlx5 driver already places an upper bound on
migration data size internally.

Maybe some of these can come as DEVICE_FEATURES as we go, but for any
sort of cloud vendor SLA, I'm afraid we're only enabling migration of
devices with negligible transition latencies and negligible device
states, with some hand waving how to determine that either of those are
the case without device specific knowledge in the orchestration.
Thanks,

Alex
Jason Gunthorpe Feb. 16, 2022, 1:17 a.m. UTC | #10
On Tue, Feb 15, 2022 at 04:32:31PM -0700, Alex Williamson wrote:

> > I suppose you have to do as Alex says and try to estimate how much
> > time the stop phase of migration will take and grant only the
> > remaining time from the SLA to the guest to finish its PRI flushing,
> > otherwise go back to PRE_COPY and try again later if the timer hits.
> > 
> > This suggests to me the right interface from the driver is some
> > estimate of time to enter STOP_COPY and resulting required transfer
> > size.
> > 
> > Still, I just don't see how SLAs can really be feasible with this kind
> > of HW that requires guest co-operation..
> 
> Devil's advocate, does this discussion raise any concerns whether a
> synchronous vs asynchronous arc transition ioctl is still the right
> solution here?  

v2 switched to the data_fd which allows almost everything important to
be async, assuming someone wants to implement it in qemu and a driver.

It allows RUNNING -> STOP_COPY to be made async because the driver can
return SET_STATE immediately, backround the state save and indicate
completion/progress/error via poll(readable) on the data_fd. However
the device does still have to suspend DMA synchronously.

RESUMING -> STOP can also be async. The driver will make the data_fd
not writable before the last byte using its internal knowledge of the
data framing. Once the driver allows the last byte to be delivered
qemu will immediately do SET_STATE which will be low latency.

The entire data transfer flow itself is now async event driven and can
be run in parallel across devices with an epoll or iouring type
scheme.

STOP->RUNNING should be low latency for any reasonable device design.

For the P2P extension the RUNNING -> RUNNING_P2P has stopped vCPUs,
but I think a reasonable implementation must make this low latency,
just like suspending DMA to get to STOP_COPY must be low latency.
Making it async won't make it faster, though I would like to see it
run in parallel for all P2P devices.

The other arcs have the vCPU running, so don't matter to this.

In essence, compared to v1, we already made it almost fully async.

Also, at least with the mlx5 design, we can run all the commands async
(though there is a blocker preventing this right now) however we
cannot abort commands in progress. So as far as a SLA is concerned I
don't think async necessarily helps much.

I also think acc and several other drivers we are looking at would not
implement, or gain any advantage from async arcs.

Are there more arcs that benefit from async? PRI draining has come
up.

Keep in mind, qemu can still userspace thread SET_STATE. There has
also been talk about a generic iouring based kernel threaded
ioctl: https://lwn.net/Articles/844875/

What I suggested to Kevin is also something to look at, userspace
provides an event FD to SET_STATE and the event FD is triggered when
the background action is done.

So, I'm not worried about this. There are more than enough options to
address any async requirements down the road.

> and processors.  The mlx5 driver already places an upper bound on
> migration data size internally.

We did that because it seemed unreasonable to allow userspace to
allocate unlimited kernel memory during resuming. Ideally we'd limit
it to the device's max capability but the device doesn't know how to
do that today.

> Maybe some of these can come as DEVICE_FEATURES as we go, but for any
> sort of cloud vendor SLA, I'm afraid we're only enabling migration of
> devices with negligible transition latencies and negligible device
> states

Even if this is true, it is not a failure! Most of the migration
drivers we foresee are of this class.

My feeling is that more complex devices would benefit from some stuff,
eg like estimating times, but I'd rather collect actual field data and
understand where things lie, and what device changes are needed,
before we design something.

> with some hand waving how to determine that either of those are
> the case without device specific knowledge in the orchestration.

I don't think the orchestration necessarily needs special
knowledge. Certainly when the cloud operator designs the VMs and sets
the SLA parameters they need to do it with understanding of what the
mix of devices are and what kind of migration performance they get out
of the entire system.

More than anything system migration performance is going to be
impacted by the network for devices like mlx5 that have a non-trivial
STOP_COPY data blob.

Basically, I think it is worth thinking about, but not worth acting on
right now.

Jason
Tian, Kevin Feb. 16, 2022, 3:04 a.m. UTC | #11
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, February 15, 2022 11:34 PM
> 
> > > +#define VFIO_DEVICE_STATE_V1_STOP      (0)
> > > +#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
> > > +#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
> > > +#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)
> > > +#define VFIO_DEVICE_STATE_MASK
> (VFIO_DEVICE_STATE_V1_RUNNING
> > > | \
> > > +				     VFIO_DEVICE_STATE_V1_SAVING |  \
> > > +				     VFIO_DEVICE_STATE_V1_RESUMING)
> >
> > Does it make sense to also add 'V1' to MASK and also following macros
> > given their names are general?
> 
> No, the point of this exercise is to avoid trouble for qemu - the
> fewest changes we can get away with the better.
> 
> Once qemu is updated we'll delete this old stuff from the kernel.

sounds good.

> 
> > > +/*
> > > + * Indicates the device can support the migration API. See enum
> >
> > call it V2? Not necessary to add V2 in code but worthy of a clarification
> > in comment.
> 
> We've only called it 'v2' for discussions.
> 
> If you think it is unclear lets say 'support the migration API through
> VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE'

yes, that's clearer.

> > > + *
> > > + * STOP -> STOP_COPY
> > > + *   This arc begin the process of saving the device state and will return a
> > > + *   new data_fd.
> > > + *
> > > + *   While in the STOP_COPY state the device has the same behavior as
> STOP
> > > + *   with the addition that the data transfers session continues to stream
> the
> > > + *   migration state. End of stream on the FD indicates the entire device
> > > + *   state has been transferred.
> > > + *
> > > + *   The user should take steps to restrict access to vfio device regions
> while
> > > + *   the device is in STOP_COPY or risk corruption of the device migration
> > > data
> > > + *   stream.
> >
> > Restricting access has been explained in the to-STOP arcs and it is stated
> > that while in STOP_COPY the device has the same behavior as STOP. So
> > I think the last paragraph is possibly not required.
> 
> It is not the same, the language in STOP is saying that the device
> must tolerate external touches without breaking the kernel
> 
> This language is saying if external touches happen then the device is
> free to corrupt the migration stream.
> 
> In both cases we expect good userspace to not have device
> touches, the guidance here is for driver authors about what kind of
> steps they need to take to protect against hostile userspace.

fair enough.

> 
> > > + * STOP -> RESUMING
> > > + *   Entering the RESUMING state starts a process of restoring the device
> > > + *   state and will return a new data_fd. The data stream fed into the
> > > data_fd
> > > + *   should be taken from the data transfer output of the saving group
> states
> >
> > No definition of 'group state' (maybe introduced in a later patch?)
> 
> Yes, it was added in the P2P patch
> 
> We can avoid talking about saving group here entirely, it really just
> means a single FD.
> 
>  *    The data stream fed into the data_fd should
>  *   be taken from the data transfer output of a single FD during saving on a
>  *   from a compatible device.
> 

Yes.

Thanks
Kevin
Tian, Kevin Feb. 16, 2022, 3:17 a.m. UTC | #12
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, February 16, 2022 12:04 AM
> 
> On Tue, Feb 15, 2022 at 10:41:56AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, February 9, 2022 10:37 AM
> > >
> > > > >  /* -------- API for Type1 VFIO IOMMU -------- */
> > > > >
> > > > >  /**
> > > >
> > > > Otherwise, I'm still not sure how userspace handles the fact that it
> > > > can't know how much data will be read from the device and how
> important
> > > > that is.  There's no replacement of that feature from the v1 protocol
> > > > here.
> > >
> > > I'm not sure this was part of the v1 protocol either. Yes it had a
> > > pending_bytes, but I don't think it was actually expected to be 100%
> > > accurate. Computing this value accurately is potentially quite
> > > expensive, I would prefer we not enforce this on an implementation
> > > without a reason, and qemu currently doesn't make use of it.
> > >
> > > The ioctl from the precopy patch is probably the best approach, I
> > > think it would be fine to allow that for stop copy as well, but also
> > > don't see a usage right now.
> > >
> > > It is not something that needs decision now, it is very easy to detect
> > > if an ioctl is supported on the data_fd at runtime to add new things
> > > here when needed.
> > >
> >
> > Another interesting thing (not an immediate concern on this series)
> > is how to handle devices which may have long time (e.g. due to
> > draining outstanding requests, even w/o vPRI) to enter the STOP
> > state. that time is not as deterministic as pending bytes thus cannot
> > be reported back to the user before the operation is actually done.
> 
> Well, it is not deterministic at all..
> 
> I suppose you have to do as Alex says and try to estimate how much
> time the stop phase of migration will take and grant only the
> remaining time from the SLA to the guest to finish its PRI flushing,

Let's separate it from PRI stuff thus no guest operation.

It's a simple story that vCPUs have been stopped and Qemu requests
state transition from RUNNING to STOP on a device which needs
migration driver to drain outstanding requests before being stopped.

those requests don't rely on vCPUs but still take time to complete
(thus may break SLA) and are invisible to migration driver (directly
submitted by the guest thus cannot be estimated). So the only means 
is for user to wait on a fd with a timeout (based on whatever SLA) and
if expires then aborts migration (may retry later).

I'm not sure whether we want to leverage the new arc for vPRI or
just allow changing the STOP behavior to return a eventfd for an 
async transition.

Thanks
Kevin
Jason Gunthorpe Feb. 16, 2022, 12:14 p.m. UTC | #13
On Wed, Feb 16, 2022 at 03:17:36AM +0000, Tian, Kevin wrote:

> those requests don't rely on vCPUs but still take time to complete
> (thus may break SLA) and are invisible to migration driver (directly
> submitted by the guest thus cannot be estimated). So the only means 
> is for user to wait on a fd with a timeout (based on whatever SLA) and
> if expires then aborts migration (may retry later).

I think I explained in my other email how this can be implemented
today with v2 for STOP_COPY without an event fd.

Such a device might even be able to implement an abort..

Jason
Tian, Kevin Feb. 17, 2022, 2:29 a.m. UTC | #14
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, February 16, 2022 8:14 PM
> 
> On Wed, Feb 16, 2022 at 03:17:36AM +0000, Tian, Kevin wrote:
> 
> > those requests don't rely on vCPUs but still take time to complete
> > (thus may break SLA) and are invisible to migration driver (directly
> > submitted by the guest thus cannot be estimated). So the only means
> > is for user to wait on a fd with a timeout (based on whatever SLA) and
> > if expires then aborts migration (may retry later).
> 
> I think I explained in my other email how this can be implemented
> today with v2 for STOP_COPY without an event fd.
> 

I suppose you meant this part:

"It allows RUNNING -> STOP_COPY to be made async because the driver can
return SET_STATE immediately, backround the state save and indicate
completion/progress/error via poll(readable) on the data_fd."

Yes it could work if the user directly request STOP_COPY as the end state
(with STOP as an implicit/immediate step). In that case polling on data_fd
with timeout can cover the requirement described for STOP here.

Thanks
Kevin
diff mbox series

Patch

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 71763e2ac561..e7ab9f2048cd 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1557,6 +1557,196 @@  static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 	return 0;
 }
 
+/*
+ * vfio_mig_get_next_state - Compute the next step in the FSM
+ * @cur_fsm - The current state the device is in
+ * @new_fsm - The target state to reach
+ * @next_fsm - Pointer to the next step to get to new_fsm
+ *
+ * Return 0 upon success, otherwise -errno
+ * Upon success the next step in the state progression between cur_fsm and
+ * new_fsm will be set in next_fsm.
+ *
+ * This breaks down requests for combination transitions into smaller steps and
+ * returns the next step to get to new_fsm. The function may need to be called
+ * multiple times before reaching new_fsm.
+ *
+ */
+int vfio_mig_get_next_state(struct vfio_device *device,
+			    enum vfio_device_mig_state cur_fsm,
+			    enum vfio_device_mig_state new_fsm,
+			    enum vfio_device_mig_state *next_fsm)
+{
+	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 };
+	/*
+	 * The coding in this table requires the driver to implement 6
+	 * FSM arcs:
+	 *         RESUMING -> STOP
+	 *         RUNNING -> STOP
+	 *         STOP -> RESUMING
+	 *         STOP -> RUNNING
+	 *         STOP -> STOP_COPY
+	 *         STOP_COPY -> STOP
+	 *
+	 * The coding will step through multiple states for these combination
+	 * transitions:
+	 *         RESUMING -> STOP -> RUNNING
+	 *         RESUMING -> STOP -> STOP_COPY
+	 *         RUNNING -> STOP -> RESUMING
+	 *         RUNNING -> STOP -> STOP_COPY
+	 *         STOP_COPY -> STOP -> RESUMING
+	 *         STOP_COPY -> STOP -> RUNNING
+	 */
+	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
+		[VFIO_DEVICE_STATE_STOP] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_RUNNING] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_STOP_COPY] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_RESUMING] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_ERROR] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+	};
+
+	if (WARN_ON(cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table)))
+		return -EINVAL;
+
+	if (new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
+		return -EINVAL;
+
+	*next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
+	return (*next_fsm != VFIO_DEVICE_STATE_ERROR) ? 0 : -EINVAL;
+}
+EXPORT_SYMBOL_GPL(vfio_mig_get_next_state);
+
+/*
+ * Convert the drivers's struct file into a FD number and return it to userspace
+ */
+static int vfio_ioct_mig_return_fd(struct file *filp, void __user *arg,
+				   struct vfio_device_feature_mig_state *mig)
+{
+	int ret;
+	int fd;
+
+	fd = get_unused_fd_flags(O_CLOEXEC);
+	if (fd < 0) {
+		ret = fd;
+		goto out_fput;
+	}
+
+	mig->data_fd = fd;
+	if (copy_to_user(arg, mig, sizeof(*mig))) {
+		ret = -EFAULT;
+		goto out_put_unused;
+	}
+	fd_install(fd, filp);
+	return 0;
+
+out_put_unused:
+	put_unused_fd(fd);
+out_fput:
+	fput(filp);
+	return ret;
+}
+
+static int
+vfio_ioctl_device_feature_mig_device_state(struct vfio_device *device,
+					   u32 flags, void __user *arg,
+					   size_t argsz)
+{
+	size_t minsz =
+		offsetofend(struct vfio_device_feature_mig_state, data_fd);
+	struct vfio_device_feature_mig_state mig;
+	struct file *filp = NULL;
+	int ret;
+
+	if (!device->ops->migration_set_state ||
+	    !device->ops->migration_get_state)
+		return -ENOTTY;
+
+	ret = vfio_check_feature(flags, argsz,
+				 VFIO_DEVICE_FEATURE_SET |
+				 VFIO_DEVICE_FEATURE_GET,
+				 sizeof(mig));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&mig, arg, minsz))
+		return -EFAULT;
+
+	if (flags & VFIO_DEVICE_FEATURE_GET) {
+		enum vfio_device_mig_state curr_state;
+
+		ret = device->ops->migration_get_state(device, &curr_state);
+		if (ret)
+			return ret;
+		mig.device_state = curr_state;
+		goto out_copy;
+	}
+
+	/* Handle the VFIO_DEVICE_FEATURE_SET */
+	filp = device->ops->migration_set_state(device, mig.device_state);
+	if (IS_ERR(filp) || !filp)
+		goto out_copy;
+
+	return vfio_ioct_mig_return_fd(filp, arg, &mig);
+out_copy:
+	mig.data_fd = -1;
+	if (copy_to_user(arg, &mig, sizeof(mig)))
+		return -EFAULT;
+	if (IS_ERR(filp))
+		return PTR_ERR(filp);
+	return 0;
+}
+
+static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
+					       u32 flags, void __user *arg,
+					       size_t argsz)
+{
+	struct vfio_device_feature_migration mig = {
+		.flags = VFIO_MIGRATION_STOP_COPY,
+	};
+	int ret;
+
+	if (!device->ops->migration_set_state)
+		return -ENOTTY;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(mig));
+	if (ret != 1)
+		return ret;
+	if (copy_to_user(arg, &mig, sizeof(mig)))
+		return -EFAULT;
+	return 0;
+}
+
 static int vfio_ioctl_device_feature(struct vfio_device *device,
 				     struct vfio_device_feature __user *arg)
 {
@@ -1582,6 +1772,14 @@  static int vfio_ioctl_device_feature(struct vfio_device *device,
 		return -EINVAL;
 
 	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
+	case VFIO_DEVICE_FEATURE_MIGRATION:
+		return vfio_ioctl_device_feature_migration(
+			device, feature.flags, arg->data,
+			feature.argsz - minsz);
+	case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE:
+		return vfio_ioctl_device_feature_mig_device_state(
+			device, feature.flags, arg->data,
+			feature.argsz - minsz);
 	default:
 		if (unlikely(!device->ops->device_feature))
 			return -EINVAL;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index ca69516f869d..3f4a1a7c2277 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -56,6 +56,13 @@  struct vfio_device {
  *         match, -errno for abort (ex. match with insufficient or incorrect
  *         additional args)
  * @device_feature: Fill in the VFIO_DEVICE_FEATURE ioctl
+ * @migration_set_state: Optional callback to change the migration state for
+ *         devices that support migration. The returned FD is used for data
+ *         transfer according to the FSM definition. The driver is responsible
+ *         to ensure that FD is isolated whenever the migration FSM leaves a
+ *         data transfer state or before close_device() returns.
+ @migration_get_state: Optional callback to get the migration state for
+ *         devices that support migration.
  */
 struct vfio_device_ops {
 	char	*name;
@@ -72,6 +79,11 @@  struct vfio_device_ops {
 	int	(*match)(struct vfio_device *vdev, char *buf);
 	int	(*device_feature)(struct vfio_device *device, u32 flags,
 				  void __user *arg, size_t argsz);
+	struct file *(*migration_set_state)(
+		struct vfio_device *device,
+		enum vfio_device_mig_state new_state);
+	int (*migration_get_state)(struct vfio_device *device,
+				   enum vfio_device_mig_state *curr_state);
 };
 
 /**
@@ -114,6 +126,11 @@  extern void vfio_device_put(struct vfio_device *device);
 
 int vfio_assign_device_set(struct vfio_device *device, void *set_id);
 
+int vfio_mig_get_next_state(struct vfio_device *device,
+			    enum vfio_device_mig_state cur_fsm,
+			    enum vfio_device_mig_state new_fsm,
+			    enum vfio_device_mig_state *next_fsm);
+
 /*
  * External user API
  */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ef33ea002b0b..89012bc01663 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -605,25 +605,25 @@  struct vfio_region_gfx_edid {
 
 struct vfio_device_migration_info {
 	__u32 device_state;         /* VFIO device state */
-#define VFIO_DEVICE_STATE_STOP      (0)
-#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
-#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
-#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
-#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
-				     VFIO_DEVICE_STATE_SAVING |  \
-				     VFIO_DEVICE_STATE_RESUMING)
+#define VFIO_DEVICE_STATE_V1_STOP      (0)
+#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_V1_RUNNING | \
+				     VFIO_DEVICE_STATE_V1_SAVING |  \
+				     VFIO_DEVICE_STATE_V1_RESUMING)
 
 #define VFIO_DEVICE_STATE_VALID(state) \
-	(state & VFIO_DEVICE_STATE_RESUMING ? \
-	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
+	(state & VFIO_DEVICE_STATE_V1_RESUMING ? \
+	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_V1_RESUMING : 1)
 
 #define VFIO_DEVICE_STATE_IS_ERROR(state) \
-	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
-					      VFIO_DEVICE_STATE_RESUMING))
+	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_V1_SAVING | \
+					      VFIO_DEVICE_STATE_V1_RESUMING))
 
 #define VFIO_DEVICE_STATE_SET_ERROR(state) \
-	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
-					     VFIO_DEVICE_STATE_RESUMING)
+	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_STATE_V1_SAVING | \
+					     VFIO_DEVICE_STATE_V1_RESUMING)
 
 	__u32 reserved;
 	__u64 pending_bytes;
@@ -1002,6 +1002,152 @@  struct vfio_device_feature {
  */
 #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN	(0)
 
+/*
+ * Indicates the device can support the migration API. See enum
+ * vfio_device_mig_state for details. If present flags must be non-zero and
+ * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported.
+ *
+ * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY and
+ * RESUMING are supported.
+ */
+struct vfio_device_feature_migration {
+	__aligned_u64 flags;
+#define VFIO_MIGRATION_STOP_COPY	(1 << 0)
+};
+#define VFIO_DEVICE_FEATURE_MIGRATION 1
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET,
+ * Execute a migration state change on the VFIO device.
+ * The new state is supplied in device_state.
+ *
+ * The kernel migration driver must fully transition the device to the new state
+ * value before the write(2) operation returns to the user.
+ *
+ * The kernel migration driver must not generate asynchronous device state
+ * transitions outside of manipulation by the user or the VFIO_DEVICE_RESET
+ * ioctl as described above.
+ *
+ * If this function fails then current device_state may be the original
+ * operating state or some other state along the combination transition path.
+ * The user can then decide if it should execute a VFIO_DEVICE_RESET, attempt
+ * to return to the original state, or attempt to return to some other state
+ * such as RUNNING or STOP.
+ *
+ * If the new_state starts a new data transfer session then the FD associated
+ * with that session is returned in data_fd. The user is responsible to close
+ * this FD when it is finished. The user must consider the migration data
+ * segments carried over the FD to be opaque and non-fungible. During RESUMING,
+ * the data segments must be written in the same order they came out of the
+ * saving side FD.
+ *
+ * Upon VFIO_DEVICE_FEATURE_GET,
+ * Get the current migration state of the VFIO device, data_fd will be -1.
+ */
+struct vfio_device_feature_mig_state {
+	__u32 device_state; /* From enum vfio_device_mig_state */
+	__s32 data_fd;
+};
+#define VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE 2
+
+/*
+ * The device migration Finite State Machine is described by the enum
+ * vfio_device_mig_state. Some of the FSM arcs will create a migration data
+ * transfer session by returning a FD, in this case the migration data will
+ * flow over the FD using read() and write() as discussed below.
+ *
+ * There are 5 states to support VFIO_MIGRATION_STOP_COPY:
+ *  RUNNING - The device is running normally
+ *  STOP - The device does not change the internal or external state
+ *  STOP_COPY - The device internal state can be read out
+ *  RESUMING - The device is stopped and is loading a new internal state
+ *  ERROR - The device has failed and must be reset
+ *
+ * The FSM takes actions on the arcs between FSM states. The driver implements
+ * the following behavior for the FSM arcs:
+ *
+ * RUNNING -> STOP
+ * STOP_COPY -> STOP
+ *   While in STOP the device must stop the operation of the device. The
+ *   device must not generate interrupts, DMA, or advance its internal
+ *   state. When stopped the device and kernel migration driver must accept
+ *   and respond to interaction to support external subsystems in the STOP
+ *   state, for example PCI MSI-X and PCI config pace. Failure by the user to
+ *   restrict device access while in STOP must not result in error conditions
+ *   outside the user context (ex. host system faults).
+ *
+ *   The STOP_COPY arc will terminate a data transfer session.
+ *
+ * RESUMING -> STOP
+ *   Leaving RESUMING terminates a data transfer session and indicates the
+ *   device should complete processing of the data delivered by write(). The
+ *   kernel migration driver should complete the incorporation of data written
+ *   to the data transfer FD into the device internal state and perform
+ *   final validity and consistency checking of the new device state. If the
+ *   user provided data is found to be incomplete, inconsistent, or otherwise
+ *   invalid, the migration driver must fail the SET_STATE ioctl and
+ *   optionally go to the ERROR state as described below.
+ *
+ *   While in STOP the device has the same behavior as other STOP states
+ *   described above.
+ *
+ *   To abort a RESUMING session the device must be reset.
+ *
+ * STOP -> RUNNING
+ *   While in RUNNING the device is fully operational, the device may generate
+ *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
+ *   and the device may advance its internal state.
+ *
+ * STOP -> STOP_COPY
+ *   This arc begin the process of saving the device state and will return a
+ *   new data_fd.
+ *
+ *   While in the STOP_COPY state the device has the same behavior as STOP
+ *   with the addition that the data transfers session continues to stream the
+ *   migration state. End of stream on the FD indicates the entire device
+ *   state has been transferred.
+ *
+ *   The user should take steps to restrict access to vfio device regions while
+ *   the device is in STOP_COPY or risk corruption of the device migration data
+ *   stream.
+ *
+ * STOP -> RESUMING
+ *   Entering the RESUMING state starts a process of restoring the device
+ *   state and will return a new data_fd. The data stream fed into the data_fd
+ *   should be taken from the data transfer output of the saving group states
+ *   from a compatible device. The migration driver may alter/reset the
+ *   internal device state for this arc if required to prepare the device to
+ *   receive the migration data.
+ *
+ * any -> ERROR
+ *   ERROR cannot be specified as a device state, however any transition request
+ *   can be failed with an errno return and may then move the device_state into
+ *   ERROR. In this case the device was unable to execute the requested arc and
+ *   was also unable to restore the device to any valid device_state.
+ *   To recover from ERROR VFIO_DEVICE_RESET must be used to return the
+ *   device_state back to RUNNING.
+ *
+ * The remaining possible transitions are interpreted as combinations of the
+ * above FSM arcs. As there are multiple paths through the FSM arcs the path
+ * should be selected based on the following rules:
+ *   - Select the shortest path.
+ * Refer to vfio_mig_get_next_state() for the result of the algorithm.
+ *
+ * The automatic transit through the FSM arcs that make up the combination
+ * transition is invisible to the user. When working with combination arcs the
+ * user may see any step along the path in the device_state if SET_STATE
+ * fails. When handling these types of errors users should anticipate future
+ * revisions of this protocol using new states and those states becoming
+ * visible in this case.
+ */
+enum vfio_device_mig_state {
+	VFIO_DEVICE_STATE_ERROR = 0,
+	VFIO_DEVICE_STATE_STOP = 1,
+	VFIO_DEVICE_STATE_RUNNING = 2,
+	VFIO_DEVICE_STATE_STOP_COPY = 3,
+	VFIO_DEVICE_STATE_RESUMING = 4,
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**