diff mbox series

[RFC] vfio: Revise and update the migration uAPI description

Message ID 0-v1-a4f7cab64938+3f-vfio_mig_states_jgg@nvidia.com (mailing list archive)
State New, archived
Headers show
Series [RFC] vfio: Revise and update the migration uAPI description | expand

Commit Message

Jason Gunthorpe Jan. 14, 2022, 7:35 p.m. UTC
Clarify how the migration API works by recasting its specification from a
bunch of bits into a formal FSM. This describes the same functional
outcome, with no practical ABI change.

Compared to the clarification Alex proposed:

https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen

this has a few deliberate functional differences:

 - STOP_COPY -> PRE_COPY is made optional and to be defined in
   future. This is motivated by the realization that mlx5 and other
   devices that don't implement pre-copy cannot really achieve this
   without a lot of work. We know of no user space that issues this
   transition, so there is no practical ABI break to fail it.

 - ERROR arcs allow the device function to remain unchanged. Reviewing the
   driver design in this series makes it clear there is no advantage to
   putting more complexity in the kernel. If the device has a double
   fault, then userspace already has all the necessary tools to clean it
   up.

 - NDMA is redefined into two FSM states and is described in terms of the
   desired P2P quiescent behavior, noting that halting all DMA is an
   acceptable implementation.

First, each of the 6 existing valid values of device_state are recast into
FSM states using the names from the current comment. The two new states
for P2P are inserted in the two invalid numbers giving a packed definition
of all 8 states. As the numbers match the existing ABI there is no change.

Next, we define a set of FSM arcs which the driver will directly
implement. Each of the 15 FSM arcs is listed and Alex's clarification
wording is substantially reused to describe how they should operate.

Finally, we define the remaining set of old/new device_state transitions
as 'combination transitions' which are naturally defined as taking
multiple FSM arcs along the shortest path within the FSM's digraph. Two
rules are defined which result in an unambiguous specification.

The kernel will fully support all possible transitions automatically by
either executing the FSM arc directly, or by following the defined path
and executing each step's FSM arc. It allows all drivers to automatically
and consistently support all device_state writes in the original 5 states,
except for STOP_COPY -> PRE_COPY. This fulfills the original concept of
the ABI of allowing all combinations of device_state transitions.

In terms of code, the driver is responsible to implement all of the FSM
arcs and the core code implements the combination transitions. This
ensures that every driver implements the combination transitions in the
same way, and insulates the driver from this complexity. The shortest
paths, with the ambiguity resolved, are all precomputed and stored in a 64
byte lookup table.

Further, the complicated error unwind in the combination transitions is
solved for all drivers by having the core code attempt to return back to
the starting state through a natural sequence of FSM arcs. Should this be
impossible, or a double fault happens, the ERROR state is invoked.

The FSM approach proves extensible in the future as new additional states
and arcs can be added to the FSM. The combination transitions logic will
automatically walk through these new arcs providing backwards
compatibility, and it is easy for the core to insulate drivers that do not
support new states from ever seeing them while still using a single FSM
digraph as the behavioral specification. A separate patch, to be included
with the first driver that requires it, demonstrates how this works for
the optional P2P states.

A following patch in the series introduces a simple query IOCTL that
identifies what FSM arcs the device supports allowing extensible
discoverability.

The old uAPI definitions are removed from the header file. As per Linus's
past remarks we do not have a hard requirement to retain compilation
compatibility in uapi headers and qemu is already following Linus's
preferred model of copying the kernel headers. As the ABI functionality is
preserved this is only a compilation break.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/vfio.c       | 236 +++++++++++++++++++
 include/linux/vfio.h      |   5 +
 include/uapi/linux/vfio.h | 461 +++++++++++++++++++++-----------------
 3 files changed, 499 insertions(+), 203 deletions(-)

This is RFC because the full series is not fully tested yet, that should be
done next week. The series can be previewed here:

  https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci

The mlx5 implementation of the state change:

  https://github.com/jgunthorpe/linux/blob/0a6416da226fe8ee888aa8026f1e363698e137a8/drivers/vfio/pci/mlx5/main.c#L264

Has turned out very clean. Compare this to the v5 version, which is full of
subtle bugs:

  https://lore.kernel.org/kvm/20211027095658.144468-12-yishaih@nvidia.com/

This patch adds the VFIO_DEVICE_MIG_ARC_SUPPORTED ioctl:

  https://github.com/jgunthorpe/linux/commit/c92eff6c2afd1ecc9ed5c67a1f81c7f270f6e940

And this shows how the Huawei driver should opt out of P2P arcs:

  https://github.com/jgunthorpe/linux/commit/dd2571c481d27546a33ff4583ce8ad49847fe300

Finally, this helper script was used to audit the FSM construction, the commit
is informative to help understand the details:

  https://github.com/jgunthorpe/linux/commit/0a6416da226fe8ee888aa8026f1e363698e137a8

I think this should resolve all the disconnect. It keeps the uAPI practically
unchanged while strongly defining it in precise terms of what arcs a driver
must implement. The core code deals with the troublesome precedence and error
issue in a way that is intuitive to understand. It is now actually easy to
audit a driver for correct implementation.


base-commit: 40404a6ce71339d7bc0f0a0e185ad557bf421cec

Comments

Yishai Hadas Jan. 18, 2022, 2:04 p.m. UTC | #1
On 1/14/2022 9:35 PM, Jason Gunthorpe wrote:
> Clarify how the migration API works by recasting its specification from a
> bunch of bits into a formal FSM. This describes the same functional
> outcome, with no practical ABI change.
>
> Compared to the clarification Alex proposed:
>
> https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen
>
> this has a few deliberate functional differences:
>
>   - STOP_COPY -> PRE_COPY is made optional and to be defined in
>     future. This is motivated by the realization that mlx5 and other
>     devices that don't implement pre-copy cannot really achieve this
>     without a lot of work. We know of no user space that issues this
>     transition, so there is no practical ABI break to fail it.
>
>   - ERROR arcs allow the device function to remain unchanged. Reviewing the
>     driver design in this series makes it clear there is no advantage to
>     putting more complexity in the kernel. If the device has a double
>     fault, then userspace already has all the necessary tools to clean it
>     up.
>
>   - NDMA is redefined into two FSM states and is described in terms of the
>     desired P2P quiescent behavior, noting that halting all DMA is an
>     acceptable implementation.
>
> First, each of the 6 existing valid values of device_state are recast into
> FSM states using the names from the current comment. The two new states
> for P2P are inserted in the two invalid numbers giving a packed definition
> of all 8 states. As the numbers match the existing ABI there is no change.
>
> Next, we define a set of FSM arcs which the driver will directly
> implement. Each of the 15 FSM arcs is listed and Alex's clarification
> wording is substantially reused to describe how they should operate.
>
> Finally, we define the remaining set of old/new device_state transitions
> as 'combination transitions' which are naturally defined as taking
> multiple FSM arcs along the shortest path within the FSM's digraph. Two
> rules are defined which result in an unambiguous specification.
>
> The kernel will fully support all possible transitions automatically by
> either executing the FSM arc directly, or by following the defined path
> and executing each step's FSM arc. It allows all drivers to automatically
> and consistently support all device_state writes in the original 5 states,
> except for STOP_COPY -> PRE_COPY. This fulfills the original concept of
> the ABI of allowing all combinations of device_state transitions.
>
> In terms of code, the driver is responsible to implement all of the FSM
> arcs and the core code implements the combination transitions. This
> ensures that every driver implements the combination transitions in the
> same way, and insulates the driver from this complexity. The shortest
> paths, with the ambiguity resolved, are all precomputed and stored in a 64
> byte lookup table.
>
> Further, the complicated error unwind in the combination transitions is
> solved for all drivers by having the core code attempt to return back to
> the starting state through a natural sequence of FSM arcs. Should this be
> impossible, or a double fault happens, the ERROR state is invoked.
>
> The FSM approach proves extensible in the future as new additional states
> and arcs can be added to the FSM. The combination transitions logic will
> automatically walk through these new arcs providing backwards
> compatibility, and it is easy for the core to insulate drivers that do not
> support new states from ever seeing them while still using a single FSM
> digraph as the behavioral specification. A separate patch, to be included
> with the first driver that requires it, demonstrates how this works for
> the optional P2P states.
>
> A following patch in the series introduces a simple query IOCTL that
> identifies what FSM arcs the device supports allowing extensible
> discoverability.
>
> The old uAPI definitions are removed from the header file. As per Linus's
> past remarks we do not have a hard requirement to retain compilation
> compatibility in uapi headers and qemu is already following Linus's
> preferred model of copying the kernel headers. As the ABI functionality is
> preserved this is only a compilation break.
>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/vfio/vfio.c       | 236 +++++++++++++++++++
>   include/linux/vfio.h      |   5 +
>   include/uapi/linux/vfio.h | 461 +++++++++++++++++++++-----------------
>   3 files changed, 499 insertions(+), 203 deletions(-)
>
> This is RFC because the full series is not fully tested yet, that should be
> done next week. The series can be previewed here:


The testing with the full series seems OK, see few typos on the 
documentation part inline.


>    https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
>
> The mlx5 implementation of the state change:
>
>    https://github.com/jgunthorpe/linux/blob/0a6416da226fe8ee888aa8026f1e363698e137a8/drivers/vfio/pci/mlx5/main.c#L264
>
> Has turned out very clean. Compare this to the v5 version, which is full of
> subtle bugs:
>
>    https://lore.kernel.org/kvm/20211027095658.144468-12-yishaih@nvidia.com/
>
> This patch adds the VFIO_DEVICE_MIG_ARC_SUPPORTED ioctl:
>
>    https://github.com/jgunthorpe/linux/commit/c92eff6c2afd1ecc9ed5c67a1f81c7f270f6e940
>
> And this shows how the Huawei driver should opt out of P2P arcs:
>
>    https://github.com/jgunthorpe/linux/commit/dd2571c481d27546a33ff4583ce8ad49847fe300
>
> Finally, this helper script was used to audit the FSM construction, the commit
> is informative to help understand the details:
>
>    https://github.com/jgunthorpe/linux/commit/0a6416da226fe8ee888aa8026f1e363698e137a8
>
> I think this should resolve all the disconnect. It keeps the uAPI practically
> unchanged while strongly defining it in precise terms of what arcs a driver
> must implement. The core code deals with the troublesome precedence and error
> issue in a way that is intuitive to understand. It is now actually easy to
> audit a driver for correct implementation.
>
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 735d1d344af9d4..96001f03bc39f1 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1557,6 +1557,242 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
>   	return 0;
>   }
>   
> +enum {
> +	VFIO_MIG_FSM_MAX_STATE = VFIO_DEVICE_STATE_RUNNING_P2P + 1,
> +};
> +
> +/*
> + * vfio_mig_get_next_state - Compute the next step in the FSM
> + * @cur_fsm - The current state the device is in
> + * @new_fsm - The target state to reach
> + *
> + * Return the next step in the state progression between cur_fsm and
> + * new_fsm. This breaks down requests for complex transitions into
> + * smaller steps and returns the next step to get to new_fsm. The
> + * function may need to be called up to four times before reaching new_fsm.
> + *
> + * VFIO_DEVICE_STATE_ERROR is returned if the state transition is not allowed.
> + */
> +static u32 vfio_mig_get_next_state(u32 cur_fsm, u32 new_fsm)
> +{
> +	/*
> +	 * The coding in this table requires the driver to implement 15
> +	 * FSM arcs:
> +	 *        PRE_COPY -> PRE_COPY_P2P
> +	 *        PRE_COPY -> RUNNING
> +	 *        PRE_COPY_P2P -> PRE_COPY
> +	 *        PRE_COPY_P2P -> RUNNING_P2P
> +	 *        PRE_COPY_P2P -> STOP_COPY
> +	 *        RESUMING -> STOP
> +	 *        RUNNING -> PRE_COPY
> +	 *        RUNNING -> RUNNING_P2P
> +	 *        RUNNING_P2P -> PRE_COPY_P2P
> +	 *        RUNNING_P2P -> RUNNING
> +	 *        RUNNING_P2P -> STOP
> +	 *        STOP -> RESUMING
> +	 *        STOP -> RUNNING_P2P
> +	 *        STOP -> STOP_COPY
> +	 *        STOP_COPY -> STOP
> +	 *
> +	 * The coding will step through multiple states for these combination
> +	 * transitions:
> +	 *        PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
> +	 *        PRE_COPY -> RUNNING -> RUNNING_P2P
> +	 *        PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP
> +	 *        PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP -> RESUMING
> +	 *        PRE_COPY_P2P -> RUNNING_P2P -> RUNNING
> +	 *        PRE_COPY_P2P -> RUNNING_P2P -> STOP
> +	 *        PRE_COPY_P2P -> RUNNING_P2P -> STOP -> RESUMING
> +	 *        RESUMING -> STOP -> RUNNING_P2P
> +	 *        RESUMING -> STOP -> RUNNING_P2P -> PRE_COPY_P2P
> +	 *        RESUMING -> STOP -> RUNNING_P2P -> RUNNING
> +	 *        RESUMING -> STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
> +	 *        RESUMING -> STOP -> STOP_COPY
> +	 *        RUNNING -> RUNNING_P2P -> PRE_COPY_P2P
> +	 *        RUNNING -> RUNNING_P2P -> STOP
> +	 *        RUNNING -> RUNNING_P2P -> STOP -> RESUMING
> +	 *        RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
> +	 *        RUNNING_P2P -> RUNNING -> PRE_COPY
> +	 *        RUNNING_P2P -> STOP -> RESUMING
> +	 *        RUNNING_P2P -> STOP -> STOP_COPY
> +	 *        STOP -> RUNNING_P2P -> PRE_COPY_P2P
> +	 *        STOP -> RUNNING_P2P -> RUNNING
> +	 *        STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
> +	 *        STOP_COPY -> STOP -> RESUMING
> +	 *        STOP_COPY -> STOP -> RUNNING_P2P
> +	 *        STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
> +	 *
> +	 *  The following transitions are blocked:
> +	 *        STOP_COPY -> PRE_COPY
> +	 *        STOP_COPY -> PRE_COPY_P2P
> +	 */
> +	static const u8 vfio_from_fsm_table[VFIO_MIG_FSM_MAX_STATE][VFIO_MIG_FSM_MAX_STATE] = {
> +		[VFIO_DEVICE_STATE_STOP] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_RUNNING] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_PRE_COPY] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_PRE_COPY_P2P] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_STOP_COPY] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_RESUMING] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_ERROR] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +	};
> +	return vfio_from_fsm_table[cur_fsm][new_fsm];
> +}
> +
> +static bool vfio_mig_in_saving_group(u32 state)
> +{
> +	return state == VFIO_DEVICE_STATE_PRE_COPY ||
> +	       state == VFIO_DEVICE_STATE_PRE_COPY_P2P ||
> +	       state == VFIO_DEVICE_STATE_STOP_COPY;
> +}
> +
> +/*
> + * vfio_mig_set_device_state - Change the migration device_state
> + * @device - The VFIO device to act on
> + * @target_device_state - The new state from the uAPI
> + * @cur_state - Pointer to the drivers current migration FSM state
> + *
> + * This validates target_device_state and then calls
> + * ops->migration_step_device_state() enough times to achieve the target state.
> + * See vfio_mig_get_next_state() for the required arcs.
> + *
> + * If the op callback fails then the driver should leave the device state
> + * unchanged and return errno, should this not be possible then it should set
> + * cur_state to VFIO_DEVICE_STATE_ERROR and return errno.
> + *
> + * If a step fails then this attempts to reverse the FSM back to the original
> + * state, should that fail it is set to VFIO_DEVICE_STATE_ERROR and error is
> + * returned.
> + */
> +int vfio_mig_set_device_state(struct vfio_device *device, u32 target_state,
> +			      u32 *cur_state)
> +{
> +	u32 starting_state = *cur_state;
> +	u32 next_state;
> +	int ret;
> +
> +	if (target_state >= VFIO_MIG_FSM_MAX_STATE)
> +		return -EINVAL;
> +
> +	while (*cur_state != target_state) {
> +		next_state = vfio_mig_get_next_state(*cur_state, target_state);
> +		if (next_state == VFIO_DEVICE_STATE_ERROR) {
> +			ret = -EINVAL;
> +			goto out_restore_state;
> +		}
> +		ret = device->ops->migration_step_device_state(device,
> +							       next_state);
> +		if (ret)
> +			goto out_restore_state;
> +		*cur_state = next_state;
> +	}
> +	return 0;
> +
> +out_restore_state:
> +	if (*cur_state == VFIO_DEVICE_STATE_ERROR)
> +		return ret;
> +	/*
> +	 * If the window as initialized, and we closed the window, then we
> +	 * cannot recover the old state.
> +	 */
> +	if ((vfio_mig_in_saving_group(starting_state) &&
> +	     !vfio_mig_in_saving_group(*cur_state)) ||
> +	    (starting_state == VFIO_DEVICE_STATE_RESUMING &&
> +	     *cur_state != VFIO_DEVICE_STATE_RESUMING)) {
> +		*cur_state = VFIO_DEVICE_STATE_ERROR;
> +		return ret;
> +	}
> +
> +	/*
> +	 * Make a best effort to restore things back to where we started.
> +	 */
> +	while (*cur_state != starting_state) {
> +		next_state =
> +			vfio_mig_get_next_state(*cur_state, starting_state);
> +		if (next_state == VFIO_DEVICE_STATE_ERROR ||
> +		    device->ops->migration_step_device_state(device,
> +							     next_state)) {
> +			*cur_state = VFIO_DEVICE_STATE_ERROR;
> +			break;
> +		}
> +		*cur_state = next_state;
> +	}
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mig_set_device_state);
> +
>   static long vfio_device_fops_unl_ioctl(struct file *filep,
>   				       unsigned int cmd, unsigned long arg)
>   {
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 76191d7abed185..5c96ef5e8d5202 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -69,6 +69,8 @@ struct vfio_device_ops {
>   	int	(*mmap)(struct vfio_device *vdev, struct vm_area_struct *vma);
>   	void	(*request)(struct vfio_device *vdev, unsigned int count);
>   	int	(*match)(struct vfio_device *vdev, char *buf);
> +	int     (*migration_step_device_state)(struct vfio_device *device,
> +					       u32 next_state);
>   };
>   
>   void vfio_init_group_dev(struct vfio_device *device, struct device *dev,
> @@ -82,6 +84,9 @@ extern void vfio_device_put(struct vfio_device *device);
>   
>   int vfio_assign_device_set(struct vfio_device *device, void *set_id);
>   
> +int vfio_mig_set_device_state(struct vfio_device *device, u32 target_state,
> +			      u32 *cur_state);
> +
>   /*
>    * External user API
>    */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ef33ea002b0b31..42e0ab905d2df7 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -408,223 +408,278 @@ struct vfio_region_gfx_edid {
>   #define VFIO_REGION_SUBTYPE_MIGRATION           (1)
>   
>   /*
> - * The structure vfio_device_migration_info is placed at the 0th offset of
> - * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
> - * migration information. Field accesses from this structure are only supported
> - * at their native width and alignment. Otherwise, the result is undefined and
> - * vendor drivers should return an error.
> + * The structure vfio_device_migration_info is placed at the immediate start
> + * of the per-device VFIO_REGION_SUBTYPE_MIGRATION region to manage the device
> + * state and migration information for the device.  Field accesses for this
> + * structure are only supported using their native width and alignment,
> + * accesses otherwise are undefined, and the kernel migration driver should
> + * return an error.
>    *
>    * device_state: (read/write)
> - *      - The user application writes to this field to inform the vendor driver
> - *        about the device state to be transitioned to.
> - *      - The vendor driver should take the necessary actions to change the
> - *        device state. After successful transition to a given state, the
> - *        vendor driver should return success on write(device_state, state)
> - *        system call. If the device state transition fails, the vendor driver
> - *        should return an appropriate -errno for the fault condition.
> - *      - On the user application side, if the device state transition fails,
> - *	  that is, if write(device_state, state) returns an error, read
> - *	  device_state again to determine the current state of the device from
> - *	  the vendor driver.
> - *      - The vendor driver should return previous state of the device unless
> - *        the vendor driver has encountered an internal error, in which case
> - *        the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR.
> - *      - The user application must use the device reset ioctl to recover the
> - *        device from VFIO_DEVICE_STATE_ERROR state. If the device is
> - *        indicated to be in a valid device state by reading device_state, the
> - *        user application may attempt to transition the device to any valid
> - *        state reachable from the current state or terminate itself.
> - *
> - *      device_state consists of 3 bits:
> - *      - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear,
> - *        it indicates the _STOP state. When the device state is changed to
> - *        _STOP, driver should stop the device before write() returns.
> - *      - If bit 1 is set, it indicates the _SAVING state, which means that the
> - *        driver should start gathering device state information that will be
> - *        provided to the VFIO user application to save the device's state.
> - *      - If bit 2 is set, it indicates the _RESUMING state, which means that
> - *        the driver should prepare to resume the device. Data provided through
> - *        the migration region should be used to resume the device.
> - *      Bits 3 - 31 are reserved for future use. To preserve them, the user
> - *      application should perform a read-modify-write operation on this
> - *      field when modifying the specified bits.
> - *
> - *  +------- _RESUMING
> - *  |+------ _SAVING
> - *  ||+----- _RUNNING
> - *  |||
> - *  000b => Device Stopped, not saving or resuming
> - *  001b => Device running, which is the default state
> - *  010b => Stop the device & save the device state, stop-and-copy state
> - *  011b => Device running and save the device state, pre-copy state
> - *  100b => Device stopped and the device state is resuming
> - *  101b => Invalid state
> - *  110b => Error state
> - *  111b => Invalid state
> - *
> - * State transitions:
> - *
> - *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
> - *                (100b)     (001b)     (011b)        (010b)       (000b)
> - * 0. Running or default state
> - *                             |
> - *
> - * 1. Normal Shutdown (optional)
> - *                             |------------------------------------->|
> - *
> - * 2. Save the state or suspend
> - *                             |------------------------->|---------->|
> - *
> - * 3. Save the state during live migration
> - *                             |----------->|------------>|---------->|
> - *
> - * 4. Resuming
> - *                  |<---------|
> - *
> - * 5. Resumed
> - *                  |--------->|
> - *
> - * 0. Default state of VFIO device is _RUNNING when the user application starts.
> - * 1. During normal shutdown of the user application, the user application may
> - *    optionally change the VFIO device state from _RUNNING to _STOP. This
> - *    transition is optional. The vendor driver must support this transition but
> - *    must not require it.
> - * 2. When the user application saves state or suspends the application, the
> - *    device state transitions from _RUNNING to stop-and-copy and then to _STOP.
> - *    On state transition from _RUNNING to stop-and-copy, driver must stop the
> - *    device, save the device state and send it to the application through the
> - *    migration region. The sequence to be followed for such transition is given
> - *    below.
> - * 3. In live migration of user application, the state transitions from _RUNNING
> - *    to pre-copy, to stop-and-copy, and to _STOP.
> - *    On state transition from _RUNNING to pre-copy, the driver should start
> - *    gathering the device state while the application is still running and send
> - *    the device state data to application through the migration region.
> - *    On state transition from pre-copy to stop-and-copy, the driver must stop
> - *    the device, save the device state and send it to the user application
> - *    through the migration region.
> - *    Vendor drivers must support the pre-copy state even for implementations
> - *    where no data is provided to the user before the stop-and-copy state. The
> - *    user must not be required to consume all migration data before the device
> - *    transitions to a new state, including the stop-and-copy state.
> - *    The sequence to be followed for above two transitions is given below.
> - * 4. To start the resuming phase, the device state should be transitioned from
> - *    the _RUNNING to the _RESUMING state.
> - *    In the _RESUMING state, the driver should use the device state data
> - *    received through the migration region to resume the device.
> - * 5. After providing saved device data to the driver, the application should
> - *    change the state from _RESUMING to _RUNNING.
> + *   The device_state field is the current state in a finite state machine
> + *   (FSM) that controls the migration function of the device. A write by the
> + *   user requests the FSM move to the new state. Some general rules govern
> + *   the FSM:
> + *     - The user may read or write the device state register at any time.
> + *     - The kernel migration driver must fully transition the device to the
> + *       new state value before the write(2) operation returns to the user.
> + *     - The kernel migration driver must not generate asynchronous device
> + *       state transitions outside of manipulation by the user or the
> + *       VFIO_DEVICE_RESET ioctl as described below.
> + *     - In the event of a device state transition failure, the kernel
> + *       migration driver must return a write(2) error with appropriate errno
> + *       to the user. Refer to the ERROR arc section below for additional detail
> + *       on error handling.
> + *     - Devices supporting migration via this specification must support the
> + *       VFIO_DEVICE_RESET ioctl and any use of that ioctl must return the
> + *       device migration state to the RUNNING state.
> + *
> + *   There are 6 mandatory states defined as STOP, RUNNING, STOP_COPY,
> + *   PRE_COPY, RESUMING and ERROR plus two optional states PRE_COPY_P2P and
> + *   RESUMING_P2P.

Typo, no RESUMING_P2P, this should be RUNNING_P2P.


> + *
> + *   When the user writes to the device_state it triggers a transition from
> + *   the current value of device_state to the value written by the user. Some
> + *   transitions are called FSM arcs, and when requested, have the driver
> + *   behavior defined as follows:
> + *
> + *   RUNNING_P2P -> STOP
> + *   STOP_COPY -> STOP
> + *     While in STOP the device must stop the operation of the device.  The
> + *     device must not generate interrupts, DMA, or advance its internal
> + *     state. When stopped the device and kernel migration driver must accept
> + *     and respond to interaction to support external subsystems in the
> + *     STOP state, for example PCI MSI-X and PCI config pace. Failure by
> + *     the user to restrict device access while in STOP must not result in
> + *     error conditions outside the user context (ex. host system faults).
> + *
> + *     The STOP_COPY arc will close the data window.
> + *
> + *   RESUMING -> STOP
> + *     Leaving RESUMING closes the migration region data window and indicates
> + *     the end of a resuming session for the device.  The kernel migration
> + *     driver should complete the incorporation of data written to the
> + *     migration data window into the device internal state and perform final
> + *     validity and consistency checking of the new device state.  If the user
> + *     provided data is found to be incomplete, inconsistent, or otherwise
> + *     invalid, the migration driver must indicate a write(2) error and
> + *     optionally go to the ERROR state as described below. The data window
> + *     is closed.
> + *
> + *     While in STOP the device has the same behavior as other STOP states
> + *     described above.
> + *
> + *     To abort a RESUMING session the device must be reset.
> + *
> + *   PRE_COPY -> RUNNING
> + *   RUNNING_P2P -> RUNNING
> + *     While in RUNNING the device is fully operational, the device may
> + *     generate interrupts, DMA, respond to MMIO, all vfio device regions are
> + *     functional, and the device may advance its internal state.
> + *
> + *     The PRE_COPY arc will close the data window.
> + *
> + *   PRE_COPY_P2P -> RUNNING_P2P
> + *   RUNNING -> RUNNING_P2P
> + *   STOP -> RUNNING_P2P
> + *     While in RUNNING_P2P the device is partially running in the P2P quiescent
> + *     state defined below.
> + *
> + *     The PRE_COPY arc will close the data window.
> + *
> + *   RUNNING -> PRE_COPY
> + *   RUNNING_P2P -> PRE_COPY_P2P
> + *   STOP -> STOP_COPY
> + *     PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of states
> + *     which share the data window. Moving between these states alters what is
> + *     streamed in the data window, but does not close or otherwise effect the
> + *     state of the data window.
> + *
> + *     These arcs begin the process of saving the device state. The device
> + *     initializes the migration region data window and associated fields
> + *     within vfio_device_migration_info for capturing the migration data
> + *     stream for the device. The migration driver may perform actions such as
> + *     enabling dirty logging of device state when entering PRE_COPY or
> + *     PER_COPY_P2P.
> + *
> + *     Each arc does not change the device operation, the device remains RUNNING,
> + *     P2P quiesced or in STOP. The STOP_COPY state is described below in
> + *     PRE_COPY_P2P -> STOP_COPY.
> + *
> + *   PRE_COPY -> PRE_COPY_P2P
> + *     Entering PRE_COPY_P2P continues all the behaviors of PRE_COPY above.
> + *     However, while in the PRE_COPY_P2P state, the device is partially
> + *     running in the P2P quiescent state defined below, like RUNNING_P2P.
> + *
> + *   PRE_COPY_P2P -> PRE_COPY
> + *     This arc allows returning the device to a full RUNNING behavior while
> + *     continuing all the behaviors of PRE_COPY.
> + *
> + *   PRE_COPY_P2P -> STOP_COPY
> + *     While in the STOP_COPY state the device has the same behavior as STOP
> + *     with the addition that the data window continues to stream the
> + *     migration state. End of stream indicates the entire device state has
> + *     been transferred as detailed in the data window protocol below.
> + *
> + *     The user should take steps to restrict access to vfio device regions
> + *     other than the migration region while the device is in STOP_COPY or
> + *     risk corruption of the device migration data stream.
> + *
> + *   STOP -> RESUMING
> + *     Entering the RESUMING state enables and initializes the migration
> + *     region data window and associated fields within struct
> + *     vfio_device_migration_info for restoring the device from a migration
> + *     data stream captured from a COPY session with a compatible device. The
> + *     migration driver may alter/reset the internal device state for this arc
> + *     if required to prepare the device to receive the migration data.
> + *
> + *   STOP_COPY -> PRE_COPY
> + *   STOP_COPY -> PRE_COPY_P2P
> + *     These arcs are not permitted and return error if requested. Future
> + *     revisions of this API may define behaviors for these arcs.
> + *
> + *   any -> ERROR
> + *     ERROR cannot be specified as a device state, however any transition
> + *     request can be failed with an errno return and may then move the
> + *     device_state into ERROR. In this case the device was unable to execute
> + *     the requested arc and was also unable to restore the device to the
> + *     original device_state. The device behavior in ERROR can be any
> + *     combination of allowed behaviors from the FSM path between the current
> + *     and requested device_state. The user can detect the ERROR state by
> + *     reading the device_state after receiving an errno failure to a
> + *     device_state write. To recover from ERROR VFIO_DEVICE_RESET must be
> + *     used to return the device_state back to RUNNING.
> + *
> + *   The peer to peer (P2P) quiescent state is intended to be a quiescent
> + *   state for the device for the purposes of managing multiple devices within
> + *   a user context where peer-to-peer DMA between devices may be active. The
> + *   PRE_COPY_P2P and RUNNING_P2P states must prevent the device from
> + *   initiating any new P2P DMA transactions. If the device can identify P2P
> + *   transactions then it can stop only P2P DMA, otherwise it must stop all
> + *   DMA.  The migration driver must complete any such outstanding operations
> + *   prior to completing the FSM arc into either P2P state.
> + *
> + *   The remaining possible transitions are interpreted as combinations of the
> + *   above FSM arcs. As there are multiple paths through the FSM arcs the path
> + *   should be selected based on the following rules:
> + *     - Selet the shortest path
> + *     - The path can not have saving group states as interior arcs, only
> + *       starting/end states.
> + *   Refer to vfio_mig_get_next_state() for the result of the algorithm.
> + *
> + *   The automatic transit through the FSM arcs that make up the combination
> + *   transition is invisible to the user. The operation either succeeds and
> + *   sets device_state to the new value or fails and leaves it at the original
> + *   value or ERROR.
> + *
> + *   The PRE_COPY_P2P and RESUMING_P2P device_states are optional.

As of above, RESUMING_P2P should be RUNNING_P2P.

>   If a device
> + *   does not support it then the state cannot be written or read from
> + *   device_state. For the purposes of specification, the state continues to
> + *   exist but modified so that the device is fully operational while in
> + *   either state. When not supported the user will have to request
> + *   combination transitions (ie PRE_COPY -> STOP_COPY / RESUMING -> RUNNING)
> + *   to avoid writing the unsupported device_state value.
>    *
>    * reserved:
> - *      Reads on this field return zero and writes are ignored.
> + *   Reads on this field return zero and writes are ignored.
>    *
>    * pending_bytes: (read only)
> - *      The number of pending bytes still to be migrated from the vendor driver.
> + *   The kernel migration driver uses this field to indicate an estimate of
> + *   the remaining data size (in bytes) for the user to copy while the data
> + *   window is used by the saving group of states. The value should be
> + *   considered volatile, especially while in the PRE_COPY[_P2P] states.
> + *   Userspace uses this field to test whether data is available to be read
> + *   from the data section described below.  Userspace should only consider
> + *   whether the value read is zero or non-zero for the purposes of the
> + *   protocol below.  The user may only consider the migration data stream to
> + *   be end of stream when pending_bytes reports a zero value while the device
> + *   is in STOP_COPY.  The kernel migration driver must not consider the state
> + *   of pending_bytes or the data window when executing arcs, and cannot fail
> + *   any arc because the data window has not reached end of stream. The value
> + *   of this field is undefined outside the saving group of states.
>    *
>    * data_offset: (read only)
> - *      The user application should read data_offset field from the migration
> - *      region. The user application should read the device data from this
> - *      offset within the migration region during the _SAVING state or write
> - *      the device data during the _RESUMING state. See below for details of
> - *      sequence to be followed.
> + *   This field indicates the offset relative to the start of the device
> + *   migration region for the user to collect (COPY) or store (RESUMING)
> + *   migration data for the device following the protocol described below. The
> + *   migration driver may provide sparse mmap support for the migration region
> + *   and use the data_offset field to direct user accesses as appropriate, but
> + *   must not require mmap access when provided.  The value of this field is
> + *   undefined when device_state is not in the saving group of states.
>    *
>    * data_size: (read/write)
> - *      The user application should read data_size to get the size in bytes of
> - *      the data copied in the migration region during the _SAVING state and
> - *      write the size in bytes of the data copied in the migration region
> - *      during the _RESUMING state.
> - *
> - * The format of the migration region is as follows:
> - *  ------------------------------------------------------------------
> - * |vfio_device_migration_info|    data section                      |
> - * |                          |     ///////////////////////////////  |
> - * ------------------------------------------------------------------
> - *   ^                              ^
> - *  offset 0-trapped part        data_offset
> - *
> - * The structure vfio_device_migration_info is always followed by the data
> - * section in the region, so data_offset will always be nonzero. The offset
> - * from where the data is copied is decided by the kernel driver. The data
> - * section can be trapped, mmapped, or partitioned, depending on how the kernel
> - * driver defines the data section. The data section partition can be defined
> - * as mapped by the sparse mmap capability. If mmapped, data_offset must be
> - * page aligned, whereas initial section which contains the
> - * vfio_device_migration_info structure, might not end at the offset, which is
> - * page aligned. The user is not required to access through mmap regardless
> - * of the capabilities of the region mmap.
> - * The vendor driver should determine whether and how to partition the data
> - * section. The vendor driver should return data_offset accordingly.
> - *
> - * The sequence to be followed while in pre-copy state and stop-and-copy state
> - * is as follows:
> - * a. Read pending_bytes, indicating the start of a new iteration to get device
> - *    data. Repeated read on pending_bytes at this stage should have no side
> - *    effects.
> - *    If pending_bytes == 0, the user application should not iterate to get data
> - *    for that device.
> - *    If pending_bytes > 0, perform the following steps.
> - * b. Read data_offset, indicating that the vendor driver should make data
> - *    available through the data section. The vendor driver should return this
> - *    read operation only after data is available from (region + data_offset)
> - *    to (region + data_offset + data_size).
> - * c. Read data_size, which is the amount of data in bytes available through
> - *    the migration region.
> - *    Read on data_offset and data_size should return the offset and size of
> - *    the current buffer if the user application reads data_offset and
> - *    data_size more than once here.
> - * d. Read data_size bytes of data from (region + data_offset) from the
> - *    migration region.
> - * e. Process the data.
> - * f. Read pending_bytes, which indicates that the data from the previous
> - *    iteration has been read. If pending_bytes > 0, go to step b.
> - *
> - * The user application can transition from the _SAVING|_RUNNING
> - * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
> - * number of pending bytes. The user application should iterate in _SAVING
> - * (stop-and-copy) until pending_bytes is 0.
> - *
> - * The sequence to be followed while _RESUMING device state is as follows:
> - * While data for this device is available, repeat the following steps:
> - * a. Read data_offset from where the user application should write data.
> - * b. Write migration data starting at the migration region + data_offset for
> - *    the length determined by data_size from the migration source.
> - * c. Write data_size, which indicates to the vendor driver that data is
> - *    written in the migration region. Vendor driver must return this write
> - *    operations on consuming data. Vendor driver should apply the
> - *    user-provided migration region data to the device resume state.
> - *
> - * If an error occurs during the above sequences, the vendor driver can return
> - * an error code for next read() or write() operation, which will terminate the
> - * loop. The user application should then take the next necessary action, for
> - * example, failing migration or terminating the user application.
> - *
> - * For the user application, data is opaque. The user application should write
> - * data in the same order as the data is received and the data should be of
> - * same transaction size at the source.
> + *   This field indicates the length of the current data segment in
> + *   bytes. While COPY, the kernel migration driver uses this field to
> + *   indicate to the user the length of the migration data stream available at
> + *   data_offset. When RESUMING, the user writes this field with the length of
> + *   the data segment written at the migration driver provided
> + *   data_offset. The value of this field is undefined when device_state is
> + *   not in the saving group of states.
> + *
> + * The following protocol is used while the device is in the saving group of
> + * states:
> + *
> + * a) The user reads pending_bytes.  If the read value is zero, no data is
> + *    currently available for the device.  If the device is in STOP_COPY and a
> + *    zero value is read, this indicates the end of the device migration
> + *    stream and the device must not generate any new migration data.  If the
> + *    read value is non-zero, the user may proceed to collect device migration
> + *    data in step b).  Repeated reads of pending_bytes is allowed and must
> + *    not compromise the migration data stream provided the user does not
> + *    proceed to the following step.
> + * b) The user reads data_offset, which indicates to the migration driver to
> + *    make a segment of device migration data available to the user at the
> + *    provided offset.  This action commits the user to collect the data
> + *    segment.
> + * c) The user reads data_size to determine the extent of the currently
> + *    available migration data segment.
> + * d) The user collects the data_size segment of device migration data at the
> + *    previously provided data_offset using access methods compatible to those
> + *    for the migration region.  The user must not be required to collect the
> + *    data in a single operation.
> + * e) The user re-reads pending_bytes to indicate to the migration driver that
> + *    the provided data has been collected.  Provided the read pending_bytes
> + *    value is non-zero, the user may proceed directly to step b) for another
> + *    iteration.
> + *
> + * The following protocol is used while the device is in the RESUMING
> + * device_state:
> + *
> + * a) The user reads data_offset, which directs the user to the location
> + *    within the migration region to store the migration data segment.
> + * b) The user writes the migration data segment starting at the provided
> + *    data_offset.  The user must preserve the data segment size as used when
> + *    the segment was collected from the device when COPY.
> + * c) The user writes the data_size field with the number of bytes written to
> + *    the migration region in step b).  The kernel migration driver may use
> + *    this write to indicate the end of the current iteration.
> + * d) User proceeds to step a) so long as the migration data stream is not
> + *    complete.
> + *
> + * The kernel migration driver may indicate an error condition by returning a
> + * fault on read(2) or write(2) for any operation most approximate to the
> + * detection of the error.  Field accesses are provided within the protocol
> + * such that an opportunity exists to return a fault regardless of whether the
> + * data section is directly accessed via an mmap.
> + *
> + * The user must consider the migration data segments to be opaque and
> + * non-fungible. During RESUMING, the data segments must be written in the
> + * size and order as provided during any part of the saving group of states.
>    */
>   
> +enum {
> +	VFIO_DEVICE_STATE_STOP = 0,
> +	VFIO_DEVICE_STATE_RUNNING = 1,
> +	VFIO_DEVICE_STATE_STOP_COPY = 2,
> +	VFIO_DEVICE_STATE_PRE_COPY = 3,
> +	VFIO_DEVICE_STATE_RESUMING = 4,
> +	VFIO_DEVICE_STATE_PRE_COPY_P2P = 5,
> +	VFIO_DEVICE_STATE_ERROR = 6,
> +	VFIO_DEVICE_STATE_RUNNING_P2P = 7,
> +};
> +
>   struct vfio_device_migration_info {
>   	__u32 device_state;         /* VFIO device state */
> -#define VFIO_DEVICE_STATE_STOP      (0)
> -#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> -#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> -#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> -#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> -				     VFIO_DEVICE_STATE_SAVING |  \
> -				     VFIO_DEVICE_STATE_RESUMING)
> -
> -#define VFIO_DEVICE_STATE_VALID(state) \
> -	(state & VFIO_DEVICE_STATE_RESUMING ? \
> -	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> -
> -#define VFIO_DEVICE_STATE_IS_ERROR(state) \
> -	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
> -					      VFIO_DEVICE_STATE_RESUMING))
> -
> -#define VFIO_DEVICE_STATE_SET_ERROR(state) \
> -	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
> -					     VFIO_DEVICE_STATE_RESUMING)
> -
>   	__u32 reserved;
>   	__u64 pending_bytes;
>   	__u64 data_offset;
>
> base-commit: 40404a6ce71339d7bc0f0a0e185ad557bf421cec
Alex Williamson Jan. 18, 2022, 7:55 p.m. UTC | #2
On Fri, 14 Jan 2022 15:35:14 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> Clarify how the migration API works by recasting its specification from a
> bunch of bits into a formal FSM. This describes the same functional
> outcome, with no practical ABI change.

I don't understand why we're so reluctant to drop the previous
specification and call this v2.  Yes, it's clever that the enum for the
FSM matches the bit definitions, but we're also inserting previously
invalid states as a standard part of the device flow... (see below)

> Compared to the clarification Alex proposed:
> 
> https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen
> 
> this has a few deliberate functional differences:
> 
>  - STOP_COPY -> PRE_COPY is made optional and to be defined in
>    future. This is motivated by the realization that mlx5 and other
>    devices that don't implement pre-copy cannot really achieve this
>    without a lot of work. We know of no user space that issues this
>    transition, so there is no practical ABI break to fail it.
> 
>  - ERROR arcs allow the device function to remain unchanged. Reviewing the
>    driver design in this series makes it clear there is no advantage to
>    putting more complexity in the kernel. If the device has a double
>    fault, then userspace already has all the necessary tools to clean it
>    up.
> 
>  - NDMA is redefined into two FSM states and is described in terms of the
>    desired P2P quiescent behavior, noting that halting all DMA is an
>    acceptable implementation.
> 
> First, each of the 6 existing valid values of device_state are recast into
> FSM states using the names from the current comment. The two new states
> for P2P are inserted in the two invalid numbers giving a packed definition
> of all 8 states. As the numbers match the existing ABI there is no change.
> 
> Next, we define a set of FSM arcs which the driver will directly
> implement. Each of the 15 FSM arcs is listed and Alex's clarification
> wording is substantially reused to describe how they should operate.
> 
> Finally, we define the remaining set of old/new device_state transitions
> as 'combination transitions' which are naturally defined as taking
> multiple FSM arcs along the shortest path within the FSM's digraph. Two
> rules are defined which result in an unambiguous specification.
> 
> The kernel will fully support all possible transitions automatically by
> either executing the FSM arc directly, or by following the defined path
> and executing each step's FSM arc. It allows all drivers to automatically
> and consistently support all device_state writes in the original 5 states,
> except for STOP_COPY -> PRE_COPY. This fulfills the original concept of
> the ABI of allowing all combinations of device_state transitions.
> 
> In terms of code, the driver is responsible to implement all of the FSM
> arcs and the core code implements the combination transitions. This
> ensures that every driver implements the combination transitions in the
> same way, and insulates the driver from this complexity. The shortest
> paths, with the ambiguity resolved, are all precomputed and stored in a 64
> byte lookup table.
> 
> Further, the complicated error unwind in the combination transitions is
> solved for all drivers by having the core code attempt to return back to
> the starting state through a natural sequence of FSM arcs. Should this be
> impossible, or a double fault happens, the ERROR state is invoked.
> 
> The FSM approach proves extensible in the future as new additional states
> and arcs can be added to the FSM. The combination transitions logic will
> automatically walk through these new arcs providing backwards
> compatibility, and it is easy for the core to insulate drivers that do not
> support new states from ever seeing them while still using a single FSM
> digraph as the behavioral specification. A separate patch, to be included
> with the first driver that requires it, demonstrates how this works for
> the optional P2P states.
> 
> A following patch in the series introduces a simple query IOCTL that
> identifies what FSM arcs the device supports allowing extensible
> discoverability.
> 
> The old uAPI definitions are removed from the header file. As per Linus's
> past remarks we do not have a hard requirement to retain compilation
> compatibility in uapi headers and qemu is already following Linus's
> preferred model of copying the kernel headers. As the ABI functionality is
> preserved this is only a compilation break.
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/vfio/vfio.c       | 236 +++++++++++++++++++
>  include/linux/vfio.h      |   5 +
>  include/uapi/linux/vfio.h | 461 +++++++++++++++++++++-----------------
>  3 files changed, 499 insertions(+), 203 deletions(-)
> 
> This is RFC because the full series is not fully tested yet, that should be
> done next week. The series can be previewed here:
> 
>   https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
> 
> The mlx5 implementation of the state change:
> 
>   https://github.com/jgunthorpe/linux/blob/0a6416da226fe8ee888aa8026f1e363698e137a8/drivers/vfio/pci/mlx5/main.c#L264
> 
> Has turned out very clean. Compare this to the v5 version, which is full of
> subtle bugs:
> 
>   https://lore.kernel.org/kvm/20211027095658.144468-12-yishaih@nvidia.com/
> 
> This patch adds the VFIO_DEVICE_MIG_ARC_SUPPORTED ioctl:
> 
>   https://github.com/jgunthorpe/linux/commit/c92eff6c2afd1ecc9ed5c67a1f81c7f270f6e940
> 
> And this shows how the Huawei driver should opt out of P2P arcs:
> 
>   https://github.com/jgunthorpe/linux/commit/dd2571c481d27546a33ff4583ce8ad49847fe300

We've been bitten several times by device support that didn't come to
be in this whole vfio migration effort.  Let's imagine that all devices
initially support these P2P states and adding any sort of opt-out would
be dead code.  The specification below doesn't propose the
ARC_SUPPORTED ioctl and userspace is implemented with some vague notion
that these states may not be implemented, but has no means to test.
Userspace is written to support the P2P arcs and fail on state
transition failures.

At some point later hns support is ready, it supports the migration
region, but migration fails with all existing userspace written to the
below spec.  I can't imagine that a device advertising migration, but it
being essentially guaranteed to fail is a viable condition and we can't
retroactively add this proposed ioctl to existing userspace binaries.
I think our recourse here would be to rev the migration sub-type again
so that userspace that doesn't know about devices that lack P2P won't
enable migration support.

So I think this ends up being a poor example of how to extend the uAPI.
An opt-out for part of the base specification is hard, it's much easier
to opt-in P2P as a feature.

 
> Finally, this helper script was used to audit the FSM construction, the commit
> is informative to help understand the details:
> 
>   https://github.com/jgunthorpe/linux/commit/0a6416da226fe8ee888aa8026f1e363698e137a8
> 
> I think this should resolve all the disconnect. It keeps the uAPI practically
> unchanged while strongly defining it in precise terms of what arcs a driver
> must implement. The core code deals with the troublesome precedence and error
> issue in a way that is intuitive to understand. It is now actually easy to
> audit a driver for correct implementation.
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 735d1d344af9d4..96001f03bc39f1 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1557,6 +1557,242 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
>  	return 0;
>  }
>  
> +enum {
> +	VFIO_MIG_FSM_MAX_STATE = VFIO_DEVICE_STATE_RUNNING_P2P + 1,
> +};

We have existing examples of using VFIO_foo_NUM_bar in the vfio uAPI
already, let's follow that lead.

> +
> +/*
> + * vfio_mig_get_next_state - Compute the next step in the FSM
> + * @cur_fsm - The current state the device is in
> + * @new_fsm - The target state to reach
> + *
> + * Return the next step in the state progression between cur_fsm and
> + * new_fsm. This breaks down requests for complex transitions into
> + * smaller steps and returns the next step to get to new_fsm. The
> + * function may need to be called up to four times before reaching new_fsm.
> + *
> + * VFIO_DEVICE_STATE_ERROR is returned if the state transition is not allowed.
> + */
> +static u32 vfio_mig_get_next_state(u32 cur_fsm, u32 new_fsm)

Can we use a typedef somewhere for type checking?

> +{
> +	/*
> +	 * The coding in this table requires the driver to implement 15
> +	 * FSM arcs:
> +	 *        PRE_COPY -> PRE_COPY_P2P
> +	 *        PRE_COPY -> RUNNING
> +	 *        PRE_COPY_P2P -> PRE_COPY
> +	 *        PRE_COPY_P2P -> RUNNING_P2P
> +	 *        PRE_COPY_P2P -> STOP_COPY
> +	 *        RESUMING -> STOP
> +	 *        RUNNING -> PRE_COPY
> +	 *        RUNNING -> RUNNING_P2P
> +	 *        RUNNING_P2P -> PRE_COPY_P2P
> +	 *        RUNNING_P2P -> RUNNING
> +	 *        RUNNING_P2P -> STOP
> +	 *        STOP -> RESUMING
> +	 *        STOP -> RUNNING_P2P
> +	 *        STOP -> STOP_COPY
> +	 *        STOP_COPY -> STOP
> +	 *
> +	 * The coding will step through multiple states for these combination
> +	 * transitions:
> +	 *        PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
> +	 *        PRE_COPY -> RUNNING -> RUNNING_P2P
> +	 *        PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP
> +	 *        PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP -> RESUMING
> +	 *        PRE_COPY_P2P -> RUNNING_P2P -> RUNNING
> +	 *        PRE_COPY_P2P -> RUNNING_P2P -> STOP
> +	 *        PRE_COPY_P2P -> RUNNING_P2P -> STOP -> RESUMING
> +	 *        RESUMING -> STOP -> RUNNING_P2P
> +	 *        RESUMING -> STOP -> RUNNING_P2P -> PRE_COPY_P2P
> +	 *        RESUMING -> STOP -> RUNNING_P2P -> RUNNING
> +	 *        RESUMING -> STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
> +	 *        RESUMING -> STOP -> STOP_COPY
> +	 *        RUNNING -> RUNNING_P2P -> PRE_COPY_P2P
> +	 *        RUNNING -> RUNNING_P2P -> STOP
> +	 *        RUNNING -> RUNNING_P2P -> STOP -> RESUMING
> +	 *        RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
> +	 *        RUNNING_P2P -> RUNNING -> PRE_COPY
> +	 *        RUNNING_P2P -> STOP -> RESUMING
> +	 *        RUNNING_P2P -> STOP -> STOP_COPY
> +	 *        STOP -> RUNNING_P2P -> PRE_COPY_P2P
> +	 *        STOP -> RUNNING_P2P -> RUNNING
> +	 *        STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
> +	 *        STOP_COPY -> STOP -> RESUMING
> +	 *        STOP_COPY -> STOP -> RUNNING_P2P
> +	 *        STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
> +	 *
> +	 *  The following transitions are blocked:
> +	 *        STOP_COPY -> PRE_COPY
> +	 *        STOP_COPY -> PRE_COPY_P2P
> +	 */
> +	static const u8 vfio_from_fsm_table[VFIO_MIG_FSM_MAX_STATE][VFIO_MIG_FSM_MAX_STATE] = {
> +		[VFIO_DEVICE_STATE_STOP] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_RUNNING] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_PRE_COPY] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_PRE_COPY_P2P] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_STOP_COPY] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_RESUMING] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_ERROR] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +	};
> +	return vfio_from_fsm_table[cur_fsm][new_fsm];
> +}
> +
> +static bool vfio_mig_in_saving_group(u32 state)
> +{
> +	return state == VFIO_DEVICE_STATE_PRE_COPY ||
> +	       state == VFIO_DEVICE_STATE_PRE_COPY_P2P ||
> +	       state == VFIO_DEVICE_STATE_STOP_COPY;
> +}
> +
> +/*
> + * vfio_mig_set_device_state - Change the migration device_state
> + * @device - The VFIO device to act on
> + * @target_device_state - The new state from the uAPI
> + * @cur_state - Pointer to the drivers current migration FSM state
> + *
> + * This validates target_device_state and then calls
> + * ops->migration_step_device_state() enough times to achieve the target state.
> + * See vfio_mig_get_next_state() for the required arcs.
> + *
> + * If the op callback fails then the driver should leave the device state
> + * unchanged and return errno, should this not be possible then it should set
> + * cur_state to VFIO_DEVICE_STATE_ERROR and return errno.
> + *
> + * If a step fails then this attempts to reverse the FSM back to the original
> + * state, should that fail it is set to VFIO_DEVICE_STATE_ERROR and error is
> + * returned.
> + */
> +int vfio_mig_set_device_state(struct vfio_device *device, u32 target_state,
> +			      u32 *cur_state)
> +{
> +	u32 starting_state = *cur_state;
> +	u32 next_state;
> +	int ret;
> +
> +	if (target_state >= VFIO_MIG_FSM_MAX_STATE)
> +		return -EINVAL;
> +
> +	while (*cur_state != target_state) {
> +		next_state = vfio_mig_get_next_state(*cur_state, target_state);
> +		if (next_state == VFIO_DEVICE_STATE_ERROR) {
> +			ret = -EINVAL;
> +			goto out_restore_state;
> +		}
> +		ret = device->ops->migration_step_device_state(device,
> +							       next_state);
> +		if (ret)
> +			goto out_restore_state;
> +		*cur_state = next_state;
> +	}
> +	return 0;
> +
> +out_restore_state:
> +	if (*cur_state == VFIO_DEVICE_STATE_ERROR)
> +		return ret;
> +	/*
> +	 * If the window as initialized, and we closed the window, then we
> +	 * cannot recover the old state.
> +	 */
> +	if ((vfio_mig_in_saving_group(starting_state) &&
> +	     !vfio_mig_in_saving_group(*cur_state)) ||
> +	    (starting_state == VFIO_DEVICE_STATE_RESUMING &&
> +	     *cur_state != VFIO_DEVICE_STATE_RESUMING)) {
> +		*cur_state = VFIO_DEVICE_STATE_ERROR;
> +		return ret;
> +	}
> +
> +	/*
> +	 * Make a best effort to restore things back to where we started.
> +	 */
> +	while (*cur_state != starting_state) {
> +		next_state =
> +			vfio_mig_get_next_state(*cur_state, starting_state);
> +		if (next_state == VFIO_DEVICE_STATE_ERROR ||
> +		    device->ops->migration_step_device_state(device,
> +							     next_state)) {
> +			*cur_state = VFIO_DEVICE_STATE_ERROR;
> +			break;
> +		}
> +		*cur_state = next_state;
> +	}

Core code "transitioning" the device to ERROR seems a little suspect
here.  I guess you're imagining that driver code calling this with an
pointer to internal state that it also tests on a non-zero return.
Should we just step-device-state to ERROR to directly inform the driver?


> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mig_set_device_state);
> +
>  static long vfio_device_fops_unl_ioctl(struct file *filep,
>  				       unsigned int cmd, unsigned long arg)
>  {
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 76191d7abed185..5c96ef5e8d5202 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -69,6 +69,8 @@ struct vfio_device_ops {
>  	int	(*mmap)(struct vfio_device *vdev, struct vm_area_struct *vma);
>  	void	(*request)(struct vfio_device *vdev, unsigned int count);
>  	int	(*match)(struct vfio_device *vdev, char *buf);
> +	int     (*migration_step_device_state)(struct vfio_device *device,
> +					       u32 next_state);
>  };
>  
>  void vfio_init_group_dev(struct vfio_device *device, struct device *dev,
> @@ -82,6 +84,9 @@ extern void vfio_device_put(struct vfio_device *device);
>  
>  int vfio_assign_device_set(struct vfio_device *device, void *set_id);
>  
> +int vfio_mig_set_device_state(struct vfio_device *device, u32 target_state,
> +			      u32 *cur_state);
> +
>  /*
>   * External user API
>   */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ef33ea002b0b31..42e0ab905d2df7 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -408,223 +408,278 @@ struct vfio_region_gfx_edid {
>  #define VFIO_REGION_SUBTYPE_MIGRATION           (1)
>  
>  /*
> - * The structure vfio_device_migration_info is placed at the 0th offset of
> - * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
> - * migration information. Field accesses from this structure are only supported
> - * at their native width and alignment. Otherwise, the result is undefined and
> - * vendor drivers should return an error.
> + * The structure vfio_device_migration_info is placed at the immediate start
> + * of the per-device VFIO_REGION_SUBTYPE_MIGRATION region to manage the device
> + * state and migration information for the device.  Field accesses for this
> + * structure are only supported using their native width and alignment,
> + * accesses otherwise are undefined, and the kernel migration driver should
> + * return an error.
>   *
>   * device_state: (read/write)
> - *      - The user application writes to this field to inform the vendor driver
> - *        about the device state to be transitioned to.
> - *      - The vendor driver should take the necessary actions to change the
> - *        device state. After successful transition to a given state, the
> - *        vendor driver should return success on write(device_state, state)
> - *        system call. If the device state transition fails, the vendor driver
> - *        should return an appropriate -errno for the fault condition.
> - *      - On the user application side, if the device state transition fails,
> - *	  that is, if write(device_state, state) returns an error, read
> - *	  device_state again to determine the current state of the device from
> - *	  the vendor driver.
> - *      - The vendor driver should return previous state of the device unless
> - *        the vendor driver has encountered an internal error, in which case
> - *        the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR.
> - *      - The user application must use the device reset ioctl to recover the
> - *        device from VFIO_DEVICE_STATE_ERROR state. If the device is
> - *        indicated to be in a valid device state by reading device_state, the
> - *        user application may attempt to transition the device to any valid
> - *        state reachable from the current state or terminate itself.
> - *
> - *      device_state consists of 3 bits:
> - *      - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear,
> - *        it indicates the _STOP state. When the device state is changed to
> - *        _STOP, driver should stop the device before write() returns.
> - *      - If bit 1 is set, it indicates the _SAVING state, which means that the
> - *        driver should start gathering device state information that will be
> - *        provided to the VFIO user application to save the device's state.
> - *      - If bit 2 is set, it indicates the _RESUMING state, which means that
> - *        the driver should prepare to resume the device. Data provided through
> - *        the migration region should be used to resume the device.
> - *      Bits 3 - 31 are reserved for future use. To preserve them, the user
> - *      application should perform a read-modify-write operation on this
> - *      field when modifying the specified bits.
> - *
> - *  +------- _RESUMING
> - *  |+------ _SAVING
> - *  ||+----- _RUNNING
> - *  |||
> - *  000b => Device Stopped, not saving or resuming
> - *  001b => Device running, which is the default state
> - *  010b => Stop the device & save the device state, stop-and-copy state
> - *  011b => Device running and save the device state, pre-copy state
> - *  100b => Device stopped and the device state is resuming
> - *  101b => Invalid state
> - *  110b => Error state
> - *  111b => Invalid state
> - *
> - * State transitions:
> - *
> - *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
> - *                (100b)     (001b)     (011b)        (010b)       (000b)
> - * 0. Running or default state
> - *                             |
> - *
> - * 1. Normal Shutdown (optional)
> - *                             |------------------------------------->|
> - *
> - * 2. Save the state or suspend
> - *                             |------------------------->|---------->|
> - *
> - * 3. Save the state during live migration
> - *                             |----------->|------------>|---------->|
> - *
> - * 4. Resuming
> - *                  |<---------|
> - *
> - * 5. Resumed
> - *                  |--------->|
> - *
> - * 0. Default state of VFIO device is _RUNNING when the user application starts.
> - * 1. During normal shutdown of the user application, the user application may
> - *    optionally change the VFIO device state from _RUNNING to _STOP. This
> - *    transition is optional. The vendor driver must support this transition but
> - *    must not require it.
> - * 2. When the user application saves state or suspends the application, the
> - *    device state transitions from _RUNNING to stop-and-copy and then to _STOP.
> - *    On state transition from _RUNNING to stop-and-copy, driver must stop the
> - *    device, save the device state and send it to the application through the
> - *    migration region. The sequence to be followed for such transition is given
> - *    below.
> - * 3. In live migration of user application, the state transitions from _RUNNING
> - *    to pre-copy, to stop-and-copy, and to _STOP.
> - *    On state transition from _RUNNING to pre-copy, the driver should start
> - *    gathering the device state while the application is still running and send
> - *    the device state data to application through the migration region.
> - *    On state transition from pre-copy to stop-and-copy, the driver must stop
> - *    the device, save the device state and send it to the user application
> - *    through the migration region.
> - *    Vendor drivers must support the pre-copy state even for implementations
> - *    where no data is provided to the user before the stop-and-copy state. The
> - *    user must not be required to consume all migration data before the device
> - *    transitions to a new state, including the stop-and-copy state.
> - *    The sequence to be followed for above two transitions is given below.
> - * 4. To start the resuming phase, the device state should be transitioned from
> - *    the _RUNNING to the _RESUMING state.
> - *    In the _RESUMING state, the driver should use the device state data
> - *    received through the migration region to resume the device.
> - * 5. After providing saved device data to the driver, the application should
> - *    change the state from _RESUMING to _RUNNING.
> + *   The device_state field is the current state in a finite state machine
> + *   (FSM) that controls the migration function of the device. A write by the
> + *   user requests the FSM move to the new state. Some general rules govern
> + *   the FSM:
> + *     - The user may read or write the device state register at any time.
> + *     - The kernel migration driver must fully transition the device to the
> + *       new state value before the write(2) operation returns to the user.
> + *     - The kernel migration driver must not generate asynchronous device
> + *       state transitions outside of manipulation by the user or the
> + *       VFIO_DEVICE_RESET ioctl as described below.
> + *     - In the event of a device state transition failure, the kernel
> + *       migration driver must return a write(2) error with appropriate errno
> + *       to the user. Refer to the ERROR arc section below for additional detail
> + *       on error handling.
> + *     - Devices supporting migration via this specification must support the
> + *       VFIO_DEVICE_RESET ioctl and any use of that ioctl must return the
> + *       device migration state to the RUNNING state.
> + *
> + *   There are 6 mandatory states defined as STOP, RUNNING, STOP_COPY,
> + *   PRE_COPY, RESUMING and ERROR plus two optional states PRE_COPY_P2P and
> + *   RESUMING_P2P.
> + *
> + *   When the user writes to the device_state it triggers a transition from
> + *   the current value of device_state to the value written by the user. Some
> + *   transitions are called FSM arcs, and when requested, have the driver
> + *   behavior defined as follows:
> + *
> + *   RUNNING_P2P -> STOP
> + *   STOP_COPY -> STOP
> + *     While in STOP the device must stop the operation of the device.  The
> + *     device must not generate interrupts, DMA, or advance its internal
> + *     state. When stopped the device and kernel migration driver must accept
> + *     and respond to interaction to support external subsystems in the
> + *     STOP state, for example PCI MSI-X and PCI config pace. Failure by
> + *     the user to restrict device access while in STOP must not result in
> + *     error conditions outside the user context (ex. host system faults).
> + *
> + *     The STOP_COPY arc will close the data window.

These data window "sessions" should probably be described as a concept
first.

> + *
> + *   RESUMING -> STOP
> + *     Leaving RESUMING closes the migration region data window and indicates
> + *     the end of a resuming session for the device.  The kernel migration
> + *     driver should complete the incorporation of data written to the
> + *     migration data window into the device internal state and perform final
> + *     validity and consistency checking of the new device state.  If the user
> + *     provided data is found to be incomplete, inconsistent, or otherwise
> + *     invalid, the migration driver must indicate a write(2) error and
> + *     optionally go to the ERROR state as described below. The data window
> + *     is closed.
> + *
> + *     While in STOP the device has the same behavior as other STOP states
> + *     described above.

There is one STOP state, invariant of how the device arrives to it.
The reader can infer from the above statement that the behavior of a
state may be flexible to how the device arrives at the state unless
otherwise specified.

It seems like it would be more clear to describe each state
separately, describe the concept of data window sessions, then
present the valid state transitions.  Thanks,

Alex

> + *
> + *     To abort a RESUMING session the device must be reset.
> + *
> + *   PRE_COPY -> RUNNING
> + *   RUNNING_P2P -> RUNNING
> + *     While in RUNNING the device is fully operational, the device may
> + *     generate interrupts, DMA, respond to MMIO, all vfio device regions are
> + *     functional, and the device may advance its internal state.
> + *
> + *     The PRE_COPY arc will close the data window.
> + *
> + *   PRE_COPY_P2P -> RUNNING_P2P
> + *   RUNNING -> RUNNING_P2P
> + *   STOP -> RUNNING_P2P
> + *     While in RUNNING_P2P the device is partially running in the P2P quiescent
> + *     state defined below.
> + *
> + *     The PRE_COPY arc will close the data window.
> + *
> + *   RUNNING -> PRE_COPY
> + *   RUNNING_P2P -> PRE_COPY_P2P
> + *   STOP -> STOP_COPY
> + *     PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of states
> + *     which share the data window. Moving between these states alters what is
> + *     streamed in the data window, but does not close or otherwise effect the
> + *     state of the data window.
> + *
> + *     These arcs begin the process of saving the device state. The device
> + *     initializes the migration region data window and associated fields
> + *     within vfio_device_migration_info for capturing the migration data
> + *     stream for the device. The migration driver may perform actions such as
> + *     enabling dirty logging of device state when entering PRE_COPY or
> + *     PER_COPY_P2P.
> + *
> + *     Each arc does not change the device operation, the device remains RUNNING,
> + *     P2P quiesced or in STOP. The STOP_COPY state is described below in
> + *     PRE_COPY_P2P -> STOP_COPY.
> + *
> + *   PRE_COPY -> PRE_COPY_P2P
> + *     Entering PRE_COPY_P2P continues all the behaviors of PRE_COPY above.
> + *     However, while in the PRE_COPY_P2P state, the device is partially
> + *     running in the P2P quiescent state defined below, like RUNNING_P2P.
> + *
> + *   PRE_COPY_P2P -> PRE_COPY
> + *     This arc allows returning the device to a full RUNNING behavior while
> + *     continuing all the behaviors of PRE_COPY.
> + *
> + *   PRE_COPY_P2P -> STOP_COPY
> + *     While in the STOP_COPY state the device has the same behavior as STOP
> + *     with the addition that the data window continues to stream the
> + *     migration state. End of stream indicates the entire device state has
> + *     been transferred as detailed in the data window protocol below.
> + *
> + *     The user should take steps to restrict access to vfio device regions
> + *     other than the migration region while the device is in STOP_COPY or
> + *     risk corruption of the device migration data stream.
> + *
> + *   STOP -> RESUMING
> + *     Entering the RESUMING state enables and initializes the migration
> + *     region data window and associated fields within struct
> + *     vfio_device_migration_info for restoring the device from a migration
> + *     data stream captured from a COPY session with a compatible device. The
> + *     migration driver may alter/reset the internal device state for this arc
> + *     if required to prepare the device to receive the migration data.
> + *
> + *   STOP_COPY -> PRE_COPY
> + *   STOP_COPY -> PRE_COPY_P2P
> + *     These arcs are not permitted and return error if requested. Future
> + *     revisions of this API may define behaviors for these arcs.
> + *
> + *   any -> ERROR
> + *     ERROR cannot be specified as a device state, however any transition
> + *     request can be failed with an errno return and may then move the
> + *     device_state into ERROR. In this case the device was unable to execute
> + *     the requested arc and was also unable to restore the device to the
> + *     original device_state. The device behavior in ERROR can be any
> + *     combination of allowed behaviors from the FSM path between the current
> + *     and requested device_state. The user can detect the ERROR state by
> + *     reading the device_state after receiving an errno failure to a
> + *     device_state write. To recover from ERROR VFIO_DEVICE_RESET must be
> + *     used to return the device_state back to RUNNING.
> + *
> + *   The peer to peer (P2P) quiescent state is intended to be a quiescent
> + *   state for the device for the purposes of managing multiple devices within
> + *   a user context where peer-to-peer DMA between devices may be active. The
> + *   PRE_COPY_P2P and RUNNING_P2P states must prevent the device from
> + *   initiating any new P2P DMA transactions. If the device can identify P2P
> + *   transactions then it can stop only P2P DMA, otherwise it must stop all
> + *   DMA.  The migration driver must complete any such outstanding operations
> + *   prior to completing the FSM arc into either P2P state.
> + *
> + *   The remaining possible transitions are interpreted as combinations of the
> + *   above FSM arcs. As there are multiple paths through the FSM arcs the path
> + *   should be selected based on the following rules:
> + *     - Selet the shortest path
> + *     - The path can not have saving group states as interior arcs, only
> + *       starting/end states.
> + *   Refer to vfio_mig_get_next_state() for the result of the algorithm.
> + *
> + *   The automatic transit through the FSM arcs that make up the combination
> + *   transition is invisible to the user. The operation either succeeds and
> + *   sets device_state to the new value or fails and leaves it at the original
> + *   value or ERROR.
> + *
> + *   The PRE_COPY_P2P and RESUMING_P2P device_states are optional. If a device
> + *   does not support it then the state cannot be written or read from
> + *   device_state. For the purposes of specification, the state continues to
> + *   exist but modified so that the device is fully operational while in
> + *   either state. When not supported the user will have to request
> + *   combination transitions (ie PRE_COPY -> STOP_COPY / RESUMING -> RUNNING)
> + *   to avoid writing the unsupported device_state value.
>   *
>   * reserved:
> - *      Reads on this field return zero and writes are ignored.
> + *   Reads on this field return zero and writes are ignored.
>   *
>   * pending_bytes: (read only)
> - *      The number of pending bytes still to be migrated from the vendor driver.
> + *   The kernel migration driver uses this field to indicate an estimate of
> + *   the remaining data size (in bytes) for the user to copy while the data
> + *   window is used by the saving group of states. The value should be
> + *   considered volatile, especially while in the PRE_COPY[_P2P] states.
> + *   Userspace uses this field to test whether data is available to be read
> + *   from the data section described below.  Userspace should only consider
> + *   whether the value read is zero or non-zero for the purposes of the
> + *   protocol below.  The user may only consider the migration data stream to
> + *   be end of stream when pending_bytes reports a zero value while the device
> + *   is in STOP_COPY.  The kernel migration driver must not consider the state
> + *   of pending_bytes or the data window when executing arcs, and cannot fail
> + *   any arc because the data window has not reached end of stream. The value
> + *   of this field is undefined outside the saving group of states.
>   *
>   * data_offset: (read only)
> - *      The user application should read data_offset field from the migration
> - *      region. The user application should read the device data from this
> - *      offset within the migration region during the _SAVING state or write
> - *      the device data during the _RESUMING state. See below for details of
> - *      sequence to be followed.
> + *   This field indicates the offset relative to the start of the device
> + *   migration region for the user to collect (COPY) or store (RESUMING)
> + *   migration data for the device following the protocol described below. The
> + *   migration driver may provide sparse mmap support for the migration region
> + *   and use the data_offset field to direct user accesses as appropriate, but
> + *   must not require mmap access when provided.  The value of this field is
> + *   undefined when device_state is not in the saving group of states.
>   *
>   * data_size: (read/write)
> - *      The user application should read data_size to get the size in bytes of
> - *      the data copied in the migration region during the _SAVING state and
> - *      write the size in bytes of the data copied in the migration region
> - *      during the _RESUMING state.
> - *
> - * The format of the migration region is as follows:
> - *  ------------------------------------------------------------------
> - * |vfio_device_migration_info|    data section                      |
> - * |                          |     ///////////////////////////////  |
> - * ------------------------------------------------------------------
> - *   ^                              ^
> - *  offset 0-trapped part        data_offset
> - *
> - * The structure vfio_device_migration_info is always followed by the data
> - * section in the region, so data_offset will always be nonzero. The offset
> - * from where the data is copied is decided by the kernel driver. The data
> - * section can be trapped, mmapped, or partitioned, depending on how the kernel
> - * driver defines the data section. The data section partition can be defined
> - * as mapped by the sparse mmap capability. If mmapped, data_offset must be
> - * page aligned, whereas initial section which contains the
> - * vfio_device_migration_info structure, might not end at the offset, which is
> - * page aligned. The user is not required to access through mmap regardless
> - * of the capabilities of the region mmap.
> - * The vendor driver should determine whether and how to partition the data
> - * section. The vendor driver should return data_offset accordingly.
> - *
> - * The sequence to be followed while in pre-copy state and stop-and-copy state
> - * is as follows:
> - * a. Read pending_bytes, indicating the start of a new iteration to get device
> - *    data. Repeated read on pending_bytes at this stage should have no side
> - *    effects.
> - *    If pending_bytes == 0, the user application should not iterate to get data
> - *    for that device.
> - *    If pending_bytes > 0, perform the following steps.
> - * b. Read data_offset, indicating that the vendor driver should make data
> - *    available through the data section. The vendor driver should return this
> - *    read operation only after data is available from (region + data_offset)
> - *    to (region + data_offset + data_size).
> - * c. Read data_size, which is the amount of data in bytes available through
> - *    the migration region.
> - *    Read on data_offset and data_size should return the offset and size of
> - *    the current buffer if the user application reads data_offset and
> - *    data_size more than once here.
> - * d. Read data_size bytes of data from (region + data_offset) from the
> - *    migration region.
> - * e. Process the data.
> - * f. Read pending_bytes, which indicates that the data from the previous
> - *    iteration has been read. If pending_bytes > 0, go to step b.
> - *
> - * The user application can transition from the _SAVING|_RUNNING
> - * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
> - * number of pending bytes. The user application should iterate in _SAVING
> - * (stop-and-copy) until pending_bytes is 0.
> - *
> - * The sequence to be followed while _RESUMING device state is as follows:
> - * While data for this device is available, repeat the following steps:
> - * a. Read data_offset from where the user application should write data.
> - * b. Write migration data starting at the migration region + data_offset for
> - *    the length determined by data_size from the migration source.
> - * c. Write data_size, which indicates to the vendor driver that data is
> - *    written in the migration region. Vendor driver must return this write
> - *    operations on consuming data. Vendor driver should apply the
> - *    user-provided migration region data to the device resume state.
> - *
> - * If an error occurs during the above sequences, the vendor driver can return
> - * an error code for next read() or write() operation, which will terminate the
> - * loop. The user application should then take the next necessary action, for
> - * example, failing migration or terminating the user application.
> - *
> - * For the user application, data is opaque. The user application should write
> - * data in the same order as the data is received and the data should be of
> - * same transaction size at the source.
> + *   This field indicates the length of the current data segment in
> + *   bytes. While COPY, the kernel migration driver uses this field to
> + *   indicate to the user the length of the migration data stream available at
> + *   data_offset. When RESUMING, the user writes this field with the length of
> + *   the data segment written at the migration driver provided
> + *   data_offset. The value of this field is undefined when device_state is
> + *   not in the saving group of states.
> + *
> + * The following protocol is used while the device is in the saving group of
> + * states:
> + *
> + * a) The user reads pending_bytes.  If the read value is zero, no data is
> + *    currently available for the device.  If the device is in STOP_COPY and a
> + *    zero value is read, this indicates the end of the device migration
> + *    stream and the device must not generate any new migration data.  If the
> + *    read value is non-zero, the user may proceed to collect device migration
> + *    data in step b).  Repeated reads of pending_bytes is allowed and must
> + *    not compromise the migration data stream provided the user does not
> + *    proceed to the following step.
> + * b) The user reads data_offset, which indicates to the migration driver to
> + *    make a segment of device migration data available to the user at the
> + *    provided offset.  This action commits the user to collect the data
> + *    segment.
> + * c) The user reads data_size to determine the extent of the currently
> + *    available migration data segment.
> + * d) The user collects the data_size segment of device migration data at the
> + *    previously provided data_offset using access methods compatible to those
> + *    for the migration region.  The user must not be required to collect the
> + *    data in a single operation.
> + * e) The user re-reads pending_bytes to indicate to the migration driver that
> + *    the provided data has been collected.  Provided the read pending_bytes
> + *    value is non-zero, the user may proceed directly to step b) for another
> + *    iteration.
> + *
> + * The following protocol is used while the device is in the RESUMING
> + * device_state:
> + *
> + * a) The user reads data_offset, which directs the user to the location
> + *    within the migration region to store the migration data segment.
> + * b) The user writes the migration data segment starting at the provided
> + *    data_offset.  The user must preserve the data segment size as used when
> + *    the segment was collected from the device when COPY.
> + * c) The user writes the data_size field with the number of bytes written to
> + *    the migration region in step b).  The kernel migration driver may use
> + *    this write to indicate the end of the current iteration.
> + * d) User proceeds to step a) so long as the migration data stream is not
> + *    complete.
> + *
> + * The kernel migration driver may indicate an error condition by returning a
> + * fault on read(2) or write(2) for any operation most approximate to the
> + * detection of the error.  Field accesses are provided within the protocol
> + * such that an opportunity exists to return a fault regardless of whether the
> + * data section is directly accessed via an mmap.
> + *
> + * The user must consider the migration data segments to be opaque and
> + * non-fungible. During RESUMING, the data segments must be written in the
> + * size and order as provided during any part of the saving group of states.
>   */
>  
> +enum {
> +	VFIO_DEVICE_STATE_STOP = 0,
> +	VFIO_DEVICE_STATE_RUNNING = 1,
> +	VFIO_DEVICE_STATE_STOP_COPY = 2,
> +	VFIO_DEVICE_STATE_PRE_COPY = 3,
> +	VFIO_DEVICE_STATE_RESUMING = 4,
> +	VFIO_DEVICE_STATE_PRE_COPY_P2P = 5,
> +	VFIO_DEVICE_STATE_ERROR = 6,
> +	VFIO_DEVICE_STATE_RUNNING_P2P = 7,
> +};
> +
>  struct vfio_device_migration_info {
>  	__u32 device_state;         /* VFIO device state */
> -#define VFIO_DEVICE_STATE_STOP      (0)
> -#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> -#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> -#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> -#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> -				     VFIO_DEVICE_STATE_SAVING |  \
> -				     VFIO_DEVICE_STATE_RESUMING)
> -
> -#define VFIO_DEVICE_STATE_VALID(state) \
> -	(state & VFIO_DEVICE_STATE_RESUMING ? \
> -	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> -
> -#define VFIO_DEVICE_STATE_IS_ERROR(state) \
> -	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
> -					      VFIO_DEVICE_STATE_RESUMING))
> -
> -#define VFIO_DEVICE_STATE_SET_ERROR(state) \
> -	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
> -					     VFIO_DEVICE_STATE_RESUMING)
> -
>  	__u32 reserved;
>  	__u64 pending_bytes;
>  	__u64 data_offset;
> 
> base-commit: 40404a6ce71339d7bc0f0a0e185ad557bf421cec
Jason Gunthorpe Jan. 18, 2022, 9 p.m. UTC | #3
On Tue, Jan 18, 2022 at 12:55:22PM -0700, Alex Williamson wrote:
> On Fri, 14 Jan 2022 15:35:14 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > Clarify how the migration API works by recasting its specification from a
> > bunch of bits into a formal FSM. This describes the same functional
> > outcome, with no practical ABI change.
> 
> I don't understand why we're so reluctant to drop the previous
> specification and call this v2. 

I won't object - but I can't say it is really necessary.

> Yes, it's clever that the enum for the FSM matches the bit
> definitions, but we're also inserting previously invalid states as a
> standard part of the device flow... (see below)

This is completely invisible to userspace, if userspace never writes
the new states to device_state it can never read them back.

> > This is RFC because the full series is not fully tested yet, that should be
> > done next week. The series can be previewed here:
> > 
> >   https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
> > 
> > The mlx5 implementation of the state change:
> > 
> >   https://github.com/jgunthorpe/linux/blob/0a6416da226fe8ee888aa8026f1e363698e137a8/drivers/vfio/pci/mlx5/main.c#L264
> > 
> > Has turned out very clean. Compare this to the v5 version, which is full of
> > subtle bugs:
> > 
> >   https://lore.kernel.org/kvm/20211027095658.144468-12-yishaih@nvidia.com/
> > 
> > This patch adds the VFIO_DEVICE_MIG_ARC_SUPPORTED ioctl:
> > 
> >   https://github.com/jgunthorpe/linux/commit/c92eff6c2afd1ecc9ed5c67a1f81c7f270f6e940
> > 
> > And this shows how the Huawei driver should opt out of P2P arcs:
> > 
> >   https://github.com/jgunthorpe/linux/commit/dd2571c481d27546a33ff4583ce8ad49847fe300
> 
> We've been bitten several times by device support that didn't come to
> be in this whole vfio migration effort.

Which is why this patch is for Huawei not mlx5..

> At some point later hns support is ready, it supports the migration
> region, but migration fails with all existing userspace written to the
> below spec.  I can't imagine that a device advertising migration, but it
> being essentially guaranteed to fail is a viable condition and we can't
> retroactively add this proposed ioctl to existing userspace binaries.
> I think our recourse here would be to rev the migration sub-type again
> so that userspace that doesn't know about devices that lack P2P won't
> enable migration support.

Global versions are rarely a good idea. What happens if we have three
optional things, what do you set the version to in order to get
maximum compatibility?

For the scenario you describe it is much better for qemu to call
VFIO_DEVICE_MIG_ARC_SUPPORTED on every single transition it intends to
use when it first opens the device. If any fail then it can deem the
device as having some future ABI and refuse to use it with migration.

> So I think this ends up being a poor example of how to extend the uAPI.
> An opt-out for part of the base specification is hard, it's much easier
> to opt-in P2P as a feature.

I'm not sure I understand this 'base specification'. 

My remark was how we took current qemu as an ABI added P2P to the
specification and defined it in a way that is naturally backwards
compatible and is still well specified.

> > +enum {
> > +	VFIO_MIG_FSM_MAX_STATE = VFIO_DEVICE_STATE_RUNNING_P2P + 1,
> > +};
> 
> We have existing examples of using VFIO_foo_NUM_bar in the vfio uAPI
> already, let's follow that lead.

OK
 
> > +/*
> > + * vfio_mig_get_next_state - Compute the next step in the FSM
> > + * @cur_fsm - The current state the device is in
> > + * @new_fsm - The target state to reach
> > + *
> > + * Return the next step in the state progression between cur_fsm and
> > + * new_fsm. This breaks down requests for complex transitions into
> > + * smaller steps and returns the next step to get to new_fsm. The
> > + * function may need to be called up to four times before reaching new_fsm.
> > + *
> > + * VFIO_DEVICE_STATE_ERROR is returned if the state transition is not allowed.
> > + */
> > +static u32 vfio_mig_get_next_state(u32 cur_fsm, u32 new_fsm)
> 
> Can we use a typedef somewhere for type checking?

An earlier draft of this used a typed enum, lets go back to that
then. It just seemed odd because the uapi struct comes in as u32 and
at some point it has to get casted to the enum.

> > +	/*
> > +	 * Make a best effort to restore things back to where we started.
> > +	 */
> > +	while (*cur_state != starting_state) {
> > +		next_state =
> > +			vfio_mig_get_next_state(*cur_state, starting_state);
> > +		if (next_state == VFIO_DEVICE_STATE_ERROR ||
> > +		    device->ops->migration_step_device_state(device,
> > +							     next_state)) {
> > +			*cur_state = VFIO_DEVICE_STATE_ERROR;
> > +			break;
> > +		}
> > +		*cur_state = next_state;
> > +	}
> 
> Core code "transitioning" the device to ERROR seems a little suspect
> here.  I guess you're imagining that driver code calling this with an
> pointer to internal state that it also tests on a non-zero return.

Unfortunately, ideally the migration state would be stored in struct
vfio_device, and ideally the way to transition states would be a core
ioctl, not a region thingy.

If it was an ioctl then I'd return a 'needs reset' and exact current
device state.

> Should we just step-device-state to ERROR to directly inform the driver?

That is certainly a valid choice, it may eliminate the very ugly
pointer argument too. I will try it out.

> > + *   RUNNING_P2P -> STOP
> > + *   STOP_COPY -> STOP
> > + *     While in STOP the device must stop the operation of the device.  The
> > + *     device must not generate interrupts, DMA, or advance its internal
> > + *     state. When stopped the device and kernel migration driver must accept
> > + *     and respond to interaction to support external subsystems in the
> > + *     STOP state, for example PCI MSI-X and PCI config pace. Failure by
> > + *     the user to restrict device access while in STOP must not result in
> > + *     error conditions outside the user context (ex. host system faults).
> > + *
> > + *     The STOP_COPY arc will close the data window.
> 
> These data window "sessions" should probably be described as a concept
> first.

Er, OK.. I think it is all well described now, your new data window
language is much clearer, but lets try a small paragraph at the start
linking the data window to the FSM..

> > + *   RESUMING -> STOP
> > + *     Leaving RESUMING closes the migration region data window and indicates
> > + *     the end of a resuming session for the device.  The kernel migration
> > + *     driver should complete the incorporation of data written to the
> > + *     migration data window into the device internal state and perform final
> > + *     validity and consistency checking of the new device state.  If the user
> > + *     provided data is found to be incomplete, inconsistent, or otherwise
> > + *     invalid, the migration driver must indicate a write(2) error and
> > + *     optionally go to the ERROR state as described below. The data window
> > + *     is closed.
> > + *
> > + *     While in STOP the device has the same behavior as other STOP states
> > + *     described above.
> 
> There is one STOP state, invariant of how the device arrives to it.

In FSM theory we have the idea of Moore and Mealy FSM designs, this
description is following Mealy principles because that best fits the
actual driver implementation.

Paraphrashed, the distinction is that Moore is strictly about the
current state, while Mealy considers the arc by witch the state was
reached.

What I expect is that the device will exhibit a Moore kind of behavior
- eg STOP is STOP, but the driver implementation must be fully
Mealy. Look at how mlx5 is setup, each and every test is looking at an
arc.

This is because the arc encodes important information,
PRE_COPY -> RUNNING has a very different driver implementation than
RUNNING_P2P -> RUNNING even though they both end up in RUNNING.

Yes, the device should exhibit the same continuous behavior while in
the RUNNING state, but we can still describe that behavior by only
documenting arcs.

If you recall I tried to mix Moore and Mealy descriptions in my last
effort and everyone thought that was really confusing, I don't think
this will fare any better.

In this case, RESUMING -> STOP is different than any other way to
reach STOP, and it does trigger special driver behavior that needed a
paragraph to describe. The difference only exists while executing this
arc, and then STOP is STOP.

I think sticking to one FSM style for the documentation is the right
answer here. How about instead of just listing the states at the very
start we have short list with a one line summary of the state?

> The reader can infer from the above statement that the behavior of a
> state may be flexible to how the device arrives at the state unless
> otherwise specified.

There are two different things, the continuous behavior while in a
state, which is not flexible, and the actions the device takes, which
is. Go back to my last document to see how this looks when we try to
split them and spell them out separately.

Anyhow, it seems you are fine with the approach. Yishai can make the
changes you noted and lets go forward next week.

Thanks,
Jason
Cornelia Huck Jan. 19, 2022, 11:40 a.m. UTC | #4
On Tue, Jan 18 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jan 18, 2022 at 12:55:22PM -0700, Alex Williamson wrote:
>> At some point later hns support is ready, it supports the migration
>> region, but migration fails with all existing userspace written to the
>> below spec.  I can't imagine that a device advertising migration, but it
>> being essentially guaranteed to fail is a viable condition and we can't
>> retroactively add this proposed ioctl to existing userspace binaries.
>> I think our recourse here would be to rev the migration sub-type again
>> so that userspace that doesn't know about devices that lack P2P won't
>> enable migration support.
>
> Global versions are rarely a good idea. What happens if we have three
> optional things, what do you set the version to in order to get
> maximum compatibility?
>
> For the scenario you describe it is much better for qemu to call
> VFIO_DEVICE_MIG_ARC_SUPPORTED on every single transition it intends to
> use when it first opens the device. If any fail then it can deem the
> device as having some future ABI and refuse to use it with migration.

Userspace having to discover piecemeal what is and what isn't supported
does not sound like a very good idea. It should be able to figure that
out in one go.

>
>> So I think this ends up being a poor example of how to extend the uAPI.
>> An opt-out for part of the base specification is hard, it's much easier
>> to opt-in P2P as a feature.
>
> I'm not sure I understand this 'base specification'. 
>
> My remark was how we took current qemu as an ABI added P2P to the
> specification and defined it in a way that is naturally backwards
> compatible and is still well specified.

I agree with Alex that this approach, while clever, is not a good way to
extend the uapi.

What about leaving the existing migration region alone (in order to not
break whatever exists out there) and add a v2 migration region that
defines a base specification (the mandatory part that everyone must
support) and a capability mechanism to allow for extensions like P2P?
The base specification should really only contain what everybody can and
will need to implement; if we know that mlx5 will need more, we simply
need to define those additional features right from the start.

(I do not object to using a FSM for describing the state transitions; I
have not reviewed it so far.)
Jason Gunthorpe Jan. 19, 2022, 12:44 p.m. UTC | #5
On Wed, Jan 19, 2022 at 12:40:43PM +0100, Cornelia Huck wrote:
> On Tue, Jan 18 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Jan 18, 2022 at 12:55:22PM -0700, Alex Williamson wrote:
> >> At some point later hns support is ready, it supports the migration
> >> region, but migration fails with all existing userspace written to the
> >> below spec.  I can't imagine that a device advertising migration, but it
> >> being essentially guaranteed to fail is a viable condition and we can't
> >> retroactively add this proposed ioctl to existing userspace binaries.
> >> I think our recourse here would be to rev the migration sub-type again
> >> so that userspace that doesn't know about devices that lack P2P won't
> >> enable migration support.
> >
> > Global versions are rarely a good idea. What happens if we have three
> > optional things, what do you set the version to in order to get
> > maximum compatibility?
> >
> > For the scenario you describe it is much better for qemu to call
> > VFIO_DEVICE_MIG_ARC_SUPPORTED on every single transition it intends to
> > use when it first opens the device. If any fail then it can deem the
> > device as having some future ABI and refuse to use it with migration.
> 
> Userspace having to discover piecemeal what is and what isn't supported
> does not sound like a very good idea. It should be able to figure that
> out in one go.

Why? Are you worried about performance?

> >> So I think this ends up being a poor example of how to extend the uAPI.
> >> An opt-out for part of the base specification is hard, it's much easier
> >> to opt-in P2P as a feature.
> >
> > I'm not sure I understand this 'base specification'. 
> >
> > My remark was how we took current qemu as an ABI added P2P to the
> > specification and defined it in a way that is naturally backwards
> > compatible and is still well specified.
> 
> I agree with Alex that this approach, while clever, is not a good way to
> extend the uapi.

Again, why? It is fully backwards and forwards compatable, what
exactly is not a good idea here?
 
> What about leaving the existing migration region alone (in order to not
> break whatever exists out there) and add a v2 migration region that
> defines a base specification (the mandatory part that everyone must
> support) and a capability mechanism to allow for extensions like
> P2P?

No, that is a huge mess, now we have to support multiple regions in
every driver, and we still don't have any clean way forward if we want
to make more changes.

Replacing everything is a *failure* from a uABI compatability
perspective, those kinds of suggestions are completely out of sync
with how Linux Kenerl approaches this.

> The base specification should really only contain what everybody can and
> will need to implement; if we know that mlx5 will need more, we simply
> need to define those additional features right from the start.

Again, why? And what is a "base specification" anyhow?
 
Jason
Jason Gunthorpe Jan. 19, 2022, 1:18 p.m. UTC | #6
On Tue, Jan 18, 2022 at 12:55:22PM -0700, Alex Williamson wrote:

> be dead code.  The specification below doesn't propose the
> ARC_SUPPORTED ioctl and userspace is implemented with some vague notion
> that these states may not be implemented, but has no means to test.

BTW, to be clear, we are proposing to include the ARC_SUPPORTED ioctl
immediately, and adopt it in qemu when we implement the P2P support
there. It was just split into a seperate patch for review.

If qemu will attempt to support the optional non-P2P mode (without a
supporting driver yet existing) or simply fail to support migration if
P2P arcs are not supported is up to the qemu team.

It doesn't matter which choice qemu makes to the kernel series.

I also don't care if we write this patch with P2P described as
non-optional - so long as we all understand that when hns's driver
comes along and cannot implement the P2P stuff we have a fully
compatible upgrade path that does not change the migration region
version.

Jason
Jason Gunthorpe Jan. 19, 2022, 1:42 p.m. UTC | #7
On Wed, Jan 19, 2022 at 08:44:32AM -0400, Jason Gunthorpe wrote:

> > What about leaving the existing migration region alone (in order to not
> > break whatever exists out there) and add a v2 migration region that
> > defines a base specification (the mandatory part that everyone must
> > support) and a capability mechanism to allow for extensions like
> > P2P?

Actually, I misunderstood your remark, I think.

The ARC_SUPPORTED *is* the capability mechanism you are asking for. 

It is naturally defined in terms of the thing we are querying instead
of being an 'capability bit'.

It would be reasonable to define bundles of arcs, eg if any of these
are supported then all of them must be supported:

        PRE_COPY -> RUNNING
        PRE_COPY -> STOP_COPY
        RESUMING -> STOP
        RUNNING -> PRE_COPY
        RUNNING -> STOP
        STOP -> RESUMING
        STOP -> RUNNING
        STOP -> STOP_COPY
        STOP_COPY -> STOP

(And since we already defined this as mandatory already, it must
succeed)

And similar for P2P, if any are supported all must be supported

        PRE_COPY -> PRE_COPY_P2P
        PRE_COPY_P2P -> PRE_COPY
        PRE_COPY_P2P -> RUNNING_P2P
        PRE_COPY_P2P -> STOP_COPY
        RUNNING -> RUNNING_P2P
        RUNNING_P2P -> PRE_COPY_P2P
        RUNNING_P2P -> RUNNING
        RUNNING_P2P -> STOP
        STOP -> RUNNING_P2P

        [Plus the frst group]

This is pretty much the intention anyhow, even if it is was not
fully written down.

Which means that qemu needs to do one ARC_SUPPORTED call to determine
if it should use the P2P arcs or not.

We also have the possible STOP_COPY -> PRE_COPY scenario Alex thought
about which fits nicely here as well.

I can't see a good reason to use capability flags to represent the
same thing, that is less precise and a bit more obfuscated, IMHO. But
it doesn't really matter either way - it expresses the same idea. We
used a cap flag in the prior attempt for NDMA already, it isn't a big
change.

Please lets just pick the colour of this bike shed and move on.

Thanks,
Jason
Jason Gunthorpe Jan. 19, 2022, 2:59 p.m. UTC | #8
On Tue, Jan 18, 2022 at 05:00:48PM -0400, Jason Gunthorpe wrote:
> > Core code "transitioning" the device to ERROR seems a little suspect
> > here.  I guess you're imagining that driver code calling this with an
> > pointer to internal state that it also tests on a non-zero return.
> 
> Unfortunately, ideally the migration state would be stored in struct
> vfio_device, and ideally the way to transition states would be a core
> ioctl, not a region thingy.
> 
> If it was an ioctl then I'd return a 'needs reset' and exact current
> device state.
> 
> > Should we just step-device-state to ERROR to directly inform the driver?
> 
> That is certainly a valid choice, it may eliminate the very ugly
> pointer argument too. I will try it out.

It looks more poor unfortunately.

The pointer has the second purpose of allowing the core code to know
when the driver goes to ERROR if it returned -errno. If we get rid of
this then the core code's idea of device state becomes desync'd with
the driver's version and it will start doing nonsense things like
invoking cur_state = ERROR. Currently the core code protects the
driver from seeing those kinds of arcs.

Second, the pointer is consolidating the code to update new state only
upon success of the driver's function - without this we have to open
code this in every driver, it increases the LOC of the mlx5
migration_step_device_state() by about 30% - though it is not
complicated new code.

(realy the whole pointer is some hacky approach to avoid putting the
device_state enum in struct vfio_device, and maybe we should just do
that in the first place)

Since the scheme has the core code update the current state's storage,
and I don't really want to undo that, adding a call to ERROR is just
dead core code at this point as mlx5 doesn't do anything with it.

This is still a reasonable idea, but lets do it when a driver comes
along to implement something for the ERROR arc. It can include this:

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c02e057b87cd3c..913bf02946b832 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1800,6 +1800,7 @@ int vfio_mig_set_device_state(struct vfio_device *device, u32 target_state,
 	     !vfio_mig_in_saving_group(*cur_state)) ||
 	    (starting_state == VFIO_DEVICE_STATE_RESUMING &&
 	     *cur_state != VFIO_DEVICE_STATE_RESUMING)) {
+		device->ops->migration_step_device_state(device, VFIO_DEVICE_STATE_ERROR);
 		*cur_state = VFIO_DEVICE_STATE_ERROR;
 		return ret;
 	}
@@ -1813,7 +1814,11 @@ int vfio_mig_set_device_state(struct vfio_device *device, u32 target_state,
 		if (next_state == VFIO_DEVICE_STATE_ERROR ||
 		    device->ops->migration_step_device_state(device,
 							     next_state)) {
-			*cur_state = VFIO_DEVICE_STATE_ERROR;
+			if (*cur_state != VFIO_DEVICE_STATE_ERROR) {
+				device->ops->migration_step_device_state(
+					device, VFIO_DEVICE_STATE_ERROR);
+				*cur_state = VFIO_DEVICE_STATE_ERROR;
+			}
 			break;
 		}
 		*cur_state = next_state;

I have a feeling this might make sense for mdev based drivers that
implement a SW reset.

Jason
Alex Williamson Jan. 19, 2022, 3:32 p.m. UTC | #9
On Tue, 18 Jan 2022 17:00:48 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jan 18, 2022 at 12:55:22PM -0700, Alex Williamson wrote:
> > On Fri, 14 Jan 2022 15:35:14 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > Clarify how the migration API works by recasting its specification from a
> > > bunch of bits into a formal FSM. This describes the same functional
> > > outcome, with no practical ABI change.  
> > 
> > I don't understand why we're so reluctant to drop the previous
> > specification and call this v2.   
> 
> I won't object - but I can't say it is really necessary.
> 
> > Yes, it's clever that the enum for the FSM matches the bit
> > definitions, but we're also inserting previously invalid states as a
> > standard part of the device flow... (see below)  
> 
> This is completely invisible to userspace, if userspace never writes
> the new states to device_state it can never read them back.
> 
> > > This is RFC because the full series is not fully tested yet, that should be
> > > done next week. The series can be previewed here:
> > > 
> > >   https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
> > > 
> > > The mlx5 implementation of the state change:
> > > 
> > >   https://github.com/jgunthorpe/linux/blob/0a6416da226fe8ee888aa8026f1e363698e137a8/drivers/vfio/pci/mlx5/main.c#L264
> > > 
> > > Has turned out very clean. Compare this to the v5 version, which is full of
> > > subtle bugs:
> > > 
> > >   https://lore.kernel.org/kvm/20211027095658.144468-12-yishaih@nvidia.com/
> > > 
> > > This patch adds the VFIO_DEVICE_MIG_ARC_SUPPORTED ioctl:
> > > 
> > >   https://github.com/jgunthorpe/linux/commit/c92eff6c2afd1ecc9ed5c67a1f81c7f270f6e940
> > > 
> > > And this shows how the Huawei driver should opt out of P2P arcs:
> > > 
> > >   https://github.com/jgunthorpe/linux/commit/dd2571c481d27546a33ff4583ce8ad49847fe300  
> > 
> > We've been bitten several times by device support that didn't come to
> > be in this whole vfio migration effort.  
> 
> Which is why this patch is for Huawei not mlx5..
> 
> > At some point later hns support is ready, it supports the migration
> > region, but migration fails with all existing userspace written to the
> > below spec.  I can't imagine that a device advertising migration, but it
> > being essentially guaranteed to fail is a viable condition and we can't
> > retroactively add this proposed ioctl to existing userspace binaries.
> > I think our recourse here would be to rev the migration sub-type again
> > so that userspace that doesn't know about devices that lack P2P won't
> > enable migration support.  
> 
> Global versions are rarely a good idea. What happens if we have three
> optional things, what do you set the version to in order to get
> maximum compatibility?
> 
> For the scenario you describe it is much better for qemu to call
> VFIO_DEVICE_MIG_ARC_SUPPORTED on every single transition it intends to
> use when it first opens the device. If any fail then it can deem the
> device as having some future ABI and refuse to use it with migration.

This misses the point of the thought experiment,
VFIO_DEVICE_MIG_ARC_SUPPORTED is not defined here, it's defined in some
adjacent commit.  The migration region sub-type is not a global
version, it's one of the means we have built in to the device specific
region API for allowing compatibility breaks.  Userspace that only
knows about v1 migration sub-types will not look at v2 sub-types.

> > So I think this ends up being a poor example of how to extend the uAPI.
> > An opt-out for part of the base specification is hard, it's much easier
> > to opt-in P2P as a feature.  
> 
> I'm not sure I understand this 'base specification'. 
> 
> My remark was how we took current qemu as an ABI added P2P to the
> specification and defined it in a way that is naturally backwards
> compatible and is still well specified.

Ok, this is backwards.  In the patch proposed here, ie. what I'm
referring to as the base specification, we're adding new states that
effectively userspace isn't allowed to use because we're trying to
subtly maintain backwards compatibility with existing userspace, but
then in a follow-on patch add a new ioctl that userspace is required to
use to validate state transitions.  Isn't the requirement to use the
new ioctl enough to demand a compatibility break?

If the order was to propose a new FSM uAPI compatible to the existing
bit definitions without the P2P states, then add a new ioctl and P2P
states, and require userspace to use the ioctl to validate support for
those new P2P states, I might be able to swallow that.

The value of maintaining compatibility with the v1 migration sub-type
is essentially nil afaict.  If we consider proprietary drivers with
existing releases that include v1 migration support and hypervisor
vendors that might ignore the experimental nature of QEMU support, I'd
just rather avoid any potential headaches.  In-kernel drivers
should expose a v2 migration sub-type based on the FSM uAPI and probing
ioctl, v1 is deprecated and dropped from QEMU.  Thanks,

Alex
Jason Gunthorpe Jan. 19, 2022, 3:40 p.m. UTC | #10
On Wed, Jan 19, 2022 at 08:32:22AM -0700, Alex Williamson wrote:

> If the order was to propose a new FSM uAPI compatible to the existing
> bit definitions without the P2P states, then add a new ioctl and P2P
> states, and require userspace to use the ioctl to validate support for
> those new P2P states, I might be able to swallow that.

That is what this achieves!

Are you really asking that we have to redo all the docs/etc again just
to split them slightly differently into patches? What benefit is this
make work to anyone?

Jason
Alex Williamson Jan. 19, 2022, 4:06 p.m. UTC | #11
On Wed, 19 Jan 2022 11:40:28 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Jan 19, 2022 at 08:32:22AM -0700, Alex Williamson wrote:
> 
> > If the order was to propose a new FSM uAPI compatible to the existing
> > bit definitions without the P2P states, then add a new ioctl and P2P
> > states, and require userspace to use the ioctl to validate support for
> > those new P2P states, I might be able to swallow that.  
> 
> That is what this achieves!
> 
> Are you really asking that we have to redo all the docs/etc again just
> to split them slightly differently into patches? What benefit is this
> make work to anyone?

Only if you're really set on trying to claim compatibility with the
existing migration sub-type.  The simpler solution is to roll the
arc-supported ioctl into this proposal, bump the sub-type to v2 and
define the v2 uAPI to require this ioctl.  The proposal presented here
does not stand on it's own, it requires the new ioctl.  Those new p2p
states are not really usable without the ioctl.  Seems like we're just
expecting well behaved userspace to ignore them as presented in this
stand alone RFC.  Thanks,

Alex
Jason Gunthorpe Jan. 19, 2022, 4:38 p.m. UTC | #12
On Wed, Jan 19, 2022 at 09:06:14AM -0700, Alex Williamson wrote:
> On Wed, 19 Jan 2022 11:40:28 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Jan 19, 2022 at 08:32:22AM -0700, Alex Williamson wrote:
> > 
> > > If the order was to propose a new FSM uAPI compatible to the existing
> > > bit definitions without the P2P states, then add a new ioctl and P2P
> > > states, and require userspace to use the ioctl to validate support for
> > > those new P2P states, I might be able to swallow that.  
> > 
> > That is what this achieves!
> > 
> > Are you really asking that we have to redo all the docs/etc again just
> > to split them slightly differently into patches? What benefit is this
> > make work to anyone?
> 
> Only if you're really set on trying to claim compatibility with the
> existing migration sub-type.  The simpler solution is to roll the
> arc-supported ioctl into this proposal, bump the sub-type to v2 and

How about we just order the arc-supported ioctl patch first, then the
spec revision and include the language about how to use arc-supported
that is currently in the arc-supported ioctl?

I'm still completely mystified why you think we need to bump the
sub-type at all??

If you insist, but I'd like a good reason because I know it is going
to hurt a bunch of people out there. ie can you point at something
that is actually practically incompatible?

Jason
Alex Williamson Jan. 19, 2022, 5:02 p.m. UTC | #13
On Wed, 19 Jan 2022 12:38:21 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Jan 19, 2022 at 09:06:14AM -0700, Alex Williamson wrote:
> > On Wed, 19 Jan 2022 11:40:28 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Jan 19, 2022 at 08:32:22AM -0700, Alex Williamson wrote:
> > >   
> > > > If the order was to propose a new FSM uAPI compatible to the existing
> > > > bit definitions without the P2P states, then add a new ioctl and P2P
> > > > states, and require userspace to use the ioctl to validate support for
> > > > those new P2P states, I might be able to swallow that.    
> > > 
> > > That is what this achieves!
> > > 
> > > Are you really asking that we have to redo all the docs/etc again just
> > > to split them slightly differently into patches? What benefit is this
> > > make work to anyone?  
> > 
> > Only if you're really set on trying to claim compatibility with the
> > existing migration sub-type.  The simpler solution is to roll the
> > arc-supported ioctl into this proposal, bump the sub-type to v2 and  
> 
> How about we just order the arc-supported ioctl patch first, then the
> spec revision and include the language about how to use arc-supported
> that is currently in the arc-supported ioctl?
> 
> I'm still completely mystified why you think we need to bump the
> sub-type at all??
> 
> If you insist, but I'd like a good reason because I know it is going
> to hurt a bunch of people out there. ie can you point at something
> that is actually practically incompatible?

I'm equally as mystified who is going to break by bumping the sub-type.
QEMU support is experimental and does not properly handle multiple
devices.  I'm only aware of one proprietary driver that includes
migration code, but afaik it's not supported due to the status of QEMU.

Using a new sub-type allows us an opportunity to update QEMU to fully
support this new uAPI without any baggage to maintain support for the
v1 uAPI or risk breaking unknown users.

Minimally QEMU support needs to be marked non-experimental before I
feel like we're really going to "hurt a bunch of people", so it really
ought not to be an issue to revise support to the new uAPI at the same
time.

If a hypervisor vendor has chosen to run with experimental QEMU
support, it's on them to handle long term support and a transition plan
and I think that's also easier to do when it's clear whether the device
is exposing the original migration uAPI or the updated FSM model with
p2p states and an arc-supported ioctl.  Thanks,

Alex
Jason Gunthorpe Jan. 20, 2022, 12:19 a.m. UTC | #14
On Wed, Jan 19, 2022 at 10:02:17AM -0700, Alex Williamson wrote:

> > If you insist, but I'd like a good reason because I know it is going
> > to hurt a bunch of people out there. ie can you point at something
> > that is actually practically incompatible?
> 
> I'm equally as mystified who is going to break by bumping the sub-type.
> QEMU support is experimental and does not properly handle multiple
> devices.  I'm only aware of one proprietary driver that includes
> migration code, but afaik it's not supported due to the status of QEMU.

I do not think "not supported" is accurate

> If a hypervisor vendor has chosen to run with experimental QEMU
> support, it's on them to handle long term support and a transition plan
> and I think that's also easier to do when it's clear whether the device
> is exposing the original migration uAPI or the updated FSM model with
> p2p states and an arc-supported ioctl.  Thanks,

I'm not sure I agree with you on this, but I don't want to get into
qemu politics.

So, OK, I drafted a new series that just replaces the whole v1
protocol. If we are agreed on breaking everything then I'd like to
clean the other troublesome bits too, already we have some future
topics on our radar that will benefit from doing this.

The net result is a fairly stunning removal of ~300 lines of ugly
kernel driver code, which is significant considering the whole mlx5
project is only about 1000 lines.

The general gist is to stop abusing a migration region as a system
call interface and instead define two new migration specific ioctls
(set_state and arc_supported). Data transfer flows over a dedicated FD
created for each transfer session with a clear lifecycle instead of
through the region. qemu will discover the new protocol by issuing the
arc_supported ioctl. (or if we prefer the other shed colour, using the
VFIO_DEVICE_FEATURE ioctl instead of arc_supported)

Aside from being a more unixy interface, an FD can be used with
poll/io_uring/splice/etc and opens up better avenues to optimize for
operating migrations of multiple devices in parallel. It kills a wack
of goofy tricky driver code too.

If you know some reason to be set on the using a region for this then
please share, otherwise we'll look at the qemu work required to update
to this and if it is managable we'll send a RFC.

Thanks,
Jason
Cornelia Huck Jan. 24, 2022, 10:24 a.m. UTC | #15
On Wed, Jan 19 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:

> So, OK, I drafted a new series that just replaces the whole v1
> protocol. If we are agreed on breaking everything then I'd like to
> clean the other troublesome bits too, already we have some future
> topics on our radar that will benefit from doing this.

Can you share something about those "future topics"? It will help us
understand what you are trying to do, and maybe others might be going
into that direction as well.

>
> The net result is a fairly stunning removal of ~300 lines of ugly
> kernel driver code, which is significant considering the whole mlx5
> project is only about 1000 lines.
>
> The general gist is to stop abusing a migration region as a system
> call interface and instead define two new migration specific ioctls
> (set_state and arc_supported). Data transfer flows over a dedicated FD
> created for each transfer session with a clear lifecycle instead of
> through the region. qemu will discover the new protocol by issuing the
> arc_supported ioctl. (or if we prefer the other shed colour, using the
> VFIO_DEVICE_FEATURE ioctl instead of arc_supported)
>
> Aside from being a more unixy interface, an FD can be used with
> poll/io_uring/splice/etc and opens up better avenues to optimize for
> operating migrations of multiple devices in parallel. It kills a wack
> of goofy tricky driver code too.

Cleaner code certainly sounds compelling. It will be easier to review a
more concrete proposal, though, so I'll reserve judgment until then.
Jason Gunthorpe Jan. 24, 2022, 5:57 p.m. UTC | #16
On Mon, Jan 24, 2022 at 11:24:32AM +0100, Cornelia Huck wrote:
> On Wed, Jan 19 2022, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > So, OK, I drafted a new series that just replaces the whole v1
> > protocol. If we are agreed on breaking everything then I'd like to
> > clean the other troublesome bits too, already we have some future
> > topics on our radar that will benefit from doing this.
> 
> Can you share something about those "future topics"? It will help us
> understand what you are trying to do, and maybe others might be going
> into that direction as well.

We are concerned that the region API has no way to notify userspace
that it has data ready. We discussed this before and Alex was thinking
qemu should be busy looping, but we are expecting to have many devices
in a VM at any time and this seems inefficient.

eg currently it looks like qemu will enter STOP_COPY serially on every
device, and for something like mlx5 this means it sits around doing
nothing while the snapshot is prepared.

It would be better if qemu put all the devices into STOP_COPY and let
them run their work in the background then use poll() to wait for data
to come out. Then we can parallelize all the device steps and support
a model where we the device is streaming the STOP_COPY data slower
than the CPU can consume it, which we are also thinking about for a
future mlx5 revision.

Basically all of this is to speed up migration in for cases with
multiple STOP_COPY type devices.

From what I can see qemu doesn't have the event loop infrastructure to
support this in migration, but we can get the kernel side setup as
part of the simplification process.

> > Aside from being a more unixy interface, an FD can be used with
> > poll/io_uring/splice/etc and opens up better avenues to optimize for
> > operating migrations of multiple devices in parallel. It kills a wack
> > of goofy tricky driver code too.
> 
> Cleaner code certainly sounds compelling. It will be easier to review a
> more concrete proposal, though, so I'll reserve judgment until then.

Sure, we have patches now, just going through testing steps. 

A full series should be posted in the next few days, but if you want
to look ahead:

https://github.com/jgunthorpe/linux/commits/for-yishai

We have also made the matching qemu changes.

Thanks,
Jason
Tian, Kevin Jan. 25, 2022, 3:55 a.m. UTC | #17
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, January 15, 2022 3:35 AM
> + *
> + *   The peer to peer (P2P) quiescent state is intended to be a quiescent
> + *   state for the device for the purposes of managing multiple devices
> within
> + *   a user context where peer-to-peer DMA between devices may be active.
> The
> + *   PRE_COPY_P2P and RUNNING_P2P states must prevent the device from
> + *   initiating any new P2P DMA transactions. If the device can identify P2P
> + *   transactions then it can stop only P2P DMA, otherwise it must stop all
> + *   DMA.  The migration driver must complete any such outstanding
> operations
> + *   prior to completing the FSM arc into either P2P state.
> + *

Now NDMA is renamed to P2P... but we did discuss the potential
usage of using this state on devices which cannot stop DMA quickly 
thus needs to drain pending page requests which further requires 
running vCPUs if the fault is on guest I/O page table.

If you didn't change the mind isn't NDMA a more general name
giving multiple usages around it? Or do you intend the vPRI case
to introduce another new state now?

Thanks
Kevin
Jason Gunthorpe Jan. 25, 2022, 1:11 p.m. UTC | #18
On Tue, Jan 25, 2022 at 03:55:31AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, January 15, 2022 3:35 AM
> > + *
> > + *   The peer to peer (P2P) quiescent state is intended to be a quiescent
> > + *   state for the device for the purposes of managing multiple devices
> > within
> > + *   a user context where peer-to-peer DMA between devices may be active.
> > The
> > + *   PRE_COPY_P2P and RUNNING_P2P states must prevent the device from
> > + *   initiating any new P2P DMA transactions. If the device can identify P2P
> > + *   transactions then it can stop only P2P DMA, otherwise it must stop all
> > + *   DMA.  The migration driver must complete any such outstanding
> > operations
> > + *   prior to completing the FSM arc into either P2P state.
> > + *
> 
> Now NDMA is renamed to P2P... but we did discuss the potential
> usage of using this state on devices which cannot stop DMA quickly 
> thus needs to drain pending page requests which further requires 
> running vCPUs if the fault is on guest I/O page table.

I think this needs to be fleshed out more before we can add it,
ideally along with a driver and some qemu implementation

It looks like the qemu part for this will not be so easy..

Jason
Tian, Kevin Jan. 26, 2022, 1:17 a.m. UTC | #19
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, January 25, 2022 9:12 PM
> 
> On Tue, Jan 25, 2022 at 03:55:31AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, January 15, 2022 3:35 AM
> > > + *
> > > + *   The peer to peer (P2P) quiescent state is intended to be a quiescent
> > > + *   state for the device for the purposes of managing multiple devices
> > > within
> > > + *   a user context where peer-to-peer DMA between devices may be
> active.
> > > The
> > > + *   PRE_COPY_P2P and RUNNING_P2P states must prevent the device
> from
> > > + *   initiating any new P2P DMA transactions. If the device can identify
> P2P
> > > + *   transactions then it can stop only P2P DMA, otherwise it must stop
> all
> > > + *   DMA.  The migration driver must complete any such outstanding
> > > operations
> > > + *   prior to completing the FSM arc into either P2P state.
> > > + *
> >
> > Now NDMA is renamed to P2P... but we did discuss the potential
> > usage of using this state on devices which cannot stop DMA quickly
> > thus needs to drain pending page requests which further requires
> > running vCPUs if the fault is on guest I/O page table.
> 
> I think this needs to be fleshed out more before we can add it,
> ideally along with a driver and some qemu implementation

Yes. We have internal implementation but it has to be cleaned up
based on this new proposal.

> 
> It looks like the qemu part for this will not be so easy..
> 

My point is that we know that usage in the radar (though it needs more
discussion with real example) then does it make sense to make the 
current name more general? I'm not sure how many devices can figure
out P2P from normal DMAs. If most devices have to stop all DMAs to
meet the requirement, calling it a name about stopping all DMAs doesn't
hurt the current P2P requirement and is more extensible to cover other
stop-dma requirements.

Thanks
Kevin
Jason Gunthorpe Jan. 26, 2022, 1:32 a.m. UTC | #20
On Wed, Jan 26, 2022 at 01:17:26AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, January 25, 2022 9:12 PM
> > 
> > On Tue, Jan 25, 2022 at 03:55:31AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Saturday, January 15, 2022 3:35 AM
> > > > + *
> > > > + *   The peer to peer (P2P) quiescent state is intended to be a quiescent
> > > > + *   state for the device for the purposes of managing multiple devices
> > > > within
> > > > + *   a user context where peer-to-peer DMA between devices may be
> > active.
> > > > The
> > > > + *   PRE_COPY_P2P and RUNNING_P2P states must prevent the device
> > from
> > > > + *   initiating any new P2P DMA transactions. If the device can identify
> > P2P
> > > > + *   transactions then it can stop only P2P DMA, otherwise it must stop
> > all
> > > > + *   DMA.  The migration driver must complete any such outstanding
> > > > operations
> > > > + *   prior to completing the FSM arc into either P2P state.
> > > > + *
> > >
> > > Now NDMA is renamed to P2P... but we did discuss the potential
> > > usage of using this state on devices which cannot stop DMA quickly
> > > thus needs to drain pending page requests which further requires
> > > running vCPUs if the fault is on guest I/O page table.
> > 
> > I think this needs to be fleshed out more before we can add it,
> > ideally along with a driver and some qemu implementation
> 
> Yes. We have internal implementation but it has to be cleaned up
> based on this new proposal.
> 
> > 
> > It looks like the qemu part for this will not be so easy..
> > 
> 
> My point is that we know that usage in the radar (though it needs more
> discussion with real example) then does it make sense to make the 
> current name more general? I'm not sure how many devices can figure
> out P2P from normal DMAs. If most devices have to stop all DMAs to
> meet the requirement, calling it a name about stopping all DMAs doesn't
> hurt the current P2P requirement and is more extensible to cover other
> stop-dma requirements.

Except you are not talking about stopping all DMAs, you are talking
about a state that might hang indefinately waiting for a vPRI to
complete

In my mind this is completely different, and may motivate another
state in the graph

  PRE_COPY -> PRE_COPY_STOP_PRI -> PRE_COPY_STOP_P2P -> STOP_COPY

As STOP_PRI can be defined as halting any new PRIs and always return
immediately.

STOP_P2P can hang if PRI's are open

This affords a pretty clean approach for userspace to conclude the
open PRIs or decide it has to give up the migration.

Theoretical future devices that can support aborting PRI would not use
this state and would have STOP_P2P as also being NO_PRI. On this
device userspace would somehow abort the PRIs when it reaches
STOP_COPY.

Or at least that is one possibility.

In any event, the v2 is built as Alex and Cornelia were suggesting
with a minimal base feature set and two optional extensions for P2P
and PRE_COPY. Adding a 3rd extension for vPRI is completely
reasonable.

Further, from what I can understand devices doing PRI are incompatible
with the base feature set anyhow, as they can not support a RUNNING ->
STOP_COPY transition without, minimally, completing all the open
vPRIs. As VMMs implementing the base protocol should stop the vCPU and
then move the device to STOP_COPY, it is inherently incompatible with
what you are proposing.

The new vPRI enabled protocol would have to superceed the base
protocol and eliminate implicit transitions through the VPRI
maintenance states as these are non-transparent.

It is all stuff we can do in the FSM model, but it all needs a careful
think and a FSM design.

(there is also the interesting question how to even detect this as
vPRI special cases should only even exist if the device was bound to a
PRI capable io page table, so a single device may or may not use this
depending, and at least right now things are assuming these flags are
static at device setup time, so hurm)

Jason
Jason Gunthorpe Jan. 26, 2022, 1:35 a.m. UTC | #21
On Wed, Jan 26, 2022 at 01:17:26AM +0000, Tian, Kevin wrote:

> Yes. We have internal implementation but it has to be cleaned up
> based on this new proposal.

Can you talk more about what this will be?

Does it use precopy?

Can it do NDMA?

Thanks,
Jason
Tian, Kevin Jan. 26, 2022, 1:49 a.m. UTC | #22
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, January 26, 2022 9:33 AM
> 
> On Wed, Jan 26, 2022 at 01:17:26AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, January 25, 2022 9:12 PM
> > >
> > > On Tue, Jan 25, 2022 at 03:55:31AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Saturday, January 15, 2022 3:35 AM
> > > > > + *
> > > > > + *   The peer to peer (P2P) quiescent state is intended to be a
> quiescent
> > > > > + *   state for the device for the purposes of managing multiple
> devices
> > > > > within
> > > > > + *   a user context where peer-to-peer DMA between devices may be
> > > active.
> > > > > The
> > > > > + *   PRE_COPY_P2P and RUNNING_P2P states must prevent the device
> > > from
> > > > > + *   initiating any new P2P DMA transactions. If the device can
> identify
> > > P2P
> > > > > + *   transactions then it can stop only P2P DMA, otherwise it must
> stop
> > > all
> > > > > + *   DMA.  The migration driver must complete any such outstanding
> > > > > operations
> > > > > + *   prior to completing the FSM arc into either P2P state.
> > > > > + *
> > > >
> > > > Now NDMA is renamed to P2P... but we did discuss the potential
> > > > usage of using this state on devices which cannot stop DMA quickly
> > > > thus needs to drain pending page requests which further requires
> > > > running vCPUs if the fault is on guest I/O page table.
> > >
> > > I think this needs to be fleshed out more before we can add it,
> > > ideally along with a driver and some qemu implementation
> >
> > Yes. We have internal implementation but it has to be cleaned up
> > based on this new proposal.
> >
> > >
> > > It looks like the qemu part for this will not be so easy..
> > >
> >
> > My point is that we know that usage in the radar (though it needs more
> > discussion with real example) then does it make sense to make the
> > current name more general? I'm not sure how many devices can figure
> > out P2P from normal DMAs. If most devices have to stop all DMAs to
> > meet the requirement, calling it a name about stopping all DMAs doesn't
> > hurt the current P2P requirement and is more extensible to cover other
> > stop-dma requirements.
> 
> Except you are not talking about stopping all DMAs, you are talking
> about a state that might hang indefinately waiting for a vPRI to
> complete
> 
> In my mind this is completely different, and may motivate another
> state in the graph
> 
>   PRE_COPY -> PRE_COPY_STOP_PRI -> PRE_COPY_STOP_P2P -> STOP_COPY
> 
> As STOP_PRI can be defined as halting any new PRIs and always return
> immediately.

The problem is that on such devices PRIs are continuously triggered
when the driver tries to drain the in-fly requests to enter STOP_P2P
or STOP_COPY. If we simply halt any new PRIs in STOP_PRI, it
essentially implies no migration support for such device.

> 
> STOP_P2P can hang if PRI's are open

In earlier discussions we agreed on a timeout mechanism to avoid such
hang issue.

> 
> This affords a pretty clean approach for userspace to conclude the
> open PRIs or decide it has to give up the migration.
> 
> Theoretical future devices that can support aborting PRI would not use
> this state and would have STOP_P2P as also being NO_PRI. On this
> device userspace would somehow abort the PRIs when it reaches
> STOP_COPY.
> 
> Or at least that is one possibility.
> 
> In any event, the v2 is built as Alex and Cornelia were suggesting
> with a minimal base feature set and two optional extensions for P2P
> and PRE_COPY. Adding a 3rd extension for vPRI is completely
> reasonable.

I'm fine if adding a 3rd extension works. But here imho the requirement
can be translated into that the user expects to stop all DMAs while 
vCPU is running. If PRIs are triggered in that operation, then it will be 
handled by the running vCPU. If any corner case blocks it, the timeout
mechanism allows aborting the migration process.

> 
> Further, from what I can understand devices doing PRI are incompatible
> with the base feature set anyhow, as they can not support a RUNNING ->
> STOP_COPY transition without, minimally, completing all the open
> vPRIs. As VMMs implementing the base protocol should stop the vCPU and
> then move the device to STOP_COPY, it is inherently incompatible with
> what you are proposing.

My understanding is that STOP_P2P is entered before stopping vCPU.
If that state can be extended for STOP_DMA, then it's compatible.

> 
> The new vPRI enabled protocol would have to superceed the base
> protocol and eliminate implicit transitions through the VPRI
> maintenance states as these are non-transparent.
> 
> It is all stuff we can do in the FSM model, but it all needs a careful
> think and a FSM design.
> 
> (there is also the interesting question how to even detect this as
> vPRI special cases should only even exist if the device was bound to a
> PRI capable io page table, so a single device may or may not use this
> depending, and at least right now things are assuming these flags are
> static at device setup time, so hurm)
> 

Need more thinking on this part.

Thanks
Kevin
Tian, Kevin Jan. 26, 2022, 1:58 a.m. UTC | #23
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, January 26, 2022 9:36 AM
> 
> On Wed, Jan 26, 2022 at 01:17:26AM +0000, Tian, Kevin wrote:
> 
> > Yes. We have internal implementation but it has to be cleaned up
> > based on this new proposal.
> 
> Can you talk more about what this will be?
> 
> Does it use precopy?
> 
> Can it do NDMA?
> 

Current implementation is more like a hack.

No precopy as the state of that device is small.

It simply switches the order of stopping vcpu and clearing the device
RUNNING bit. In existing migration protocol DMA is stopped when 
the RUNNING bit is cleared. As long as vCPU is running all PRIs generated
when draining in-fly requests can be completed. That prototype doesn't
consider hostile guest or P2P deadlock.

With the new state machine this can be addressed by a general NDMA
state (with vCPU running) plus a timeout mechanism.

Thanks
Kevin
Jason Gunthorpe Jan. 26, 2022, 12:14 p.m. UTC | #24
On Wed, Jan 26, 2022 at 01:49:09AM +0000, Tian, Kevin wrote:

> > As STOP_PRI can be defined as halting any new PRIs and always return
> > immediately.
> 
> The problem is that on such devices PRIs are continuously triggered
> when the driver tries to drain the in-fly requests to enter STOP_P2P
> or STOP_COPY. If we simply halt any new PRIs in STOP_PRI, it
> essentially implies no migration support for such device.
 
So what can this HW even do? It can't immediately stop and disable its
queues?

Are you sure it can support migration?
 
> > STOP_P2P can hang if PRI's are open
> 
> In earlier discussions we agreed on a timeout mechanism to avoid such
> hang issue.

It is very ugly, ideally I'd prefer the userspace to handle the
timeout policy..

> > with the base feature set anyhow, as they can not support a RUNNING ->
> > STOP_COPY transition without, minimally, completing all the open
> > vPRIs. As VMMs implementing the base protocol should stop the vCPU and
> > then move the device to STOP_COPY, it is inherently incompatible with
> > what you are proposing.
> 
> My understanding is that STOP_P2P is entered before stopping vCPU.
> If that state can be extended for STOP_DMA, then it's compatible.

Well, it hasn't been coded yet, but this isn't strictly required to
achieve its purpose..

Jason
Jason Gunthorpe Jan. 26, 2022, 3:33 p.m. UTC | #25
On Wed, Jan 26, 2022 at 08:14:47AM -0400, Jason Gunthorpe wrote:
> > > with the base feature set anyhow, as they can not support a RUNNING ->
> > > STOP_COPY transition without, minimally, completing all the open
> > > vPRIs. As VMMs implementing the base protocol should stop the vCPU and
> > > then move the device to STOP_COPY, it is inherently incompatible with
> > > what you are proposing.
> > 
> > My understanding is that STOP_P2P is entered before stopping vCPU.
> > If that state can be extended for STOP_DMA, then it's compatible.
> 
> Well, it hasn't been coded yet, but this isn't strictly required to
> achieve its purpose..

Sorry, this isn't quite clear

The base v2 protocol specified RUNNING -> STOP(_COPY) as the only FSM
arc, and due to the definition of STOP this must be done with the vCPU
suspended.

So, this vPRI you are talking about simply cannot implement this, and
is incompatible with the base protocol that requires it.

Thus we have a device in this mode unable to do certain FSM
transitions, like RUNNING -> STOP and should block them.

On the other hand, the P2P stuff is deliberately compatible and due to
this there will be cases where RUNNING_P2P/etc can logically occur
with vCPU halted.

So.. this vPRI requirement is quite a big deviation. We can certainly
handle it inside the FSM framework, but it doesn't seem backward
compatible. I wouldn't worry too much about defining it now at least

Jason
Tian, Kevin Jan. 27, 2022, 12:38 a.m. UTC | #26
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, January 26, 2022 11:33 PM
> 
> On Wed, Jan 26, 2022 at 08:14:47AM -0400, Jason Gunthorpe wrote:
> > > > with the base feature set anyhow, as they can not support a RUNNING ->
> > > > STOP_COPY transition without, minimally, completing all the open
> > > > vPRIs. As VMMs implementing the base protocol should stop the vCPU
> and
> > > > then move the device to STOP_COPY, it is inherently incompatible with
> > > > what you are proposing.
> > >
> > > My understanding is that STOP_P2P is entered before stopping vCPU.
> > > If that state can be extended for STOP_DMA, then it's compatible.
> >
> > Well, it hasn't been coded yet, but this isn't strictly required to
> > achieve its purpose..
> 
> Sorry, this isn't quite clear
> 
> The base v2 protocol specified RUNNING -> STOP(_COPY) as the only FSM
> arc, and due to the definition of STOP this must be done with the vCPU
> suspended.
> 
> So, this vPRI you are talking about simply cannot implement this, and
> is incompatible with the base protocol that requires it.
> 
> Thus we have a device in this mode unable to do certain FSM
> transitions, like RUNNING -> STOP and should block them.

From this angle, yes.

> 
> On the other hand, the P2P stuff is deliberately compatible and due to
> this there will be cases where RUNNING_P2P/etc can logically occur
> with vCPU halted.
> 
> So.. this vPRI requirement is quite a big deviation. We can certainly
> handle it inside the FSM framework, but it doesn't seem backward
> compatible. I wouldn't worry too much about defining it now at least
> 

Now I see your point. Yes, we have to do some incompatible way to
support vPRI. In the end we need part of arc in FSM can run with 
active vCPUs.

Thanks
Kevin
Jason Gunthorpe Jan. 27, 2022, 12:48 a.m. UTC | #27
On Thu, Jan 27, 2022 at 12:38:24AM +0000, Tian, Kevin wrote:
> > So.. this vPRI requirement is quite a big deviation. We can certainly
> > handle it inside the FSM framework, but it doesn't seem backward
> > compatible. I wouldn't worry too much about defining it now at least
> 
> Now I see your point. Yes, we have to do some incompatible way to
> support vPRI. In the end we need part of arc in FSM can run with 
> active vCPUs.

I still think the right answer is a new state that stops new PRIs from
coming, I'm just not sure what that means. If the device can't
actually stop PRIs until it has completed PRIs - that is pretty messed
up - but it means at least we have this weird PRI state that might
timeout on some devices, and better devices might immediately return.

This is even quite possibly the same HW function as NDMA..

Anyhow, due to this discussion I redid our v2 draft to use cap bits
instead of the arc_supported ioctl, as it seems like it is more robust
against the notion in future some devices won't even support the basic
mandatory transitions. So we'd turn off old bits and turn on new bits
for these devices..

Jason
Tian, Kevin Jan. 27, 2022, 12:53 a.m. UTC | #28
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, January 26, 2022 8:15 PM
> 
> On Wed, Jan 26, 2022 at 01:49:09AM +0000, Tian, Kevin wrote:
> 
> > > As STOP_PRI can be defined as halting any new PRIs and always return
> > > immediately.
> >
> > The problem is that on such devices PRIs are continuously triggered
> > when the driver tries to drain the in-fly requests to enter STOP_P2P
> > or STOP_COPY. If we simply halt any new PRIs in STOP_PRI, it
> > essentially implies no migration support for such device.
> 
> So what can this HW even do? It can't immediately stop and disable its
> queues?
> 
> Are you sure it can support migration?

It's a draining model thus cannot immediately stop. Instead it has to
wait for in-fly requests to be completed (even not talking about vPRI).

It cannot support mandatory migration, but definitely can support
optional migration per our earlier discussions. Due to unbound time
of completing in-fly requests there is higher likelihood of breaking SLA.
For this case having a way allowing user to specify a timeout would 
be beneficial even for the base arc <RUNNING -> STOP>

> 
> > > STOP_P2P can hang if PRI's are open
> >
> > In earlier discussions we agreed on a timeout mechanism to avoid such
> > hang issue.
> 
> It is very ugly, ideally I'd prefer the userspace to handle the
> timeout policy..

timeout policy is always in userspace. We just need an interface for the user
to communicate it to the kernel. 

Thanks
Kevin
Tian, Kevin Jan. 27, 2022, 1:03 a.m. UTC | #29
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, January 27, 2022 8:48 AM
> 
> On Thu, Jan 27, 2022 at 12:38:24AM +0000, Tian, Kevin wrote:
> > > So.. this vPRI requirement is quite a big deviation. We can certainly
> > > handle it inside the FSM framework, but it doesn't seem backward
> > > compatible. I wouldn't worry too much about defining it now at least
> >
> > Now I see your point. Yes, we have to do some incompatible way to
> > support vPRI. In the end we need part of arc in FSM can run with
> > active vCPUs.
> 
> I still think the right answer is a new state that stops new PRIs from
> coming, I'm just not sure what that means. If the device can't
> actually stop PRIs until it has completed PRIs - that is pretty messed
> up - but it means at least we have this weird PRI state that might
> timeout on some devices, and better devices might immediately return.
> 
> This is even quite possibly the same HW function as NDMA..

We can discuss in detail after the base FSM is merged. With actual code
it'd be easier to identify the right approach of supporting that usage. 
The output from current discussion is that I know what the compatibility
means in current FSM. 
Jason Gunthorpe Jan. 27, 2022, 1:10 a.m. UTC | #30
On Thu, Jan 27, 2022 at 12:53:54AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, January 26, 2022 8:15 PM
> > 
> > On Wed, Jan 26, 2022 at 01:49:09AM +0000, Tian, Kevin wrote:
> > 
> > > > As STOP_PRI can be defined as halting any new PRIs and always return
> > > > immediately.
> > >
> > > The problem is that on such devices PRIs are continuously triggered
> > > when the driver tries to drain the in-fly requests to enter STOP_P2P
> > > or STOP_COPY. If we simply halt any new PRIs in STOP_PRI, it
> > > essentially implies no migration support for such device.
> > 
> > So what can this HW even do? It can't immediately stop and disable its
> > queues?
> > 
> > Are you sure it can support migration?
> 
> It's a draining model thus cannot immediately stop. Instead it has to
> wait for in-fly requests to be completed (even not talking about vPRI).

So, it can't complete draining without completing an unknown number of
vPRIs?

> timeout policy is always in userspace. We just need an interface for the user
> to communicate it to the kernel. 

Can the HW tell if the draining is completed somehow? Ie can it
trigger and eventfd or something?

The v2 API has this nice feature where it can return an FD, so we
could possibly go into a 'stopping PRI' state and that can return an
eventfd for the user to poll on to know when it is OK to move onwards.

That was the sticking point before, we want completing RUNNING_P2P to
mean the device is halted, but vPRI ideally wants to do a background
halting - now we have a way to do that..

Returning to running would abort the draining.

Userspace does the timeout with poll on the event fd..

This also logically justifies why this is not backwards compatabile as
one of the rules in the FSM construction is any arc that can return a
FD must be the final arc.

So, if the FSM seqeunce is

   RUNNING -> RUNNING_STOP_PRI -> RUNNING_STOP_P2P_AND_PRI -> STOP_COPY

Then by the design rules we cannot pass through RUNNING_STOP_PRI
automatically, it must be explicit.

A cap like "running_p2p returns an event fd, doesn't finish until the
VCPU does stuff, and stops pri as well as p2p" might be all that is
required here (and not an actual new state)

It is somewhat bizzaro from a wording perspective, but does
potentially allow qemu to be almost unchanged for the two cases..

Jason
Tian, Kevin Jan. 27, 2022, 1:21 a.m. UTC | #31
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, January 27, 2022 9:11 AM
> 
> On Thu, Jan 27, 2022 at 12:53:54AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, January 26, 2022 8:15 PM
> > >
> > > On Wed, Jan 26, 2022 at 01:49:09AM +0000, Tian, Kevin wrote:
> > >
> > > > > As STOP_PRI can be defined as halting any new PRIs and always return
> > > > > immediately.
> > > >
> > > > The problem is that on such devices PRIs are continuously triggered
> > > > when the driver tries to drain the in-fly requests to enter STOP_P2P
> > > > or STOP_COPY. If we simply halt any new PRIs in STOP_PRI, it
> > > > essentially implies no migration support for such device.
> > >
> > > So what can this HW even do? It can't immediately stop and disable its
> > > queues?
> > >
> > > Are you sure it can support migration?
> >
> > It's a draining model thus cannot immediately stop. Instead it has to
> > wait for in-fly requests to be completed (even not talking about vPRI).
> 
> So, it can't complete draining without completing an unknown number of
> vPRIs?

Right.

> 
> > timeout policy is always in userspace. We just need an interface for the
> user
> > to communicate it to the kernel.
> 
> Can the HW tell if the draining is completed somehow? Ie can it
> trigger and eventfd or something?

Yes. Software can specify an interrupt to be triggered when the draining
command is completed.

> 
> The v2 API has this nice feature where it can return an FD, so we
> could possibly go into a 'stopping PRI' state and that can return an
> eventfd for the user to poll on to know when it is OK to move onwards.
> 
> That was the sticking point before, we want completing RUNNING_P2P to
> mean the device is halted, but vPRI ideally wants to do a background
> halting - now we have a way to do that..

this is nice.

> 
> Returning to running would abort the draining.
> 
> Userspace does the timeout with poll on the event fd..

Yes.

> 
> This also logically justifies why this is not backwards compatabile as
> one of the rules in the FSM construction is any arc that can return a
> FD must be the final arc.
> 
> So, if the FSM seqeunce is
> 
>    RUNNING -> RUNNING_STOP_PRI -> RUNNING_STOP_P2P_AND_PRI ->
> STOP_COPY
> 
> Then by the design rules we cannot pass through RUNNING_STOP_PRI
> automatically, it must be explicit.
> 
> A cap like "running_p2p returns an event fd, doesn't finish until the
> VCPU does stuff, and stops pri as well as p2p" might be all that is
> required here (and not an actual new state)
> 
> It is somewhat bizzaro from a wording perspective, but does
> potentially allow qemu to be almost unchanged for the two cases..
> 

let me have more thinking on this part. I need better understanding
of existing design rules before concluding agreement here, though it does
sound like a good signal. 
diff mbox series

Patch

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 735d1d344af9d4..96001f03bc39f1 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1557,6 +1557,242 @@  static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 	return 0;
 }
 
+enum {
+	VFIO_MIG_FSM_MAX_STATE = VFIO_DEVICE_STATE_RUNNING_P2P + 1,
+};
+
+/*
+ * vfio_mig_get_next_state - Compute the next step in the FSM
+ * @cur_fsm - The current state the device is in
+ * @new_fsm - The target state to reach
+ *
+ * Return the next step in the state progression between cur_fsm and
+ * new_fsm. This breaks down requests for complex transitions into
+ * smaller steps and returns the next step to get to new_fsm. The
+ * function may need to be called up to four times before reaching new_fsm.
+ *
+ * VFIO_DEVICE_STATE_ERROR is returned if the state transition is not allowed.
+ */
+static u32 vfio_mig_get_next_state(u32 cur_fsm, u32 new_fsm)
+{
+	/*
+	 * The coding in this table requires the driver to implement 15
+	 * FSM arcs:
+	 *        PRE_COPY -> PRE_COPY_P2P
+	 *        PRE_COPY -> RUNNING
+	 *        PRE_COPY_P2P -> PRE_COPY
+	 *        PRE_COPY_P2P -> RUNNING_P2P
+	 *        PRE_COPY_P2P -> STOP_COPY
+	 *        RESUMING -> STOP
+	 *        RUNNING -> PRE_COPY
+	 *        RUNNING -> RUNNING_P2P
+	 *        RUNNING_P2P -> PRE_COPY_P2P
+	 *        RUNNING_P2P -> RUNNING
+	 *        RUNNING_P2P -> STOP
+	 *        STOP -> RESUMING
+	 *        STOP -> RUNNING_P2P
+	 *        STOP -> STOP_COPY
+	 *        STOP_COPY -> STOP
+	 *
+	 * The coding will step through multiple states for these combination
+	 * transitions:
+	 *        PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
+	 *        PRE_COPY -> RUNNING -> RUNNING_P2P
+	 *        PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP
+	 *        PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP -> RESUMING
+	 *        PRE_COPY_P2P -> RUNNING_P2P -> RUNNING
+	 *        PRE_COPY_P2P -> RUNNING_P2P -> STOP
+	 *        PRE_COPY_P2P -> RUNNING_P2P -> STOP -> RESUMING
+	 *        RESUMING -> STOP -> RUNNING_P2P
+	 *        RESUMING -> STOP -> RUNNING_P2P -> PRE_COPY_P2P
+	 *        RESUMING -> STOP -> RUNNING_P2P -> RUNNING
+	 *        RESUMING -> STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
+	 *        RESUMING -> STOP -> STOP_COPY
+	 *        RUNNING -> RUNNING_P2P -> PRE_COPY_P2P
+	 *        RUNNING -> RUNNING_P2P -> STOP
+	 *        RUNNING -> RUNNING_P2P -> STOP -> RESUMING
+	 *        RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
+	 *        RUNNING_P2P -> RUNNING -> PRE_COPY
+	 *        RUNNING_P2P -> STOP -> RESUMING
+	 *        RUNNING_P2P -> STOP -> STOP_COPY
+	 *        STOP -> RUNNING_P2P -> PRE_COPY_P2P
+	 *        STOP -> RUNNING_P2P -> RUNNING
+	 *        STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
+	 *        STOP_COPY -> STOP -> RESUMING
+	 *        STOP_COPY -> STOP -> RUNNING_P2P
+	 *        STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
+	 *
+	 *  The following transitions are blocked:
+	 *        STOP_COPY -> PRE_COPY
+	 *        STOP_COPY -> PRE_COPY_P2P
+	 */
+	static const u8 vfio_from_fsm_table[VFIO_MIG_FSM_MAX_STATE][VFIO_MIG_FSM_MAX_STATE] = {
+		[VFIO_DEVICE_STATE_STOP] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_RUNNING] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_PRE_COPY] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_PRE_COPY_P2P] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_STOP_COPY] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_RESUMING] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_ERROR] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+	};
+	return vfio_from_fsm_table[cur_fsm][new_fsm];
+}
+
+static bool vfio_mig_in_saving_group(u32 state)
+{
+	return state == VFIO_DEVICE_STATE_PRE_COPY ||
+	       state == VFIO_DEVICE_STATE_PRE_COPY_P2P ||
+	       state == VFIO_DEVICE_STATE_STOP_COPY;
+}
+
+/*
+ * vfio_mig_set_device_state - Change the migration device_state
+ * @device - The VFIO device to act on
+ * @target_device_state - The new state from the uAPI
+ * @cur_state - Pointer to the drivers current migration FSM state
+ *
+ * This validates target_device_state and then calls
+ * ops->migration_step_device_state() enough times to achieve the target state.
+ * See vfio_mig_get_next_state() for the required arcs.
+ *
+ * If the op callback fails then the driver should leave the device state
+ * unchanged and return errno, should this not be possible then it should set
+ * cur_state to VFIO_DEVICE_STATE_ERROR and return errno.
+ *
+ * If a step fails then this attempts to reverse the FSM back to the original
+ * state, should that fail it is set to VFIO_DEVICE_STATE_ERROR and error is
+ * returned.
+ */
+int vfio_mig_set_device_state(struct vfio_device *device, u32 target_state,
+			      u32 *cur_state)
+{
+	u32 starting_state = *cur_state;
+	u32 next_state;
+	int ret;
+
+	if (target_state >= VFIO_MIG_FSM_MAX_STATE)
+		return -EINVAL;
+
+	while (*cur_state != target_state) {
+		next_state = vfio_mig_get_next_state(*cur_state, target_state);
+		if (next_state == VFIO_DEVICE_STATE_ERROR) {
+			ret = -EINVAL;
+			goto out_restore_state;
+		}
+		ret = device->ops->migration_step_device_state(device,
+							       next_state);
+		if (ret)
+			goto out_restore_state;
+		*cur_state = next_state;
+	}
+	return 0;
+
+out_restore_state:
+	if (*cur_state == VFIO_DEVICE_STATE_ERROR)
+		return ret;
+	/*
+	 * If the window as initialized, and we closed the window, then we
+	 * cannot recover the old state.
+	 */
+	if ((vfio_mig_in_saving_group(starting_state) &&
+	     !vfio_mig_in_saving_group(*cur_state)) ||
+	    (starting_state == VFIO_DEVICE_STATE_RESUMING &&
+	     *cur_state != VFIO_DEVICE_STATE_RESUMING)) {
+		*cur_state = VFIO_DEVICE_STATE_ERROR;
+		return ret;
+	}
+
+	/*
+	 * Make a best effort to restore things back to where we started.
+	 */
+	while (*cur_state != starting_state) {
+		next_state =
+			vfio_mig_get_next_state(*cur_state, starting_state);
+		if (next_state == VFIO_DEVICE_STATE_ERROR ||
+		    device->ops->migration_step_device_state(device,
+							     next_state)) {
+			*cur_state = VFIO_DEVICE_STATE_ERROR;
+			break;
+		}
+		*cur_state = next_state;
+	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_mig_set_device_state);
+
 static long vfio_device_fops_unl_ioctl(struct file *filep,
 				       unsigned int cmd, unsigned long arg)
 {
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 76191d7abed185..5c96ef5e8d5202 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -69,6 +69,8 @@  struct vfio_device_ops {
 	int	(*mmap)(struct vfio_device *vdev, struct vm_area_struct *vma);
 	void	(*request)(struct vfio_device *vdev, unsigned int count);
 	int	(*match)(struct vfio_device *vdev, char *buf);
+	int     (*migration_step_device_state)(struct vfio_device *device,
+					       u32 next_state);
 };
 
 void vfio_init_group_dev(struct vfio_device *device, struct device *dev,
@@ -82,6 +84,9 @@  extern void vfio_device_put(struct vfio_device *device);
 
 int vfio_assign_device_set(struct vfio_device *device, void *set_id);
 
+int vfio_mig_set_device_state(struct vfio_device *device, u32 target_state,
+			      u32 *cur_state);
+
 /*
  * External user API
  */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ef33ea002b0b31..42e0ab905d2df7 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -408,223 +408,278 @@  struct vfio_region_gfx_edid {
 #define VFIO_REGION_SUBTYPE_MIGRATION           (1)
 
 /*
- * The structure vfio_device_migration_info is placed at the 0th offset of
- * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
- * migration information. Field accesses from this structure are only supported
- * at their native width and alignment. Otherwise, the result is undefined and
- * vendor drivers should return an error.
+ * The structure vfio_device_migration_info is placed at the immediate start
+ * of the per-device VFIO_REGION_SUBTYPE_MIGRATION region to manage the device
+ * state and migration information for the device.  Field accesses for this
+ * structure are only supported using their native width and alignment,
+ * accesses otherwise are undefined, and the kernel migration driver should
+ * return an error.
  *
  * device_state: (read/write)
- *      - The user application writes to this field to inform the vendor driver
- *        about the device state to be transitioned to.
- *      - The vendor driver should take the necessary actions to change the
- *        device state. After successful transition to a given state, the
- *        vendor driver should return success on write(device_state, state)
- *        system call. If the device state transition fails, the vendor driver
- *        should return an appropriate -errno for the fault condition.
- *      - On the user application side, if the device state transition fails,
- *	  that is, if write(device_state, state) returns an error, read
- *	  device_state again to determine the current state of the device from
- *	  the vendor driver.
- *      - The vendor driver should return previous state of the device unless
- *        the vendor driver has encountered an internal error, in which case
- *        the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR.
- *      - The user application must use the device reset ioctl to recover the
- *        device from VFIO_DEVICE_STATE_ERROR state. If the device is
- *        indicated to be in a valid device state by reading device_state, the
- *        user application may attempt to transition the device to any valid
- *        state reachable from the current state or terminate itself.
- *
- *      device_state consists of 3 bits:
- *      - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear,
- *        it indicates the _STOP state. When the device state is changed to
- *        _STOP, driver should stop the device before write() returns.
- *      - If bit 1 is set, it indicates the _SAVING state, which means that the
- *        driver should start gathering device state information that will be
- *        provided to the VFIO user application to save the device's state.
- *      - If bit 2 is set, it indicates the _RESUMING state, which means that
- *        the driver should prepare to resume the device. Data provided through
- *        the migration region should be used to resume the device.
- *      Bits 3 - 31 are reserved for future use. To preserve them, the user
- *      application should perform a read-modify-write operation on this
- *      field when modifying the specified bits.
- *
- *  +------- _RESUMING
- *  |+------ _SAVING
- *  ||+----- _RUNNING
- *  |||
- *  000b => Device Stopped, not saving or resuming
- *  001b => Device running, which is the default state
- *  010b => Stop the device & save the device state, stop-and-copy state
- *  011b => Device running and save the device state, pre-copy state
- *  100b => Device stopped and the device state is resuming
- *  101b => Invalid state
- *  110b => Error state
- *  111b => Invalid state
- *
- * State transitions:
- *
- *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
- *                (100b)     (001b)     (011b)        (010b)       (000b)
- * 0. Running or default state
- *                             |
- *
- * 1. Normal Shutdown (optional)
- *                             |------------------------------------->|
- *
- * 2. Save the state or suspend
- *                             |------------------------->|---------->|
- *
- * 3. Save the state during live migration
- *                             |----------->|------------>|---------->|
- *
- * 4. Resuming
- *                  |<---------|
- *
- * 5. Resumed
- *                  |--------->|
- *
- * 0. Default state of VFIO device is _RUNNING when the user application starts.
- * 1. During normal shutdown of the user application, the user application may
- *    optionally change the VFIO device state from _RUNNING to _STOP. This
- *    transition is optional. The vendor driver must support this transition but
- *    must not require it.
- * 2. When the user application saves state or suspends the application, the
- *    device state transitions from _RUNNING to stop-and-copy and then to _STOP.
- *    On state transition from _RUNNING to stop-and-copy, driver must stop the
- *    device, save the device state and send it to the application through the
- *    migration region. The sequence to be followed for such transition is given
- *    below.
- * 3. In live migration of user application, the state transitions from _RUNNING
- *    to pre-copy, to stop-and-copy, and to _STOP.
- *    On state transition from _RUNNING to pre-copy, the driver should start
- *    gathering the device state while the application is still running and send
- *    the device state data to application through the migration region.
- *    On state transition from pre-copy to stop-and-copy, the driver must stop
- *    the device, save the device state and send it to the user application
- *    through the migration region.
- *    Vendor drivers must support the pre-copy state even for implementations
- *    where no data is provided to the user before the stop-and-copy state. The
- *    user must not be required to consume all migration data before the device
- *    transitions to a new state, including the stop-and-copy state.
- *    The sequence to be followed for above two transitions is given below.
- * 4. To start the resuming phase, the device state should be transitioned from
- *    the _RUNNING to the _RESUMING state.
- *    In the _RESUMING state, the driver should use the device state data
- *    received through the migration region to resume the device.
- * 5. After providing saved device data to the driver, the application should
- *    change the state from _RESUMING to _RUNNING.
+ *   The device_state field is the current state in a finite state machine
+ *   (FSM) that controls the migration function of the device. A write by the
+ *   user requests the FSM move to the new state. Some general rules govern
+ *   the FSM:
+ *     - The user may read or write the device state register at any time.
+ *     - The kernel migration driver must fully transition the device to the
+ *       new state value before the write(2) operation returns to the user.
+ *     - The kernel migration driver must not generate asynchronous device
+ *       state transitions outside of manipulation by the user or the
+ *       VFIO_DEVICE_RESET ioctl as described below.
+ *     - In the event of a device state transition failure, the kernel
+ *       migration driver must return a write(2) error with appropriate errno
+ *       to the user. Refer to the ERROR arc section below for additional detail
+ *       on error handling.
+ *     - Devices supporting migration via this specification must support the
+ *       VFIO_DEVICE_RESET ioctl and any use of that ioctl must return the
+ *       device migration state to the RUNNING state.
+ *
+ *   There are 6 mandatory states defined as STOP, RUNNING, STOP_COPY,
+ *   PRE_COPY, RESUMING and ERROR plus two optional states PRE_COPY_P2P and
+ *   RESUMING_P2P.
+ *
+ *   When the user writes to the device_state it triggers a transition from
+ *   the current value of device_state to the value written by the user. Some
+ *   transitions are called FSM arcs, and when requested, have the driver
+ *   behavior defined as follows:
+ *
+ *   RUNNING_P2P -> STOP
+ *   STOP_COPY -> STOP
+ *     While in STOP the device must stop the operation of the device.  The
+ *     device must not generate interrupts, DMA, or advance its internal
+ *     state. When stopped the device and kernel migration driver must accept
+ *     and respond to interaction to support external subsystems in the
+ *     STOP state, for example PCI MSI-X and PCI config pace. Failure by
+ *     the user to restrict device access while in STOP must not result in
+ *     error conditions outside the user context (ex. host system faults).
+ *
+ *     The STOP_COPY arc will close the data window.
+ *
+ *   RESUMING -> STOP
+ *     Leaving RESUMING closes the migration region data window and indicates
+ *     the end of a resuming session for the device.  The kernel migration
+ *     driver should complete the incorporation of data written to the
+ *     migration data window into the device internal state and perform final
+ *     validity and consistency checking of the new device state.  If the user
+ *     provided data is found to be incomplete, inconsistent, or otherwise
+ *     invalid, the migration driver must indicate a write(2) error and
+ *     optionally go to the ERROR state as described below. The data window
+ *     is closed.
+ *
+ *     While in STOP the device has the same behavior as other STOP states
+ *     described above.
+ *
+ *     To abort a RESUMING session the device must be reset.
+ *
+ *   PRE_COPY -> RUNNING
+ *   RUNNING_P2P -> RUNNING
+ *     While in RUNNING the device is fully operational, the device may
+ *     generate interrupts, DMA, respond to MMIO, all vfio device regions are
+ *     functional, and the device may advance its internal state.
+ *
+ *     The PRE_COPY arc will close the data window.
+ *
+ *   PRE_COPY_P2P -> RUNNING_P2P
+ *   RUNNING -> RUNNING_P2P
+ *   STOP -> RUNNING_P2P
+ *     While in RUNNING_P2P the device is partially running in the P2P quiescent
+ *     state defined below.
+ *
+ *     The PRE_COPY arc will close the data window.
+ *
+ *   RUNNING -> PRE_COPY
+ *   RUNNING_P2P -> PRE_COPY_P2P
+ *   STOP -> STOP_COPY
+ *     PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of states
+ *     which share the data window. Moving between these states alters what is
+ *     streamed in the data window, but does not close or otherwise effect the
+ *     state of the data window.
+ *
+ *     These arcs begin the process of saving the device state. The device
+ *     initializes the migration region data window and associated fields
+ *     within vfio_device_migration_info for capturing the migration data
+ *     stream for the device. The migration driver may perform actions such as
+ *     enabling dirty logging of device state when entering PRE_COPY or
+ *     PER_COPY_P2P.
+ *
+ *     Each arc does not change the device operation, the device remains RUNNING,
+ *     P2P quiesced or in STOP. The STOP_COPY state is described below in
+ *     PRE_COPY_P2P -> STOP_COPY.
+ *
+ *   PRE_COPY -> PRE_COPY_P2P
+ *     Entering PRE_COPY_P2P continues all the behaviors of PRE_COPY above.
+ *     However, while in the PRE_COPY_P2P state, the device is partially
+ *     running in the P2P quiescent state defined below, like RUNNING_P2P.
+ *
+ *   PRE_COPY_P2P -> PRE_COPY
+ *     This arc allows returning the device to a full RUNNING behavior while
+ *     continuing all the behaviors of PRE_COPY.
+ *
+ *   PRE_COPY_P2P -> STOP_COPY
+ *     While in the STOP_COPY state the device has the same behavior as STOP
+ *     with the addition that the data window continues to stream the
+ *     migration state. End of stream indicates the entire device state has
+ *     been transferred as detailed in the data window protocol below.
+ *
+ *     The user should take steps to restrict access to vfio device regions
+ *     other than the migration region while the device is in STOP_COPY or
+ *     risk corruption of the device migration data stream.
+ *
+ *   STOP -> RESUMING
+ *     Entering the RESUMING state enables and initializes the migration
+ *     region data window and associated fields within struct
+ *     vfio_device_migration_info for restoring the device from a migration
+ *     data stream captured from a COPY session with a compatible device. The
+ *     migration driver may alter/reset the internal device state for this arc
+ *     if required to prepare the device to receive the migration data.
+ *
+ *   STOP_COPY -> PRE_COPY
+ *   STOP_COPY -> PRE_COPY_P2P
+ *     These arcs are not permitted and return error if requested. Future
+ *     revisions of this API may define behaviors for these arcs.
+ *
+ *   any -> ERROR
+ *     ERROR cannot be specified as a device state, however any transition
+ *     request can be failed with an errno return and may then move the
+ *     device_state into ERROR. In this case the device was unable to execute
+ *     the requested arc and was also unable to restore the device to the
+ *     original device_state. The device behavior in ERROR can be any
+ *     combination of allowed behaviors from the FSM path between the current
+ *     and requested device_state. The user can detect the ERROR state by
+ *     reading the device_state after receiving an errno failure to a
+ *     device_state write. To recover from ERROR VFIO_DEVICE_RESET must be
+ *     used to return the device_state back to RUNNING.
+ *
+ *   The peer to peer (P2P) quiescent state is intended to be a quiescent
+ *   state for the device for the purposes of managing multiple devices within
+ *   a user context where peer-to-peer DMA between devices may be active. The
+ *   PRE_COPY_P2P and RUNNING_P2P states must prevent the device from
+ *   initiating any new P2P DMA transactions. If the device can identify P2P
+ *   transactions then it can stop only P2P DMA, otherwise it must stop all
+ *   DMA.  The migration driver must complete any such outstanding operations
+ *   prior to completing the FSM arc into either P2P state.
+ *
+ *   The remaining possible transitions are interpreted as combinations of the
+ *   above FSM arcs. As there are multiple paths through the FSM arcs the path
+ *   should be selected based on the following rules:
+ *     - Selet the shortest path
+ *     - The path can not have saving group states as interior arcs, only
+ *       starting/end states.
+ *   Refer to vfio_mig_get_next_state() for the result of the algorithm.
+ *
+ *   The automatic transit through the FSM arcs that make up the combination
+ *   transition is invisible to the user. The operation either succeeds and
+ *   sets device_state to the new value or fails and leaves it at the original
+ *   value or ERROR.
+ *
+ *   The PRE_COPY_P2P and RESUMING_P2P device_states are optional. If a device
+ *   does not support it then the state cannot be written or read from
+ *   device_state. For the purposes of specification, the state continues to
+ *   exist but modified so that the device is fully operational while in
+ *   either state. When not supported the user will have to request
+ *   combination transitions (ie PRE_COPY -> STOP_COPY / RESUMING -> RUNNING)
+ *   to avoid writing the unsupported device_state value.
  *
  * reserved:
- *      Reads on this field return zero and writes are ignored.
+ *   Reads on this field return zero and writes are ignored.
  *
  * pending_bytes: (read only)
- *      The number of pending bytes still to be migrated from the vendor driver.
+ *   The kernel migration driver uses this field to indicate an estimate of
+ *   the remaining data size (in bytes) for the user to copy while the data
+ *   window is used by the saving group of states. The value should be
+ *   considered volatile, especially while in the PRE_COPY[_P2P] states.
+ *   Userspace uses this field to test whether data is available to be read
+ *   from the data section described below.  Userspace should only consider
+ *   whether the value read is zero or non-zero for the purposes of the
+ *   protocol below.  The user may only consider the migration data stream to
+ *   be end of stream when pending_bytes reports a zero value while the device
+ *   is in STOP_COPY.  The kernel migration driver must not consider the state
+ *   of pending_bytes or the data window when executing arcs, and cannot fail
+ *   any arc because the data window has not reached end of stream. The value
+ *   of this field is undefined outside the saving group of states.
  *
  * data_offset: (read only)
- *      The user application should read data_offset field from the migration
- *      region. The user application should read the device data from this
- *      offset within the migration region during the _SAVING state or write
- *      the device data during the _RESUMING state. See below for details of
- *      sequence to be followed.
+ *   This field indicates the offset relative to the start of the device
+ *   migration region for the user to collect (COPY) or store (RESUMING)
+ *   migration data for the device following the protocol described below. The
+ *   migration driver may provide sparse mmap support for the migration region
+ *   and use the data_offset field to direct user accesses as appropriate, but
+ *   must not require mmap access when provided.  The value of this field is
+ *   undefined when device_state is not in the saving group of states.
  *
  * data_size: (read/write)
- *      The user application should read data_size to get the size in bytes of
- *      the data copied in the migration region during the _SAVING state and
- *      write the size in bytes of the data copied in the migration region
- *      during the _RESUMING state.
- *
- * The format of the migration region is as follows:
- *  ------------------------------------------------------------------
- * |vfio_device_migration_info|    data section                      |
- * |                          |     ///////////////////////////////  |
- * ------------------------------------------------------------------
- *   ^                              ^
- *  offset 0-trapped part        data_offset
- *
- * The structure vfio_device_migration_info is always followed by the data
- * section in the region, so data_offset will always be nonzero. The offset
- * from where the data is copied is decided by the kernel driver. The data
- * section can be trapped, mmapped, or partitioned, depending on how the kernel
- * driver defines the data section. The data section partition can be defined
- * as mapped by the sparse mmap capability. If mmapped, data_offset must be
- * page aligned, whereas initial section which contains the
- * vfio_device_migration_info structure, might not end at the offset, which is
- * page aligned. The user is not required to access through mmap regardless
- * of the capabilities of the region mmap.
- * The vendor driver should determine whether and how to partition the data
- * section. The vendor driver should return data_offset accordingly.
- *
- * The sequence to be followed while in pre-copy state and stop-and-copy state
- * is as follows:
- * a. Read pending_bytes, indicating the start of a new iteration to get device
- *    data. Repeated read on pending_bytes at this stage should have no side
- *    effects.
- *    If pending_bytes == 0, the user application should not iterate to get data
- *    for that device.
- *    If pending_bytes > 0, perform the following steps.
- * b. Read data_offset, indicating that the vendor driver should make data
- *    available through the data section. The vendor driver should return this
- *    read operation only after data is available from (region + data_offset)
- *    to (region + data_offset + data_size).
- * c. Read data_size, which is the amount of data in bytes available through
- *    the migration region.
- *    Read on data_offset and data_size should return the offset and size of
- *    the current buffer if the user application reads data_offset and
- *    data_size more than once here.
- * d. Read data_size bytes of data from (region + data_offset) from the
- *    migration region.
- * e. Process the data.
- * f. Read pending_bytes, which indicates that the data from the previous
- *    iteration has been read. If pending_bytes > 0, go to step b.
- *
- * The user application can transition from the _SAVING|_RUNNING
- * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
- * number of pending bytes. The user application should iterate in _SAVING
- * (stop-and-copy) until pending_bytes is 0.
- *
- * The sequence to be followed while _RESUMING device state is as follows:
- * While data for this device is available, repeat the following steps:
- * a. Read data_offset from where the user application should write data.
- * b. Write migration data starting at the migration region + data_offset for
- *    the length determined by data_size from the migration source.
- * c. Write data_size, which indicates to the vendor driver that data is
- *    written in the migration region. Vendor driver must return this write
- *    operations on consuming data. Vendor driver should apply the
- *    user-provided migration region data to the device resume state.
- *
- * If an error occurs during the above sequences, the vendor driver can return
- * an error code for next read() or write() operation, which will terminate the
- * loop. The user application should then take the next necessary action, for
- * example, failing migration or terminating the user application.
- *
- * For the user application, data is opaque. The user application should write
- * data in the same order as the data is received and the data should be of
- * same transaction size at the source.
+ *   This field indicates the length of the current data segment in
+ *   bytes. While COPY, the kernel migration driver uses this field to
+ *   indicate to the user the length of the migration data stream available at
+ *   data_offset. When RESUMING, the user writes this field with the length of
+ *   the data segment written at the migration driver provided
+ *   data_offset. The value of this field is undefined when device_state is
+ *   not in the saving group of states.
+ *
+ * The following protocol is used while the device is in the saving group of
+ * states:
+ *
+ * a) The user reads pending_bytes.  If the read value is zero, no data is
+ *    currently available for the device.  If the device is in STOP_COPY and a
+ *    zero value is read, this indicates the end of the device migration
+ *    stream and the device must not generate any new migration data.  If the
+ *    read value is non-zero, the user may proceed to collect device migration
+ *    data in step b).  Repeated reads of pending_bytes is allowed and must
+ *    not compromise the migration data stream provided the user does not
+ *    proceed to the following step.
+ * b) The user reads data_offset, which indicates to the migration driver to
+ *    make a segment of device migration data available to the user at the
+ *    provided offset.  This action commits the user to collect the data
+ *    segment.
+ * c) The user reads data_size to determine the extent of the currently
+ *    available migration data segment.
+ * d) The user collects the data_size segment of device migration data at the
+ *    previously provided data_offset using access methods compatible to those
+ *    for the migration region.  The user must not be required to collect the
+ *    data in a single operation.
+ * e) The user re-reads pending_bytes to indicate to the migration driver that
+ *    the provided data has been collected.  Provided the read pending_bytes
+ *    value is non-zero, the user may proceed directly to step b) for another
+ *    iteration.
+ *
+ * The following protocol is used while the device is in the RESUMING
+ * device_state:
+ *
+ * a) The user reads data_offset, which directs the user to the location
+ *    within the migration region to store the migration data segment.
+ * b) The user writes the migration data segment starting at the provided
+ *    data_offset.  The user must preserve the data segment size as used when
+ *    the segment was collected from the device when COPY.
+ * c) The user writes the data_size field with the number of bytes written to
+ *    the migration region in step b).  The kernel migration driver may use
+ *    this write to indicate the end of the current iteration.
+ * d) User proceeds to step a) so long as the migration data stream is not
+ *    complete.
+ *
+ * The kernel migration driver may indicate an error condition by returning a
+ * fault on read(2) or write(2) for any operation most approximate to the
+ * detection of the error.  Field accesses are provided within the protocol
+ * such that an opportunity exists to return a fault regardless of whether the
+ * data section is directly accessed via an mmap.
+ *
+ * The user must consider the migration data segments to be opaque and
+ * non-fungible. During RESUMING, the data segments must be written in the
+ * size and order as provided during any part of the saving group of states.
  */
 
+enum {
+	VFIO_DEVICE_STATE_STOP = 0,
+	VFIO_DEVICE_STATE_RUNNING = 1,
+	VFIO_DEVICE_STATE_STOP_COPY = 2,
+	VFIO_DEVICE_STATE_PRE_COPY = 3,
+	VFIO_DEVICE_STATE_RESUMING = 4,
+	VFIO_DEVICE_STATE_PRE_COPY_P2P = 5,
+	VFIO_DEVICE_STATE_ERROR = 6,
+	VFIO_DEVICE_STATE_RUNNING_P2P = 7,
+};
+
 struct vfio_device_migration_info {
 	__u32 device_state;         /* VFIO device state */
-#define VFIO_DEVICE_STATE_STOP      (0)
-#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
-#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
-#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
-#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
-				     VFIO_DEVICE_STATE_SAVING |  \
-				     VFIO_DEVICE_STATE_RESUMING)
-
-#define VFIO_DEVICE_STATE_VALID(state) \
-	(state & VFIO_DEVICE_STATE_RESUMING ? \
-	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
-
-#define VFIO_DEVICE_STATE_IS_ERROR(state) \
-	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
-					      VFIO_DEVICE_STATE_RESUMING))
-
-#define VFIO_DEVICE_STATE_SET_ERROR(state) \
-	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
-					     VFIO_DEVICE_STATE_RESUMING)
-
 	__u32 reserved;
 	__u64 pending_bytes;
 	__u64 data_offset;