diff mbox series

[V7,mlx5-next,09/15] vfio: Extend the device migration protocol with RUNNING_P2P

Message ID 20220207172216.206415-10-yishaih@nvidia.com (mailing list archive)
State Superseded
Headers show
Series Add mlx5 live migration driver and v2 migration protocol | expand

Commit Message

Yishai Hadas Feb. 7, 2022, 5:22 p.m. UTC
From: Jason Gunthorpe <jgg@nvidia.com>

The RUNNING_P2P state is designed to support multiple devices in the same
VM that are doing P2P transactions between themselves. When in RUNNING_P2P
the device must be able to accept incoming P2P transactions but should not
generate outgoing transactions.

As an optional extension to the mandatory states it is defined as
inbetween STOP and RUNNING:
   STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP

For drivers that are unable to support RUNNING_P2P the core code silently
merges RUNNING_P2P and RUNNING together. Drivers that support this will be
required to implement 4 FSM arcs beyond the basic FSM. 2 of the basic FSM
arcs become combination transitions.

Compared to the v1 clarification, NDMA is redefined into FSM states and is
described in terms of the desired P2P quiescent behavior, noting that
halting all DMA is an acceptable implementation.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/vfio.c       | 79 ++++++++++++++++++++++++++++++---------
 include/linux/vfio.h      |  1 +
 include/uapi/linux/vfio.h | 34 ++++++++++++++++-
 3 files changed, 95 insertions(+), 19 deletions(-)

Comments

Tian, Kevin Feb. 15, 2022, 10:18 a.m. UTC | #1
> From: Yishai Hadas <yishaih@nvidia.com>
> Sent: Tuesday, February 8, 2022 1:22 AM
> 
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> The RUNNING_P2P state is designed to support multiple devices in the same
> VM that are doing P2P transactions between themselves. When in
> RUNNING_P2P
> the device must be able to accept incoming P2P transactions but should not
> generate outgoing transactions.

outgoing 'P2P' transactions.

> 
> As an optional extension to the mandatory states it is defined as
> inbetween STOP and RUNNING:
>    STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP
> 
> For drivers that are unable to support RUNNING_P2P the core code silently
> merges RUNNING_P2P and RUNNING together. Drivers that support this will

It would be clearer if following message could be also reflected here:

  + * The optional states cannot be used with SET_STATE if the device does not
  + * support them. The user can discover if these states are supported by using
  + * VFIO_DEVICE_FEATURE_MIGRATION. 

Otherwise the original context reads like RUNNING_P2P can be used as
end state even if the underlying driver doesn't support it then makes me
wonder what is the point of the new capability bit.

> be
> required to implement 4 FSM arcs beyond the basic FSM. 2 of the basic FSM
> arcs become combination transitions.
> 
> Compared to the v1 clarification, NDMA is redefined into FSM states and is
> described in terms of the desired P2P quiescent behavior, noting that
> halting all DMA is an acceptable implementation.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/vfio/vfio.c       | 79 ++++++++++++++++++++++++++++++---------
>  include/linux/vfio.h      |  1 +
>  include/uapi/linux/vfio.h | 34 ++++++++++++++++-
>  3 files changed, 95 insertions(+), 19 deletions(-)
> 
> @@ -1631,17 +1657,36 @@ int vfio_mig_get_next_state(struct vfio_device

[...]

>  	*next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> +	while ((state_flags_table[*next_fsm] & device->migration_flags) !=
> +			state_flags_table[*next_fsm])
> +		*next_fsm = vfio_from_fsm_table[*next_fsm][new_fsm];
> +

A comment highlighting the silent merging of unsupported states would
be informative here.

and I have a puzzle on following messages:

>   *
> + * And 1 optional state to support VFIO_MIGRATION_P2P:
> + *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer
> DMA
>   *

and

> + * RUNNING_P2P -> RUNNING
>   *   While in RUNNING the device is fully operational, the device may
> generate
>   *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
>   *   and the device may advance its internal state.
>   *

and below

> + * The optional peer to peer (P2P) quiescent state is intended to be a
> quiescent
> + * state for the device for the purposes of managing multiple devices within
> a
> + * user context where peer-to-peer DMA between devices may be active.
> The
> + * RUNNING_P2P states must prevent the device from initiating
> + * any new P2P DMA transactions. If the device can identify P2P transactions
> + * then it can stop only P2P DMA, otherwise it must stop all DMA. The
> migration
> + * driver must complete any such outstanding operations prior to
> completing the
> + * FSM arc into a P2P state. For the purpose of specification the states
> + * behave as though the device was fully running if not supported.

Defining RUNNING_P2P in above way implies that RUNNING_P2P inherits 
all behaviors in RUNNING except blocking outbound P2P:
	* generate interrupts and DMAs
	* respond to MMIO
	* all vfio regions are functional
	* device may advance its internal state
	* drain and block outstanding P2P requests

I think this is not the intended behavior when NDMA was being discussed
in previous threads, as above definition suggests the user could continue
to submit new requests after outstanding P2P requests are completed given
all vfio regions are functional when the device is in RUNNING_P2P.

Though just a naming thing, possibly what we really require is a STOPPING_P2P
state which indicates the device is moving to the STOP (or STOPPED) state.
In this state the device is functional but vfio regions are not so the user still
needs to restrict device access:
	* generate interrupts and DMAs
	* respond to MMIO
	* all vfio regions are NOT functional (no user access)
	* device may advance its internal state
	* drain and block outstanding P2P requests

In virtualization this means Qemu must stop vCPU first before entering
STOPPING_P2P for a device.

Back to your earlier suggestion on reusing RUNNING_P2P to cover vPRI 
usage via a new capability bit [1]:

    "A cap like "running_p2p returns an event fd, doesn't finish until the
    VCPU does stuff, and stops pri as well as p2p" might be all that is
    required here (and not an actual new state)"

vPRI requires a RUNNING semantics. A new capability bit can change 
the behaviors listed above for STOPPING_P2P to below:
	* both P2P and vPRI requests should be drained and blocked;
	* all vfio regions are functional (with a RUNNING behavior) so
	  vCPUs can continue running to help drain vPRI requests;
	* an eventfd is returned for the user to poll-wait the completion
	  of state transition;

and in this regard possibly it makes more sense to call this state 
as STOPPING to encapsulate any optional preparation work before 
the device can be transitioned to STOP (with default as defined for
STOPPING_P2P above and actual behavior changeable by future
capability bits)? 

One additional requirement in driver side is to dynamically mediate the 
fast path and queue any new request which may trigger vPRI or P2P
before moving out of RUNNING_P2P. If moving to STOP_COPY, then
queued requests will also be included as device state to be replayed
in the resuming path.

Does above sound a reasonable understanding of this FSM mechanism? 

> + *
> + * The optional states cannot be used with SET_STATE if the device does not
> + * support them. The user can disocver if these states are supported by

'disocver' -> 'discover'

Thanks
Kevin
Jason Gunthorpe Feb. 15, 2022, 3:56 p.m. UTC | #2
On Tue, Feb 15, 2022 at 10:18:11AM +0000, Tian, Kevin wrote:
> > From: Yishai Hadas <yishaih@nvidia.com>
> > Sent: Tuesday, February 8, 2022 1:22 AM
> > 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > 
> > The RUNNING_P2P state is designed to support multiple devices in the same
> > VM that are doing P2P transactions between themselves. When in
> > RUNNING_P2P
> > the device must be able to accept incoming P2P transactions but should not
> > generate outgoing transactions.
> 
> outgoing 'P2P' transactions.

Yes

> > As an optional extension to the mandatory states it is defined as
> > inbetween STOP and RUNNING:
> >    STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP
> > 
> > For drivers that are unable to support RUNNING_P2P the core code silently
> > merges RUNNING_P2P and RUNNING together. Drivers that support this will
> 
> It would be clearer if following message could be also reflected here:
> 
>   + * The optional states cannot be used with SET_STATE if the device does not
>   + * support them. The user can discover if these states are supported by using
>   + * VFIO_DEVICE_FEATURE_MIGRATION. 
> 
> Otherwise the original context reads like RUNNING_P2P can be used as
> end state even if the underlying driver doesn't support it then makes me
> wonder what is the point of the new capability bit.

You've read it right. Lets just add a simple "Unless driver support is
present the new state cannot be used in SET_STATE"

> >  	*next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> > +	while ((state_flags_table[*next_fsm] & device->migration_flags) !=
> > +			state_flags_table[*next_fsm])
> > +		*next_fsm = vfio_from_fsm_table[*next_fsm][new_fsm];
> > +
> 
> A comment highlighting the silent merging of unsupported states would
> be informative here.

	/*
	 * Arcs touching optional and unsupported states are skipped over. The
	 * driver will instead  see an arc from the original state to the next
	 * logical state, as per the above comment.
	 */

> Defining RUNNING_P2P in above way implies that RUNNING_P2P inherits 
> all behaviors in RUNNING except blocking outbound P2P:
> 	* generate interrupts and DMAs
> 	* respond to MMIO
> 	* all vfio regions are functional
> 	* device may advance its internal state
> 	* drain and block outstanding P2P requests

Correct.

The device must be able to recieve and process any MMIO P2P
transaction during this state.

We discussed and left interrupts as allowed behavior.

> I think this is not the intended behavior when NDMA was being discussed
> in previous threads, as above definition suggests the user could continue
> to submit new requests after outstanding P2P requests are completed given
> all vfio regions are functional when the device is in RUNNING_P2P.

It is the desired behavior. The device must internally stop generating
DMA from new work, it cannot rely on external things not poking it
with MMIO, because the whole point of the state is that MMIO P2P is
still allowed to happen.

What gets confusing is that in normal cases I wouldn't expect any P2P
activity to trigger a new work submission.

Probably, since many devices can't implement this, we will end up with
devices providing a weaker version where they do RUNNING_P2P but this
relies on the VM operating the device "sanely" without programming P2P
work submission. It is similar to your notion that migration requires
guest co-operation in the vPRI case.

I don't like it, and better devices really should avoid requiring
guest co-operation, but it seems like where things are going.

> Though just a naming thing, possibly what we really require is a STOPPING_P2P
> state which indicates the device is moving to the STOP (or STOPPED)
> state.

No, I've deliberately avoided STOP because this isn't anything like
STOP. It is RUNNING with one restriction.

> In this state the device is functional but vfio regions are not so the user still
> needs to restrict device access:

The device is not functional in STOP. STOP means the device does not
provide working MMIO. Ie mlx5 devices will discard all writes and
read all 0's when in STOP.

The point of RUNNING_P2P is to allow the device to continue to recieve
all MMIO while halting generation of MMIO to other devices.

> In virtualization this means Qemu must stop vCPU first before entering
> STOPPING_P2P for a device.

This is already the case. RUNNING/STOP here does not refer to the
vCPU, it refers to this device.

> Back to your earlier suggestion on reusing RUNNING_P2P to cover vPRI 
> usage via a new capability bit [1]:
> 
>     "A cap like "running_p2p returns an event fd, doesn't finish until the
>     VCPU does stuff, and stops pri as well as p2p" might be all that is
>     required here (and not an actual new state)"
> 
> vPRI requires a RUNNING semantics. A new capability bit can change 
> the behaviors listed above for STOPPING_P2P to below:
> 	* both P2P and vPRI requests should be drained and blocked;
> 	* all vfio regions are functional (with a RUNNING behavior) so
> 	  vCPUs can continue running to help drain vPRI requests;
> 	* an eventfd is returned for the user to poll-wait the completion
> 	  of state transition;

vPRI draining is not STOP either. If the device is expected to provide
working MMIO it is not STOP by definition.

> One additional requirement in driver side is to dynamically mediate the 
> fast path and queue any new request which may trigger vPRI or P2P
> before moving out of RUNNING_P2P. If moving to STOP_COPY, then
> queued requests will also be included as device state to be replayed
> in the resuming path.

This could make sense. I don't know how you dynamically mediate
though, or how you will trap ENQCMD..

> Does above sound a reasonable understanding of this FSM mechanism? 

Other than mis-using the STOP label, it is close yes.

> > + * The optional states cannot be used with SET_STATE if the device does not
> > + * support them. The user can disocver if these states are supported by
> 
> 'disocver' -> 'discover'

Yep, thanks

Jason
Tian, Kevin Feb. 16, 2022, 2:52 a.m. UTC | #3
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, February 15, 2022 11:56 PM
> 
> > Defining RUNNING_P2P in above way implies that RUNNING_P2P inherits
> > all behaviors in RUNNING except blocking outbound P2P:
> > 	* generate interrupts and DMAs
> > 	* respond to MMIO
> > 	* all vfio regions are functional
> > 	* device may advance its internal state
> > 	* drain and block outstanding P2P requests
> 
> Correct.
> 
> The device must be able to recieve and process any MMIO P2P
> transaction during this state.
> 
> We discussed and left interrupts as allowed behavior.
> 
> > I think this is not the intended behavior when NDMA was being discussed
> > in previous threads, as above definition suggests the user could continue
> > to submit new requests after outstanding P2P requests are completed
> given
> > all vfio regions are functional when the device is in RUNNING_P2P.
> 
> It is the desired behavior. The device must internally stop generating
> DMA from new work, it cannot rely on external things not poking it
> with MMIO, because the whole point of the state is that MMIO P2P is
> still allowed to happen.
> 
> What gets confusing is that in normal cases I wouldn't expect any P2P
> activity to trigger a new work submission.
> 
> Probably, since many devices can't implement this, we will end up with
> devices providing a weaker version where they do RUNNING_P2P but this
> relies on the VM operating the device "sanely" without programming P2P
> work submission. It is similar to your notion that migration requires
> guest co-operation in the vPRI case.
> 
> I don't like it, and better devices really should avoid requiring
> guest co-operation, but it seems like where things are going.

Make sense to me now. 

btw can disabling PCI bus master be a general means for devices which
don't have a way of blocking P2P to implement RUNNING_P2P? 

> 
> > Though just a naming thing, possibly what we really require is a
> STOPPING_P2P
> > state which indicates the device is moving to the STOP (or STOPPED)
> > state.
> 
> No, I've deliberately avoided STOP because this isn't anything like
> STOP. It is RUNNING with one restriction.

With above explanation I'm fine with it.

> 
> > In this state the device is functional but vfio regions are not so the user still
> > needs to restrict device access:
> 
> The device is not functional in STOP. STOP means the device does not
> provide working MMIO. Ie mlx5 devices will discard all writes and
> read all 0's when in STOP.

btw I used 'STOPPING' to indicate a transitional state between RUNNING
and STOP thus its definition could be defined separately from STOP. But 
it doesn't matter now.

> 
> The point of RUNNING_P2P is to allow the device to continue to recieve
> all MMIO while halting generation of MMIO to other devices.
> 
> > In virtualization this means Qemu must stop vCPU first before entering
> > STOPPING_P2P for a device.
> 
> This is already the case. RUNNING/STOP here does not refer to the
> vCPU, it refers to this device.

I know that point. Originally I thought that having 'RUNNING' in RUNNING_P2P
implies that vCPU doesn't need to be stopped first given all vfio regions are
functional. But now I think the rationale is clear. If guest-operation exists
then vCPU can be active when entering RUNNING_P2P since the guest will
guarantee no P2P submission (via vCPU or via P2P). Otherwise vCPU must be 
stopped first to block potential P2P work submissions as a brute-force operation.

> 
> > Back to your earlier suggestion on reusing RUNNING_P2P to cover vPRI
> > usage via a new capability bit [1]:
> >
> >     "A cap like "running_p2p returns an event fd, doesn't finish until the
> >     VCPU does stuff, and stops pri as well as p2p" might be all that is
> >     required here (and not an actual new state)"
> >
> > vPRI requires a RUNNING semantics. A new capability bit can change
> > the behaviors listed above for STOPPING_P2P to below:
> > 	* both P2P and vPRI requests should be drained and blocked;
> > 	* all vfio regions are functional (with a RUNNING behavior) so
> > 	  vCPUs can continue running to help drain vPRI requests;
> > 	* an eventfd is returned for the user to poll-wait the completion
> > 	  of state transition;
> 
> vPRI draining is not STOP either. If the device is expected to provide
> working MMIO it is not STOP by definition.
> 
> > One additional requirement in driver side is to dynamically mediate the
> > fast path and queue any new request which may trigger vPRI or P2P
> > before moving out of RUNNING_P2P. If moving to STOP_COPY, then
> > queued requests will also be included as device state to be replayed
> > in the resuming path.
> 
> This could make sense. I don't know how you dynamically mediate
> though, or how you will trap ENQCMD..

Qemu can ask KVM to temporarily clear EPT mapping of the cmd portal 
to enable mediation on src and then restore the mapping before resuming
vCPU on dest. In our internal POC the cmd portal address is hard coded
in Qemu which is not good. Possibly we need a general mechanism so
migration driver which supports vPRI and extended RUNNING_P2P behavior
can report to the user a list of pages which must be accessed via read()/
write() instead of mmap when the device is in RUNNING_P2P and vCPUs
are active. Based on that information Qemu can zap related EPT mappings 
before moving the device to RUNNING_P2P.

> 
> > Does above sound a reasonable understanding of this FSM mechanism?
> 
> Other than mis-using the STOP label, it is close yes.
> 

Thanks
Kevin
Jason Gunthorpe Feb. 16, 2022, 12:11 p.m. UTC | #4
On Wed, Feb 16, 2022 at 02:52:55AM +0000, Tian, Kevin wrote:

> btw can disabling PCI bus master be a general means for devices which
> don't have a way of blocking P2P to implement RUNNING_P2P? 

I think if it works for a specific device then that device's driver
can use it.

I wouldn't make something general, too likely a device will blow up if
you do this to it.

Jason
diff mbox series

Patch

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index e7ab9f2048cd..8c484593dfe0 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1577,39 +1577,55 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 			    enum vfio_device_mig_state new_fsm,
 			    enum vfio_device_mig_state *next_fsm)
 {
-	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 };
+	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RUNNING_P2P + 1 };
 	/*
-	 * The coding in this table requires the driver to implement 6
+	 * The coding in this table requires the driver to implement
 	 * FSM arcs:
 	 *         RESUMING -> STOP
-	 *         RUNNING -> STOP
 	 *         STOP -> RESUMING
-	 *         STOP -> RUNNING
 	 *         STOP -> STOP_COPY
 	 *         STOP_COPY -> STOP
 	 *
-	 * The coding will step through multiple states for these combination
-	 * transitions:
-	 *         RESUMING -> STOP -> RUNNING
+	 * If P2P is supported then the driver must also implement these FSM
+	 * arcs:
+	 *         RUNNING -> RUNNING_P2P
+	 *         RUNNING_P2P -> RUNNING
+	 *         RUNNING_P2P -> STOP
+	 *         STOP -> RUNNING_P2P
+	 * Without P2P the driver must implement:
+	 *         RUNNING -> STOP
+	 *         STOP -> RUNNING
+	 *
+	 * If all optional features are supported then the coding will step
+	 * through multiple states for these combination transitions:
+	 *         RESUMING -> STOP -> RUNNING_P2P
+	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING
 	 *         RESUMING -> STOP -> STOP_COPY
-	 *         RUNNING -> STOP -> RESUMING
-	 *         RUNNING -> STOP -> STOP_COPY
+	 *         RUNNING -> RUNNING_P2P -> STOP
+	 *         RUNNING -> RUNNING_P2P -> STOP -> RESUMING
+	 *         RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
+	 *         RUNNING_P2P -> STOP -> RESUMING
+	 *         RUNNING_P2P -> STOP -> STOP_COPY
+	 *         STOP -> RUNNING_P2P -> RUNNING
 	 *         STOP_COPY -> STOP -> RESUMING
-	 *         STOP_COPY -> STOP -> RUNNING
+	 *         STOP_COPY -> STOP -> RUNNING_P2P
+	 *         STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
 	 */
 	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
 		[VFIO_DEVICE_STATE_STOP] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
-			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_RUNNING] = {
-			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
-			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
-			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_STOP_COPY] = {
@@ -1617,6 +1633,7 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_RESUMING] = {
@@ -1624,6 +1641,15 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_ERROR] = {
@@ -1631,17 +1657,36 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 	};
 
-	if (WARN_ON(cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table)))
+	static const unsigned int state_flags_table[VFIO_DEVICE_NUM_STATES] = {
+		[VFIO_DEVICE_STATE_STOP] = VFIO_MIGRATION_STOP_COPY,
+		[VFIO_DEVICE_STATE_RUNNING] = VFIO_MIGRATION_STOP_COPY,
+		[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_MIGRATION_STOP_COPY,
+		[VFIO_DEVICE_STATE_RESUMING] = VFIO_MIGRATION_STOP_COPY,
+		[VFIO_DEVICE_STATE_RUNNING_P2P] =
+			VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P,
+		[VFIO_DEVICE_STATE_ERROR] = ~0U,
+	};
+
+	if (WARN_ON(cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
+		    (state_flags_table[cur_fsm] & device->migration_flags) !=
+			state_flags_table[cur_fsm]))
 		return -EINVAL;
 
-	if (new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
+	if (new_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
+	   (state_flags_table[new_fsm] & device->migration_flags) !=
+			state_flags_table[new_fsm])
 		return -EINVAL;
 
 	*next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
+	while ((state_flags_table[*next_fsm] & device->migration_flags) !=
+			state_flags_table[*next_fsm])
+		*next_fsm = vfio_from_fsm_table[*next_fsm][new_fsm];
+
 	return (*next_fsm != VFIO_DEVICE_STATE_ERROR) ? 0 : -EINVAL;
 }
 EXPORT_SYMBOL_GPL(vfio_mig_get_next_state);
@@ -1731,7 +1776,7 @@  static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
 					       size_t argsz)
 {
 	struct vfio_device_feature_migration mig = {
-		.flags = VFIO_MIGRATION_STOP_COPY,
+		.flags = device->migration_flags,
 	};
 	int ret;
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 3f4a1a7c2277..a173718d2a1b 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -33,6 +33,7 @@  struct vfio_device {
 	struct vfio_group *group;
 	struct vfio_device_set *dev_set;
 	struct list_head dev_set_list;
+	unsigned int migration_flags;
 
 	/* Members below here are private, not for driver use */
 	refcount_t refcount;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 89012bc01663..773895988cf1 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1009,10 +1009,16 @@  struct vfio_device_feature {
  *
  * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY and
  * RESUMING are supported.
+ *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P
+ * is supported in addition to the STOP_COPY states.
+ *
+ * Other combinations of flags have behavior to be defined in the future.
  */
 struct vfio_device_feature_migration {
 	__aligned_u64 flags;
 #define VFIO_MIGRATION_STOP_COPY	(1 << 0)
+#define VFIO_MIGRATION_P2P		(1 << 1)
 };
 #define VFIO_DEVICE_FEATURE_MIGRATION 1
 
@@ -1063,10 +1069,13 @@  struct vfio_device_feature_mig_state {
  *  RESUMING - The device is stopped and is loading a new internal state
  *  ERROR - The device has failed and must be reset
  *
+ * And 1 optional state to support VFIO_MIGRATION_P2P:
+ *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
+ *
  * The FSM takes actions on the arcs between FSM states. The driver implements
  * the following behavior for the FSM arcs:
  *
- * RUNNING -> STOP
+ * RUNNING_P2P -> STOP
  * STOP_COPY -> STOP
  *   While in STOP the device must stop the operation of the device. The
  *   device must not generate interrupts, DMA, or advance its internal
@@ -1093,11 +1102,16 @@  struct vfio_device_feature_mig_state {
  *
  *   To abort a RESUMING session the device must be reset.
  *
- * STOP -> RUNNING
+ * RUNNING_P2P -> RUNNING
  *   While in RUNNING the device is fully operational, the device may generate
  *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
  *   and the device may advance its internal state.
  *
+ * RUNNING -> RUNNING_P2P
+ * STOP -> RUNNING_P2P
+ *   While in RUNNING_P2P the device is partially running in the P2P quiescent
+ *   state defined below.
+ *
  * STOP -> STOP_COPY
  *   This arc begin the process of saving the device state and will return a
  *   new data_fd.
@@ -1127,6 +1141,16 @@  struct vfio_device_feature_mig_state {
  *   To recover from ERROR VFIO_DEVICE_RESET must be used to return the
  *   device_state back to RUNNING.
  *
+ * The optional peer to peer (P2P) quiescent state is intended to be a quiescent
+ * state for the device for the purposes of managing multiple devices within a
+ * user context where peer-to-peer DMA between devices may be active. The
+ * RUNNING_P2P states must prevent the device from initiating
+ * any new P2P DMA transactions. If the device can identify P2P transactions
+ * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration
+ * driver must complete any such outstanding operations prior to completing the
+ * FSM arc into a P2P state. For the purpose of specification the states
+ * behave as though the device was fully running if not supported.
+ *
  * The remaining possible transitions are interpreted as combinations of the
  * above FSM arcs. As there are multiple paths through the FSM arcs the path
  * should be selected based on the following rules:
@@ -1139,6 +1163,11 @@  struct vfio_device_feature_mig_state {
  * fails. When handling these types of errors users should anticipate future
  * revisions of this protocol using new states and those states becoming
  * visible in this case.
+ *
+ * The optional states cannot be used with SET_STATE if the device does not
+ * support them. The user can disocver if these states are supported by using
+ * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
+ * avoid knowing about these optional states if the kernel driver supports them.
  */
 enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_ERROR = 0,
@@ -1146,6 +1175,7 @@  enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING = 2,
 	VFIO_DEVICE_STATE_STOP_COPY = 3,
 	VFIO_DEVICE_STATE_RESUMING = 4,
+	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
 /* -------- API for Type1 VFIO IOMMU -------- */