diff mbox series

[V6,mlx5-next,09/15] vfio: Extend the device migration protocol with RUNNING_P2P

Message ID 20220130160826.32449-10-yishaih@nvidia.com (mailing list archive)
State Superseded
Headers show
Series Add mlx5 live migration driver and v2 migration protocol | expand

Commit Message

Yishai Hadas Jan. 30, 2022, 4:08 p.m. UTC
From: Jason Gunthorpe <jgg@nvidia.com>

The RUNNING_P2P state is designed to support multiple devices in the same
VM that are doing P2P transactions between themselves. When in RUNNING_P2P
the device must be able to accept incoming P2P transactions but should not
generate outgoing transactions.

As an optional extension to the mandatory states it is defined as
inbetween STOP and RUNNING:
   STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP

For drivers that are unable to support RUNNING_P2P the core code silently
merges RUNNING_P2P and RUNNING together. Drivers that support this will be
required to implement 4 FSM arcs beyond the basic FSM. 2 of the basic FSM
arcs become combination transitions.

Compared to the v1 clarification, NDMA is redefined into FSM states and is
described in terms of the desired P2P quiescent behavior, noting that
halting all DMA is an acceptable implementation.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/vfio.c       | 70 ++++++++++++++++++++++++++++++---------
 include/linux/vfio.h      |  2 ++
 include/uapi/linux/vfio.h | 34 +++++++++++++++++--
 3 files changed, 88 insertions(+), 18 deletions(-)

Comments

Cornelia Huck Feb. 1, 2022, 11:54 a.m. UTC | #1
On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:

> @@ -44,6 +45,7 @@ struct vfio_device {
>  /**
>   * struct vfio_device_ops - VFIO bus driver device callbacks
>   *
> + * @flags: Global flags from enum vfio_device_ops_flags

You add this here, only to remove it in patch 15 again. Leftover from
some refactoring?

>   * @open_device: Called when the first file descriptor is opened for this device
>   * @close_device: Opposite of open_device
>   * @read: Perform read(2) on device file descriptor
Jason Gunthorpe Feb. 1, 2022, 12:13 p.m. UTC | #2
On Tue, Feb 01, 2022 at 12:54:10PM +0100, Cornelia Huck wrote:
> On Sun, Jan 30 2022, Yishai Hadas <yishaih@nvidia.com> wrote:
> 
> > @@ -44,6 +45,7 @@ struct vfio_device {
> >  /**
> >   * struct vfio_device_ops - VFIO bus driver device callbacks
> >   *
> > + * @flags: Global flags from enum vfio_device_ops_flags
> 
> You add this here, only to remove it in patch 15 again. Leftover from
> some refactoring?

Yes, thanks, it is a rebasing error :\

Jason
Alex Williamson Feb. 1, 2022, 6:31 p.m. UTC | #3
On Sun, 30 Jan 2022 18:08:20 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> The RUNNING_P2P state is designed to support multiple devices in the same
> VM that are doing P2P transactions between themselves. When in RUNNING_P2P
> the device must be able to accept incoming P2P transactions but should not
> generate outgoing transactions.
> 
> As an optional extension to the mandatory states it is defined as
> inbetween STOP and RUNNING:
>    STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP
> 
> For drivers that are unable to support RUNNING_P2P the core code silently
> merges RUNNING_P2P and RUNNING together. Drivers that support this will be
> required to implement 4 FSM arcs beyond the basic FSM. 2 of the basic FSM
> arcs become combination transitions.
> 
> Compared to the v1 clarification, NDMA is redefined into FSM states and is
> described in terms of the desired P2P quiescent behavior, noting that
> halting all DMA is an acceptable implementation.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/vfio/vfio.c       | 70 ++++++++++++++++++++++++++++++---------
>  include/linux/vfio.h      |  2 ++
>  include/uapi/linux/vfio.h | 34 +++++++++++++++++--
>  3 files changed, 88 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index b12be212d048..a722a1a8a48a 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1573,39 +1573,55 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
>  			    enum vfio_device_mig_state cur_fsm,
>  			    enum vfio_device_mig_state new_fsm)
>  {
> -	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 };
> +	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RUNNING_P2P + 1 };
>  	/*
> -	 * The coding in this table requires the driver to implement 6
> +	 * The coding in this table requires the driver to implement
>  	 * FSM arcs:
>  	 *         RESUMING -> STOP
> -	 *         RUNNING -> STOP
>  	 *         STOP -> RESUMING
> -	 *         STOP -> RUNNING
>  	 *         STOP -> STOP_COPY
>  	 *         STOP_COPY -> STOP
>  	 *
> -	 * The coding will step through multiple states for these combination
> -	 * transitions:
> -	 *         RESUMING -> STOP -> RUNNING
> +	 * If P2P is supported then the driver must also implement these FSM
> +	 * arcs:
> +	 *         RUNNING -> RUNNING_P2P
> +	 *         RUNNING_P2P -> RUNNING
> +	 *         RUNNING_P2P -> STOP
> +	 *         STOP -> RUNNING_P2P
> +	 * Without P2P the driver must implement:
> +	 *         RUNNING -> STOP
> +	 *         STOP -> RUNNING
> +	 *
> +	 * If all optional features are supported then the coding will step
> +	 * through multiple states for these combination transitions:
> +	 *         RESUMING -> STOP -> RUNNING_P2P
> +	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING
>  	 *         RESUMING -> STOP -> STOP_COPY
> -	 *         RUNNING -> STOP -> RESUMING
> -	 *         RUNNING -> STOP -> STOP_COPY
> +	 *         RUNNING -> RUNNING_P2P -> STOP
> +	 *         RUNNING -> RUNNING_P2P -> STOP -> RESUMING
> +	 *         RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
> +	 *         RUNNING_P2P -> STOP -> RESUMING
> +	 *         RUNNING_P2P -> STOP -> STOP_COPY
> +	 *         STOP -> RUNNING_P2P -> RUNNING
>  	 *         STOP_COPY -> STOP -> RESUMING
> -	 *         STOP_COPY -> STOP -> RUNNING
> +	 *         STOP_COPY -> STOP -> RUNNING_P2P
> +	 *         STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
>  	 */
>  	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
>  		[VFIO_DEVICE_STATE_STOP] = {
>  			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> -			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
>  			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
>  		},
>  		[VFIO_DEVICE_STATE_RUNNING] = {
> -			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> -			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
> -			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
>  		},
>  		[VFIO_DEVICE_STATE_STOP_COPY] = {
> @@ -1613,6 +1629,7 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
>  			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
>  			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
>  		},
>  		[VFIO_DEVICE_STATE_RESUMING] = {
> @@ -1620,6 +1637,15 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
>  			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
>  			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
> +		},
> +		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
> +			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
> +			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
>  			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
>  		},
>  		[VFIO_DEVICE_STATE_ERROR] = {
> @@ -1627,14 +1653,26 @@ u32 vfio_mig_get_next_state(struct vfio_device *device,
>  			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
>  			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
>  			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
> +			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
>  			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
>  		},
>  	};
> +	bool have_p2p = device->migration_flags & VFIO_MIGRATION_P2P;
> +
>  	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
>  	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
>  		return VFIO_DEVICE_STATE_ERROR;
>  
> -	return vfio_from_fsm_table[cur_fsm][new_fsm];
> +	if (!have_p2p && (new_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
> +			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P))
> +		return VFIO_DEVICE_STATE_ERROR;

new_fsm is provided by the user, we pass set_state.device_state
directly to .migration_set_state.  We should do bounds checking and
compatibility testing on the end state in the core so that we can
return an appropriate -EINVAL and -ENOSUPP respectively, otherwise
we're giving userspace a path to put the device into ERROR state, which
we claim is not allowed.

Testing cur_fsm is more an internal consistency check, maybe those
should be WARN_ON.

> +
> +	cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> +	if (!have_p2p) {
> +		while (cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P)
> +			cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> +	}

Perhaps this could be generalized with something like:

	static const unsigned int state_flags_table[VFIO_DEVICE_NUM_STATES] = {
		[VFIO_DEVICE_STATE_STOP] = VFIO_MIGRATION_STOP_COPY,
		[VFIO_DEVICE_STATE_RUNNING] = VFIO_MIGRATION_STOP_COPY,
		[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_MIGRATION_STOP_COPY,
		[VFIO_DEVICE_STATE_RESUMING] = VFIO_MIGRATION_STOP_COPY,
		[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_MIGRATION_P2P,
		[VFIO_DEVICE_STATE_ERROR] = ~0U,
	};

	while (!(state_flags_table[cur_fsm] & device->migration_flags))
		cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];

Thanks,
Alex
Jason Gunthorpe Feb. 1, 2022, 6:53 p.m. UTC | #4
On Tue, Feb 01, 2022 at 11:31:44AM -0700, Alex Williamson wrote:
> > +	bool have_p2p = device->migration_flags & VFIO_MIGRATION_P2P;
> > +
> >  	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
> >  	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
> >  		return VFIO_DEVICE_STATE_ERROR;
> >  
> > -	return vfio_from_fsm_table[cur_fsm][new_fsm];
> > +	if (!have_p2p && (new_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
> > +			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P))
> > +		return VFIO_DEVICE_STATE_ERROR;
> 
> new_fsm is provided by the user, we pass set_state.device_state
> directly to .migration_set_state.  We should do bounds checking and
> compatibility testing on the end state in the core so that we can

This is the core :)

> return an appropriate -EINVAL and -ENOSUPP respectively, otherwise
> we're giving userspace a path to put the device into ERROR state, which
> we claim is not allowed.

Userspace can never put the device into error. As the function comment
says VFIO_DEVICE_STATE_ERROR is returned to indicate the arc is not
permitted. The driver is required to reflect that back as an errno
like mlx5 shows:

+		next_state = vfio_mig_get_next_state(vdev, mvdev->mig_state,
+						     new_state);
+		if (next_state == VFIO_DEVICE_STATE_ERROR) {
+			res = ERR_PTR(-EINVAL);
+			break;
+		}

We never get the driver into error, userspaces gets an EINVAL and no
change to the device state.

It is organized this way because the driver controls the locking for
its current state and thus the core code caller along the ioctl path
cannot validate the arc before passing it to the driver. The code is
shared by having the driver callback to the core to validate the
entire fsm arc under its lock.

The driver ends up with a small while loop that will probably be copy
and pasted to each driver. As I said, I'm interested to lift this up
as well but I need to better understand the locking needs of the other
driver implementations first, or we need your patch series to use the
inode for zap to land to eliminate the complicated locking in the
first place..

> Testing cur_fsm is more an internal consistency check, maybe those
> should be WARN_ON.

Sure
 
> > +
> > +	cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> > +	if (!have_p2p) {
> > +		while (cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P)
> > +			cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> > +	}
> 
> Perhaps this could be generalized with something like:

Oh, that table could probably do both tests, if the bit isn't set it
is an invalid cur/next_fsm as well..

Thanks,
Jason
Alex Williamson Feb. 1, 2022, 7:13 p.m. UTC | #5
On Tue, 1 Feb 2022 14:53:21 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 01, 2022 at 11:31:44AM -0700, Alex Williamson wrote:
> > > +	bool have_p2p = device->migration_flags & VFIO_MIGRATION_P2P;
> > > +
> > >  	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
> > >  	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
> > >  		return VFIO_DEVICE_STATE_ERROR;
> > >  
> > > -	return vfio_from_fsm_table[cur_fsm][new_fsm];
> > > +	if (!have_p2p && (new_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
> > > +			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P))
> > > +		return VFIO_DEVICE_STATE_ERROR;  
> > 
> > new_fsm is provided by the user, we pass set_state.device_state
> > directly to .migration_set_state.  We should do bounds checking and
> > compatibility testing on the end state in the core so that we can  
> 
> This is the core :)

But this is the wrong place, we need to do it earlier rather than when
we're already iterating next states.  I only mention core to avoid that
I'm suggesting a per driver responsibility.

> 
> > return an appropriate -EINVAL and -ENOSUPP respectively, otherwise
> > we're giving userspace a path to put the device into ERROR state, which
> > we claim is not allowed.  
> 
> Userspace can never put the device into error. As the function comment
> says VFIO_DEVICE_STATE_ERROR is returned to indicate the arc is not
> permitted. The driver is required to reflect that back as an errno
> like mlx5 shows:
> 
> +		next_state = vfio_mig_get_next_state(vdev, mvdev->mig_state,
> +						     new_state);
> +		if (next_state == VFIO_DEVICE_STATE_ERROR) {
> +			res = ERR_PTR(-EINVAL);
> +			break;
> +		}
> 
> We never get the driver into error, userspaces gets an EINVAL and no
> change to the device state.

Hmm, subtle.  I'd argue that if we do a bounds and support check of the
end state in vfio_ioctl_mig_set_state() before calling
.migration_set_state() then we could remove ERROR from
vfio_from_fsm_table[] altogether and simply begin
vfio_mig_get_next_state() with:

	if (cur_fsm = ERROR)
		return ERROR;

Then we only get to ERROR by the driver placing us in ERROR and things
feel a bit more sane to me.

> It is organized this way because the driver controls the locking for
> its current state and thus the core code caller along the ioctl path
> cannot validate the arc before passing it to the driver. The code is
> shared by having the driver callback to the core to validate the
> entire fsm arc under its lock.

P2P is defined in a way that if the endpoint is valid then the arc is
valid.  We skip intermediate unsupported states.  We need to do that
for compatibility.  So why do we care about driver locking to do that?
Thanks,

Alex
Jason Gunthorpe Feb. 1, 2022, 7:50 p.m. UTC | #6
On Tue, Feb 01, 2022 at 12:13:22PM -0700, Alex Williamson wrote:
> On Tue, 1 Feb 2022 14:53:21 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Feb 01, 2022 at 11:31:44AM -0700, Alex Williamson wrote:
> > > > +	bool have_p2p = device->migration_flags & VFIO_MIGRATION_P2P;
> > > > +
> > > >  	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
> > > >  	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
> > > >  		return VFIO_DEVICE_STATE_ERROR;
> > > >  
> > > > -	return vfio_from_fsm_table[cur_fsm][new_fsm];
> > > > +	if (!have_p2p && (new_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
> > > > +			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P))
> > > > +		return VFIO_DEVICE_STATE_ERROR;  
> > > 
> > > new_fsm is provided by the user, we pass set_state.device_state
> > > directly to .migration_set_state.  We should do bounds checking and
> > > compatibility testing on the end state in the core so that we can  
> > 
> > This is the core :)
> 
> But this is the wrong place, we need to do it earlier rather than when
> we're already iterating next states.  I only mention core to avoid that
> I'm suggesting a per driver responsibility.

Only the first vfio_mig_get_next_state() can return ERROR, once it
succeeds the subsequent ones must also succeed.

This is the earliest can be. It is done directly after taking the lock
that allows us to read the current state to call this function to
determine if the requested transition is acceptable.

> > Userspace can never put the device into error. As the function comment
> > says VFIO_DEVICE_STATE_ERROR is returned to indicate the arc is not
> > permitted. The driver is required to reflect that back as an errno
> > like mlx5 shows:
> > 
> > +		next_state = vfio_mig_get_next_state(vdev, mvdev->mig_state,
> > +						     new_state);
> > +		if (next_state == VFIO_DEVICE_STATE_ERROR) {
> > +			res = ERR_PTR(-EINVAL);
> > +			break;
> > +		}
> > 
> > We never get the driver into error, userspaces gets an EINVAL and no
> > change to the device state.
> 
> Hmm, subtle.  I'd argue that if we do a bounds and support check of the
> end state in vfio_ioctl_mig_set_state() before calling
> .migration_set_state() then we could remove ERROR from
> vfio_from_fsm_table[] altogether and simply begin
> vfio_mig_get_next_state() with:

Then we can't reject blocked arcs like STOP_COPY -> PRE_COPY.

It is setup this way to allow the core code to assert all policy, not
just a simple validation of the next_fsm.

> Then we only get to ERROR by the driver placing us in ERROR and things
> feel a bit more sane to me.

This is already true.

Perhaps it is confusing using ERROR to indicate that
vfio_mig_get_next_state() failed. Would you be happier with a -errno
return?

> > It is organized this way because the driver controls the locking for
> > its current state and thus the core code caller along the ioctl path
> > cannot validate the arc before passing it to the driver. The code is
> > shared by having the driver callback to the core to validate the
> > entire fsm arc under its lock.
> 
> P2P is defined in a way that if the endpoint is valid then the arc is
> valid.  We skip intermediate unsupported states.  We need to do that
> for compatibility.  So why do we care about driver locking to do
> that?

Without the driver locking we can't identify the arc because we don't
know the curent state the driver is in. We only know the target
state.

Jason
Alex Williamson Feb. 2, 2022, 11:54 p.m. UTC | #7
On Tue, 1 Feb 2022 15:50:03 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 01, 2022 at 12:13:22PM -0700, Alex Williamson wrote:
> > On Tue, 1 Feb 2022 14:53:21 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Tue, Feb 01, 2022 at 11:31:44AM -0700, Alex Williamson wrote:  
> > > > > +	bool have_p2p = device->migration_flags & VFIO_MIGRATION_P2P;
> > > > > +
> > > > >  	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
> > > > >  	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
> > > > >  		return VFIO_DEVICE_STATE_ERROR;
> > > > >  
> > > > > -	return vfio_from_fsm_table[cur_fsm][new_fsm];
> > > > > +	if (!have_p2p && (new_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
> > > > > +			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P))
> > > > > +		return VFIO_DEVICE_STATE_ERROR;    
> > > > 
> > > > new_fsm is provided by the user, we pass set_state.device_state
> > > > directly to .migration_set_state.  We should do bounds checking and
> > > > compatibility testing on the end state in the core so that we can    
> > > 
> > > This is the core :)  
> > 
> > But this is the wrong place, we need to do it earlier rather than when
> > we're already iterating next states.  I only mention core to avoid that
> > I'm suggesting a per driver responsibility.  
> 
> Only the first vfio_mig_get_next_state() can return ERROR, once it
> succeeds the subsequent ones must also succeed.

Yes, I see that.

> This is the earliest can be. It is done directly after taking the lock
> that allows us to read the current state to call this function to
> determine if the requested transition is acceptable.

I think the argument here is that there's no value to validating or
bounds checking the end state, which could be done in the core ioctl
before calling the driver if the first iteration will already fail for
both the end state and the full path validation.

> > > Userspace can never put the device into error. As the function comment
> > > says VFIO_DEVICE_STATE_ERROR is returned to indicate the arc is not
> > > permitted. The driver is required to reflect that back as an errno
> > > like mlx5 shows:
> > > 
> > > +		next_state = vfio_mig_get_next_state(vdev, mvdev->mig_state,
> > > +						     new_state);
> > > +		if (next_state == VFIO_DEVICE_STATE_ERROR) {
> > > +			res = ERR_PTR(-EINVAL);
> > > +			break;
> > > +		}
> > > 
> > > We never get the driver into error, userspaces gets an EINVAL and no
> > > change to the device state.  
> > 
> > Hmm, subtle.  I'd argue that if we do a bounds and support check of the
> > end state in vfio_ioctl_mig_set_state() before calling
> > .migration_set_state() then we could remove ERROR from
> > vfio_from_fsm_table[] altogether and simply begin
> > vfio_mig_get_next_state() with:  
> 
> Then we can't reject blocked arcs like STOP_COPY -> PRE_COPY.

Right, I hadn't made it through to 15/, which helps to clarify how the
cur_fsm + new_fsm validate the full arc.
 
> It is setup this way to allow the core code to assert all policy, not
> just a simple validation of the next_fsm.
> 
> > Then we only get to ERROR by the driver placing us in ERROR and things
> > feel a bit more sane to me.  
> 
> This is already true.
> 
> Perhaps it is confusing using ERROR to indicate that
> vfio_mig_get_next_state() failed. Would you be happier with a -errno
> return?

Yes, it's confusing to me that next_state() returns states that don't
become the device_state.  Stuffing the next step back into cur_fsm and
using an errno for a bounds/validity/blocked-arc test would be a better
API.  Thanks,

Alex
Jason Gunthorpe Feb. 3, 2022, 2:22 p.m. UTC | #8
On Wed, Feb 02, 2022 at 04:54:44PM -0700, Alex Williamson wrote:

> I think the argument here is that there's no value to validating or
> bounds checking the end state, which could be done in the core ioctl
> before calling the driver if the first iteration will already fail for
> both the end state and the full path validation.

Yes, I had a version like this in an internal draft, it was something
like this:

if (vfio_mig_get_next_state(vdev, set_state.device_state, 
     set_state.device_state) == VFIO_DEVICE_STATE_ERROR)
    return -EINVAL;

Which is fully redundant with the driver, only does half the check and
looks weird.

> > Perhaps it is confusing using ERROR to indicate that
> > vfio_mig_get_next_state() failed. Would you be happier with a -errno
> > return?
> 
> Yes, it's confusing to me that next_state() returns states that don't
> become the device_state.  Stuffing the next step back into cur_fsm and
> using an errno for a bounds/validity/blocked-arc test would be a better
> API.  Thanks,

OK

Jason
diff mbox series

Patch

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index b12be212d048..a722a1a8a48a 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1573,39 +1573,55 @@  u32 vfio_mig_get_next_state(struct vfio_device *device,
 			    enum vfio_device_mig_state cur_fsm,
 			    enum vfio_device_mig_state new_fsm)
 {
-	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 };
+	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RUNNING_P2P + 1 };
 	/*
-	 * The coding in this table requires the driver to implement 6
+	 * The coding in this table requires the driver to implement
 	 * FSM arcs:
 	 *         RESUMING -> STOP
-	 *         RUNNING -> STOP
 	 *         STOP -> RESUMING
-	 *         STOP -> RUNNING
 	 *         STOP -> STOP_COPY
 	 *         STOP_COPY -> STOP
 	 *
-	 * The coding will step through multiple states for these combination
-	 * transitions:
-	 *         RESUMING -> STOP -> RUNNING
+	 * If P2P is supported then the driver must also implement these FSM
+	 * arcs:
+	 *         RUNNING -> RUNNING_P2P
+	 *         RUNNING_P2P -> RUNNING
+	 *         RUNNING_P2P -> STOP
+	 *         STOP -> RUNNING_P2P
+	 * Without P2P the driver must implement:
+	 *         RUNNING -> STOP
+	 *         STOP -> RUNNING
+	 *
+	 * If all optional features are supported then the coding will step
+	 * through multiple states for these combination transitions:
+	 *         RESUMING -> STOP -> RUNNING_P2P
+	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING
 	 *         RESUMING -> STOP -> STOP_COPY
-	 *         RUNNING -> STOP -> RESUMING
-	 *         RUNNING -> STOP -> STOP_COPY
+	 *         RUNNING -> RUNNING_P2P -> STOP
+	 *         RUNNING -> RUNNING_P2P -> STOP -> RESUMING
+	 *         RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
+	 *         RUNNING_P2P -> STOP -> RESUMING
+	 *         RUNNING_P2P -> STOP -> STOP_COPY
+	 *         STOP -> RUNNING_P2P -> RUNNING
 	 *         STOP_COPY -> STOP -> RESUMING
-	 *         STOP_COPY -> STOP -> RUNNING
+	 *         STOP_COPY -> STOP -> RUNNING_P2P
+	 *         STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
 	 */
 	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
 		[VFIO_DEVICE_STATE_STOP] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
-			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_RUNNING] = {
-			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
-			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
-			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_STOP_COPY] = {
@@ -1613,6 +1629,7 @@  u32 vfio_mig_get_next_state(struct vfio_device *device,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_RESUMING] = {
@@ -1620,6 +1637,15 @@  u32 vfio_mig_get_next_state(struct vfio_device *device,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 		[VFIO_DEVICE_STATE_ERROR] = {
@@ -1627,14 +1653,26 @@  u32 vfio_mig_get_next_state(struct vfio_device *device,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
 	};
+	bool have_p2p = device->migration_flags & VFIO_MIGRATION_P2P;
+
 	if (cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) ||
 	    new_fsm >= ARRAY_SIZE(vfio_from_fsm_table))
 		return VFIO_DEVICE_STATE_ERROR;
 
-	return vfio_from_fsm_table[cur_fsm][new_fsm];
+	if (!have_p2p && (new_fsm == VFIO_DEVICE_STATE_RUNNING_P2P ||
+			  cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P))
+		return VFIO_DEVICE_STATE_ERROR;
+
+	cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
+	if (!have_p2p) {
+		while (cur_fsm == VFIO_DEVICE_STATE_RUNNING_P2P)
+			cur_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
+	}
+	return cur_fsm;
 }
 EXPORT_SYMBOL_GPL(vfio_mig_get_next_state);
 
@@ -1719,7 +1757,7 @@  static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
 					       size_t argsz)
 {
 	struct vfio_device_feature_migration mig = {
-		.flags = VFIO_MIGRATION_STOP_COPY,
+		.flags = device->migration_flags,
 	};
 	int ret;
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 697790ec4065..69a574ba085e 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -33,6 +33,7 @@  struct vfio_device {
 	struct vfio_group *group;
 	struct vfio_device_set *dev_set;
 	struct list_head dev_set_list;
+	unsigned int migration_flags;
 
 	/* Members below here are private, not for driver use */
 	refcount_t refcount;
@@ -44,6 +45,7 @@  struct vfio_device {
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
  *
+ * @flags: Global flags from enum vfio_device_ops_flags
  * @open_device: Called when the first file descriptor is opened for this device
  * @close_device: Opposite of open_device
  * @read: Perform read(2) on device file descriptor
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index d9162702973a..9efc35535b29 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1009,10 +1009,16 @@  struct vfio_device_feature {
  *
  * VFIO_MIGRATION_STOP_COPY means that RUNNING, STOP, STOP_COPY and
  * RESUMING are supported.
+ *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P
+ * is supported in addition to the STOP_COPY states.
+ *
+ * Other combinations of flags have behavior to be defined in the future.
  */
 struct vfio_device_feature_migration {
 	__aligned_u64 flags;
 #define VFIO_MIGRATION_STOP_COPY	(1 << 0)
+#define VFIO_MIGRATION_P2P		(1 << 1)
 };
 #define VFIO_DEVICE_FEATURE_MIGRATION 1
 
@@ -1029,10 +1035,13 @@  struct vfio_device_feature_migration {
  *  RESUMING - The device is stopped and is loading a new internal state
  *  ERROR - The device has failed and must be reset
  *
+ * And 1 optional state to support VFIO_MIGRATION_P2P:
+ *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
+ *
  * The FSM takes actions on the arcs between FSM states. The driver implements
  * the following behavior for the FSM arcs:
  *
- * RUNNING -> STOP
+ * RUNNING_P2P -> STOP
  * STOP_COPY -> STOP
  *   While in STOP the device must stop the operation of the device. The
  *   device must not generate interrupts, DMA, or advance its internal
@@ -1059,11 +1068,16 @@  struct vfio_device_feature_migration {
  *
  *   To abort a RESUMING session the device must be reset.
  *
- * STOP -> RUNNING
+ * RUNNING_P2P -> RUNNING
  *   While in RUNNING the device is fully operational, the device may generate
  *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
  *   and the device may advance its internal state.
  *
+ * RUNNING -> RUNNING_P2P
+ * STOP -> RUNNING_P2P
+ *   While in RUNNING_P2P the device is partially running in the P2P quiescent
+ *   state defined below.
+ *
  * STOP -> STOP_COPY
  *   This arc begin the process of saving the device state and will return a
  *   new data_fd.
@@ -1094,6 +1108,16 @@  struct vfio_device_feature_migration {
  *   recover from ERROR VFIO_DEVICE_RESET must be used to return the
  *   device_state back to RUNNING.
  *
+ * The optional peer to peer (P2P) quiescent state is intended to be a quiescent
+ * state for the device for the purposes of managing multiple devices within a
+ * user context where peer-to-peer DMA between devices may be active. The
+ * RUNNING_P2P states must prevent the device from initiating
+ * any new P2P DMA transactions. If the device can identify P2P transactions
+ * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration
+ * driver must complete any such outstanding operations prior to completing the
+ * FSM arc into a P2P state. For the purpose of specification the states
+ * behave as though the device was fully running if not supported.
+ *
  * The remaining possible transitions are interpreted as combinations of the
  * above FSM arcs. As there are multiple paths through the FSM arcs the path
  * should be selected based on the following rules:
@@ -1106,6 +1130,11 @@  struct vfio_device_feature_migration {
  * fails. When handling these types of errors users should anticipate future
  * revisions of this protocol using new states and those states becoming
  * visible in this case.
+ *
+ * The optional states cannot be used with SET_STATE if the device does not
+ * support them. The user can disocver if these states are supported by using
+ * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
+ * avoid knowing about these optional states if the kernel driver supports them.
  */
 enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_ERROR = 0,
@@ -1113,6 +1142,7 @@  enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING = 2,
 	VFIO_DEVICE_STATE_STOP_COPY = 3,
 	VFIO_DEVICE_STATE_RESUMING = 4,
+	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
 /**