Message ID | 20220220095716.153757-10-yishaih@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add mlx5 live migration driver and v2 migration protocol | expand |
> From: Yishai Hadas <yishaih@nvidia.com> > Sent: Sunday, February 20, 2022 5:57 PM > > From: Jason Gunthorpe <jgg@nvidia.com> > > Replace the existing region based migration protocol with an ioctl based > protocol. The two protocols have the same general semantic behaviors, but > the way the data is transported is changed. > > This is the STOP_COPY portion of the new protocol, it defines the 5 states > for basic stop and copy migration and the protocol to move the migration > data in/out of the kernel. > > Compared to the clarification of the v1 protocol Alex proposed: > > https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit > @omen > > This has a few deliberate functional differences: > > - ERROR arcs allow the device function to remain unchanged. > > - The protocol is not required to return to the original state on > transition failure. Instead userspace can execute an unwind back to > the original state, reset, or do something else without needing kernel > support. This simplifies the kernel design and should userspace choose > a policy like always reset, avoids doing useless work in the kernel > on error handling paths. > > - PRE_COPY is made optional, userspace must discover it before using it. > This reflects the fact that the majority of drivers we are aware of > right now will not implement PRE_COPY. > > - segmentation is not part of the data stream protocol, the receiver > does not have to reproduce the framing boundaries. > > The hybrid FSM for the device_state is described as a Mealy machine by > documenting each of the arcs the driver is required to implement. Defining > the remaining set of old/new device_state transitions as 'combination > transitions' which are naturally defined as taking multiple FSM arcs along > the shortest path within the FSM's digraph allows a complete matrix of > transitions. > > A new VFIO_DEVICE_FEATURE of > VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is > defined to replace writing to the device_state field in the region. This > allows returning a brand new FD whenever the requested transition opens > a data transfer session. > > The VFIO core code implements the new feature and provides a helper > function to the driver. Using the helper the driver only has to > implement 6 of the FSM arcs and the other combination transitions are > elaborated consistently from those arcs. > > A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIGRATION is > defined to > report the capability for migration and indicate which set of states and > arcs are supported by the device. The FSM provides a lot of flexibility to > make backwards compatible extensions but the VFIO_DEVICE_FEATURE also > allows for future breaking extensions for scenarios that cannot support > even the basic STOP_COPY requirements. > > The VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE with the GET option (i.e. > VFIO_DEVICE_FEATURE_GET) can be used to read the current migration state > of the VFIO device. > > Data transfer sessions are now carried over a file descriptor, instead of > the region. The FD functions for the lifetime of the data transfer > session. read() and write() transfer the data with normal Linux stream FD > semantics. This design allows future expansion to support poll(), > io_uring, and other performance optimizations. > > The complicated mmap mode for data transfer is discarded as current qemu > doesn't take meaningful advantage of it, and the new qemu implementation > avoids substantially all the performance penalty of using a read() on the > region. > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> > Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> > --- > drivers/vfio/vfio.c | 199 ++++++++++++++++++++++++++++++++++++++ > include/linux/vfio.h | 18 ++++ > include/uapi/linux/vfio.h | 173 ++++++++++++++++++++++++++++++--- > 3 files changed, 377 insertions(+), 13 deletions(-) > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c > index 71763e2ac561..b37ab27b511f 100644 > --- a/drivers/vfio/vfio.c > +++ b/drivers/vfio/vfio.c > @@ -1557,6 +1557,197 @@ static int vfio_device_fops_release(struct inode > *inode, struct file *filep) > return 0; > } > > +/* > + * vfio_mig_get_next_state - Compute the next step in the FSM > + * @cur_fsm - The current state the device is in > + * @new_fsm - The target state to reach > + * @next_fsm - Pointer to the next step to get to new_fsm > + * > + * Return 0 upon success, otherwise -errno > + * Upon success the next step in the state progression between cur_fsm and > + * new_fsm will be set in next_fsm. > + * > + * This breaks down requests for combination transitions into smaller steps > and > + * returns the next step to get to new_fsm. The function may need to be > called > + * multiple times before reaching new_fsm. > + * > + */ > +int vfio_mig_get_next_state(struct vfio_device *device, > + enum vfio_device_mig_state cur_fsm, > + enum vfio_device_mig_state new_fsm, > + enum vfio_device_mig_state *next_fsm) > +{ > + enum { VFIO_DEVICE_NUM_STATES = > VFIO_DEVICE_STATE_RESUMING + 1 }; > + /* > + * The coding in this table requires the driver to implement 6 > + * FSM arcs: > + * RESUMING -> STOP > + * RUNNING -> STOP > + * STOP -> RESUMING > + * STOP -> RUNNING > + * STOP -> STOP_COPY > + * STOP_COPY -> STOP > + * > + * The coding will step through multiple states for these combination > + * transitions: > + * RESUMING -> STOP -> RUNNING > + * RESUMING -> STOP -> STOP_COPY > + * RUNNING -> STOP -> RESUMING > + * RUNNING -> STOP -> STOP_COPY > + * STOP_COPY -> STOP -> RESUMING > + * STOP_COPY -> STOP -> RUNNING > + */ > + static const u8 > vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STA > TES] = { > + [VFIO_DEVICE_STATE_STOP] = { > + [VFIO_DEVICE_STATE_STOP] = > VFIO_DEVICE_STATE_STOP, > + [VFIO_DEVICE_STATE_RUNNING] = > VFIO_DEVICE_STATE_RUNNING, > + [VFIO_DEVICE_STATE_STOP_COPY] = > VFIO_DEVICE_STATE_STOP_COPY, > + [VFIO_DEVICE_STATE_RESUMING] = > VFIO_DEVICE_STATE_RESUMING, > + [VFIO_DEVICE_STATE_ERROR] = > VFIO_DEVICE_STATE_ERROR, > + }, > + [VFIO_DEVICE_STATE_RUNNING] = { > + [VFIO_DEVICE_STATE_STOP] = > VFIO_DEVICE_STATE_STOP, > + [VFIO_DEVICE_STATE_RUNNING] = > VFIO_DEVICE_STATE_RUNNING, > + [VFIO_DEVICE_STATE_STOP_COPY] = > VFIO_DEVICE_STATE_STOP, > + [VFIO_DEVICE_STATE_RESUMING] = > VFIO_DEVICE_STATE_STOP, > + [VFIO_DEVICE_STATE_ERROR] = > VFIO_DEVICE_STATE_ERROR, > + }, > + [VFIO_DEVICE_STATE_STOP_COPY] = { > + [VFIO_DEVICE_STATE_STOP] = > VFIO_DEVICE_STATE_STOP, > + [VFIO_DEVICE_STATE_RUNNING] = > VFIO_DEVICE_STATE_STOP, > + [VFIO_DEVICE_STATE_STOP_COPY] = > VFIO_DEVICE_STATE_STOP_COPY, > + [VFIO_DEVICE_STATE_RESUMING] = > VFIO_DEVICE_STATE_STOP, > + [VFIO_DEVICE_STATE_ERROR] = > VFIO_DEVICE_STATE_ERROR, > + }, > + [VFIO_DEVICE_STATE_RESUMING] = { > + [VFIO_DEVICE_STATE_STOP] = > VFIO_DEVICE_STATE_STOP, > + [VFIO_DEVICE_STATE_RUNNING] = > VFIO_DEVICE_STATE_STOP, > + [VFIO_DEVICE_STATE_STOP_COPY] = > VFIO_DEVICE_STATE_STOP, > + [VFIO_DEVICE_STATE_RESUMING] = > VFIO_DEVICE_STATE_RESUMING, > + [VFIO_DEVICE_STATE_ERROR] = > VFIO_DEVICE_STATE_ERROR, > + }, > + [VFIO_DEVICE_STATE_ERROR] = { > + [VFIO_DEVICE_STATE_STOP] = > VFIO_DEVICE_STATE_ERROR, > + [VFIO_DEVICE_STATE_RUNNING] = > VFIO_DEVICE_STATE_ERROR, > + [VFIO_DEVICE_STATE_STOP_COPY] = > VFIO_DEVICE_STATE_ERROR, > + [VFIO_DEVICE_STATE_RESUMING] = > VFIO_DEVICE_STATE_ERROR, > + [VFIO_DEVICE_STATE_ERROR] = > VFIO_DEVICE_STATE_ERROR, > + }, > + }; > + > + if (WARN_ON(cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table))) > + return -EINVAL; > + > + if (new_fsm >= ARRAY_SIZE(vfio_from_fsm_table)) > + return -EINVAL; > + > + *next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm]; > + return (*next_fsm != VFIO_DEVICE_STATE_ERROR) ? 0 : -EINVAL; > +} > +EXPORT_SYMBOL_GPL(vfio_mig_get_next_state); > + > +/* > + * Convert the drivers's struct file into a FD number and return it to > userspace > + */ > +static int vfio_ioct_mig_return_fd(struct file *filp, void __user *arg, > + struct vfio_device_feature_mig_state *mig) > +{ > + int ret; > + int fd; > + > + fd = get_unused_fd_flags(O_CLOEXEC); > + if (fd < 0) { > + ret = fd; > + goto out_fput; > + } > + > + mig->data_fd = fd; > + if (copy_to_user(arg, mig, sizeof(*mig))) { > + ret = -EFAULT; > + goto out_put_unused; > + } > + fd_install(fd, filp); > + return 0; > + > +out_put_unused: > + put_unused_fd(fd); > +out_fput: > + fput(filp); > + return ret; > +} > + > +static int > +vfio_ioctl_device_feature_mig_device_state(struct vfio_device *device, > + u32 flags, void __user *arg, > + size_t argsz) > +{ > + size_t minsz = > + offsetofend(struct vfio_device_feature_mig_state, data_fd); > + struct vfio_device_feature_mig_state mig; > + struct file *filp = NULL; > + int ret; > + > + if (!device->ops->migration_set_state || > + !device->ops->migration_get_state) > + return -ENOTTY; > + > + ret = vfio_check_feature(flags, argsz, > + VFIO_DEVICE_FEATURE_SET | > + VFIO_DEVICE_FEATURE_GET, > + sizeof(mig)); > + if (ret != 1) > + return ret; > + > + if (copy_from_user(&mig, arg, minsz)) > + return -EFAULT; > + > + if (flags & VFIO_DEVICE_FEATURE_GET) { > + enum vfio_device_mig_state curr_state; > + > + ret = device->ops->migration_get_state(device, &curr_state); > + if (ret) > + return ret; > + mig.device_state = curr_state; > + goto out_copy; > + } > + > + /* Handle the VFIO_DEVICE_FEATURE_SET */ > + filp = device->ops->migration_set_state(device, mig.device_state); > + if (IS_ERR(filp) || !filp) > + goto out_copy; > + > + return vfio_ioct_mig_return_fd(filp, arg, &mig); > +out_copy: > + mig.data_fd = -1; > + if (copy_to_user(arg, &mig, sizeof(mig))) > + return -EFAULT; > + if (IS_ERR(filp)) > + return PTR_ERR(filp); > + return 0; > +} > + > +static int vfio_ioctl_device_feature_migration(struct vfio_device *device, > + u32 flags, void __user *arg, > + size_t argsz) > +{ > + struct vfio_device_feature_migration mig = { > + .flags = VFIO_MIGRATION_STOP_COPY, > + }; > + int ret; > + > + if (!device->ops->migration_set_state || > + !device->ops->migration_get_state) > + return -ENOTTY; > + > + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, > + sizeof(mig)); > + if (ret != 1) > + return ret; > + if (copy_to_user(arg, &mig, sizeof(mig))) > + return -EFAULT; > + return 0; > +} > + > static int vfio_ioctl_device_feature(struct vfio_device *device, > struct vfio_device_feature __user *arg) > { > @@ -1582,6 +1773,14 @@ static int vfio_ioctl_device_feature(struct > vfio_device *device, > return -EINVAL; > > switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) { > + case VFIO_DEVICE_FEATURE_MIGRATION: > + return vfio_ioctl_device_feature_migration( > + device, feature.flags, arg->data, > + feature.argsz - minsz); > + case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE: > + return vfio_ioctl_device_feature_mig_device_state( > + device, feature.flags, arg->data, > + feature.argsz - minsz); > default: > if (unlikely(!device->ops->device_feature)) > return -EINVAL; > diff --git a/include/linux/vfio.h b/include/linux/vfio.h > index ca69516f869d..3bbadcdbc9c8 100644 > --- a/include/linux/vfio.h > +++ b/include/linux/vfio.h > @@ -56,6 +56,14 @@ struct vfio_device { > * match, -errno for abort (ex. match with insufficient or incorrect > * additional args) > * @device_feature: Fill in the VFIO_DEVICE_FEATURE ioctl > + * @migration_set_state: Optional callback to change the migration state for > + * devices that support migration. The returned FD is used for data > + * transfer according to the FSM definition. The driver is responsible > + * to ensure that FD reaches end of stream or error whenever the > + * migration FSM leaves a data transfer state or before close_device() > + * returns. > + * @migration_get_state: Optional callback to get the migration state for > + * devices that support migration. > */ > struct vfio_device_ops { > char *name; > @@ -72,6 +80,11 @@ struct vfio_device_ops { > int (*match)(struct vfio_device *vdev, char *buf); > int (*device_feature)(struct vfio_device *device, u32 flags, > void __user *arg, size_t argsz); > + struct file *(*migration_set_state)( > + struct vfio_device *device, > + enum vfio_device_mig_state new_state); > + int (*migration_get_state)(struct vfio_device *device, > + enum vfio_device_mig_state *curr_state); > }; > > /** > @@ -114,6 +127,11 @@ extern void vfio_device_put(struct vfio_device > *device); > > int vfio_assign_device_set(struct vfio_device *device, void *set_id); > > +int vfio_mig_get_next_state(struct vfio_device *device, > + enum vfio_device_mig_state cur_fsm, > + enum vfio_device_mig_state new_fsm, > + enum vfio_device_mig_state *next_fsm); > + > /* > * External user API > */ > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index ef33ea002b0b..02b836ea8f46 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -605,25 +605,25 @@ struct vfio_region_gfx_edid { > > struct vfio_device_migration_info { > __u32 device_state; /* VFIO device state */ > -#define VFIO_DEVICE_STATE_STOP (0) > -#define VFIO_DEVICE_STATE_RUNNING (1 << 0) > -#define VFIO_DEVICE_STATE_SAVING (1 << 1) > -#define VFIO_DEVICE_STATE_RESUMING (1 << 2) > -#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \ > - VFIO_DEVICE_STATE_SAVING | \ > - VFIO_DEVICE_STATE_RESUMING) > +#define VFIO_DEVICE_STATE_V1_STOP (0) > +#define VFIO_DEVICE_STATE_V1_RUNNING (1 << 0) > +#define VFIO_DEVICE_STATE_V1_SAVING (1 << 1) > +#define VFIO_DEVICE_STATE_V1_RESUMING (1 << 2) > +#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_V1_RUNNING > | \ > + VFIO_DEVICE_STATE_V1_SAVING | \ > + VFIO_DEVICE_STATE_V1_RESUMING) > > #define VFIO_DEVICE_STATE_VALID(state) \ > - (state & VFIO_DEVICE_STATE_RESUMING ? \ > - (state & VFIO_DEVICE_STATE_MASK) == > VFIO_DEVICE_STATE_RESUMING : 1) > + (state & VFIO_DEVICE_STATE_V1_RESUMING ? \ > + (state & VFIO_DEVICE_STATE_MASK) == > VFIO_DEVICE_STATE_V1_RESUMING : 1) > > #define VFIO_DEVICE_STATE_IS_ERROR(state) \ > - ((state & VFIO_DEVICE_STATE_MASK) == > (VFIO_DEVICE_STATE_SAVING | \ > - VFIO_DEVICE_STATE_RESUMING)) > + ((state & VFIO_DEVICE_STATE_MASK) == > (VFIO_DEVICE_STATE_V1_SAVING | \ > + > VFIO_DEVICE_STATE_V1_RESUMING)) > > #define VFIO_DEVICE_STATE_SET_ERROR(state) \ > - ((state & ~VFIO_DEVICE_STATE_MASK) | > VFIO_DEVICE_SATE_SAVING | \ > - VFIO_DEVICE_STATE_RESUMING) > + ((state & ~VFIO_DEVICE_STATE_MASK) | > VFIO_DEVICE_STATE_V1_SAVING | \ > + > VFIO_DEVICE_STATE_V1_RESUMING) > > __u32 reserved; > __u64 pending_bytes; > @@ -1002,6 +1002,153 @@ struct vfio_device_feature { > */ > #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN (0) > > +/* > + * Indicates the device can support the migration API through > + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. If present flags must be > non-zero and > + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported. The RUNNING > and > + * ERROR states are always supported if this GET succeeds. > + * > + * VFIO_MIGRATION_STOP_COPY means that STOP, STOP_COPY and > + * RESUMING are supported. > + */ > +struct vfio_device_feature_migration { > + __aligned_u64 flags; > +#define VFIO_MIGRATION_STOP_COPY (1 << 0) > +}; > +#define VFIO_DEVICE_FEATURE_MIGRATION 1 > + > +/* > + * Upon VFIO_DEVICE_FEATURE_SET, execute a migration state change on > the VFIO > + * device. The new state is supplied in device_state, see enum > + * vfio_device_mig_state for details > + * > + * The kernel migration driver must fully transition the device to the new > state > + * value before the operation returns to the user. > + * > + * The kernel migration driver must not generate asynchronous device state > + * transitions outside of manipulation by the user or the > VFIO_DEVICE_RESET > + * ioctl as described above. > + * > + * If this function fails then current device_state may be the original > + * operating state or some other state along the combination transition path. > + * The user can then decide if it should execute a VFIO_DEVICE_RESET, > attempt > + * to return to the original state, or attempt to return to some other state > + * such as RUNNING or STOP. > + * > + * If the new_state starts a new data transfer session then the FD associated > + * with that session is returned in data_fd. The user is responsible to close > + * this FD when it is finished. The user must consider the migration data > + * segments carried over the FD to be opaque and non-fungible. During > RESUMING, > + * the data segments must be written in the same order they came out of > the > + * saving side FD. > + * > + * Upon VFIO_DEVICE_FEATURE_GET, get the current migration state of the > VFIO > + * device, data_fd will be -1. > + */ > +struct vfio_device_feature_mig_state { > + __u32 device_state; /* From enum vfio_device_mig_state */ > + __s32 data_fd; > +}; > +#define VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE 2 > + > +/* > + * The device migration Finite State Machine is described by the enum > + * vfio_device_mig_state. Some of the FSM arcs will create a migration data > + * transfer session by returning a FD, in this case the migration data will > + * flow over the FD using read() and write() as discussed below. > + * > + * There are 5 states to support VFIO_MIGRATION_STOP_COPY: > + * RUNNING - The device is running normally > + * STOP - The device does not change the internal or external state > + * STOP_COPY - The device internal state can be read out > + * RESUMING - The device is stopped and is loading a new internal state > + * ERROR - The device has failed and must be reset > + * > + * The FSM takes actions on the arcs between FSM states. The driver > implements > + * the following behavior for the FSM arcs: > + * > + * RUNNING -> STOP > + * STOP_COPY -> STOP > + * While in STOP the device must stop the operation of the device. The > device > + * must not generate interrupts, DMA, or any other change to external > state. > + * It must not change its internal state. When stopped the device and > kernel > + * migration driver must accept and respond to interaction to support > external > + * subsystems in the STOP state, for example PCI MSI-X and PCI config > space. > + * Failure by the user to restrict device access while in STOP must not result > + * in error conditions outside the user context (ex. host system faults). > + * > + * The STOP_COPY arc will terminate a data transfer session. > + * > + * RESUMING -> STOP > + * Leaving RESUMING terminates a data transfer session and indicates the > + * device should complete processing of the data delivered by write(). The > + * kernel migration driver should complete the incorporation of data > written > + * to the data transfer FD into the device internal state and perform > + * final validity and consistency checking of the new device state. If the > + * user provided data is found to be incomplete, inconsistent, or otherwise > + * invalid, the migration driver must fail the SET_STATE ioctl and > + * optionally go to the ERROR state as described below. > + * > + * While in STOP the device has the same behavior as other STOP states > + * described above. > + * > + * To abort a RESUMING session the device must be reset. > + * > + * STOP -> RUNNING > + * While in RUNNING the device is fully operational, the device may > generate > + * interrupts, DMA, respond to MMIO, all vfio device regions are functional, > + * and the device may advance its internal state. > + * > + * STOP -> STOP_COPY > + * This arc begin the process of saving the device state and will return a > + * new data_fd. > + * > + * While in the STOP_COPY state the device has the same behavior as STOP > + * with the addition that the data transfers session continues to stream the > + * migration state. End of stream on the FD indicates the entire device > + * state has been transferred. > + * > + * The user should take steps to restrict access to vfio device regions while > + * the device is in STOP_COPY or risk corruption of the device migration > data > + * stream. > + * > + * STOP -> RESUMING > + * Entering the RESUMING state starts a process of restoring the device > state > + * and will return a new data_fd. The data stream fed into the data_fd > should > + * be taken from the data transfer output of a single FD during saving from > + * a compatible device. The migration driver may alter/reset the internal > + * device state for this arc if required to prepare the device to receive the > + * migration data. > + * > + * any -> ERROR > + * ERROR cannot be specified as a device state, however any transition > request > + * can be failed with an errno return and may then move the device_state > into > + * ERROR. In this case the device was unable to execute the requested arc > and > + * was also unable to restore the device to any valid device_state. > + * To recover from ERROR VFIO_DEVICE_RESET must be used to return the > + * device_state back to RUNNING. > + * > + * The remaining possible transitions are interpreted as combinations of the > + * above FSM arcs. As there are multiple paths through the FSM arcs the > path > + * should be selected based on the following rules: > + * - Select the shortest path. > + * Refer to vfio_mig_get_next_state() for the result of the algorithm. > + * > + * The automatic transit through the FSM arcs that make up the > combination > + * transition is invisible to the user. When working with combination arcs > the > + * user may see any step along the path in the device_state if SET_STATE > + * fails. When handling these types of errors users should anticipate future > + * revisions of this protocol using new states and those states becoming > + * visible in this case. > + */ > +enum vfio_device_mig_state { > + VFIO_DEVICE_STATE_ERROR = 0, > + VFIO_DEVICE_STATE_STOP = 1, > + VFIO_DEVICE_STATE_RUNNING = 2, > + VFIO_DEVICE_STATE_STOP_COPY = 3, > + VFIO_DEVICE_STATE_RESUMING = 4, > +}; > + > /* -------- API for Type1 VFIO IOMMU -------- */ > > /** > -- > 2.18.1
On Sun, 20 Feb 2022 11:57:10 +0200 Yishai Hadas <yishaih@nvidia.com> wrote: > From: Jason Gunthorpe <jgg@nvidia.com> > > Replace the existing region based migration protocol with an ioctl based > protocol. The two protocols have the same general semantic behaviors, but > the way the data is transported is changed. > > This is the STOP_COPY portion of the new protocol, it defines the 5 states > for basic stop and copy migration and the protocol to move the migration > data in/out of the kernel. > > Compared to the clarification of the v1 protocol Alex proposed: > > https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen > > This has a few deliberate functional differences: > > - ERROR arcs allow the device function to remain unchanged. > > - The protocol is not required to return to the original state on > transition failure. Instead userspace can execute an unwind back to > the original state, reset, or do something else without needing kernel > support. This simplifies the kernel design and should userspace choose > a policy like always reset, avoids doing useless work in the kernel > on error handling paths. > > - PRE_COPY is made optional, userspace must discover it before using it. > This reflects the fact that the majority of drivers we are aware of > right now will not implement PRE_COPY. > > - segmentation is not part of the data stream protocol, the receiver > does not have to reproduce the framing boundaries. I'm not sure how to reconcile the statement above with: "The user must consider the migration data segments carried over the FD to be opaque and non-fungible. During RESUMING, the data segments must be written in the same order they came out of the saving side FD." This is subtly conflicting that it's not segmented, but segments must be written in order. We'll naturally have some segmentation due to buffering in kernel and userspace, but I think referring to it as a stream suggests that the user can cut and join segments arbitrarily so long as byte order is preserved, right? I suspect the commit log comment is referring to the driver imposed segmentation and framing relative to region offsets. Maybe something like: "The user must consider the migration data stream carried over the FD to be opaque and must preserve the byte order of the stream. The user is not required to preserve buffer segmentation when writing the data stream during the RESUMING operation." This statement also gives me pause relative to Jason's comments regarding async support: > + * The kernel migration driver must fully transition the device to the new state > + * value before the operation returns to the user. The above statement certainly doesn't preclude asynchronous availability of data on the stream FD, but it does demand that the device state transition itself is synchronous and can cannot be shortcut. If the state transition itself exceeds migration SLAs, we're in a pickle. Thanks, Alex
On Tue, Feb 22, 2022 at 04:53:00PM -0700, Alex Williamson wrote: > On Sun, 20 Feb 2022 11:57:10 +0200 > Yishai Hadas <yishaih@nvidia.com> wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Replace the existing region based migration protocol with an ioctl based > > protocol. The two protocols have the same general semantic behaviors, but > > the way the data is transported is changed. > > > > This is the STOP_COPY portion of the new protocol, it defines the 5 states > > for basic stop and copy migration and the protocol to move the migration > > data in/out of the kernel. > > > > Compared to the clarification of the v1 protocol Alex proposed: > > > > https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen > > > > This has a few deliberate functional differences: > > > > - ERROR arcs allow the device function to remain unchanged. > > > > - The protocol is not required to return to the original state on > > transition failure. Instead userspace can execute an unwind back to > > the original state, reset, or do something else without needing kernel > > support. This simplifies the kernel design and should userspace choose > > a policy like always reset, avoids doing useless work in the kernel > > on error handling paths. > > > > - PRE_COPY is made optional, userspace must discover it before using it. > > This reflects the fact that the majority of drivers we are aware of > > right now will not implement PRE_COPY. > > > > - segmentation is not part of the data stream protocol, the receiver > > does not have to reproduce the framing boundaries. > > I'm not sure how to reconcile the statement above with: > > "The user must consider the migration data segments carried > over the FD to be opaque and non-fungible. During RESUMING, the > data segments must be written in the same order they came out > of the saving side FD." > > This is subtly conflicting that it's not segmented, but segments must > be written in order. We'll naturally have some segmentation due to > buffering in kernel and userspace, but I think referring to it as a > stream suggests that the user can cut and join segments arbitrarily so > long as byte order is preserved, right? Yes, it is just some odd language that carried over from the v1 language > I suspect the commit log comment is referring to the driver imposed > segmentation and framing relative to region offsets. v1 had some special behavior where qemu would carry each data_size as a single unit to the other side present it whole to the migration region. We couldn't find any use case for this, and it wasn't clear if this was deliberate or just a quirk of qemu's implementation. We tossed it because doing an extra ioctl or something to learn this framing would hurt a zero-copy async iouring data mover scheme. > Maybe something like: > > "The user must consider the migration data stream carried over > the FD to be opaque and must preserve the byte order of the > stream. The user is not required to preserve buffer > segmentation when writing the data stream during the RESUMING > operation." Yes > > + * The kernel migration driver must fully transition the device to the new state > > + * value before the operation returns to the user. > > The above statement certainly doesn't preclude asynchronous > availability of data on the stream FD, but it does demand that the > device state transition itself is synchronous and can cannot be > shortcut. If the state transition itself exceeds migration SLAs, we're > in a pickle. Thanks, Even if the commands were async, it is not easy to believe a device can instantaneously abort an arc when a timer hits and return to full operation. For instance, mlx5 can't do this. The vCPU cannot be restarted to try to meet the SLA until a command going back to RUNNING returns. If we want to have a SLA feature it feels better to pass in the deadline time as part of the set state ioctl and the driver can then internally do something appropriate and not have to figure out how to juggle an external abort. The driver would be expected to return fully completed from STOP or return back to RUNNING before the deadline. For instance mlx5 could possibly implement this by checking the migration size and doing some maths before deciding if it should commit to its unabortable device command. I have a feeling supporting SLA means devices are going to have to report latencies for various arcs and work in a more classical realtime deadline oriented way overall. Estimating the transfer latency and size is another factor too. Overall, this SLA topic looks quite big to me, and I think a full solution will come with many facets. We are also quite interested in dirty rate limiting, for instance. Thanks, Jason
On Tue, 22 Feb 2022 20:21:36 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Feb 22, 2022 at 04:53:00PM -0700, Alex Williamson wrote: > > On Sun, 20 Feb 2022 11:57:10 +0200 > > Yishai Hadas <yishaih@nvidia.com> wrote: > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > > > Replace the existing region based migration protocol with an ioctl based > > > protocol. The two protocols have the same general semantic behaviors, but > > > the way the data is transported is changed. > > > > > > This is the STOP_COPY portion of the new protocol, it defines the 5 states > > > for basic stop and copy migration and the protocol to move the migration > > > data in/out of the kernel. > > > > > > Compared to the clarification of the v1 protocol Alex proposed: > > > > > > https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen > > > > > > This has a few deliberate functional differences: > > > > > > - ERROR arcs allow the device function to remain unchanged. > > > > > > - The protocol is not required to return to the original state on > > > transition failure. Instead userspace can execute an unwind back to > > > the original state, reset, or do something else without needing kernel > > > support. This simplifies the kernel design and should userspace choose > > > a policy like always reset, avoids doing useless work in the kernel > > > on error handling paths. > > > > > > - PRE_COPY is made optional, userspace must discover it before using it. > > > This reflects the fact that the majority of drivers we are aware of > > > right now will not implement PRE_COPY. > > > > > > - segmentation is not part of the data stream protocol, the receiver > > > does not have to reproduce the framing boundaries. > > > > I'm not sure how to reconcile the statement above with: > > > > "The user must consider the migration data segments carried > > over the FD to be opaque and non-fungible. During RESUMING, the > > data segments must be written in the same order they came out > > of the saving side FD." > > > > This is subtly conflicting that it's not segmented, but segments must > > be written in order. We'll naturally have some segmentation due to > > buffering in kernel and userspace, but I think referring to it as a > > stream suggests that the user can cut and join segments arbitrarily so > > long as byte order is preserved, right? > > Yes, it is just some odd language that carried over from the v1 language > > > I suspect the commit log comment is referring to the driver imposed > > segmentation and framing relative to region offsets. > > v1 had some special behavior where qemu would carry each data_size as > a single unit to the other side present it whole to the migration > region. We couldn't find any use case for this, and it wasn't clear if > this was deliberate or just a quirk of qemu's implementation. > > We tossed it because doing an extra ioctl or something to learn this > framing would hurt a zero-copy async iouring data mover scheme. It was deliberate in the v1 because the data region might cover both emulated and direct mapped ranges and might do so in combinations. For instance the driver could create a "frame" where the header lands in emulated space to validate sequencing and setup the fault address for mmap access. A driver might use a windowing scheme to iterate across a giant framebuffer, for example. > > Maybe something like: > > > > "The user must consider the migration data stream carried over > > the FD to be opaque and must preserve the byte order of the > > stream. The user is not required to preserve buffer > > segmentation when writing the data stream during the RESUMING > > operation." > > Yes > > > > + * The kernel migration driver must fully transition the device to the new state > > > + * value before the operation returns to the user. > > > > The above statement certainly doesn't preclude asynchronous > > availability of data on the stream FD, but it does demand that the > > device state transition itself is synchronous and can cannot be > > shortcut. If the state transition itself exceeds migration SLAs, we're > > in a pickle. Thanks, > > Even if the commands were async, it is not easy to believe a device > can instantaneously abort an arc when a timer hits and return to full > operation. For instance, mlx5 can't do this. > > The vCPU cannot be restarted to try to meet the SLA until a command > going back to RUNNING returns. > > If we want to have a SLA feature it feels better to pass in the > deadline time as part of the set state ioctl and the driver can then > internally do something appropriate and not have to figure out how to > juggle an external abort. The driver would be expected to return fully > completed from STOP or return back to RUNNING before the deadline. > > For instance mlx5 could possibly implement this by checking the > migration size and doing some maths before deciding if it should > commit to its unabortable device command. > > I have a feeling supporting SLA means devices are going to have to > report latencies for various arcs and work in a more classical > realtime deadline oriented way overall. Estimating the transfer > latency and size is another factor too. > > Overall, this SLA topic looks quite big to me, and I think a full > solution will come with many facets. We are also quite interested in > dirty rate limiting, for instance. So if/when we were to support this, we might use a different SET_STATE feature ioctl that allows the user to specify a deadline and we'd use feature probing or a flag on the migration feature for userspace to discover this? I'd be ok with that, I just want to make sure we have agreeable options to support it. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, February 23, 2022 9:10 AM > > > > + * The kernel migration driver must fully transition the device to the > new state > > > > + * value before the operation returns to the user. > > > > > > The above statement certainly doesn't preclude asynchronous > > > availability of data on the stream FD, but it does demand that the > > > device state transition itself is synchronous and can cannot be > > > shortcut. If the state transition itself exceeds migration SLAs, we're > > > in a pickle. Thanks, > > > > Even if the commands were async, it is not easy to believe a device > > can instantaneously abort an arc when a timer hits and return to full > > operation. For instance, mlx5 can't do this. > > > > The vCPU cannot be restarted to try to meet the SLA until a command > > going back to RUNNING returns. > > > > If we want to have a SLA feature it feels better to pass in the > > deadline time as part of the set state ioctl and the driver can then > > internally do something appropriate and not have to figure out how to > > juggle an external abort. The driver would be expected to return fully > > completed from STOP or return back to RUNNING before the deadline. > > > > For instance mlx5 could possibly implement this by checking the > > migration size and doing some maths before deciding if it should > > commit to its unabortable device command. > > > > I have a feeling supporting SLA means devices are going to have to > > report latencies for various arcs and work in a more classical > > realtime deadline oriented way overall. Estimating the transfer > > latency and size is another factor too. > > > > Overall, this SLA topic looks quite big to me, and I think a full > > solution will come with many facets. We are also quite interested in > > dirty rate limiting, for instance. > > So if/when we were to support this, we might use a different SET_STATE > feature ioctl that allows the user to specify a deadline and we'd use > feature probing or a flag on the migration feature for userspace to > discover this? I'd be ok with that, I just want to make sure we have > agreeable options to support it. Thanks, > Or use a different device_feature ioctl to allow setting deadline for different arcs before changing device state and then reuse existing SET_STATE semantics with the migration driver doing estimation underlyingly based on pre-configured constraints... Thanks Kevin
On Tue, Feb 22, 2022 at 06:09:34PM -0700, Alex Williamson wrote: > So if/when we were to support this, we might use a different SET_STATE > feature ioctl that allows the user to specify a deadline and we'd use > feature probing or a flag on the migration feature for userspace to > discover this? I'd be ok with that, I just want to make sure we have > agreeable options to support it. Thanks, I think we'd just make the set_state struct longer and add a cap flag for deadline? Jason
On Sun, Feb 20 2022, Yishai Hadas <yishaih@nvidia.com> wrote: > diff --git a/include/linux/vfio.h b/include/linux/vfio.h > index ca69516f869d..3bbadcdbc9c8 100644 > --- a/include/linux/vfio.h > +++ b/include/linux/vfio.h > @@ -56,6 +56,14 @@ struct vfio_device { > * match, -errno for abort (ex. match with insufficient or incorrect > * additional args) > * @device_feature: Fill in the VFIO_DEVICE_FEATURE ioctl > + * @migration_set_state: Optional callback to change the migration state for > + * devices that support migration. The returned FD is used for data > + * transfer according to the FSM definition. The driver is responsible > + * to ensure that FD reaches end of stream or error whenever the > + * migration FSM leaves a data transfer state or before close_device() > + * returns. > + * @migration_get_state: Optional callback to get the migration state for > + * devices that support migration. Nit: I'd add "mandatory for VFIO_DEVICE_FEATURE_MIGRATION migration support" to both descriptions to be a bit more explicit. (...) > +/* > + * Indicates the device can support the migration API through > + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. If present flags must be non-zero and > + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported. The RUNNING and I'm having trouble parsing this. I think what it tries to say is that at least one of the flags defined below must be set? > + * ERROR states are always supported if this GET succeeds. What about the following instead: "Indicates device support for the migration API through VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. If present, the RUNNING and ERROR states are always supported. Support for additional states is indicated via the flags field; at least one of the flags defined below must be set." > + * > + * VFIO_MIGRATION_STOP_COPY means that STOP, STOP_COPY and > + * RESUMING are supported. > + */ > +struct vfio_device_feature_migration { > + __aligned_u64 flags; > +#define VFIO_MIGRATION_STOP_COPY (1 << 0) > +};
On Wed, Feb 23, 2022 at 06:06:13PM +0100, Cornelia Huck wrote: > On Sun, Feb 20 2022, Yishai Hadas <yishaih@nvidia.com> wrote: > > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h > > index ca69516f869d..3bbadcdbc9c8 100644 > > +++ b/include/linux/vfio.h > > @@ -56,6 +56,14 @@ struct vfio_device { > > * match, -errno for abort (ex. match with insufficient or incorrect > > * additional args) > > * @device_feature: Fill in the VFIO_DEVICE_FEATURE ioctl > > + * @migration_set_state: Optional callback to change the migration state for > > + * devices that support migration. The returned FD is used for data > > + * transfer according to the FSM definition. The driver is responsible > > + * to ensure that FD reaches end of stream or error whenever the > > + * migration FSM leaves a data transfer state or before close_device() > > + * returns. > > + * @migration_get_state: Optional callback to get the migration state for > > + * devices that support migration. > > Nit: I'd add "mandatory for VFIO_DEVICE_FEATURE_MIGRATION migration > support" to both descriptions to be a bit more explicit. Ok > > +/* > > + * Indicates the device can support the migration API through > > + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. If present flags must be non-zero and > > + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported. The RUNNING and > > I'm having trouble parsing this. I think what it tries to say is that at > least one of the flags defined below must be set? > > > + * ERROR states are always supported if this GET succeeds. > > What about the following instead: > > "Indicates device support for the migration API through > VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. If present, the RUNNING and ERROR > states are always supported. Support for additional states is indicated > via the flags field; at least one of the flags defined below must be > set." Almost, 'at least VFIO_MIGRATION_STOP_COPY must be set' Thanks, Jason
On Wed, Feb 23 2022, Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Feb 23, 2022 at 06:06:13PM +0100, Cornelia Huck wrote: >> On Sun, Feb 20 2022, Yishai Hadas <yishaih@nvidia.com> wrote: >> > +/* >> > + * Indicates the device can support the migration API through >> > + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. If present flags must be non-zero and >> > + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported. The RUNNING and >> >> I'm having trouble parsing this. I think what it tries to say is that at >> least one of the flags defined below must be set? >> >> > + * ERROR states are always supported if this GET succeeds. >> >> What about the following instead: >> >> "Indicates device support for the migration API through >> VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. If present, the RUNNING and ERROR >> states are always supported. Support for additional states is indicated >> via the flags field; at least one of the flags defined below must be >> set." > > Almost, 'at least VFIO_MIGRATION_STOP_COPY must be set' It feels a bit odd to split the mandatory states between a base layer (RUNNING/ERROR) and the ones governed by VFIO_MIGRATION_STOP_COPY. Do we want to keep the possibility of a future implementation that does not use the semantics indicated by VFIO_MIGRATION_STOP_COPY? If yes, it should be "one of the flags" and the flags that require VFIO_MIGRATION_STOP_COPY to be set as well need to note that dependency. If not, we should explicitly tag VFIO_MIGRATION_STOP_COPY as mandatory (so that the flag's special status is obvious.)
On Thu, Feb 24, 2022 at 11:41:44AM +0100, Cornelia Huck wrote: > On Wed, Feb 23 2022, Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Feb 23, 2022 at 06:06:13PM +0100, Cornelia Huck wrote: > >> On Sun, Feb 20 2022, Yishai Hadas <yishaih@nvidia.com> wrote: > > >> > +/* > >> > + * Indicates the device can support the migration API through > >> > + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. If present flags must be non-zero and > >> > + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported. The RUNNING and > >> > >> I'm having trouble parsing this. I think what it tries to say is that at > >> least one of the flags defined below must be set? > >> > >> > + * ERROR states are always supported if this GET succeeds. > >> > >> What about the following instead: > >> > >> "Indicates device support for the migration API through > >> VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. If present, the RUNNING and ERROR > >> states are always supported. Support for additional states is indicated > >> via the flags field; at least one of the flags defined below must be > >> set." > > > > Almost, 'at least VFIO_MIGRATION_STOP_COPY must be set' > > It feels a bit odd to split the mandatory states between a base layer > (RUNNING/ERROR) and the ones governed by VFIO_MIGRATION_STOP_COPY. Do we > want to keep the possibility of a future implementation that does not > use the semantics indicated by VFIO_MIGRATION_STOP_COPY? Yes we do, and when we do that the documentation can reflect that world. Today, as is, it is mandatory. Jason
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 71763e2ac561..b37ab27b511f 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -1557,6 +1557,197 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep) return 0; } +/* + * vfio_mig_get_next_state - Compute the next step in the FSM + * @cur_fsm - The current state the device is in + * @new_fsm - The target state to reach + * @next_fsm - Pointer to the next step to get to new_fsm + * + * Return 0 upon success, otherwise -errno + * Upon success the next step in the state progression between cur_fsm and + * new_fsm will be set in next_fsm. + * + * This breaks down requests for combination transitions into smaller steps and + * returns the next step to get to new_fsm. The function may need to be called + * multiple times before reaching new_fsm. + * + */ +int vfio_mig_get_next_state(struct vfio_device *device, + enum vfio_device_mig_state cur_fsm, + enum vfio_device_mig_state new_fsm, + enum vfio_device_mig_state *next_fsm) +{ + enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 }; + /* + * The coding in this table requires the driver to implement 6 + * FSM arcs: + * RESUMING -> STOP + * RUNNING -> STOP + * STOP -> RESUMING + * STOP -> RUNNING + * STOP -> STOP_COPY + * STOP_COPY -> STOP + * + * The coding will step through multiple states for these combination + * transitions: + * RESUMING -> STOP -> RUNNING + * RESUMING -> STOP -> STOP_COPY + * RUNNING -> STOP -> RESUMING + * RUNNING -> STOP -> STOP_COPY + * STOP_COPY -> STOP -> RESUMING + * STOP_COPY -> STOP -> RUNNING + */ + static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = { + [VFIO_DEVICE_STATE_STOP] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_RUNNING] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_STOP_COPY] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_RESUMING] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_ERROR] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + }; + + if (WARN_ON(cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table))) + return -EINVAL; + + if (new_fsm >= ARRAY_SIZE(vfio_from_fsm_table)) + return -EINVAL; + + *next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm]; + return (*next_fsm != VFIO_DEVICE_STATE_ERROR) ? 0 : -EINVAL; +} +EXPORT_SYMBOL_GPL(vfio_mig_get_next_state); + +/* + * Convert the drivers's struct file into a FD number and return it to userspace + */ +static int vfio_ioct_mig_return_fd(struct file *filp, void __user *arg, + struct vfio_device_feature_mig_state *mig) +{ + int ret; + int fd; + + fd = get_unused_fd_flags(O_CLOEXEC); + if (fd < 0) { + ret = fd; + goto out_fput; + } + + mig->data_fd = fd; + if (copy_to_user(arg, mig, sizeof(*mig))) { + ret = -EFAULT; + goto out_put_unused; + } + fd_install(fd, filp); + return 0; + +out_put_unused: + put_unused_fd(fd); +out_fput: + fput(filp); + return ret; +} + +static int +vfio_ioctl_device_feature_mig_device_state(struct vfio_device *device, + u32 flags, void __user *arg, + size_t argsz) +{ + size_t minsz = + offsetofend(struct vfio_device_feature_mig_state, data_fd); + struct vfio_device_feature_mig_state mig; + struct file *filp = NULL; + int ret; + + if (!device->ops->migration_set_state || + !device->ops->migration_get_state) + return -ENOTTY; + + ret = vfio_check_feature(flags, argsz, + VFIO_DEVICE_FEATURE_SET | + VFIO_DEVICE_FEATURE_GET, + sizeof(mig)); + if (ret != 1) + return ret; + + if (copy_from_user(&mig, arg, minsz)) + return -EFAULT; + + if (flags & VFIO_DEVICE_FEATURE_GET) { + enum vfio_device_mig_state curr_state; + + ret = device->ops->migration_get_state(device, &curr_state); + if (ret) + return ret; + mig.device_state = curr_state; + goto out_copy; + } + + /* Handle the VFIO_DEVICE_FEATURE_SET */ + filp = device->ops->migration_set_state(device, mig.device_state); + if (IS_ERR(filp) || !filp) + goto out_copy; + + return vfio_ioct_mig_return_fd(filp, arg, &mig); +out_copy: + mig.data_fd = -1; + if (copy_to_user(arg, &mig, sizeof(mig))) + return -EFAULT; + if (IS_ERR(filp)) + return PTR_ERR(filp); + return 0; +} + +static int vfio_ioctl_device_feature_migration(struct vfio_device *device, + u32 flags, void __user *arg, + size_t argsz) +{ + struct vfio_device_feature_migration mig = { + .flags = VFIO_MIGRATION_STOP_COPY, + }; + int ret; + + if (!device->ops->migration_set_state || + !device->ops->migration_get_state) + return -ENOTTY; + + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, + sizeof(mig)); + if (ret != 1) + return ret; + if (copy_to_user(arg, &mig, sizeof(mig))) + return -EFAULT; + return 0; +} + static int vfio_ioctl_device_feature(struct vfio_device *device, struct vfio_device_feature __user *arg) { @@ -1582,6 +1773,14 @@ static int vfio_ioctl_device_feature(struct vfio_device *device, return -EINVAL; switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) { + case VFIO_DEVICE_FEATURE_MIGRATION: + return vfio_ioctl_device_feature_migration( + device, feature.flags, arg->data, + feature.argsz - minsz); + case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE: + return vfio_ioctl_device_feature_mig_device_state( + device, feature.flags, arg->data, + feature.argsz - minsz); default: if (unlikely(!device->ops->device_feature)) return -EINVAL; diff --git a/include/linux/vfio.h b/include/linux/vfio.h index ca69516f869d..3bbadcdbc9c8 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -56,6 +56,14 @@ struct vfio_device { * match, -errno for abort (ex. match with insufficient or incorrect * additional args) * @device_feature: Fill in the VFIO_DEVICE_FEATURE ioctl + * @migration_set_state: Optional callback to change the migration state for + * devices that support migration. The returned FD is used for data + * transfer according to the FSM definition. The driver is responsible + * to ensure that FD reaches end of stream or error whenever the + * migration FSM leaves a data transfer state or before close_device() + * returns. + * @migration_get_state: Optional callback to get the migration state for + * devices that support migration. */ struct vfio_device_ops { char *name; @@ -72,6 +80,11 @@ struct vfio_device_ops { int (*match)(struct vfio_device *vdev, char *buf); int (*device_feature)(struct vfio_device *device, u32 flags, void __user *arg, size_t argsz); + struct file *(*migration_set_state)( + struct vfio_device *device, + enum vfio_device_mig_state new_state); + int (*migration_get_state)(struct vfio_device *device, + enum vfio_device_mig_state *curr_state); }; /** @@ -114,6 +127,11 @@ extern void vfio_device_put(struct vfio_device *device); int vfio_assign_device_set(struct vfio_device *device, void *set_id); +int vfio_mig_get_next_state(struct vfio_device *device, + enum vfio_device_mig_state cur_fsm, + enum vfio_device_mig_state new_fsm, + enum vfio_device_mig_state *next_fsm); + /* * External user API */ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index ef33ea002b0b..02b836ea8f46 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -605,25 +605,25 @@ struct vfio_region_gfx_edid { struct vfio_device_migration_info { __u32 device_state; /* VFIO device state */ -#define VFIO_DEVICE_STATE_STOP (0) -#define VFIO_DEVICE_STATE_RUNNING (1 << 0) -#define VFIO_DEVICE_STATE_SAVING (1 << 1) -#define VFIO_DEVICE_STATE_RESUMING (1 << 2) -#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \ - VFIO_DEVICE_STATE_SAVING | \ - VFIO_DEVICE_STATE_RESUMING) +#define VFIO_DEVICE_STATE_V1_STOP (0) +#define VFIO_DEVICE_STATE_V1_RUNNING (1 << 0) +#define VFIO_DEVICE_STATE_V1_SAVING (1 << 1) +#define VFIO_DEVICE_STATE_V1_RESUMING (1 << 2) +#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_V1_RUNNING | \ + VFIO_DEVICE_STATE_V1_SAVING | \ + VFIO_DEVICE_STATE_V1_RESUMING) #define VFIO_DEVICE_STATE_VALID(state) \ - (state & VFIO_DEVICE_STATE_RESUMING ? \ - (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1) + (state & VFIO_DEVICE_STATE_V1_RESUMING ? \ + (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_V1_RESUMING : 1) #define VFIO_DEVICE_STATE_IS_ERROR(state) \ - ((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \ - VFIO_DEVICE_STATE_RESUMING)) + ((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_V1_SAVING | \ + VFIO_DEVICE_STATE_V1_RESUMING)) #define VFIO_DEVICE_STATE_SET_ERROR(state) \ - ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \ - VFIO_DEVICE_STATE_RESUMING) + ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_STATE_V1_SAVING | \ + VFIO_DEVICE_STATE_V1_RESUMING) __u32 reserved; __u64 pending_bytes; @@ -1002,6 +1002,153 @@ struct vfio_device_feature { */ #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN (0) +/* + * Indicates the device can support the migration API through + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. If present flags must be non-zero and + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is supported. The RUNNING and + * ERROR states are always supported if this GET succeeds. + * + * VFIO_MIGRATION_STOP_COPY means that STOP, STOP_COPY and + * RESUMING are supported. + */ +struct vfio_device_feature_migration { + __aligned_u64 flags; +#define VFIO_MIGRATION_STOP_COPY (1 << 0) +}; +#define VFIO_DEVICE_FEATURE_MIGRATION 1 + +/* + * Upon VFIO_DEVICE_FEATURE_SET, execute a migration state change on the VFIO + * device. The new state is supplied in device_state, see enum + * vfio_device_mig_state for details + * + * The kernel migration driver must fully transition the device to the new state + * value before the operation returns to the user. + * + * The kernel migration driver must not generate asynchronous device state + * transitions outside of manipulation by the user or the VFIO_DEVICE_RESET + * ioctl as described above. + * + * If this function fails then current device_state may be the original + * operating state or some other state along the combination transition path. + * The user can then decide if it should execute a VFIO_DEVICE_RESET, attempt + * to return to the original state, or attempt to return to some other state + * such as RUNNING or STOP. + * + * If the new_state starts a new data transfer session then the FD associated + * with that session is returned in data_fd. The user is responsible to close + * this FD when it is finished. The user must consider the migration data + * segments carried over the FD to be opaque and non-fungible. During RESUMING, + * the data segments must be written in the same order they came out of the + * saving side FD. + * + * Upon VFIO_DEVICE_FEATURE_GET, get the current migration state of the VFIO + * device, data_fd will be -1. + */ +struct vfio_device_feature_mig_state { + __u32 device_state; /* From enum vfio_device_mig_state */ + __s32 data_fd; +}; +#define VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE 2 + +/* + * The device migration Finite State Machine is described by the enum + * vfio_device_mig_state. Some of the FSM arcs will create a migration data + * transfer session by returning a FD, in this case the migration data will + * flow over the FD using read() and write() as discussed below. + * + * There are 5 states to support VFIO_MIGRATION_STOP_COPY: + * RUNNING - The device is running normally + * STOP - The device does not change the internal or external state + * STOP_COPY - The device internal state can be read out + * RESUMING - The device is stopped and is loading a new internal state + * ERROR - The device has failed and must be reset + * + * The FSM takes actions on the arcs between FSM states. The driver implements + * the following behavior for the FSM arcs: + * + * RUNNING -> STOP + * STOP_COPY -> STOP + * While in STOP the device must stop the operation of the device. The device + * must not generate interrupts, DMA, or any other change to external state. + * It must not change its internal state. When stopped the device and kernel + * migration driver must accept and respond to interaction to support external + * subsystems in the STOP state, for example PCI MSI-X and PCI config space. + * Failure by the user to restrict device access while in STOP must not result + * in error conditions outside the user context (ex. host system faults). + * + * The STOP_COPY arc will terminate a data transfer session. + * + * RESUMING -> STOP + * Leaving RESUMING terminates a data transfer session and indicates the + * device should complete processing of the data delivered by write(). The + * kernel migration driver should complete the incorporation of data written + * to the data transfer FD into the device internal state and perform + * final validity and consistency checking of the new device state. If the + * user provided data is found to be incomplete, inconsistent, or otherwise + * invalid, the migration driver must fail the SET_STATE ioctl and + * optionally go to the ERROR state as described below. + * + * While in STOP the device has the same behavior as other STOP states + * described above. + * + * To abort a RESUMING session the device must be reset. + * + * STOP -> RUNNING + * While in RUNNING the device is fully operational, the device may generate + * interrupts, DMA, respond to MMIO, all vfio device regions are functional, + * and the device may advance its internal state. + * + * STOP -> STOP_COPY + * This arc begin the process of saving the device state and will return a + * new data_fd. + * + * While in the STOP_COPY state the device has the same behavior as STOP + * with the addition that the data transfers session continues to stream the + * migration state. End of stream on the FD indicates the entire device + * state has been transferred. + * + * The user should take steps to restrict access to vfio device regions while + * the device is in STOP_COPY or risk corruption of the device migration data + * stream. + * + * STOP -> RESUMING + * Entering the RESUMING state starts a process of restoring the device state + * and will return a new data_fd. The data stream fed into the data_fd should + * be taken from the data transfer output of a single FD during saving from + * a compatible device. The migration driver may alter/reset the internal + * device state for this arc if required to prepare the device to receive the + * migration data. + * + * any -> ERROR + * ERROR cannot be specified as a device state, however any transition request + * can be failed with an errno return and may then move the device_state into + * ERROR. In this case the device was unable to execute the requested arc and + * was also unable to restore the device to any valid device_state. + * To recover from ERROR VFIO_DEVICE_RESET must be used to return the + * device_state back to RUNNING. + * + * The remaining possible transitions are interpreted as combinations of the + * above FSM arcs. As there are multiple paths through the FSM arcs the path + * should be selected based on the following rules: + * - Select the shortest path. + * Refer to vfio_mig_get_next_state() for the result of the algorithm. + * + * The automatic transit through the FSM arcs that make up the combination + * transition is invisible to the user. When working with combination arcs the + * user may see any step along the path in the device_state if SET_STATE + * fails. When handling these types of errors users should anticipate future + * revisions of this protocol using new states and those states becoming + * visible in this case. + */ +enum vfio_device_mig_state { + VFIO_DEVICE_STATE_ERROR = 0, + VFIO_DEVICE_STATE_STOP = 1, + VFIO_DEVICE_STATE_RUNNING = 2, + VFIO_DEVICE_STATE_STOP_COPY = 3, + VFIO_DEVICE_STATE_RESUMING = 4, +}; + /* -------- API for Type1 VFIO IOMMU -------- */ /**