diff mbox series

[V7,mlx5-next,15/15] vfio: Extend the device migration protocol with PRE_COPY

Message ID 20220207172216.206415-16-yishaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series Add mlx5 live migration driver and v2 migration protocol | expand

Commit Message

Yishai Hadas Feb. 7, 2022, 5:22 p.m. UTC
From: Jason Gunthorpe <jgg@nvidia.com>

The optional PRE_COPY states open the saving data transfer FD before
reaching STOP_COPY and allows the device to dirty track internal state
changes with the general idea to reduce the volume of data transferred
in the STOP_COPY stage.

While in PRE_COPY the device remains RUNNING, but the saving FD is open.

Only if the device also supports RUNNING_P2P can it support PRE_COPY_P2P,
which halts P2P transfers while continuing the saving FD.

PRE_COPY, with P2P support, requires the driver to implement 7 new arcs
and exists as an optional FSM branch between RUNNING and STOP_COPY:
    RUNNING -> PRE_COPY -> PRE_COPY_P2P -> STOP_COPY

A new ioctl VFIO_DEVICE_MIG_PRECOPY is provided to allow userspace to
query the progress of the precopy operation in the driver with the idea it
will judge to move to STOP_COPY at least once the initial data set is
transferred, and possibly after the dirty size has shrunk appropriately.

We think there may also be merit in future extensions to the
VFIO_DEVICE_MIG_PRECOPY ioctl to also command the device to throttle the
rate it generates internal dirty state.

Compared to the v1 clarification, STOP_COPY -> PRE_COPY is made optional
and to be defined in future. While making the whole PRE_COPY feature
optional eliminates the concern from mlx5, this is still a complicated arc
to implement and seems prudent to leave it closed until a proper use case
is developed. We also split the pending_bytes report into the initial and
sustaining values, and define the protocol to get an event via poll() for
new dirty data during PRE_COPY.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/vfio.c       |  71 +++++++++++++++++++++++-
 include/uapi/linux/vfio.h | 110 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 176 insertions(+), 5 deletions(-)

Comments

Alex Williamson Feb. 17, 2022, 5:15 p.m. UTC | #1
On Mon, 7 Feb 2022 19:22:16 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> The optional PRE_COPY states open the saving data transfer FD before
> reaching STOP_COPY and allows the device to dirty track internal state
> changes with the general idea to reduce the volume of data transferred
> in the STOP_COPY stage.
> 
> While in PRE_COPY the device remains RUNNING, but the saving FD is open.
> 
> Only if the device also supports RUNNING_P2P can it support PRE_COPY_P2P,
> which halts P2P transfers while continuing the saving FD.
> 
> PRE_COPY, with P2P support, requires the driver to implement 7 new arcs
> and exists as an optional FSM branch between RUNNING and STOP_COPY:
>     RUNNING -> PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
> 
> A new ioctl VFIO_DEVICE_MIG_PRECOPY is provided to allow userspace to
> query the progress of the precopy operation in the driver with the idea it
> will judge to move to STOP_COPY at least once the initial data set is
> transferred, and possibly after the dirty size has shrunk appropriately.
> 
> We think there may also be merit in future extensions to the
> VFIO_DEVICE_MIG_PRECOPY ioctl to also command the device to throttle the
> rate it generates internal dirty state.
> 
> Compared to the v1 clarification, STOP_COPY -> PRE_COPY is made optional
> and to be defined in future. While making the whole PRE_COPY feature
> optional eliminates the concern from mlx5, this is still a complicated arc
> to implement and seems prudent to leave it closed until a proper use case
> is developed. We also split the pending_bytes report into the initial and
> sustaining values, and define the protocol to get an event via poll() for
> new dirty data during PRE_COPY.

I feel obligated to ask, is PRE_COPY support essentially RFC at this
point since we have no proposed in-kernel users?

It seems like we're winding down comments on the remainder of the
series and I feel ok with where it's headed and the options we have
available for future extensions.  Pre-copy seems like an important gap
to fill and I think this patch shows that a future extension could
allow it, but with the scrutiny not to add unused code to the kernel,
I'm not sure there's a valid justification to add it now.  Thanks,

Alex

PS - Why is this a stand-alone ioctl rather than a DEVICE_FEATURE?
Jason Gunthorpe Feb. 18, 2022, 12:03 a.m. UTC | #2
On Thu, Feb 17, 2022 at 10:15:54AM -0700, Alex Williamson wrote:

> I feel obligated to ask, is PRE_COPY support essentially RFC at this
> point since we have no proposed in-kernel users?

Yes, it is included here because the kernel in v1 had PRE_COPY, so it
seemed essential to show how this could continue to look to evaluate
v2.

NVIDIA has an out of tree driver that implemented PRE_COPY in the v1
protocol, and we have some future plan to use it in a in-tree driver.

> It seems like we're winding down comments on the remainder of the
> series and I feel ok with where it's headed and the options we have
> available for future extensions.  

Thanks, it was a lot of work for everyone to get here!

Yishai has all the revisions from Kevin included, he will sent it on
Sunday. Based on this Leon will make a formal PR next week so it can
go into linux-next through your tree. We have to stay co-ordinated
with our netdev driver branch..

I will ping the acc team and make it priority to review their next
vresion. Let's try to include their driver as well.

We'll start to make a more review ready qemu series.

> PS - Why is this a stand-alone ioctl rather than a DEVICE_FEATURE?

You asked for the ioctl to be on the data_fd, so there is no
DEVICE_FEATURE infrastructure and I think it doesn't make sense to put
a multiplexor there. We have lots of ioctl numbers and don't want this
to be complicated for performance.

Thanks,
Jason
Tian, Kevin Feb. 18, 2022, 8:01 a.m. UTC | #3
Some comments though this may not be in next version per Alex's suggestion.

> From: Yishai Hadas <yishaih@nvidia.com>
> Sent: Tuesday, February 8, 2022 1:22 AM
> 
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> The optional PRE_COPY states open the saving data transfer FD before
> reaching STOP_COPY and allows the device to dirty track internal state
> changes with the general idea to reduce the volume of data transferred
> in the STOP_COPY stage.
> 
> While in PRE_COPY the device remains RUNNING, but the saving FD is open.
> 
> Only if the device also supports RUNNING_P2P can it support PRE_COPY_P2P,
> which halts P2P transfers while continuing the saving FD.
> 
> PRE_COPY, with P2P support, requires the driver to implement 7 new arcs
> and exists as an optional FSM branch between RUNNING and STOP_COPY:
>     RUNNING -> PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
> 

that branch includes only 5 new arcs. The other two are between RUNNING_P2P
and PRECOPY_P2P.

I draw a figure to help me understand the final FSM. Put it here in case
others are interested in 
Jason Gunthorpe Feb. 18, 2022, 2:06 p.m. UTC | #4
On Fri, Feb 18, 2022 at 08:01:47AM +0000, Tian, Kevin wrote:
 
> > A new ioctl VFIO_DEVICE_MIG_PRECOPY is provided to allow userspace to
> > query the progress of the precopy operation in the driver with the idea it
> > will judge to move to STOP_COPY at least once the initial data set is
> > transferred, and possibly after the dirty size has shrunk appropriately.
> > 
> > We think there may also be merit in future extensions to the
> > VFIO_DEVICE_MIG_PRECOPY ioctl to also command the device to throttle the
> > rate it generates internal dirty state.
> > 
> > Compared to the v1 clarification, STOP_COPY -> PRE_COPY is made optional
> 
> essentially it's *BLOCKED* per following context.

Yes I suppose now that we have the cap bits not the arc discovery this
isn't worded well

> > and to be defined in future. While making the whole PRE_COPY feature
> > optional eliminates the concern from mlx5, this is still a complicated arc
> > to implement and seems prudent to leave it closed until a proper use case
> 
> Can you shed some light on the complexity here?

It is with the data_fd, once a driver enters STOP_COPY it should stuff
its final state into the data_fd. If this is aborted back to PRE_COPY
then the data_fd needs to return to streaming changes. Managing this
transition is not trivial - it is something that has to be signaled to
the receiver.

There is also something of a race here where the data_fd can reach
end-of-stream and then the user can do STOP_COPY->PRE_COPY and
continue stuffing data. This makes the construction of the data stream
framing "interesting" as there is no longer a possible in-band end of
stream marker. See the other discussion about async operation why this
is not ideal.

Basically, it is behavior current qemu doesn't trigger that requires
significant complexity and testing in any driver to support
properly. No driver proposed

> Could a driver pretend supporting PRE_COPY by simply returning both 
> initial_bytes and dirty_bytes as ZERO?

I think so, yes.

> and even if the driver doesn't support the base arc (STOP_COPY->
> PRE_COPY_P2P) what about the combination arc (STOP_COPY->STOP->
> RUNNING_P2P->PRE_COPY_P2P)?

Userspace can walk through this sequence on its own, but it cannot be
part of the FSM because it violates the construction rules. The
data_fd is open in two places.

> current FSM already allows STOP->RUNNING_P2P->PRE_COPY_P2P and in
> concept STOP_COPY and STOP have exact same device behavior.

This is allowed because it follows the FSM rules. The data_fd is the
key difference.

> with that combination arc the interim transition from STOP_COPY to
> STOP will terminate the current data stream and RUNNING_P2P to
> PRE_COPY_P2P will return a new data fd. This does violate the definition
> about transition between three 'saving group' of states, which says
> moving between them does not terminate or otherwise affect the
> associated fd.

Right, and because this happens the VMM wuld have to terminate the
resuming session as well. Remember the output of a single saving
data_fd can be sent to a single receiving resuming data_fd - they
cannot be spliced.

> > is developed. We also split the pending_bytes report into the initial and
> > sustaining values, and define the protocol to get an event via poll() for
> 
> I guess this split must have been aligned in earlier discussion but it's still
> useful if some words can be put here for the motivation. Otherwise one 
> could easily ask why not treating the 1st read of pending_bytes as the 
> initial size.

As everything is estimates the approach allows the estimate to be
refined as we go along. PRE_COPY can stop at any time, but knowing
some initial mandatory stage has passed is somewhat consistent with
how qemu seems to treat this.

> > @@ -1596,25 +1596,59 @@ int vfio_mig_get_next_state(struct vfio_device
> > *device,
> >  	 *         RUNNING -> STOP
> >  	 *         STOP -> RUNNING
> 
> The comment for above should be updated too, which currently says:
> 
> 	 * Without P2P the driver must implement:
> 
> and also move it to the end as it talks about the arcs when neither
> P2P nor PRECOPY is supported.

Yes

> > + * PRE_COPY_P2P -> RUNNING_P2P
> >   * RUNNING -> RUNNING_P2P
> >   * STOP -> RUNNING_P2P
> >   *   While in RUNNING_P2P the device is partially running in the P2P
> > quiescent
> >   *   state defined below.
> >   *
> > + *   The PRE_COPY arc will terminate a data transfer session.
> 
> PRE_COPY_P2P

Yes

> 
> > + *
> > + * RUNNING -> PRE_COPY
> > + * RUNNING_P2P -> PRE_COPY_P2P
> >   * STOP -> STOP_COPY
> > - *   This arc begin the process of saving the device state and will return a
> > - *   new data_fd.
> > + *   PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of
> > states
> > + *   which share a data transfer session. Moving between these states alters
> > + *   what is streamed in session, but does not terminate or otherwise effect
> 
> 'effect' -> 'affect'?

yes

> > @@ -959,6 +1007,8 @@ struct vfio_device_feature_mig_state {
> >   * above FSM arcs. As there are multiple paths through the FSM arcs the
> > path
> >   * should be selected based on the following rules:
> >   *   - Select the shortest path.
> > + *   - The path cannot have saving group states as interior arcs, only
> > + *     starting/end states.
> 
> what about PRECOPY->PRECOPY_P2P->STOP_COPY? In this case
> PRECOPY_P2P is used as interior arc.

It isn't an interior arc because there are only two arcs :) But yes,
it is bit unclear.

> and if we disallow a non-saving-group state as interior arc when both 
> start and end states are saving-group states (e.g. 
> STOP_COPY->STOP->RUNNING_P2P->PRE_COPY_P2P as I asked in
> the start) then it might be another rule to be specified...

This isn't a shortest path.

> > @@ -972,6 +1022,9 @@ struct vfio_device_feature_mig_state {
> >   * support them. The user can disocver if these states are supported by using
> >   * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the
> > user can
> >   * avoid knowing about these optional states if the kernel driver supports
> > them.
> > + *
> > + * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support for
> > PRE_COPY
> > + * is not present.
> 
> why adding this sentence particularly for PRE_COPY? Isn't it already
> explained by last paragraph for optional states?

Well, I thought it was clarifying about how the optionality is
constructed.

> > + * Drivers should attempt to return estimates so that initial_bytes +
> > + * dirty_bytes matches the amount of data an immediate transition to
> > STOP_COPY
> > + * will require to be streamed.
> 
> I didn't understand this requirement. In an immediate transition to
> STOP_COPY I expect the amount of data covers the entire device
> state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively returned
> then why we need set some expectation on the sum of 
> initial+round1_dity+round2_dirty+... 

"will require to be streamed" means additional data from this point
forward, not including anything already sent.

It turns into the estimate of how long STOP_COPY will take.

Jason
Tian, Kevin Feb. 22, 2022, 1:43 a.m. UTC | #5
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, February 18, 2022 10:06 PM
> 
> 
> > > and to be defined in future. While making the whole PRE_COPY feature
> > > optional eliminates the concern from mlx5, this is still a complicated arc
> > > to implement and seems prudent to leave it closed until a proper use
> case
> >
> > Can you shed some light on the complexity here?
> 
> It is with the data_fd, once a driver enters STOP_COPY it should stuff
> its final state into the data_fd. If this is aborted back to PRE_COPY
> then the data_fd needs to return to streaming changes. Managing this
> transition is not trivial - it is something that has to be signaled to
> the receiver.
> 
> There is also something of a race here where the data_fd can reach
> end-of-stream and then the user can do STOP_COPY->PRE_COPY and
> continue stuffing data. This makes the construction of the data stream
> framing "interesting" as there is no longer a possible in-band end of
> stream marker. See the other discussion about async operation why this
> is not ideal.
> 
> Basically, it is behavior current qemu doesn't trigger that requires
> significant complexity and testing in any driver to support
> properly. No driver proposed

Make sense.

> 
> > > @@ -959,6 +1007,8 @@ struct vfio_device_feature_mig_state {
> > >   * above FSM arcs. As there are multiple paths through the FSM arcs the
> > > path
> > >   * should be selected based on the following rules:
> > >   *   - Select the shortest path.
> > > + *   - The path cannot have saving group states as interior arcs, only
> > > + *     starting/end states.
> >
> > what about PRECOPY->PRECOPY_P2P->STOP_COPY? In this case
> > PRECOPY_P2P is used as interior arc.
> 
> It isn't an interior arc because there are only two arcs :) But yes,
> it is bit unclear.
> 
> > and if we disallow a non-saving-group state as interior arc when both
> > start and end states are saving-group states (e.g.
> > STOP_COPY->STOP->RUNNING_P2P->PRE_COPY_P2P as I asked in
> > the start) then it might be another rule to be specified...
> 
> This isn't a shortest path.

it is the shortest path when STOP_COPY->PRE_COPY_P2P base arc is
not supported. I guess your earlier explanation about data_fd
should be the 3rd rule for why that combination arc is not allowed
in FSM.

> 
> > > @@ -972,6 +1022,9 @@ struct vfio_device_feature_mig_state {
> > >   * support them. The user can disocver if these states are supported by
> using
> > >   * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions
> the
> > > user can
> > >   * avoid knowing about these optional states if the kernel driver supports
> > > them.
> > > + *
> > > + * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support
> for
> > > PRE_COPY
> > > + * is not present.
> >
> > why adding this sentence particularly for PRE_COPY? Isn't it already
> > explained by last paragraph for optional states?
> 
> Well, I thought it was clarifying about how the optionality is
> constructed.

The last paragraph already says:

+ * The optional states cannot be used with SET_STATE if the device does not
+ * support them. The user can disocver if these states are supported by using
+ * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
+ * avoid knowing about these optional states if the kernel driver supports them.

> 
> > > + * Drivers should attempt to return estimates so that initial_bytes +
> > > + * dirty_bytes matches the amount of data an immediate transition to
> > > STOP_COPY
> > > + * will require to be streamed.
> >
> > I didn't understand this requirement. In an immediate transition to
> > STOP_COPY I expect the amount of data covers the entire device
> > state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively returned
> > then why we need set some expectation on the sum of
> > initial+round1_dity+round2_dirty+...
> 
> "will require to be streamed" means additional data from this point
> forward, not including anything already sent.
> 
> It turns into the estimate of how long STOP_COPY will take.
> 

I still didn't get the 'match' part. Why should the amount of data which
has already been sent match the additional data to be sent in STOP_COPY?

Thanks
Keivn
Jason Gunthorpe Feb. 22, 2022, 3:50 p.m. UTC | #6
On Tue, Feb 22, 2022 at 01:43:13AM +0000, Tian, Kevin wrote:

> > > > + * Drivers should attempt to return estimates so that initial_bytes +
> > > > + * dirty_bytes matches the amount of data an immediate transition to
> > > > STOP_COPY
> > > > + * will require to be streamed.
> > >
> > > I didn't understand this requirement. In an immediate transition to
> > > STOP_COPY I expect the amount of data covers the entire device
> > > state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively returned
> > > then why we need set some expectation on the sum of
> > > initial+round1_dity+round2_dirty+...
> > 
> > "will require to be streamed" means additional data from this point
> > forward, not including anything already sent.
> > 
> > It turns into the estimate of how long STOP_COPY will take.
> 
> I still didn't get the 'match' part. Why should the amount of data which
> has already been sent match the additional data to be sent in STOP_COPY?

None of it is 'already been sent' the return values are always 'still
to be sent'

Jason
Tian, Kevin Feb. 23, 2022, 12:40 a.m. UTC | #7
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, February 22, 2022 11:51 PM
> 
> On Tue, Feb 22, 2022 at 01:43:13AM +0000, Tian, Kevin wrote:
> 
> > > > > + * Drivers should attempt to return estimates so that initial_bytes +
> > > > > + * dirty_bytes matches the amount of data an immediate transition
> to
> > > > > STOP_COPY
> > > > > + * will require to be streamed.
> > > >
> > > > I didn't understand this requirement. In an immediate transition to
> > > > STOP_COPY I expect the amount of data covers the entire device
> > > > state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively returned
> > > > then why we need set some expectation on the sum of
> > > > initial+round1_dity+round2_dirty+...
> > >
> > > "will require to be streamed" means additional data from this point
> > > forward, not including anything already sent.
> > >
> > > It turns into the estimate of how long STOP_COPY will take.
> >
> > I still didn't get the 'match' part. Why should the amount of data which
> > has already been sent match the additional data to be sent in STOP_COPY?
> 
> None of it is 'already been sent' the return values are always 'still
> to be sent'
> 

Reread the description:

+ * Drivers should attempt to return estimates so that initial_bytes +
+ * dirty_bytes matches the amount of data an immediate transition to STOP_COPY
+ * will require to be streamed.

I guess you intended to mean that when EITHER initial_bytes OR
dirty_bytes is read the returned value should match the amount 
of data as described above. It is "+" which confused me to think 
it as a sum of both numbers...

Thanks
Kevin
Jason Gunthorpe Feb. 23, 2022, 12:44 a.m. UTC | #8
On Wed, Feb 23, 2022 at 12:40:58AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, February 22, 2022 11:51 PM
> > 
> > On Tue, Feb 22, 2022 at 01:43:13AM +0000, Tian, Kevin wrote:
> > 
> > > > > > + * Drivers should attempt to return estimates so that initial_bytes +
> > > > > > + * dirty_bytes matches the amount of data an immediate transition
> > to
> > > > > > STOP_COPY
> > > > > > + * will require to be streamed.
> > > > >
> > > > > I didn't understand this requirement. In an immediate transition to
> > > > > STOP_COPY I expect the amount of data covers the entire device
> > > > > state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively returned
> > > > > then why we need set some expectation on the sum of
> > > > > initial+round1_dity+round2_dirty+...
> > > >
> > > > "will require to be streamed" means additional data from this point
> > > > forward, not including anything already sent.
> > > >
> > > > It turns into the estimate of how long STOP_COPY will take.
> > >
> > > I still didn't get the 'match' part. Why should the amount of data which
> > > has already been sent match the additional data to be sent in STOP_COPY?
> > 
> > None of it is 'already been sent' the return values are always 'still
> > to be sent'
> > 
> 
> Reread the description:
> 
> + * Drivers should attempt to return estimates so that initial_bytes +
> + * dirty_bytes matches the amount of data an immediate transition to STOP_COPY
> + * will require to be streamed.
> 
> I guess you intended to mean that when EITHER initial_bytes OR
> dirty_bytes is read the returned value should match the amount 
> of data as described above. It is "+" which confused me to think 
> it as a sum of both numbers...

It is the sum

initial_bytes declines as the data is transferred. Once everything is
read out the sum will be 0.

Jason
Tian, Kevin Feb. 23, 2022, 1:46 a.m. UTC | #9
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, February 23, 2022 8:45 AM
> 
> On Wed, Feb 23, 2022 at 12:40:58AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, February 22, 2022 11:51 PM
> > >
> > > On Tue, Feb 22, 2022 at 01:43:13AM +0000, Tian, Kevin wrote:
> > >
> > > > > > > + * Drivers should attempt to return estimates so that initial_bytes
> +
> > > > > > > + * dirty_bytes matches the amount of data an immediate
> transition
> > > to
> > > > > > > STOP_COPY
> > > > > > > + * will require to be streamed.
> > > > > >
> > > > > > I didn't understand this requirement. In an immediate transition to
> > > > > > STOP_COPY I expect the amount of data covers the entire device
> > > > > > state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively
> returned
> > > > > > then why we need set some expectation on the sum of
> > > > > > initial+round1_dity+round2_dirty+...
> > > > >
> > > > > "will require to be streamed" means additional data from this point
> > > > > forward, not including anything already sent.
> > > > >
> > > > > It turns into the estimate of how long STOP_COPY will take.
> > > >
> > > > I still didn't get the 'match' part. Why should the amount of data which
> > > > has already been sent match the additional data to be sent in
> STOP_COPY?
> > >
> > > None of it is 'already been sent' the return values are always 'still
> > > to be sent'
> > >
> >
> > Reread the description:
> >
> > + * Drivers should attempt to return estimates so that initial_bytes +
> > + * dirty_bytes matches the amount of data an immediate transition to
> STOP_COPY
> > + * will require to be streamed.
> >
> > I guess you intended to mean that when EITHER initial_bytes OR
> > dirty_bytes is read the returned value should match the amount
> > of data as described above. It is "+" which confused me to think
> > it as a sum of both numbers...
> 
> It is the sum
> 
> initial_bytes declines as the data is transferred. Once everything is
> read out the sum will be 0.
> 

That is the point which I overlooked (with the impression that initial_bytes
is static). As explained in the code comment 'initial' here means the
initial phase of precopy instead of a static number for the entire 
device state. During the initial precopy phase dirty_bytes should not
count any state which hasn't been transmitted then the sum of both
numbers can reflect the accurate size of remaining bytes to be
transmitted. Once initial phase is completed initial_bytes is always 
ZERO then dirty_bytes alone represents the remaining bytes. 
diff mbox series

Patch

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 8c484593dfe0..b4c585114ef3 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1577,7 +1577,7 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 			    enum vfio_device_mig_state new_fsm,
 			    enum vfio_device_mig_state *next_fsm)
 {
-	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RUNNING_P2P + 1 };
+	enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_PRE_COPY_P2P + 1 };
 	/*
 	 * The coding in this table requires the driver to implement
 	 * FSM arcs:
@@ -1596,25 +1596,59 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 	 *         RUNNING -> STOP
 	 *         STOP -> RUNNING
 	 *
+	 * If precopy is supported then the driver must support these additional
+	 * FSM arcs:
+	 *         RUNNING -> PRE_COPY
+	 *         PRE_COPY -> RUNNING
+	 *         PRE_COPY -> STOP_COPY
+	 * However, if precopy and P2P are supported together then the driver
+	 * must support these additional arcs beyond the P2P arcs above:
+	 *         PRE_COPY -> RUNNING
+	 *         PRE_COPY -> PRE_COPY_P2P
+	 *         PRE_COPY_P2P -> PRE_COPY
+	 *         PRE_COPY_P2P -> RUNNING_P2P
+	 *         PRE_COPY_P2P -> STOP_COPY
+	 *         RUNNING -> PRE_COPY
+	 *         RUNNING_P2P -> PRE_COPY_P2P
+	 *
 	 * If all optional features are supported then the coding will step
 	 * through multiple states for these combination transitions:
+	 *         PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP
+	 *         PRE_COPY -> RUNNING -> RUNNING_P2P -> STOP -> RESUMING
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> RUNNING
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> STOP
+	 *         PRE_COPY_P2P -> RUNNING_P2P -> STOP -> RESUMING
 	 *         RESUMING -> STOP -> RUNNING_P2P
+	 *         RESUMING -> STOP -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING
+	 *         RESUMING -> STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         RESUMING -> STOP -> STOP_COPY
+	 *         RUNNING -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         RUNNING -> RUNNING_P2P -> STOP
 	 *         RUNNING -> RUNNING_P2P -> STOP -> RESUMING
 	 *         RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY
+	 *         RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         RUNNING_P2P -> STOP -> RESUMING
 	 *         RUNNING_P2P -> STOP -> STOP_COPY
+	 *         STOP -> RUNNING_P2P -> PRE_COPY_P2P
 	 *         STOP -> RUNNING_P2P -> RUNNING
+	 *         STOP -> RUNNING_P2P -> RUNNING -> PRE_COPY
 	 *         STOP_COPY -> STOP -> RESUMING
 	 *         STOP_COPY -> STOP -> RUNNING_P2P
 	 *         STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING
+	 *
+	 *  The following transitions are blocked:
+	 *         STOP_COPY -> PRE_COPY
+	 *         STOP_COPY -> PRE_COPY_P2P
 	 */
 	static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = {
 		[VFIO_DEVICE_STATE_STOP] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
@@ -1623,14 +1657,38 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RUNNING] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
 			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
 		},
+		[VFIO_DEVICE_STATE_PRE_COPY] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
+		[VFIO_DEVICE_STATE_PRE_COPY_P2P] = {
+			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_PRE_COPY,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
+			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
+			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
+			[VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR,
+		},
 		[VFIO_DEVICE_STATE_STOP_COPY] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
@@ -1639,6 +1697,8 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RESUMING] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_STOP,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP,
@@ -1647,6 +1707,8 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_RUNNING_P2P] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_RUNNING,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_PRE_COPY_P2P,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P,
@@ -1655,6 +1717,8 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 		[VFIO_DEVICE_STATE_ERROR] = {
 			[VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY] = VFIO_DEVICE_STATE_ERROR,
+			[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR,
 			[VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR,
@@ -1665,6 +1729,11 @@  int vfio_mig_get_next_state(struct vfio_device *device,
 	static const unsigned int state_flags_table[VFIO_DEVICE_NUM_STATES] = {
 		[VFIO_DEVICE_STATE_STOP] = VFIO_MIGRATION_STOP_COPY,
 		[VFIO_DEVICE_STATE_RUNNING] = VFIO_MIGRATION_STOP_COPY,
+		[VFIO_DEVICE_STATE_PRE_COPY] =
+			VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY,
+		[VFIO_DEVICE_STATE_PRE_COPY_P2P] = VFIO_MIGRATION_STOP_COPY |
+						   VFIO_MIGRATION_P2P |
+						   VFIO_MIGRATION_PRE_COPY,
 		[VFIO_DEVICE_STATE_STOP_COPY] = VFIO_MIGRATION_STOP_COPY,
 		[VFIO_DEVICE_STATE_RESUMING] = VFIO_MIGRATION_STOP_COPY,
 		[VFIO_DEVICE_STATE_RUNNING_P2P] =
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 227f55d57e06..6424c5b3415b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -817,12 +817,20 @@  struct vfio_device_feature {
  * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P
  * is supported in addition to the STOP_COPY states.
  *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY means that
+ * PRE_COPY is supported in addition to the STOP_COPY states.
+ *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY
+ * means that RUNNING_P2P, PRE_COPY and PRE_COPY_P2P are supported
+ * in addition to the STOP_COPY states.
+ *
  * Other combinations of flags have behavior to be defined in the future.
  */
 struct vfio_device_feature_migration {
 	__aligned_u64 flags;
 #define VFIO_MIGRATION_STOP_COPY	(1 << 0)
 #define VFIO_MIGRATION_P2P		(1 << 1)
+#define VFIO_MIGRATION_PRE_COPY		(1 << 2)
 };
 #define VFIO_DEVICE_FEATURE_MIGRATION 1
 
@@ -873,8 +881,13 @@  struct vfio_device_feature_mig_state {
  *  RESUMING - The device is stopped and is loading a new internal state
  *  ERROR - The device has failed and must be reset
  *
- * And 1 optional state to support VFIO_MIGRATION_P2P:
+ * And optional states to support VFIO_MIGRATION_P2P:
  *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
+ * And VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY - The device is running normally but tracking internal state
+ *             changes
+ * And VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY_P2P - PRE_COPY, except the device cannot do peer to peer DMA
  *
  * The FSM takes actions on the arcs between FSM states. The driver implements
  * the following behavior for the FSM arcs:
@@ -906,20 +919,48 @@  struct vfio_device_feature_mig_state {
  *
  *   To abort a RESUMING session the device must be reset.
  *
+ * PRE_COPY -> RUNNING
  * RUNNING_P2P -> RUNNING
  *   While in RUNNING the device is fully operational, the device may generate
  *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
  *   and the device may advance its internal state.
  *
+ *   The PRE_COPY arc will terminate a data transfer session.
+ *
+ * PRE_COPY_P2P -> RUNNING_P2P
  * RUNNING -> RUNNING_P2P
  * STOP -> RUNNING_P2P
  *   While in RUNNING_P2P the device is partially running in the P2P quiescent
  *   state defined below.
  *
+ *   The PRE_COPY arc will terminate a data transfer session.
+ *
+ * RUNNING -> PRE_COPY
+ * RUNNING_P2P -> PRE_COPY_P2P
  * STOP -> STOP_COPY
- *   This arc begin the process of saving the device state and will return a
- *   new data_fd.
+ *   PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of states
+ *   which share a data transfer session. Moving between these states alters
+ *   what is streamed in session, but does not terminate or otherwise effect
+ *   the associated fd.
+ *
+ *   These arcs begin the process of saving the device state and will return a
+ *   new data_fd. The migration driver may perform actions such as enabling
+ *   dirty logging of device state when entering PRE_COPY or PER_COPY_P2P.
  *
+ *   Each arc does not change the device operation, the device remains
+ *   RUNNING, P2P quiesced or in STOP. The STOP_COPY state is described below
+ *   in PRE_COPY_P2P -> STOP_COPY.
+ *
+ * PRE_COPY -> PRE_COPY_P2P
+ *   Entering PRE_COPY_P2P continues all the behaviors of PRE_COPY above.
+ *   However, while in the PRE_COPY_P2P state, the device is partially running
+ *   in the P2P quiescent state defined below, like RUNNING_P2P.
+ *
+ * PRE_COPY_P2P -> PRE_COPY
+ *   This arc allows returning the device to a full RUNNING behavior while
+ *   continuing all the behaviors of PRE_COPY.
+ *
+ * PRE_COPY_P2P -> STOP_COPY
  *   While in the STOP_COPY state the device has the same behavior as STOP
  *   with the addition that the data transfers session continues to stream the
  *   migration state. End of stream on the FD indicates the entire device
@@ -937,6 +978,13 @@  struct vfio_device_feature_mig_state {
  *   internal device state for this arc if required to prepare the device to
  *   receive the migration data.
  *
+ * STOP_COPY -> PRE_COPY
+ * STOP_COPY -> PRE_COPY_P2P
+ *   These arcs are not permitted and return error if requested. Future
+ *   revisions of this API may define behaviors for these arcs, in this case
+ *   support will be discoverable by a new flag in
+ *   VFIO_DEVICE_FEATURE_MIGRATION.
+ *
  * any -> ERROR
  *   ERROR cannot be specified as a device state, however any transition request
  *   can be failed with an errno return and may then move the device_state into
@@ -948,7 +996,7 @@  struct vfio_device_feature_mig_state {
  * The optional peer to peer (P2P) quiescent state is intended to be a quiescent
  * state for the device for the purposes of managing multiple devices within a
  * user context where peer-to-peer DMA between devices may be active. The
- * RUNNING_P2P states must prevent the device from initiating
+ * RUNNING_P2P and PRE_COPY_P2P states must prevent the device from initiating
  * any new P2P DMA transactions. If the device can identify P2P transactions
  * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration
  * driver must complete any such outstanding operations prior to completing the
@@ -959,6 +1007,8 @@  struct vfio_device_feature_mig_state {
  * above FSM arcs. As there are multiple paths through the FSM arcs the path
  * should be selected based on the following rules:
  *   - Select the shortest path.
+ *   - The path cannot have saving group states as interior arcs, only
+ *     starting/end states.
  * Refer to vfio_mig_get_next_state() for the result of the algorithm.
  *
  * The automatic transit through the FSM arcs that make up the combination
@@ -972,6 +1022,9 @@  struct vfio_device_feature_mig_state {
  * support them. The user can disocver if these states are supported by using
  * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
  * avoid knowing about these optional states if the kernel driver supports them.
+ *
+ * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support for PRE_COPY
+ * is not present.
  */
 enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_ERROR = 0,
@@ -980,8 +1033,57 @@  enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_STOP_COPY = 3,
 	VFIO_DEVICE_STATE_RESUMING = 4,
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
+	VFIO_DEVICE_STATE_PRE_COPY = 6,
+	VFIO_DEVICE_STATE_PRE_COPY_P2P = 7,
+};
+
+/**
+ * VFIO_DEVICE_MIG_PRECOPY - _IO(VFIO_TYPE, VFIO_BASE + 21)
+ *
+ * This ioctl is used on the migration data FD in the precopy phase of the
+ * migration data transfer. It returns an estimate of the current data sizes
+ * remaining to be transferred. It allows the user to judge when it is
+ * appropriate to leave PRE_COPY for STOP_COPY.
+ *
+ * initial_bytes reflects the estimated remaining size of any initial mandatory
+ * precopy data transfer. When initial_bytes returns as zero then the initial
+ * phase of the precopy data is completed. Generally initial_bytes should start
+ * out as approximately the entire device state.
+ *
+ * dirty_bytes reflects an estimate for how much more data needs to be
+ * transferred to complete the migration. Generally it should start as zero
+ * and increase as internal state is dirtied.
+ *
+ * Drivers should attempt to return estimates so that initial_bytes +
+ * dirty_bytes matches the amount of data an immediate transition to STOP_COPY
+ * will require to be streamed.
+ *
+ * Drivers have alot of flexibility in when and what they transfer during the
+ * PRE_COPY phase, and how they report this from VFIO_DEVICE_MIG_PRECOPY.
+ *
+ * During pre-copy the migration data FD has a temporary "end of stream" that is
+ * reached when both initial_bytes and dirty_byte are zero. For instance, this
+ * may indicate that the device is idle and not currently dirtying any internal
+ * state. When read() is done on this temporary end of stream the kernel driver
+ * should return ENOMSG from read(). Userspace can wait for more data (which may
+ * never come) by using poll.
+ *
+ * Once in STOP_COPY the migration data FD has a permanent end of stream
+ * signaled in the usual way by read() always returning 0 and poll always
+ * returning readable. ENOMSG may not be returned in STOP_COPY. Support
+ * for this ioctl is optional.
+ *
+ * Return: 0 on success, -1 and errno set on failure.
+ */
+struct vfio_device_mig_precopy {
+	__u32 argsz;
+	__u32 flags;
+	__aligned_u64 initial_bytes;
+	__aligned_u64 dirty_bytes;
 };
 
+#define VFIO_DEVICE_MIG_PRECOPY _IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**