mbox series

[V4,vfio,00/14] Add migration PRE_COPY support for mlx5 driver

Message ID 20221206083438.37807-1-yishaih@nvidia.com (mailing list archive)
Headers show
Series Add migration PRE_COPY support for mlx5 driver | expand

Message

Yishai Hadas Dec. 6, 2022, 8:34 a.m. UTC
This series adds migration PRE_COPY uAPIs and their implementation as part of
mlx5 driver.

The uAPIs follow some discussion that was done in the mailing list [1] in this
area.

By the time the patches were sent, there was no driver implementation for the
uAPIs, now we have it for mlx5 driver.

The optional PRE_COPY state opens the saving data transfer FD before reaching
STOP_COPY and allows the device to dirty track the internal state changes with
the general idea to reduce the volume of data transferred in the STOP_COPY
stage.

While in PRE_COPY the device remains RUNNING, but the saving FD is open.

A new ioctl VFIO_MIG_GET_PRECOPY_INFO is provided to allow userspace to query
the progress of the precopy operation in the driver with the idea it will judge
to move to STOP_COPY once the initial data set is transferred, and possibly
after the dirty size has shrunk appropriately.

User space can detect whether PRE_COPY is supported for a given device by
checking the VFIO_MIGRATION_PRE_COPY flag once using the
VFIO_DEVICE_FEATURE_MIGRATION ioctl.

Extra details exist as part of the specific uAPI patch from the series.

Finally, we come with mlx5 implementation based on its device specification for
PRE_COPY.

To support PRE_COPY, mlx5 driver is transferring multiple states (images) of
the device. e.g.: the source VF can save and transfer multiple states, and the
target VF will load them by that order.

The device is saving three kinds of states:
1) Initial state - when the device moves to PRE_COPY state.
2) Middle state - during PRE_COPY phase via VFIO_MIG_GET_PRECOPY_INFO,
                  can be multiple such states.
3) Final state - when the device moves to STOP_COPY state.

After moving to PRE_COPY state, the user is holding the saving FD and should
use it for transferring the data from the source to the target while the VM is
still running. From user point of view, it's a stream of data, however, from
mlx5 driver point of view it includes multiple images/states. For that, it sets
some headers with metadata on the source to be parsed on the target.

At some point, user may switch the device state from PRE_COPY to STOP_COPY,
this will invoke saving of the final state.

As discussed earlier in the mailing list, the data that is returned as part of
PRE_COPY is not required to have any bearing relative to the data size
available during the STOP_COPY phase.

For this, we have the VFIO_DEVICE_FEATURE_MIG_DATA_SIZE option.

In mlx5 driver we could gain with this series about 20-30 percent improvement
in the downtime compared to the previous code when PRE_COPY wasn't supported.

The series includes some pre-patches to be ready for managing multiple images
then it comes with the PRE_COPY implementation itself.

The matching qemu changes can be previewed here [2].

They come on top of the v2 migration protocol patches that were sent already to
the mailing list.

[1] https://lore.kernel.org/kvm/20220302172903.1995-8-shameerali.kolothum.thodi@huawei.com/
[2] https://github.com/avihai1122/qemu/commits/mig_v2_precopy

Changes from V3: https://www.spinics.net/lists/kvm/msg297449.html
Patch #1:
- Add Acked-by: Leon Romanovsky.
Patch #10:
- Fix mlx5vf_precopy_ioctl() signature to return long instead of ssize_t
  as Alex pointed out.

Changes from V2: https://www.spinics.net/lists/kvm/msg297112.html

Patch #2:
- Add a note that the VFIO_MIG_GET_PRECOPY_INFO ioctl is mandatory when
  a driver claims to support VFIO_MIGRATION_PRE_COPY as was raised by
  Shameer Kolothum.
- Add Reviewed-by: Shameer Kolothum and Kevin Tian.
Patch #3:
- Add a comment in the code as suggested by Jason.
All:
- Add Reviewed-by: Jason Gunthorpe for the series.

Note:
As pointed out by Leon in the mailing list, no need for a PR for the
first patch of net/mlx5.

Changes from V1: https://www.spinics.net/lists/kvm/msg296475.html

Patch #2: Rephrase the 'initial_bytes' meaning as was suggested by Jason.
Patch #9: Fix to send header based on PRE_COPY support.
Patch #13: Fix some unwind flow to call complete().

Changes from V0: https://www.spinics.net/lists/kvm/msg294247.html

Drop the first 2 patches that Alex merged already.
Refactor mlx5 implementation based on Jason's comments on V0, it includes
the below:
* Refactor the PD usage to be aligned with the migration file life cycle.
* Refactor the MKEY usage to be aligned with the migration file life cycle.
* Refactor the migration file state.
* Use queue based data chunks to simplify the driver code.
* Use the FSM model on the target to simplify the driver code.
* Extend the driver pre_copy header for future use.

Yishai

Jason Gunthorpe (1):
  vfio: Extend the device migration protocol with PRE_COPY

Shay Drory (3):
  net/mlx5: Introduce ifc bits for pre_copy
  vfio/mlx5: Fallback to STOP_COPY upon specific PRE_COPY error
  vfio/mlx5: Enable MIGRATION_PRE_COPY flag

Yishai Hadas (10):
  vfio/mlx5: Enforce a single SAVE command at a time
  vfio/mlx5: Refactor PD usage
  vfio/mlx5: Refactor MKEY usage
  vfio/mlx5: Refactor migration file state
  vfio/mlx5: Refactor to use queue based data chunks
  vfio/mlx5: Introduce device transitions of PRE_COPY
  vfio/mlx5: Introduce SW headers for migration states
  vfio/mlx5: Introduce vfio precopy ioctl implementation
  vfio/mlx5: Consider temporary end of stream as part of PRE_COPY
  vfio/mlx5: Introduce multiple loads

 drivers/vfio/pci/mlx5/cmd.c   | 409 ++++++++++++++----
 drivers/vfio/pci/mlx5/cmd.h   |  96 ++++-
 drivers/vfio/pci/mlx5/main.c  | 752 ++++++++++++++++++++++++++++------
 drivers/vfio/vfio_main.c      |  74 +++-
 include/linux/mlx5/mlx5_ifc.h |  14 +-
 include/uapi/linux/vfio.h     | 123 +++++-
 6 files changed, 1248 insertions(+), 220 deletions(-)

Comments

Alex Williamson Dec. 6, 2022, 9:49 p.m. UTC | #1
On Tue, 6 Dec 2022 10:34:24 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> This series adds migration PRE_COPY uAPIs and their implementation as part of
> mlx5 driver.
> 
> The uAPIs follow some discussion that was done in the mailing list [1] in this
> area.
> 
> By the time the patches were sent, there was no driver implementation for the
> uAPIs, now we have it for mlx5 driver.
> 
> The optional PRE_COPY state opens the saving data transfer FD before reaching
> STOP_COPY and allows the device to dirty track the internal state changes with
> the general idea to reduce the volume of data transferred in the STOP_COPY
> stage.
> 
> While in PRE_COPY the device remains RUNNING, but the saving FD is open.
> 
> A new ioctl VFIO_MIG_GET_PRECOPY_INFO is provided to allow userspace to query
> the progress of the precopy operation in the driver with the idea it will judge
> to move to STOP_COPY once the initial data set is transferred, and possibly
> after the dirty size has shrunk appropriately.
> 
> User space can detect whether PRE_COPY is supported for a given device by
> checking the VFIO_MIGRATION_PRE_COPY flag once using the
> VFIO_DEVICE_FEATURE_MIGRATION ioctl.
> 
> Extra details exist as part of the specific uAPI patch from the series.
> 
> Finally, we come with mlx5 implementation based on its device specification for
> PRE_COPY.
> 
> To support PRE_COPY, mlx5 driver is transferring multiple states (images) of
> the device. e.g.: the source VF can save and transfer multiple states, and the
> target VF will load them by that order.
> 
> The device is saving three kinds of states:
> 1) Initial state - when the device moves to PRE_COPY state.
> 2) Middle state - during PRE_COPY phase via VFIO_MIG_GET_PRECOPY_INFO,
>                   can be multiple such states.
> 3) Final state - when the device moves to STOP_COPY state.
> 
> After moving to PRE_COPY state, the user is holding the saving FD and should
> use it for transferring the data from the source to the target while the VM is
> still running. From user point of view, it's a stream of data, however, from
> mlx5 driver point of view it includes multiple images/states. For that, it sets
> some headers with metadata on the source to be parsed on the target.
> 
> At some point, user may switch the device state from PRE_COPY to STOP_COPY,
> this will invoke saving of the final state.
> 
> As discussed earlier in the mailing list, the data that is returned as part of
> PRE_COPY is not required to have any bearing relative to the data size
> available during the STOP_COPY phase.
> 
> For this, we have the VFIO_DEVICE_FEATURE_MIG_DATA_SIZE option.
> 
> In mlx5 driver we could gain with this series about 20-30 percent improvement
> in the downtime compared to the previous code when PRE_COPY wasn't supported.
> 
> The series includes some pre-patches to be ready for managing multiple images
> then it comes with the PRE_COPY implementation itself.
> 
> The matching qemu changes can be previewed here [2].
> 
> They come on top of the v2 migration protocol patches that were sent already to
> the mailing list.
> 
> [1] https://lore.kernel.org/kvm/20220302172903.1995-8-shameerali.kolothum.thodi@huawei.com/
> [2] https://github.com/avihai1122/qemu/commits/mig_v2_precopy
> 
> Changes from V3: https://www.spinics.net/lists/kvm/msg297449.html
> Patch #1:
> - Add Acked-by: Leon Romanovsky.
> Patch #10:
> - Fix mlx5vf_precopy_ioctl() signature to return long instead of ssize_t
>   as Alex pointed out.
> 
> Changes from V2: https://www.spinics.net/lists/kvm/msg297112.html
> 
> Patch #2:
> - Add a note that the VFIO_MIG_GET_PRECOPY_INFO ioctl is mandatory when
>   a driver claims to support VFIO_MIGRATION_PRE_COPY as was raised by
>   Shameer Kolothum.
> - Add Reviewed-by: Shameer Kolothum and Kevin Tian.
> Patch #3:
> - Add a comment in the code as suggested by Jason.
> All:
> - Add Reviewed-by: Jason Gunthorpe for the series.
> 
> Note:
> As pointed out by Leon in the mailing list, no need for a PR for the
> first patch of net/mlx5.
> 
> Changes from V1: https://www.spinics.net/lists/kvm/msg296475.html
> 
> Patch #2: Rephrase the 'initial_bytes' meaning as was suggested by Jason.
> Patch #9: Fix to send header based on PRE_COPY support.
> Patch #13: Fix some unwind flow to call complete().
> 
> Changes from V0: https://www.spinics.net/lists/kvm/msg294247.html
> 
> Drop the first 2 patches that Alex merged already.
> Refactor mlx5 implementation based on Jason's comments on V0, it includes
> the below:
> * Refactor the PD usage to be aligned with the migration file life cycle.
> * Refactor the MKEY usage to be aligned with the migration file life cycle.
> * Refactor the migration file state.
> * Use queue based data chunks to simplify the driver code.
> * Use the FSM model on the target to simplify the driver code.
> * Extend the driver pre_copy header for future use.
> 
> Yishai
> 
> Jason Gunthorpe (1):
>   vfio: Extend the device migration protocol with PRE_COPY
> 
> Shay Drory (3):
>   net/mlx5: Introduce ifc bits for pre_copy
>   vfio/mlx5: Fallback to STOP_COPY upon specific PRE_COPY error
>   vfio/mlx5: Enable MIGRATION_PRE_COPY flag
> 
> Yishai Hadas (10):
>   vfio/mlx5: Enforce a single SAVE command at a time
>   vfio/mlx5: Refactor PD usage
>   vfio/mlx5: Refactor MKEY usage
>   vfio/mlx5: Refactor migration file state
>   vfio/mlx5: Refactor to use queue based data chunks
>   vfio/mlx5: Introduce device transitions of PRE_COPY
>   vfio/mlx5: Introduce SW headers for migration states
>   vfio/mlx5: Introduce vfio precopy ioctl implementation
>   vfio/mlx5: Consider temporary end of stream as part of PRE_COPY
>   vfio/mlx5: Introduce multiple loads
> 
>  drivers/vfio/pci/mlx5/cmd.c   | 409 ++++++++++++++----
>  drivers/vfio/pci/mlx5/cmd.h   |  96 ++++-
>  drivers/vfio/pci/mlx5/main.c  | 752 ++++++++++++++++++++++++++++------
>  drivers/vfio/vfio_main.c      |  74 +++-
>  include/linux/mlx5/mlx5_ifc.h |  14 +-
>  include/uapi/linux/vfio.h     | 123 +++++-
>  6 files changed, 1248 insertions(+), 220 deletions(-)

Applied to vfio next branch for v6.2.  Thanks,

Alex