mbox series

[V2,mlx5-next,00/14] Add mlx5 live migration driver

Message ID 20211019105838.227569-1-yishaih@nvidia.com (mailing list archive)
Headers show
Series Add mlx5 live migration driver | expand

Message

Yishai Hadas Oct. 19, 2021, 10:58 a.m. UTC
This series adds mlx5 live migration driver for VFs that are migrated
capable.

It uses vfio_pci_core to register to the VFIO subsystem and then
implements the mlx5 specific logic in the migration area.

The migration implementation follows the definition from uapi/vfio.h and
uses the mlx5 VF->PF command channel to achieve it.

As part of the migration process the VF doesn't ride on mlx5_core, the
device is driving *two* different PCI devices, the PF owned by mlx5_core
and the VF owned by the mlx5 vfio driver.

The mlx5_core of the PF is accessed only during the narrow window of the
VF's ioctl that requires its services.

To let that work properly a new API was added in the PCI layer (i.e.
pci_iov_get_pf_drvdata) that lets the VF access safely to the PF
drvdata. It was used in this series as part of mlx5_core and mlx5_vdpa
when a VF needed that functionality.

In addition, mlx5_core was aligned with other drivers to disable SRIOV
before PF has gone as part of the remove_one() call back.

This enables proper usage of the above new PCI API and prevents some
warning message that exists today when it's not done.

The series also exposes from the PCI sub system an API named
pci_iov_vf_id() to get the index of the VF. The PCI core uses this index
internally, often called the vf_id, during the setup of the VF, eg
pci_iov_add_virtfn().

The returned VF index is needed by the mlx5 vfio driver for its internal
operations to configure/control its VFs as part of the migration
process.

With the above functionality in place the driver implements the
suspend/resume flows to work over QEMU.

Changes from V1:
PCI/IOV:
- Add actual interface in the subject as was asked by Bjorn and add
  his Acked-by.
- Move to check explicitly for !dev->is_virtfn as was asked by Alex.
vfio:
- Come with a separate patch for fixing the non-compiled
  VFIO_DEVICE_STATE_SET_ERROR macro.
- Expose vfio_pci_aer_err_detected() to be set by drivers on their own
  pci error handles.
- Add a macro for VFIO_DEVICE_STATE_ERROR in the uapi header file as was
  suggested by Alex.
vfio/mlx5:
- Improve to use xor as part of checking the 'state' change command as
  was suggested by Alex.
- Set state to VFIO_DEVICE_STATE_ERROR when an error occurred instead of
  VFIO_DEVICE_STATE_INVALID.
- Improve state checking as was suggested by Jason.
- Use its own PCI reset_done error handler as was suggested by Jason and
  fix the locking scheme around the state mutex to work properly.

Changes from V0:
PCI/IOV:
- Add an API (i.e. pci_iov_get_pf_drvdata()) that allows SRVIO VF
  drivers to reach the drvdata of a PF.
net/mlx5:
- Add an extra patch to disable SRIOV before PF removal.
- Adapt to use the above PCI/IOV API as part of mlx5_vf_get_core_dev().
- Reuse the exported PCI/IOV virtfn index function call (i.e.
  pci_iov_vf_id().
vfio:
- Add support in the pci_core to let a driver be notified when
 ‘reset_done’ to let it sets its internal state accordingly.
- Add some helper stuff for ‘invalid’ state handling.
vfio/mlx5:
- Move to use the ‘command mode’ instead of the ‘state machine’
  scheme as was discussed in the mailing list.
-Handle the RESET scenario when called by vfio_pci_core to sets
 its internal state accordingly.
- Set initial state as RUNNING.
- Put the driver files as sub-folder under drivers/vfio/pci named mlx5
  and update the MAINTAINER file as was asked.
vdpa/mlx5:
Add a new patch to use mlx5_vf_get_core_dev() to get the PF device.

---------------------------------------------------------------
Alex,

This series touches our ethernet and RDMA drivers, so we will need to
route the patches through separate shared branch (mlx5-next) in order to
eliminate the chances of merge conflicts between different subsystems.

Are you fine with taking this V2 series through mlx5-next and we'll send
a PR to you to include ?

Thanks,
Yishai

Jason Gunthorpe (2):
  PCI/IOV: Add pci_iov_vf_id() to get VF index
  PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata
    of a PF

Leon Romanovsky (1):
  net/mlx5: Reuse exported virtfn index function call

Yishai Hadas (11):
  net/mlx5: Disable SRIOV before PF removal
  net/mlx5: Expose APIs to get/put the mlx5 core device
  vdpa/mlx5: Use mlx5_vf_get_core_dev() to get PF device
  vfio: Fix VFIO_DEVICE_STATE_SET_ERROR macro
  Vfio: Add a macro for VFIO_DEVICE_STATE_ERROR
  vfio/pci_core: Make the region->release() function optional
  net/mlx5: Introduce migration bits and structures
  vfio/mlx5: Expose migration commands over mlx5 device
  vfio/mlx5: Implement vfio_pci driver for mlx5 devices
  vfio/pci: Expose vfio_pci_aer_err_detected()
  vfio/mlx5: Use its own PCI reset_done error handler

 MAINTAINERS                                   |   6 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  44 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   1 +
 .../net/ethernet/mellanox/mlx5/core/sriov.c   |  17 +-
 drivers/pci/iov.c                             |  43 ++
 drivers/vdpa/mlx5/net/mlx5_vnet.c             |  27 +-
 drivers/vfio/pci/Kconfig                      |   3 +
 drivers/vfio/pci/Makefile                     |   2 +
 drivers/vfio/pci/mlx5/Kconfig                 |  10 +
 drivers/vfio/pci/mlx5/Makefile                |   4 +
 drivers/vfio/pci/mlx5/cmd.c                   | 353 +++++++++
 drivers/vfio/pci/mlx5/cmd.h                   |  43 ++
 drivers/vfio/pci/mlx5/main.c                  | 727 ++++++++++++++++++
 drivers/vfio/pci/vfio_pci_core.c              |   8 +-
 include/linux/mlx5/driver.h                   |   3 +
 include/linux/mlx5/mlx5_ifc.h                 | 145 +++-
 include/linux/pci.h                           |  15 +-
 include/linux/vfio_pci_core.h                 |   2 +
 include/uapi/linux/vfio.h                     |   4 +-
 19 files changed, 1431 insertions(+), 26 deletions(-)
 create mode 100644 drivers/vfio/pci/mlx5/Kconfig
 create mode 100644 drivers/vfio/pci/mlx5/Makefile
 create mode 100644 drivers/vfio/pci/mlx5/cmd.c
 create mode 100644 drivers/vfio/pci/mlx5/cmd.h
 create mode 100644 drivers/vfio/pci/mlx5/main.c

Comments

Cornelia Huck Nov. 17, 2021, 4:42 p.m. UTC | #1
Ok, here's the contents (as of 2021-11-17 16:30 UTC) of the etherpad at
https://etherpad.opendev.org/p/VFIOMigrationDiscussions -- in the hope
of providing a better starting point for further discussion (I know that
discussions are still ongoing in other parts of this thread; but
frankly, I'm getting a headache trying to follow them, and I think it
would be beneficial to concentrate on the fundamental questions
first...)

VFIO migration: current state and open questions

Current status
  * Linux
    * uAPI has been merged with a8a24f3f6e38 ("vfio: UAPI for migration
    interface for device state") in 5.8
      * no kernel user of the uAPI merged
      * Several out of tree drivers apparently
    * support for mlx5 currently on the list (latest:
    https://lore.kernel.org/all/20211027095658.144468-1-yishaih@nvidia.com/
    with discussion still happening on older versions)
    * support for HiSilicon ACC devices is on the list too. Adds support
    for HiSilicon crypto accelerator VF device live migration. These are
    simple DMA queue based PCIe integrated endpoint devices. No support
    for P2P and doesn't use DMA for migration. <latest:
    https://lore.kernel.org/lkml/20210915095037.1149-1-shameerali.kolothum.thodi@huawei.com/>
  * QEMU
    * basic support added in 5.2, some fixes later
      * support for vfio-pci only so far, still experimental
      ("x-enable-migration") as of 6.2
      * Only tested with out of tree drivers
  * other software?

Problems/open questions
  * Are the status bits currently defined in the uAPI
  (_RESUMING/_SAVING/_RUNNING) sufficient to express all states we need?
  * What does clearing _RUNNING imply? In particular, does it mean that
  the device is frozen, or are some operations still permitted?
  * various points brought up: P2P, SET_IRQS, ... <please summarize :)>:
    * P2P DMA support between devices requires an additional HW control
    state where the device can receive but not transmit DMA
    * No definition of what HW needs to preserve when RESUMING toggles
    off - (eg today SET_IRQS must work, what else?).
    * In general, how do IRQs work with future non-trapping IMS?
    * Dealing with pending IOMMU PRI faults during migration
  * Problems identified with the !RUNNING state:
    * When there are multiple devices in a user context (VM), we can't
    atomically move all devices to the !_RUNNING state concurently.
      * Suggests the current uAPI has a usage restriction for
      environments that do not make use of peer-to-peer DMA (ie. we
      can't have a device generating DMA to a p2p target that cannot
      accept it - especially if error response from target can generate
      host fatal conditions)
      * Possible userspace implications:
        * VMs could be limited to a single device to guarantee that no
        p2p exists - non-vfio devices generating physical p2p DMA in the
        future is a concern
        * Hypervisor may skip creating p2p DMA mappings, creating a
        choice whether the VM supports migration or p2p
      * Jason proposed a new NDMA (no-dma) state that seems to match the
      mlx5 implementation of "quiesce" vs "freeze" states, where NDMA
      would indicate the device cannot generate DMA or interrupts such
      that once userspace places all devices into the (NDMA | RUNNING)
      state the environment is fully quiesced.  A flag or capability on
      the migration region could indicate support for this feature.
      * Alex proposed that this could be equally resolved within the
      current device states if !RUNNING becomes the quiescent point
      where the device stops generating DMA and interrupts, with a
      requirement that the user moves all devices to !RUNNING before
      collecting device migration data (as indicated by reading
      pending_bytes) or else risk corrupting the migration data, which
      the device could indicate via an errno in the migration process.
      A flag or capability would still be required to indicate this
      support.
        * Jason does not favor this approach, objecting that the mode
        transition is implicit, and still needs qemu changes anyhow
    * In general, what operations or accesses is the user restricted
    from performing on the device while !RUNNING
      * Jason has proposed very restricted access (essentially none
      beyond the migration region itself), including no MMIO access
      <20211028234750.GP2744544@nvidia.com>  This essentially imposes
      device transmission to an intermediate state between SAVING and
      RUNNING.
        * Alex requested a formal uAPI update defining what accesses are
        allowed, including which regions and ioctls.
        * The existing uAPI does not require any such transition to a
        "null" state or TBD new device state bit.  QEMU currently
        expects access to config space and the ability to call SET_IRQS
        and create  mmaps while in the RESUMING state, without the
        RUNNING bit set.  Restoring MSI-X interrupt configuration
        necessarily requires MMIO access to the device.
        * Jason suggested a new device state bit and user protocol to
        account for this, where the device is in a !RUNNING and
        !RESTORING, but to some degree becomes manipulable via device
        regions and ioctls.  No compatibility mechanism proposed.
        * Alex suggested that this is potentially supportable via a spec
        clarification that requires the device migration data to be
        written to completion before userspace performs other region or
        ioctl access to the device. (mlx5's driver is designed to not
        inspect the migration blob itself, so it can't detect the
        "end". The migration blob is finished when mlx5 sees RESUMING
        clear.)
    * PRI into the guest (guest user process SVA) has a sequencing
    problem with RUNNING - can not migrate a vIOMMU in the middle of a
    page fault, must stop and flush faults before stopping vCPUs
  * The uAPI could benefit from some more detailed documentation
  (e.g. how to use it, what to do in edge cases, ...) outside of the
  header file.
  * Trying to use the mlx5 support currently on the list has unearthed
  some problems in QEMU <please summarize :)>
  * Discussion regarding dirty tracking and how much it should be
  controlled by user space still ongoing
  * General questions:
    * How much do we want to change the uAPI and/or the documentation to
    accommodate what QEMU has implemented so far?
    * How much do we want to change QEMU?

Possible solutions
  * uAPI
    * fine as is, or
    * needs some clarifications, or
    * needs rework, which might mean a v2
  * QEMU
    * fine as is (modulo bugfixes), or
    * needs some rework, but not impacting the uAPI, or
    * needs some rework, which also needs some changes in the uAPI
  * Suggested approach:
    * Work on the documentation, and try to come up with some more
    HW-centric docs
    * Depending on that, decide how many changes we want/need to do in
    QEMU
Jason Gunthorpe Nov. 17, 2021, 5:47 p.m. UTC | #2
On Wed, Nov 17, 2021 at 05:42:58PM +0100, Cornelia Huck wrote:
> Ok, here's the contents (as of 2021-11-17 16:30 UTC) of the etherpad at
> https://etherpad.opendev.org/p/VFIOMigrationDiscussions -- in the hope
> of providing a better starting point for further discussion (I know that
> discussions are still ongoing in other parts of this thread; but
> frankly, I'm getting a headache trying to follow them, and I think it
> would be beneficial to concentrate on the fundamental questions
> first...)

In my mind several of these topics now have answers:

>       * Jason proposed a new NDMA (no-dma) state that seems to match the

NDMA solves the PRI problem too, and allows dirty tracking to be
iterative. So yes to adding to device_state vs implicit via !RUNNING

>     * No definition of what HW needs to preserve when RESUMING toggles
>     off - (eg today SET_IRQS must work, what else?).

Everything in the device controlled by other kernel subystems (IRQs,
MSI, PCI config space) must continue to work across !RUNNING and must
not be disturbed by the migration driver during RESUME.

So, clear yes that SET_IRQs during !RUNNING must be supported

>     * In general, what operations or accesses is the user restricted
>     from performing on the device while !RUNNING

Still a need on this other than the carve out for above. HNS won't
work without restrictions, for instance.

>     * PRI into the guest (guest user process SVA) has a sequencing
>     problem with RUNNING - can not migrate a vIOMMU in the middle of a
>     page fault, must stop and flush faults before stopping vCPUs

NDMA|RUNNING allows to suspend the vIOMMU

> The uAPI could benefit from some more detailed documentation
> (e.g. how to use it, what to do in edge cases, ...) outside of the
> header file.

We have an internal draft of this now

> Trying to use the mlx5 support currently on the list has unearthed
> some problems in QEMU <please summarize :)>

If the kernel does anything odd qemu does abort()

Performance is bad, Yishai sent a patch

Jason