mbox series

[V2,00/45] Live update: vfio and iommufd

Message ID 1739542467-226739-1-git-send-email-steven.sistare@oracle.com (mailing list archive)
Headers show
Series Live update: vfio and iommufd | expand

Message

Steven Sistare Feb. 14, 2025, 2:13 p.m. UTC
Support vfio and iommufd devices with the cpr-transfer live migration mode.
Devices that do not support live migration can still support cpr-transfer,
allowing live update to a new version of QEMU on the same host, with no loss
of guest connectivity.

No user-visible interfaces are added.

For legacy containers:

Pass vfio device descriptors to new QEMU.  In new QEMU, during vfio_realize,
skip the ioctls that configure the device, because it is already configured.

Use VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old VA's for DMA mapped
regions, and use VFIO_DMA_MAP_FLAG_VADDR to register the new VA in new
QEMU and update the locked memory accounting.  The physical pages remain
pinned, because the descriptor of the device that locked them remains open,
so DMA to those pages continues without interruption.  Mediated devices are
not supported, however, because they require the VA to always be valid, and
there is a brief window where no VA is registered.

Save the MSI message area as part of vfio-pci vmstate, and pass the interrupt
and notifier eventfd's to new QEMU.  New QEMU loads the MSI data, then the
vfio-pci post_load handler finds the eventfds in CPR state, rebuilds vector
data structures, and attaches the interrupts to the new KVM instance.  This
logic also applies to iommufd containers.

For iommufd containers:

Use IOMMU_IOAS_MAP_FILE to register memory regions for DMA when they are
backed by a file (including a memfd), so DMA mappings do not depend on VA,
which can differ after live update.  This allows mediated devices to be
supported.

Pass the iommufd and vfio device descriptors from old to new QEMU.  In new
QEMU, during vfio_realize, skip the ioctls that configure the device, because
it is already configured.

In new QEMU, call ioctl(IOMMU_IOAS_CHANGE_PROCESS) to update mm ownership and
locked memory accounting.

Patches 5 to 14 are specific to legacy containers.
Patches 27 to 44 are specific to iommufd containers.
The remainder apply to both.

Changes from previous versions:
  * V1 of this series contains minor changes from the "Live update: vfio" and
    "Live update: iommufd" series, mainly bug fixes and refactored patches.

Changes in V2:
  * refactored various vfio code snippets into new cpr helpers
  * refactored vfio struct members into cpr-specific structures
  * refactored various small changes into their own patches
  * split complex patches.  Notably:
    - split "refactor for cpr" into 5 patches
    - split "reconstruct device" into 4 patches
  * refactored vfio_connect_container using helpers and made its
    error recovery more robust.
  * moved vfio pci msi/vector/intx cpr functions to cpr.c
  * renamed "reused" to cpr_reused and cpr.reused
  * squashed vfio_cpr_[un]register_container to their call sites
  * simplified iommu_type setting after cpr
  * added cpr_open_fd and cpr_is_incoming helpers
  * removed changes from vfio_legacy_dma_map, and instead temporarily
    override dma_map and dma_unmap ops.
  * deleted error_report and returned Error to callers where possible.
  * simplified the memory_get_xlat_addr interface
  * fixed flags passed to iommufd_backend_alloc_hwpt
  * defined MIG_PRI_UNINITIALIZED
  * added maintainers

Steve Sistare (45):
  MAINTAINERS: Add reviewer for CPR
  migration: cpr helpers
  migration: lower handler priority
  vfio: vfio_find_ram_discard_listener
  vfio/container: ram discard disable helper
  vfio/container: reform vfio_connect_container cleanup
  vfio/container: vfio_container_group_add
  vfio/container: register container for cpr
  vfio/container: preserve descriptors
  vfio/container: export vfio_legacy_dma_map
  vfio/container: discard old DMA vaddr
  vfio/container: restore DMA vaddr
  vfio/container: mdev cpr blocker
  vfio/container: recover from unmap-all-vaddr failure
  pci: export msix_is_pending
  pci: skip reset during cpr
  vfio-pci: skip reset during cpr
  vfio/pci: vfio_vector_init
  vfio/pci: vfio_notifier_init
  vfio/pci: pass vector to virq functions
  vfio/pci: vfio_notifier_init cpr parameters
  vfio/pci: vfio_notifier_cleanup
  vfio/pci: export MSI functions
  vfio-pci: preserve MSI
  vfio-pci: preserve INTx
  migration: close kvm after cpr
  migration: cpr_get_fd_param helper
  vfio: return mr from vfio_get_xlat_addr
  vfio: pass ramblock to vfio_container_dma_map
  backends/iommufd: iommufd_backend_map_file_dma
  backends/iommufd: change process ioctl
  physmem: qemu_ram_get_fd_offset
  vfio/iommufd: use IOMMU_IOAS_MAP_FILE
  vfio/iommufd: export iommufd_cdev_get_info_iova_range
  vfio/iommufd: define hwpt constructors
  vfio/iommufd: invariant device name
  vfio/iommufd: fix cpr register
  vfio/iommufd: register container for cpr
  vfio/iommufd: preserve descriptors
  vfio/iommufd: reconstruct device
  vfio/iommufd: reconstruct hw_caps
  vfio/iommufd: reconstruct hwpt
  vfio/iommufd: change process
  iommufd: preserve DMA mappings
  vfio/container: delete old cpr register

 MAINTAINERS                           |  12 ++
 accel/kvm/kvm-all.c                   |  28 ++++
 backends/iommufd.c                    |  83 ++++++++--
 backends/trace-events                 |   2 +
 hw/pci/msix.c                         |   2 +-
 hw/pci/pci.c                          |  13 ++
 hw/vfio/common.c                      |  91 ++++++++---
 hw/vfio/container-base.c              |  12 +-
 hw/vfio/container.c                   | 216 +++++++++++++++++---------
 hw/vfio/cpr-iommufd.c                 | 172 +++++++++++++++++++++
 hw/vfio/cpr-legacy.c                  | 277 ++++++++++++++++++++++++++++++++++
 hw/vfio/cpr.c                         | 176 +++++++++++++++++++--
 hw/vfio/helpers.c                     |  20 +--
 hw/vfio/iommufd.c                     | 139 +++++++++++++----
 hw/vfio/meson.build                   |   4 +-
 hw/vfio/pci.c                         | 194 ++++++++++++++++++------
 hw/vfio/pci.h                         |  12 ++
 hw/vfio/trace-events                  |   1 +
 hw/virtio/vhost-vdpa.c                |   8 +-
 include/exec/cpu-common.h             |   1 +
 include/exec/memory.h                 |   6 +-
 include/hw/pci/msix.h                 |   1 +
 include/hw/vfio/vfio-common.h         |  20 ++-
 include/hw/vfio/vfio-container-base.h |   6 +-
 include/hw/vfio/vfio-cpr.h            |  69 +++++++++
 include/migration/cpr.h               |  10 ++
 include/migration/vmstate.h           |   6 +-
 include/system/iommufd.h              |   6 +
 include/system/kvm.h                  |   1 +
 migration/cpr-transfer.c              |  18 +++
 migration/cpr.c                       |  92 +++++++++++
 migration/migration.c                 |   1 +
 migration/savevm.c                    |   4 +-
 system/memory.c                       |  19 +--
 system/physmem.c                      |   5 +
 35 files changed, 1490 insertions(+), 237 deletions(-)
 create mode 100644 hw/vfio/cpr-iommufd.c
 create mode 100644 hw/vfio/cpr-legacy.c
 create mode 100644 include/hw/vfio/vfio-cpr.h

base-commit: de278e54aefed143526174335f8286f7437d20be

Comments

Steven Sistare Feb. 14, 2025, 3:56 p.m. UTC | #1
Hi all, it would be nice to get this into qemu 10.0.  Without it, the
basic support for cpr-transfer already in 10.0 is much less interesting.
Soft feature freeze is 2024-03-12.

- Steve

On 2/14/2025 9:13 AM, Steve Sistare wrote:
> Support vfio and iommufd devices with the cpr-transfer live migration mode.
> Devices that do not support live migration can still support cpr-transfer,
> allowing live update to a new version of QEMU on the same host, with no loss
> of guest connectivity.
> 
> No user-visible interfaces are added.
> 
> For legacy containers:
> 
> Pass vfio device descriptors to new QEMU.  In new QEMU, during vfio_realize,
> skip the ioctls that configure the device, because it is already configured.
> 
> Use VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old VA's for DMA mapped
> regions, and use VFIO_DMA_MAP_FLAG_VADDR to register the new VA in new
> QEMU and update the locked memory accounting.  The physical pages remain
> pinned, because the descriptor of the device that locked them remains open,
> so DMA to those pages continues without interruption.  Mediated devices are
> not supported, however, because they require the VA to always be valid, and
> there is a brief window where no VA is registered.
> 
> Save the MSI message area as part of vfio-pci vmstate, and pass the interrupt
> and notifier eventfd's to new QEMU.  New QEMU loads the MSI data, then the
> vfio-pci post_load handler finds the eventfds in CPR state, rebuilds vector
> data structures, and attaches the interrupts to the new KVM instance.  This
> logic also applies to iommufd containers.
> 
> For iommufd containers:
> 
> Use IOMMU_IOAS_MAP_FILE to register memory regions for DMA when they are
> backed by a file (including a memfd), so DMA mappings do not depend on VA,
> which can differ after live update.  This allows mediated devices to be
> supported.
> 
> Pass the iommufd and vfio device descriptors from old to new QEMU.  In new
> QEMU, during vfio_realize, skip the ioctls that configure the device, because
> it is already configured.
> 
> In new QEMU, call ioctl(IOMMU_IOAS_CHANGE_PROCESS) to update mm ownership and
> locked memory accounting.
> 
> Patches 5 to 14 are specific to legacy containers.
> Patches 27 to 44 are specific to iommufd containers.
> The remainder apply to both.
> 
> Changes from previous versions:
>    * V1 of this series contains minor changes from the "Live update: vfio" and
>      "Live update: iommufd" series, mainly bug fixes and refactored patches.
> 
> Changes in V2:
>    * refactored various vfio code snippets into new cpr helpers
>    * refactored vfio struct members into cpr-specific structures
>    * refactored various small changes into their own patches
>    * split complex patches.  Notably:
>      - split "refactor for cpr" into 5 patches
>      - split "reconstruct device" into 4 patches
>    * refactored vfio_connect_container using helpers and made its
>      error recovery more robust.
>    * moved vfio pci msi/vector/intx cpr functions to cpr.c
>    * renamed "reused" to cpr_reused and cpr.reused
>    * squashed vfio_cpr_[un]register_container to their call sites
>    * simplified iommu_type setting after cpr
>    * added cpr_open_fd and cpr_is_incoming helpers
>    * removed changes from vfio_legacy_dma_map, and instead temporarily
>      override dma_map and dma_unmap ops.
>    * deleted error_report and returned Error to callers where possible.
>    * simplified the memory_get_xlat_addr interface
>    * fixed flags passed to iommufd_backend_alloc_hwpt
>    * defined MIG_PRI_UNINITIALIZED
>    * added maintainers
> 
> Steve Sistare (45):
>    MAINTAINERS: Add reviewer for CPR
>    migration: cpr helpers
>    migration: lower handler priority
>    vfio: vfio_find_ram_discard_listener
>    vfio/container: ram discard disable helper
>    vfio/container: reform vfio_connect_container cleanup
>    vfio/container: vfio_container_group_add
>    vfio/container: register container for cpr
>    vfio/container: preserve descriptors
>    vfio/container: export vfio_legacy_dma_map
>    vfio/container: discard old DMA vaddr
>    vfio/container: restore DMA vaddr
>    vfio/container: mdev cpr blocker
>    vfio/container: recover from unmap-all-vaddr failure
>    pci: export msix_is_pending
>    pci: skip reset during cpr
>    vfio-pci: skip reset during cpr
>    vfio/pci: vfio_vector_init
>    vfio/pci: vfio_notifier_init
>    vfio/pci: pass vector to virq functions
>    vfio/pci: vfio_notifier_init cpr parameters
>    vfio/pci: vfio_notifier_cleanup
>    vfio/pci: export MSI functions
>    vfio-pci: preserve MSI
>    vfio-pci: preserve INTx
>    migration: close kvm after cpr
>    migration: cpr_get_fd_param helper
>    vfio: return mr from vfio_get_xlat_addr
>    vfio: pass ramblock to vfio_container_dma_map
>    backends/iommufd: iommufd_backend_map_file_dma
>    backends/iommufd: change process ioctl
>    physmem: qemu_ram_get_fd_offset
>    vfio/iommufd: use IOMMU_IOAS_MAP_FILE
>    vfio/iommufd: export iommufd_cdev_get_info_iova_range
>    vfio/iommufd: define hwpt constructors
>    vfio/iommufd: invariant device name
>    vfio/iommufd: fix cpr register
>    vfio/iommufd: register container for cpr
>    vfio/iommufd: preserve descriptors
>    vfio/iommufd: reconstruct device
>    vfio/iommufd: reconstruct hw_caps
>    vfio/iommufd: reconstruct hwpt
>    vfio/iommufd: change process
>    iommufd: preserve DMA mappings
>    vfio/container: delete old cpr register
> 
>   MAINTAINERS                           |  12 ++
>   accel/kvm/kvm-all.c                   |  28 ++++
>   backends/iommufd.c                    |  83 ++++++++--
>   backends/trace-events                 |   2 +
>   hw/pci/msix.c                         |   2 +-
>   hw/pci/pci.c                          |  13 ++
>   hw/vfio/common.c                      |  91 ++++++++---
>   hw/vfio/container-base.c              |  12 +-
>   hw/vfio/container.c                   | 216 +++++++++++++++++---------
>   hw/vfio/cpr-iommufd.c                 | 172 +++++++++++++++++++++
>   hw/vfio/cpr-legacy.c                  | 277 ++++++++++++++++++++++++++++++++++
>   hw/vfio/cpr.c                         | 176 +++++++++++++++++++--
>   hw/vfio/helpers.c                     |  20 +--
>   hw/vfio/iommufd.c                     | 139 +++++++++++++----
>   hw/vfio/meson.build                   |   4 +-
>   hw/vfio/pci.c                         | 194 ++++++++++++++++++------
>   hw/vfio/pci.h                         |  12 ++
>   hw/vfio/trace-events                  |   1 +
>   hw/virtio/vhost-vdpa.c                |   8 +-
>   include/exec/cpu-common.h             |   1 +
>   include/exec/memory.h                 |   6 +-
>   include/hw/pci/msix.h                 |   1 +
>   include/hw/vfio/vfio-common.h         |  20 ++-
>   include/hw/vfio/vfio-container-base.h |   6 +-
>   include/hw/vfio/vfio-cpr.h            |  69 +++++++++
>   include/migration/cpr.h               |  10 ++
>   include/migration/vmstate.h           |   6 +-
>   include/system/iommufd.h              |   6 +
>   include/system/kvm.h                  |   1 +
>   migration/cpr-transfer.c              |  18 +++
>   migration/cpr.c                       |  92 +++++++++++
>   migration/migration.c                 |   1 +
>   migration/savevm.c                    |   4 +-
>   system/memory.c                       |  19 +--
>   system/physmem.c                      |   5 +
>   35 files changed, 1490 insertions(+), 237 deletions(-)
>   create mode 100644 hw/vfio/cpr-iommufd.c
>   create mode 100644 hw/vfio/cpr-legacy.c
>   create mode 100644 include/hw/vfio/vfio-cpr.h
> 
> base-commit: de278e54aefed143526174335f8286f7437d20be
>
Peter Xu Feb. 14, 2025, 4:06 p.m. UTC | #2
On Fri, Feb 14, 2025 at 10:56:02AM -0500, Steven Sistare wrote:
> Hi all, it would be nice to get this into qemu 10.0.  Without it, the
> basic support for cpr-transfer already in 10.0 is much less interesting.

True..

> Soft feature freeze is 2024-03-12.

Said that, targeting 10.0 for such a huge series across multiple modules,
and especially during the time VFIO review is on heavy load.. may not be
easily achievable.  It might be more practical, IMHO, to target this 10.1.
Review can still happen during / after soft-freeze.

Thanks,
Steven Sistare Feb. 14, 2025, 4:20 p.m. UTC | #3
On 2/14/2025 11:06 AM, Peter Xu wrote:
> On Fri, Feb 14, 2025 at 10:56:02AM -0500, Steven Sistare wrote:
>> Hi all, it would be nice to get this into qemu 10.0.  Without it, the
>> basic support for cpr-transfer already in 10.0 is much less interesting.
> 
> True..
> 
>> Soft feature freeze is 2024-03-12.
> 
> Said that, targeting 10.0 for such a huge series across multiple modules,
> and especially during the time VFIO review is on heavy load.. may not be
> easily achievable.  It might be more practical, IMHO, to target this 10.1.
> Review can still happen during / after soft-freeze.

Understood.  Let me know if I can do anything to help.

BTW, the series is less huge than it looks.  I divided it into small patches
as requested.

- Steve
Cédric Le Goater Feb. 14, 2025, 4:48 p.m. UTC | #4
On 2/14/25 17:20, Steven Sistare wrote:
> On 2/14/2025 11:06 AM, Peter Xu wrote:
>> On Fri, Feb 14, 2025 at 10:56:02AM -0500, Steven Sistare wrote:
>>> Hi all, it would be nice to get this into qemu 10.0.  Without it, the
>>> basic support for cpr-transfer already in 10.0 is much less interesting.
>>
>> True..
>>
>>> Soft feature freeze is 2024-03-12.
>>
>> Said that, targeting 10.0 for such a huge series across multiple modules,>> and especially during the time VFIO review is on heavy load.. may not be>> easily achievable.  It might be more practical, IMHO, to target this 10.1.
>> Review can still happen during / after soft-freeze.

yes. It is *very* optimistic and it is also a question of stability and
maintenance. One "big" feature per release is more than enough. "multifd
support for VFIO migration" is the next candidate.

And I am sorry Steve, I still haven't looked at your answers on v1 ...
They are next on my TODO list.

> Understood.  Let me know if I can do anything to help.

Well, what bothers me today is that we have been adding a lot of new features
in the VFIO subsystem these last years (migration, IOMMUFD, etc) and we still
lack decent documentation in QEMU. That would be a great addition. For the
series "multifd support for VFIO  migration" too.

> BTW, the series is less huge than it looks.  I divided it into small patches
> as requested.

That's better. May be we can merge cleanup patches preparing ground for the
larger series.

Thanks,

C.