mbox series

[v4,00/11] iommufd: Add vIOMMU infrastructure (Part-1)

Message ID cover.1729553811.git.nicolinc@nvidia.com (mailing list archive)
Headers show
Series iommufd: Add vIOMMU infrastructure (Part-1) | expand

Message

Nicolin Chen Oct. 22, 2024, 12:19 a.m. UTC
This series introduces a new vIOMMU infrastructure and related ioctls.

IOMMUFD has been using the HWPT infrastructure for all cases, including a
nested IO page table support. Yet, there're limitations for an HWPT-based
structure to support some advanced HW-accelerated features, such as CMDQV
on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
environment, it is not straightforward for nested HWPTs to share the same
parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a
parent HWPT typically hold one stage-2 IO pagetable and tag it with only
one ID in the cache entries. When sharing one large stage-2 IO pagetable
across physical IOMMU instances, that one ID may not always be available
across all the IOMMU instances. In other word, it's ideal for SW to have
a different container for the stage-2 IO pagetable so it can hold another
ID that's available.

For this "different container", add vIOMMU, an additional layer to hold
extra virtualization information:
  _______________________________________________________________________
 |                      iommufd (with vIOMMU)                            |
 |                                                                       |
 |                             [5]                                       |
 |                        _____________                                  |
 |                       |             |                                 |
 |      |----------------|    vIOMMU   |                                 |
 |      |                |             |                                 |
 |      |                |             |                                 |
 |      |      [1]       |             |          [4]             [2]    |
 |      |     ______     |             |     _____________     ________  |
 |      |    |      |    |     [3]     |    |             |   |        | |
 |      |    | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
 |      |    |______|    |_____________|    |_____________|   |________| |
 |      |        |              |                  |               |     |
 |______|________|______________|__________________|_______________|_____|
        |        |              |                  |               |
  ______v_____   |        ______v_____       ______v_____       ___v__
 |   struct   |  |  PFN  |  (paging)  |     |  (nested)  |     |struct|
 |iommu_device|  |------>|iommu_domain|<----|iommu_domain|<----|device|
 |____________|   storage|____________|     |____________|     |______|

The vIOMMU object should be seen as a slice of a physical IOMMU instance
that is passed to or shared with a VM. That can be some HW/SW resources:
 - Security namespace for guest owned ID, e.g. guest-controlled cache tags
 - Access to a sharable nesting parent pagetable across physical IOMMUs
 - Virtualization of various platforms IDs, e.g. RIDs and others
 - Delivery of paravirtualized invalidation
 - Direct assigned invalidation queues
 - Direct assigned interrupts
 - Non-affiliated event reporting

On a multi-IOMMU system, the vIOMMU object must be instanced to the number
of the physical IOMMUs that are passed to (via devices) a guest VM, while
being able to hold the shareable parent HWPT. Each vIOMMU then just needs
to allocate its own individual ID to tag its own cache:
                     ----------------------------
 ----------------    |         |  paging_hwpt0  |
 | hwpt_nested0 |--->| viommu0 ------------------
 ----------------    |         |      IDx       |
                     ----------------------------
                     ----------------------------
 ----------------    |         |  paging_hwpt0  |
 | hwpt_nested1 |--->| viommu1 ------------------
 ----------------    |         |      IDy       |
                     ----------------------------

As an initial part-1, add IOMMUFD_CMD_VIOMMU_ALLOC ioctl for an allocation
only. And implement it in arm-smmu-v3 driver as a real world use case.

More vIOMMU-based structs and ioctls will be introduced in the follow-up
series to support vDEVICE, vIRQ (vEVENT) and vQUEUE objects. Although we
repurposed the vIOMMU object from an earlier RFC, just for a referece:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/

This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v4
(paring QEMU branch for testing will be provided with the part2 series)

Changelog
v4
 * Added "Reviewed-by" from Jason
 * Dropped IOMMU_VIOMMU_TYPE_DEFAULT support
 * Dropped iommufd_object_alloc_elm renamings
 * Renamed iommufd's viommu_api.c to driver.c
 * Reworked iommufd_viommu_alloc helper
 * Added a separate iommufd_hwpt_nested_alloc_for_viommu function for
   hwpt_nested allocations on a vIOMMU, and added comparison between
   viommu->iommu_dev->ops and dev_iommu_ops(idev->dev)
 * Replaced s2_parent with vsmmu in arm_smmu_nested_domain
 * Replaced domain_alloc_user in iommu_ops with domain_alloc_nested in
   viommu_ops
 * Replaced wait_queue_head_t with a completion, to delay the unplug of
   mock_iommu_dev
 * Corrected documentation graph that was missing struct iommu_device
 * Added an iommufd_verify_unfinalized_object helper to verify driver-
   allocated vIOMMU/vDEVICE objects
 * Added missing test cases for TEST_LENGTH and fail_nth
v3
 https://lore.kernel.org/all/cover.1728491453.git.nicolinc@nvidia.com/
 * Rebased on top of Jason's nesting v3 series
   https://lore.kernel.org/all/0-v3-e2e16cd7467f+2a6a1-smmuv3_nesting_jgg@nvidia.com/
 * Split the series into smaller parts
 * Added Jason's Reviewed-by
 * Added back viommu->iommu_dev
 * Added support for driver-allocated vIOMMU v.s. core-allocated
 * Dropped arm_smmu_cache_invalidate_user
 * Added an iommufd_test_wait_for_users() in selftest
 * Reworked test code to make viommu an individual FIXTURE
 * Added missing TEST_LENGTH case for the new ioctl command
v2
 https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
 * Limited vdev_id to one per idev
 * Added a rw_sem to protect the vdev_id list
 * Reworked driver-level APIs with proper lockings
 * Added a new viommu_api file for IOMMUFD_DRIVER config
 * Dropped useless iommu_dev point from the viommu structure
 * Added missing index numnbers to new types in the uAPI header
 * Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
 * Reworked mock_viommu_cache_invalidate() using the new iommu helper
 * Reordered details of set/unset_vdev_id handlers for proper lockings
v1
 https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/

Thanks!
Nicolin

Nicolin Chen (11):
  iommufd: Move struct iommufd_object to public iommufd header
  iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct
  iommufd: Add iommufd_verify_unfinalized_object
  iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl
  iommufd: Add domain_alloc_nested op to iommufd_viommu_ops
  iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
  iommufd/selftest: Add refcount to mock_iommu_device
  iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST
  iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
  Documentation: userspace-api: iommufd: Update vIOMMU
  iommu/arm-smmu-v3: Add IOMMU_VIOMMU_TYPE_ARM_SMMUV3 support

 drivers/iommu/iommufd/Makefile                |  5 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   | 26 +++---
 drivers/iommu/iommufd/iommufd_private.h       | 36 ++------
 drivers/iommu/iommufd/iommufd_test.h          |  2 +
 include/linux/iommu.h                         | 14 +++
 include/linux/iommufd.h                       | 89 +++++++++++++++++++
 include/uapi/linux/iommufd.h                  | 56 ++++++++++--
 tools/testing/selftests/iommu/iommufd_utils.h | 28 ++++++
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     | 79 ++++++++++------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  9 +-
 drivers/iommu/iommufd/driver.c                | 38 ++++++++
 drivers/iommu/iommufd/hw_pagetable.c          | 69 +++++++++++++-
 drivers/iommu/iommufd/main.c                  | 58 ++++++------
 drivers/iommu/iommufd/selftest.c              | 73 +++++++++++++--
 drivers/iommu/iommufd/viommu.c                | 85 ++++++++++++++++++
 tools/testing/selftests/iommu/iommufd.c       | 78 ++++++++++++++++
 .../selftests/iommu/iommufd_fail_nth.c        | 11 +++
 Documentation/userspace-api/iommufd.rst       | 69 +++++++++++++-
 18 files changed, 701 insertions(+), 124 deletions(-)
 create mode 100644 drivers/iommu/iommufd/driver.c
 create mode 100644 drivers/iommu/iommufd/viommu.c

Comments

Tian, Kevin Oct. 25, 2024, 8:34 a.m. UTC | #1
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, October 22, 2024 8:19 AM
> 
> This series introduces a new vIOMMU infrastructure and related ioctls.
> 
> IOMMUFD has been using the HWPT infrastructure for all cases, including a
> nested IO page table support. Yet, there're limitations for an HWPT-based
> structure to support some advanced HW-accelerated features, such as
> CMDQV
> on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-
> IOMMU
> environment, it is not straightforward for nested HWPTs to share the same
> parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a
> parent HWPT typically hold one stage-2 IO pagetable and tag it with only
> one ID in the cache entries. When sharing one large stage-2 IO pagetable
> across physical IOMMU instances, that one ID may not always be available
> across all the IOMMU instances. In other word, it's ideal for SW to have
> a different container for the stage-2 IO pagetable so it can hold another
> ID that's available.

Just holding multiple IDs doesn't require a different container. This is
just a side effect when vIOMMU will be required for other said reasons.

If we have to put more words here I'd prefer to adding a bit more for
CMDQV which is more compelling. not a big deal though. 
Jason Gunthorpe Oct. 25, 2024, 3:42 p.m. UTC | #2
On Fri, Oct 25, 2024 at 08:34:05AM +0000, Tian, Kevin wrote:
> > The vIOMMU object should be seen as a slice of a physical IOMMU instance
> > that is passed to or shared with a VM. That can be some HW/SW resources:
> >  - Security namespace for guest owned ID, e.g. guest-controlled cache tags
> >  - Access to a sharable nesting parent pagetable across physical IOMMUs
> >  - Virtualization of various platforms IDs, e.g. RIDs and others
> >  - Delivery of paravirtualized invalidation
> >  - Direct assigned invalidation queues
> >  - Direct assigned interrupts
> >  - Non-affiliated event reporting
> 
> sorry no idea about 'non-affiliated event'. Can you elaborate?

This would be an even that is not a connected to a device

For instance a CMDQ experienced a problem.

Jason
Nicolin Chen Oct. 25, 2024, 4:28 p.m. UTC | #3
On Fri, Oct 25, 2024 at 08:34:05AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, October 22, 2024 8:19 AM
> >
> > This series introduces a new vIOMMU infrastructure and related ioctls.
> >
> > IOMMUFD has been using the HWPT infrastructure for all cases, including a
> > nested IO page table support. Yet, there're limitations for an HWPT-based
> > structure to support some advanced HW-accelerated features, such as
> > CMDQV
> > on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-
> > IOMMU
> > environment, it is not straightforward for nested HWPTs to share the same
> > parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a
> > parent HWPT typically hold one stage-2 IO pagetable and tag it with only
> > one ID in the cache entries. When sharing one large stage-2 IO pagetable
> > across physical IOMMU instances, that one ID may not always be available
> > across all the IOMMU instances. In other word, it's ideal for SW to have
> > a different container for the stage-2 IO pagetable so it can hold another
> > ID that's available.
> 
> Just holding multiple IDs doesn't require a different container. This is
> just a side effect when vIOMMU will be required for other said reasons.
> 
> If we have to put more words here I'd prefer to adding a bit more for
> CMDQV which is more compelling. not a big deal though. 
Tian, Kevin Oct. 28, 2024, 2:35 a.m. UTC | #4
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, October 25, 2024 11:43 PM
> 
> On Fri, Oct 25, 2024 at 08:34:05AM +0000, Tian, Kevin wrote:
> > > The vIOMMU object should be seen as a slice of a physical IOMMU
> instance
> > > that is passed to or shared with a VM. That can be some HW/SW
> resources:
> > >  - Security namespace for guest owned ID, e.g. guest-controlled cache
> tags
> > >  - Access to a sharable nesting parent pagetable across physical IOMMUs
> > >  - Virtualization of various platforms IDs, e.g. RIDs and others
> > >  - Delivery of paravirtualized invalidation
> > >  - Direct assigned invalidation queues
> > >  - Direct assigned interrupts
> > >  - Non-affiliated event reporting
> >
> > sorry no idea about 'non-affiliated event'. Can you elaborate?
> 
> This would be an even that is not a connected to a device
> 
> For instance a CMDQ experienced a problem.
> 


Okay, then 'non-device-affiliated' is probably clearer.