Message ID | cover.1724776335.git.nicolinc@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | iommufd: Add VIOMMU infrastructure (Part-1) | expand |
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, August 28, 2024 1:00 AM > [...] > On a multi-IOMMU system, the VIOMMU object can be instanced to the > number > of vIOMMUs in a guest VM, while holding the same parent HWPT to share > the Is there restriction that multiple vIOMMU objects can be only created on a multi-IOMMU system? > stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own > VMID to attach the shared stage-2 IO pagetable to the physical IOMMU: this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the entire context it actually means the physical 'VMID' allocated on the associated physical IOMMU, correct?
On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Wednesday, August 28, 2024 1:00 AM > > > [...] > > On a multi-IOMMU system, the VIOMMU object can be instanced to the > > number > > of vIOMMUs in a guest VM, while holding the same parent HWPT to share > > the > > Is there restriction that multiple vIOMMU objects can be only created > on a multi-IOMMU system? I think it should be generally restricted to the number of pIOMMUs, although likely (not 100% sure) we could do multiple vIOMMUs on a single-pIOMMU system. Any reason for doing that? > > stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own > > VMID to attach the shared stage-2 IO pagetable to the physical IOMMU: > > this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the > entire context it actually means the physical 'VMID' allocated on the > associated physical IOMMU, correct? Quoting Jason's narratives, a VMID is a "Security namespace for guest owned ID". The allocation, using SMMU as an example, should be a part of vIOMMU instance allocation in the host SMMU driver. Then, this VMID will be used to mark the cache tags. So, it is still a software allocated ID, while HW would use it too. Thanks Nicolin
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, September 11, 2024 3:08 PM > > On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote: > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > Sent: Wednesday, August 28, 2024 1:00 AM > > > > > [...] > > > On a multi-IOMMU system, the VIOMMU object can be instanced to the > > > number > > > of vIOMMUs in a guest VM, while holding the same parent HWPT to > share > > > the > > > > Is there restriction that multiple vIOMMU objects can be only created > > on a multi-IOMMU system? > > I think it should be generally restricted to the number of pIOMMUs, > although likely (not 100% sure) we could do multiple vIOMMUs on a > single-pIOMMU system. Any reason for doing that? No idea. But if you stated so then there will be code to enforce it e.g. failing the attempt to create a vIOMMU object on a pIOMMU to which another vIOMMU object is already linked? > > > > stage-2 IO pagetable. Each VIOMMU then just need to only allocate its > own > > > VMID to attach the shared stage-2 IO pagetable to the physical IOMMU: > > > > this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the > > entire context it actually means the physical 'VMID' allocated on the > > associated physical IOMMU, correct? > > Quoting Jason's narratives, a VMID is a "Security namespace for > guest owned ID". The allocation, using SMMU as an example, should the VMID alone is not a namespace. It's one ID to tag another namespace. > be a part of vIOMMU instance allocation in the host SMMU driver. > Then, this VMID will be used to mark the cache tags. So, it is > still a software allocated ID, while HW would use it too. > VMIDs are physical resource belonging to the host SMMU driver. but I got your original point that it's each vIOMMU gets an unique VMID from the host SMMU driver, not exactly that each vIOMMU maintains its own VMID namespace. that'd be a different concept.
On Wed, Sep 11, 2024 at 07:18:10AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Wednesday, September 11, 2024 3:08 PM > > > > On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote: > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > Sent: Wednesday, August 28, 2024 1:00 AM > > > > > > > [...] > > > > On a multi-IOMMU system, the VIOMMU object can be instanced to the > > > > number > > > > of vIOMMUs in a guest VM, while holding the same parent HWPT to > > share > > > > the > > > > > > Is there restriction that multiple vIOMMU objects can be only created > > > on a multi-IOMMU system? > > > > I think it should be generally restricted to the number of pIOMMUs, > > although likely (not 100% sure) we could do multiple vIOMMUs on a > > single-pIOMMU system. Any reason for doing that? > > No idea. But if you stated so then there will be code to enforce it e.g. > failing the attempt to create a vIOMMU object on a pIOMMU to which > another vIOMMU object is already linked? Yea, I can do that. > > > > stage-2 IO pagetable. Each VIOMMU then just need to only allocate its > > own > > > > VMID to attach the shared stage-2 IO pagetable to the physical IOMMU: > > > > > > this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the > > > entire context it actually means the physical 'VMID' allocated on the > > > associated physical IOMMU, correct? > > > > Quoting Jason's narratives, a VMID is a "Security namespace for > > guest owned ID". The allocation, using SMMU as an example, should > > the VMID alone is not a namespace. It's one ID to tag another namespace. > > > be a part of vIOMMU instance allocation in the host SMMU driver. > > Then, this VMID will be used to mark the cache tags. So, it is > > still a software allocated ID, while HW would use it too. > > > > VMIDs are physical resource belonging to the host SMMU driver. Yes. Just the lifecycle of a VMID is controlled by a vIOMMU, i.e. the guest. > but I got your original point that it's each vIOMMU gets an unique VMID > from the host SMMU driver, not exactly that each vIOMMU maintains > its own VMID namespace. that'd be a different concept. What's a VMID namespace actually? Please educate me :) Thanks Nicolin
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, September 11, 2024 3:41 PM > > On Wed, Sep 11, 2024 at 07:18:10AM +0000, Tian, Kevin wrote: > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > Sent: Wednesday, September 11, 2024 3:08 PM > > > > > > On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote: > > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > > Sent: Wednesday, August 28, 2024 1:00 AM > > > > > > > > > > stage-2 IO pagetable. Each VIOMMU then just need to only allocate its > > > own > > > > > VMID to attach the shared stage-2 IO pagetable to the physical IOMMU: > > > > > > > > this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the > > > > entire context it actually means the physical 'VMID' allocated on the > > > > associated physical IOMMU, correct? > > > > > > Quoting Jason's narratives, a VMID is a "Security namespace for > > > guest owned ID". The allocation, using SMMU as an example, should > > > > the VMID alone is not a namespace. It's one ID to tag another namespace. > > > > > be a part of vIOMMU instance allocation in the host SMMU driver. > > > Then, this VMID will be used to mark the cache tags. So, it is > > > still a software allocated ID, while HW would use it too. > > > > > > > VMIDs are physical resource belonging to the host SMMU driver. > > Yes. Just the lifecycle of a VMID is controlled by a vIOMMU, i.e. > the guest. > > > but I got your original point that it's each vIOMMU gets an unique VMID > > from the host SMMU driver, not exactly that each vIOMMU maintains > > its own VMID namespace. that'd be a different concept. > > What's a VMID namespace actually? Please educate me :) > I meant the 16bit VMID pool under each SMMU.
On Wed, Sep 11, 2024 at 08:08:04AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Wednesday, September 11, 2024 3:41 PM > > > > On Wed, Sep 11, 2024 at 07:18:10AM +0000, Tian, Kevin wrote: > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > Sent: Wednesday, September 11, 2024 3:08 PM > > > > > > > > On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote: > > > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > > > Sent: Wednesday, August 28, 2024 1:00 AM > > > > > > > > > > > > stage-2 IO pagetable. Each VIOMMU then just need to only allocate its > > > > own > > > > > > VMID to attach the shared stage-2 IO pagetable to the physical IOMMU: > > > > > > > > > > this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the > > > > > entire context it actually means the physical 'VMID' allocated on the > > > > > associated physical IOMMU, correct? > > > > > > > > Quoting Jason's narratives, a VMID is a "Security namespace for > > > > guest owned ID". The allocation, using SMMU as an example, should > > > > > > the VMID alone is not a namespace. It's one ID to tag another namespace. > > > > > > > be a part of vIOMMU instance allocation in the host SMMU driver. > > > > Then, this VMID will be used to mark the cache tags. So, it is > > > > still a software allocated ID, while HW would use it too. > > > > > > > > > > VMIDs are physical resource belonging to the host SMMU driver. > > > > Yes. Just the lifecycle of a VMID is controlled by a vIOMMU, i.e. > > the guest. > > > > > but I got your original point that it's each vIOMMU gets an unique VMID > > > from the host SMMU driver, not exactly that each vIOMMU maintains > > > its own VMID namespace. that'd be a different concept. > > > > What's a VMID namespace actually? Please educate me :) > > > > I meant the 16bit VMID pool under each SMMU. I see. Makes sense now. Thanks Nicolin
Hi Nic, On 2024/8/28 00:59, Nicolin Chen wrote: > This series introduces a new VIOMMU infrastructure and related ioctls. > > IOMMUFD has been using the HWPT infrastructure for all cases, including a > nested IO page table support. Yet, there're limitations for an HWPT-based > structure to support some advanced HW-accelerated features, such as CMDQV > on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU > environment, it is not straightforward for nested HWPTs to share the same > parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone. could you elaborate a bit for the last sentence in the above paragraph? > > The new VIOMMU object is an additional layer, between the nested HWPT and > its parent HWPT, to give to both the IOMMUFD core and an IOMMU driver an > additional structure to support HW-accelerated feature: > ---------------------------- > ---------------- | | paging_hwpt0 | > | hwpt_nested0 |--->| viommu0 ------------------ > ---------------- | | HW-accel feats | > ---------------------------- > > On a multi-IOMMU system, the VIOMMU object can be instanced to the number > of vIOMMUs in a guest VM, while holding the same parent HWPT to share the > stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own > VMID to attach the shared stage-2 IO pagetable to the physical IOMMU: > ---------------------------- > ---------------- | | paging_hwpt0 | > | hwpt_nested0 |--->| viommu0 ------------------ > ---------------- | | VMID0 | > ---------------------------- > ---------------------------- > ---------------- | | paging_hwpt0 | > | hwpt_nested1 |--->| viommu1 ------------------ > ---------------- | | VMID1 | > ---------------------------- > > As an initial part-1, add ioctls to support a VIOMMU-based invalidation: > IOMMUFD_CMD_VIOMMU_ALLOC to allocate a VIOMMU object > IOMMUFD_CMD_VIOMMU_SET/UNSET_VDEV_ID to set/clear device's virtual ID > (Resue IOMMUFD_CMD_HWPT_INVALIDATE for a VIOMMU object to flush cache > by a given driver data) > > Worth noting that the VDEV_ID is for a per-VIOMMU device list for drivers > to look up the device's physical instance from its virtual ID in a VM. It > is essential for a VIOMMU-based invalidation where the request contains a > device's virtual ID for its device cache flush, e.g. ATC invalidation. > > As for the implementation of the series, add an IOMMU_VIOMMU_TYPE_DEFAULT > type for a core-allocated-core-managed VIOMMU object, allowing drivers to > simply hook a default viommu ops for viommu-based invalidation alone. And > provide some viommu helpers to drivers for VDEV_ID translation and parent > domain lookup. Add VIOMMU invalidation support to ARM SMMUv3 driver for a > real world use case. This adds supports of arm-smmuv-v3's CMDQ_OP_ATC_INV > and CMDQ_OP_CFGI_CD/ALL commands, supplementing HWPT-based invalidations. > > In the future, drivers will also be able to choose a driver-managed type > to hold its own structure by adding a new type to enum iommu_viommu_type. > More VIOMMU-based structures and ioctls will be introduced in part-2/3 to > support a driver-managed VIOMMU, e.g. VQUEUE object for a HW accelerated > queue, VIRQ (or VEVENT) object for IRQ injections. Although we repurposed > the VIOMMU object from an earlier RFC discussion, for a referece: > https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/ > > This series is on Github: > https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v2 > Paring QEMU branch for testing: > https://github.com/nicolinc/qemu/commits/wip/for_iommufd_viommu_p1-v2 > > Changelog > v2 > * Limited vdev_id to one per idev > * Added a rw_sem to protect the vdev_id list > * Reworked driver-level APIs with proper lockings > * Added a new viommu_api file for IOMMUFD_DRIVER config > * Dropped useless iommu_dev point from the viommu structure > * Added missing index numnbers to new types in the uAPI header > * Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one > * Reworked mock_viommu_cache_invalidate() using the new iommu helper > * Reordered details of set/unset_vdev_id handlers for proper lockings > * Added arm_smmu_cache_invalidate_user patch from Jason's nesting series > v1 > https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/ > > Thanks! > Nicolin > > Jason Gunthorpe (3): > iommu: Add iommu_copy_struct_from_full_user_array helper > iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED > iommu/arm-smmu-v3: Update comments about ATS and bypass > > Nicolin Chen (16): > iommufd: Reorder struct forward declarations > iommufd/viommu: Add IOMMUFD_OBJ_VIOMMU and IOMMU_VIOMMU_ALLOC ioctl > iommu: Pass in a viommu pointer to domain_alloc_user op > iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC > iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage > iommufd/viommu: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID ioctl > iommufd/selftest: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID test coverage > iommufd/viommu: Add cache_invalidate for IOMMU_VIOMMU_TYPE_DEFAULT > iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE > iommufd/viommu: Add vdev_id helpers for IOMMU drivers > iommufd/selftest: Add mock_viommu_invalidate_user op > iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command > iommufd/selftest: Add VIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl > iommufd/viommu: Add iommufd_viommu_to_parent_domain helper > iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user > iommu/arm-smmu-v3: Add arm_smmu_viommu_cache_invalidate > > drivers/iommu/amd/iommu.c | 1 + > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 218 ++++++++++++++- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 3 + > drivers/iommu/intel/iommu.c | 1 + > drivers/iommu/iommufd/Makefile | 5 +- > drivers/iommu/iommufd/device.c | 12 + > drivers/iommu/iommufd/hw_pagetable.c | 59 +++- > drivers/iommu/iommufd/iommufd_private.h | 37 +++ > drivers/iommu/iommufd/iommufd_test.h | 30 ++ > drivers/iommu/iommufd/main.c | 12 + > drivers/iommu/iommufd/selftest.c | 101 ++++++- > drivers/iommu/iommufd/viommu.c | 196 +++++++++++++ > drivers/iommu/iommufd/viommu_api.c | 53 ++++ > include/linux/iommu.h | 56 +++- > include/linux/iommufd.h | 51 +++- > include/uapi/linux/iommufd.h | 117 +++++++- > tools/testing/selftests/iommu/iommufd.c | 259 +++++++++++++++++- > tools/testing/selftests/iommu/iommufd_utils.h | 126 +++++++++ > 18 files changed, 1299 insertions(+), 38 deletions(-) > create mode 100644 drivers/iommu/iommufd/viommu.c > create mode 100644 drivers/iommu/iommufd/viommu_api.c >
On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote: > Hi Nic, > > On 2024/8/28 00:59, Nicolin Chen wrote: > > This series introduces a new VIOMMU infrastructure and related ioctls. > > > > IOMMUFD has been using the HWPT infrastructure for all cases, including a > > nested IO page table support. Yet, there're limitations for an HWPT-based > > structure to support some advanced HW-accelerated features, such as CMDQV > > on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU > > environment, it is not straightforward for nested HWPTs to share the same > > parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone. > > could you elaborate a bit for the last sentence in the above paragraph? Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent domain across IOMMU instances, we'd have to make sure that VMID is available on all IOMMU instances. There comes the limitation and potential resource starving, so not ideal. Baolu told me that Intel may have the same: different domain IDs on different IOMMUs; multiple IOMMU instances on one chip: https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/ So, I think we are having the same situation here. Adding another vIOMMU wrapper on the other hand can allow us to allocate different VMIDs/DIDs for different IOMMUs. Thanks Nic
On 2024/9/26 02:55, Nicolin Chen wrote: > On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote: >> Hi Nic, >> >> On 2024/8/28 00:59, Nicolin Chen wrote: >>> This series introduces a new VIOMMU infrastructure and related ioctls. >>> >>> IOMMUFD has been using the HWPT infrastructure for all cases, including a >>> nested IO page table support. Yet, there're limitations for an HWPT-based >>> structure to support some advanced HW-accelerated features, such as CMDQV >>> on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU >>> environment, it is not straightforward for nested HWPTs to share the same >>> parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone. >> >> could you elaborate a bit for the last sentence in the above paragraph? > > Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent > domain across IOMMU instances, we'd have to make sure that VMID > is available on all IOMMU instances. There comes the limitation > and potential resource starving, so not ideal. got it. > Baolu told me that Intel may have the same: different domain IDs > on different IOMMUs; multiple IOMMU instances on one chip: > https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/ > So, I think we are having the same situation here. yes, it's called iommu unit or dmar. A typical Intel server can have multiple iommu units. But like Baolu mentioned in that thread, the intel iommu driver maintains separate domain ID spaces for iommu units, which means a given iommu domain has different DIDs when associated with different iommu units. So intel side is not suffering from this so far. > Adding another vIOMMU wrapper on the other hand can allow us to > allocate different VMIDs/DIDs for different IOMMUs. that looks like to generalize the association of the iommu domain and the iommu units?
On Thu, Sep 26, 2024 at 04:47:02PM +0800, Yi Liu wrote: > On 2024/9/26 02:55, Nicolin Chen wrote: > > On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote: > > > Hi Nic, > > > > > > On 2024/8/28 00:59, Nicolin Chen wrote: > > > > This series introduces a new VIOMMU infrastructure and related ioctls. > > > > > > > > IOMMUFD has been using the HWPT infrastructure for all cases, including a > > > > nested IO page table support. Yet, there're limitations for an HWPT-based > > > > structure to support some advanced HW-accelerated features, such as CMDQV > > > > on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU > > > > environment, it is not straightforward for nested HWPTs to share the same > > > > parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone. > > > > > > could you elaborate a bit for the last sentence in the above paragraph? > > > > Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent > > domain across IOMMU instances, we'd have to make sure that VMID > > is available on all IOMMU instances. There comes the limitation > > and potential resource starving, so not ideal. > > got it. > > > Baolu told me that Intel may have the same: different domain IDs > > on different IOMMUs; multiple IOMMU instances on one chip: > > https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/ > > So, I think we are having the same situation here. > > yes, it's called iommu unit or dmar. A typical Intel server can have > multiple iommu units. But like Baolu mentioned in that thread, the intel > iommu driver maintains separate domain ID spaces for iommu units, which > means a given iommu domain has different DIDs when associated with > different iommu units. So intel side is not suffering from this so far. An ARM SMMU has its own VMID pool as well. The suffering comes from associating VMIDs to one shared parent S2 domain. Does a DID per S1 nested domain or parent S2? If it is per S2, I think the same suffering applies when we share the S2 across IOMMU instances? > > Adding another vIOMMU wrapper on the other hand can allow us to > > allocate different VMIDs/DIDs for different IOMMUs. > > that looks like to generalize the association of the iommu domain and the > iommu units? A vIOMMU is a presentation/object of a physical IOMMU instance in a VM. This presentation gives a VMM some capability to take advantage of some of HW resource of the physical IOMMU: - a VMID is a small HW reousrce to tag the cache; - a vIOMMU invalidation allows to access device cache that's not straightforwardly done via an S1 HWPT invalidation; - a virtual device presentation of a physical device in a VM, related to the vIOMMU in the VM, which contains some VM-level info: virtual device ID, security level (ARM CCA), and etc; - Non-PRI IRQ forwarding to the guest VM; - HW-accelerated virtualization resource: vCMDQ, AMD VIOMMU; Thanks Nicolin
On 9/27/24 4:03 AM, Nicolin Chen wrote: > On Thu, Sep 26, 2024 at 04:47:02PM +0800, Yi Liu wrote: >> On 2024/9/26 02:55, Nicolin Chen wrote: >>> On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote: >>>> Hi Nic, >>>> >>>> On 2024/8/28 00:59, Nicolin Chen wrote: >>>>> This series introduces a new VIOMMU infrastructure and related ioctls. >>>>> >>>>> IOMMUFD has been using the HWPT infrastructure for all cases, including a >>>>> nested IO page table support. Yet, there're limitations for an HWPT-based >>>>> structure to support some advanced HW-accelerated features, such as CMDQV >>>>> on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU >>>>> environment, it is not straightforward for nested HWPTs to share the same >>>>> parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone. >>>> could you elaborate a bit for the last sentence in the above paragraph? >>> Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent >>> domain across IOMMU instances, we'd have to make sure that VMID >>> is available on all IOMMU instances. There comes the limitation >>> and potential resource starving, so not ideal. >> got it. >> >>> Baolu told me that Intel may have the same: different domain IDs >>> on different IOMMUs; multiple IOMMU instances on one chip: >>> https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/ >>> So, I think we are having the same situation here. >> yes, it's called iommu unit or dmar. A typical Intel server can have >> multiple iommu units. But like Baolu mentioned in that thread, the intel >> iommu driver maintains separate domain ID spaces for iommu units, which >> means a given iommu domain has different DIDs when associated with >> different iommu units. So intel side is not suffering from this so far. > An ARM SMMU has its own VMID pool as well. The suffering comes > from associating VMIDs to one shared parent S2 domain. > > Does a DID per S1 nested domain or parent S2? If it is per S2, > I think the same suffering applies when we share the S2 across > IOMMU instances? It's per S1 nested domain in current VT-d design. It's simple but lacks sharing of DID within a VM. We probably will change this later. Thanks, baolu
On 2024/9/27 04:03, Nicolin Chen wrote: > On Thu, Sep 26, 2024 at 04:47:02PM +0800, Yi Liu wrote: >> On 2024/9/26 02:55, Nicolin Chen wrote: >>> On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote: >>>> Hi Nic, >>>> >>>> On 2024/8/28 00:59, Nicolin Chen wrote: >>>>> This series introduces a new VIOMMU infrastructure and related ioctls. >>>>> >>>>> IOMMUFD has been using the HWPT infrastructure for all cases, including a >>>>> nested IO page table support. Yet, there're limitations for an HWPT-based >>>>> structure to support some advanced HW-accelerated features, such as CMDQV >>>>> on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU >>>>> environment, it is not straightforward for nested HWPTs to share the same >>>>> parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone. >>>> >>>> could you elaborate a bit for the last sentence in the above paragraph? >>> >>> Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent >>> domain across IOMMU instances, we'd have to make sure that VMID >>> is available on all IOMMU instances. There comes the limitation >>> and potential resource starving, so not ideal. >> >> got it. >> >>> Baolu told me that Intel may have the same: different domain IDs >>> on different IOMMUs; multiple IOMMU instances on one chip: >>> https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/ >>> So, I think we are having the same situation here. >> >> yes, it's called iommu unit or dmar. A typical Intel server can have >> multiple iommu units. But like Baolu mentioned in that thread, the intel >> iommu driver maintains separate domain ID spaces for iommu units, which >> means a given iommu domain has different DIDs when associated with >> different iommu units. So intel side is not suffering from this so far. > > An ARM SMMU has its own VMID pool as well. The suffering comes > from associating VMIDs to one shared parent S2 domain. Is this because of the VMID is tied with a S2 domain? > Does a DID per S1 nested domain or parent S2? If it is per S2, > I think the same suffering applies when we share the S2 across > IOMMU instances? per S1 I think. The iotlb efficiency is low as S2 caches would be tagged with different DIDs even the page table is the same. :) >>> Adding another vIOMMU wrapper on the other hand can allow us to >>> allocate different VMIDs/DIDs for different IOMMUs. >> >> that looks like to generalize the association of the iommu domain and the >> iommu units? > > A vIOMMU is a presentation/object of a physical IOMMU instance > in a VM. a slice of a physical IOMMU. is it? and you treat S2 hwpt as a resource of the physical IOMMU as well. > This presentation gives a VMM some capability to take > advantage of some of HW resource of the physical IOMMU: > - a VMID is a small HW reousrce to tag the cache; > - a vIOMMU invalidation allows to access device cache that's > not straightforwardly done via an S1 HWPT invalidation; > - a virtual device presentation of a physical device in a VM, > related to the vIOMMU in the VM, which contains some VM-level > info: virtual device ID, security level (ARM CCA), and etc; > - Non-PRI IRQ forwarding to the guest VM; > - HW-accelerated virtualization resource: vCMDQ, AMD VIOMMU; might be helpful to draw a diagram to show what the vIOMMU obj contains.:)
On 2024/9/27 10:05, Baolu Lu wrote: > On 9/27/24 4:03 AM, Nicolin Chen wrote: >> On Thu, Sep 26, 2024 at 04:47:02PM +0800, Yi Liu wrote: >>> On 2024/9/26 02:55, Nicolin Chen wrote: >>>> On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote: >>>>> Hi Nic, >>>>> >>>>> On 2024/8/28 00:59, Nicolin Chen wrote: >>>>>> This series introduces a new VIOMMU infrastructure and related ioctls. >>>>>> >>>>>> IOMMUFD has been using the HWPT infrastructure for all cases, >>>>>> including a >>>>>> nested IO page table support. Yet, there're limitations for an >>>>>> HWPT-based >>>>>> structure to support some advanced HW-accelerated features, such as >>>>>> CMDQV >>>>>> on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a >>>>>> multi-IOMMU >>>>>> environment, it is not straightforward for nested HWPTs to share the >>>>>> same >>>>>> parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone. >>>>> could you elaborate a bit for the last sentence in the above paragraph? >>>> Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent >>>> domain across IOMMU instances, we'd have to make sure that VMID >>>> is available on all IOMMU instances. There comes the limitation >>>> and potential resource starving, so not ideal. >>> got it. >>> >>>> Baolu told me that Intel may have the same: different domain IDs >>>> on different IOMMUs; multiple IOMMU instances on one chip: >>>> https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/ >>>> So, I think we are having the same situation here. >>> yes, it's called iommu unit or dmar. A typical Intel server can have >>> multiple iommu units. But like Baolu mentioned in that thread, the intel >>> iommu driver maintains separate domain ID spaces for iommu units, which >>> means a given iommu domain has different DIDs when associated with >>> different iommu units. So intel side is not suffering from this so far. >> An ARM SMMU has its own VMID pool as well. The suffering comes >> from associating VMIDs to one shared parent S2 domain. >> >> Does a DID per S1 nested domain or parent S2? If it is per S2, >> I think the same suffering applies when we share the S2 across >> IOMMU instances? > > It's per S1 nested domain in current VT-d design. It's simple but lacks > sharing of DID within a VM. We probably will change this later. Could you share a bit more about this? I hope it is not going to share the DID if the S1 nested domains share the same S2 hwpt. For fist-stage caches, the tag is PASID, DID and address. If both PASID and DID are the same, then there is cache conflict. And the typical scenarios is the gIOVA which uses the RIDPASID. :)
On Fri, Sep 27, 2024 at 01:54:45PM +0800, Yi Liu wrote: > > > > Baolu told me that Intel may have the same: different domain IDs > > > > on different IOMMUs; multiple IOMMU instances on one chip: > > > > https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/ > > > > So, I think we are having the same situation here. > > > > > > yes, it's called iommu unit or dmar. A typical Intel server can have > > > multiple iommu units. But like Baolu mentioned in that thread, the intel > > > iommu driver maintains separate domain ID spaces for iommu units, which > > > means a given iommu domain has different DIDs when associated with > > > different iommu units. So intel side is not suffering from this so far. > > > > An ARM SMMU has its own VMID pool as well. The suffering comes > > from associating VMIDs to one shared parent S2 domain. > > Is this because of the VMID is tied with a S2 domain? On ARM, yes. VMID is a part of S2 domain stuff. > > Does a DID per S1 nested domain or parent S2? If it is per S2, > > I think the same suffering applies when we share the S2 across > > IOMMU instances? > > per S1 I think. The iotlb efficiency is low as S2 caches would be > tagged with different DIDs even the page table is the same. :) On ARM, the stage-1 is tagged with an ASID (Address Space ID) while the stage-2 is tagged with a VMID. Then an invalidation for a nested S1 domain must require the VMID from the S2. The ASID may be also required if the invalidation is specific to that address space (otherwise, broadcast per VMID.) I feel these two might act somehow similarly to the two DIDs during nested translations? > > > > Adding another vIOMMU wrapper on the other hand can allow us to > > > > allocate different VMIDs/DIDs for different IOMMUs. > > > > > > that looks like to generalize the association of the iommu domain and the > > > iommu units? > > > > A vIOMMU is a presentation/object of a physical IOMMU instance > > in a VM. > > a slice of a physical IOMMU. is it? Yes. When multiple nested translations happen at the same time, IOMMU (just like a CPU) is shared by these slices. And so is an invalidation queue executing multiple requests. Perhaps calling it a slice sounds more accurate, as I guess all the confusion comes from the name "vIOMMU" that might be thought to be a user space object/instance that likely holds all virtual stuff like stage-1 HWPT or so? > and you treat S2 hwpt as a resource of the physical IOMMU as well. Yes. A parent HWPT (in the old day, we called it "kernel-manged" HWPT) is not a user space thing. This belongs to a kernel owned object. > > This presentation gives a VMM some capability to take > > advantage of some of HW resource of the physical IOMMU: > > - a VMID is a small HW reousrce to tag the cache; > > - a vIOMMU invalidation allows to access device cache that's > > not straightforwardly done via an S1 HWPT invalidation; > > - a virtual device presentation of a physical device in a VM, > > related to the vIOMMU in the VM, which contains some VM-level > > info: virtual device ID, security level (ARM CCA), and etc; > > - Non-PRI IRQ forwarding to the guest VM; > > - HW-accelerated virtualization resource: vCMDQ, AMD VIOMMU; > > might be helpful to draw a diagram to show what the vIOMMU obj contains.:) That's what I plan to. Basically looks like: device---->stage1--->[ viommu [s2_hwpt, vmid, virq, HW-acc, etc.] ] Thanks Nic
On 2024/9/27 14:32, Nicolin Chen wrote: > On Fri, Sep 27, 2024 at 01:54:45PM +0800, Yi Liu wrote: >>>>> Baolu told me that Intel may have the same: different domain IDs >>>>> on different IOMMUs; multiple IOMMU instances on one chip: >>>>> https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/ >>>>> So, I think we are having the same situation here. >>>> >>>> yes, it's called iommu unit or dmar. A typical Intel server can have >>>> multiple iommu units. But like Baolu mentioned in that thread, the intel >>>> iommu driver maintains separate domain ID spaces for iommu units, which >>>> means a given iommu domain has different DIDs when associated with >>>> different iommu units. So intel side is not suffering from this so far. >>> >>> An ARM SMMU has its own VMID pool as well. The suffering comes >>> from associating VMIDs to one shared parent S2 domain. >> >> Is this because of the VMID is tied with a S2 domain? > > On ARM, yes. VMID is a part of S2 domain stuff. > >>> Does a DID per S1 nested domain or parent S2? If it is per S2, >>> I think the same suffering applies when we share the S2 across >>> IOMMU instances? >> >> per S1 I think. The iotlb efficiency is low as S2 caches would be >> tagged with different DIDs even the page table is the same. :) > > On ARM, the stage-1 is tagged with an ASID (Address Space ID) > while the stage-2 is tagged with a VMID. Then an invalidation > for a nested S1 domain must require the VMID from the S2. The > ASID may be also required if the invalidation is specific to > that address space (otherwise, broadcast per VMID.) Looks like the nested s1 caches are tagged with both ASID and VMID. > I feel these two might act somehow similarly to the two DIDs > during nested translations? not quite the same. Is it possible that the ASID is the same for stage-1? Intel VT-d side can have the pasid to be the same. Like the gIOVA, all devices use the same ridpasid. Like the scenario I replied to Baolu[1], do er choose to use different DIDs to differentiate the caches for the two devices. [1] https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@intel.com/ >>>>> Adding another vIOMMU wrapper on the other hand can allow us to >>>>> allocate different VMIDs/DIDs for different IOMMUs. >>>> >>>> that looks like to generalize the association of the iommu domain and the >>>> iommu units? >>> >>> A vIOMMU is a presentation/object of a physical IOMMU instance >>> in a VM. >> >> a slice of a physical IOMMU. is it? > > Yes. When multiple nested translations happen at the same time, > IOMMU (just like a CPU) is shared by these slices. And so is an > invalidation queue executing multiple requests. > > Perhaps calling it a slice sounds more accurate, as I guess all > the confusion comes from the name "vIOMMU" that might be thought > to be a user space object/instance that likely holds all virtual > stuff like stage-1 HWPT or so? yeah. Maybe this confusion partly comes when you start it with the cache invalidation as well. I failed to get why a S2 hwpt needs to be part of the vIOMMU obj at the first glance. > >> and you treat S2 hwpt as a resource of the physical IOMMU as well. > > Yes. A parent HWPT (in the old day, we called it "kernel-manged" > HWPT) is not a user space thing. This belongs to a kernel owned > object. > >>> This presentation gives a VMM some capability to take >>> advantage of some of HW resource of the physical IOMMU: >>> - a VMID is a small HW reousrce to tag the cache; >>> - a vIOMMU invalidation allows to access device cache that's >>> not straightforwardly done via an S1 HWPT invalidation; >>> - a virtual device presentation of a physical device in a VM, >>> related to the vIOMMU in the VM, which contains some VM-level >>> info: virtual device ID, security level (ARM CCA), and etc; >>> - Non-PRI IRQ forwarding to the guest VM; >>> - HW-accelerated virtualization resource: vCMDQ, AMD VIOMMU; >> >> might be helpful to draw a diagram to show what the vIOMMU obj contains.:) > > That's what I plan to. Basically looks like: > device---->stage1--->[ viommu [s2_hwpt, vmid, virq, HW-acc, etc.] ] ok. let's see your new doc.
On Fri, Sep 27, 2024 at 08:12:20PM +0800, Yi Liu wrote: > > Perhaps calling it a slice sounds more accurate, as I guess all > > the confusion comes from the name "vIOMMU" that might be thought > > to be a user space object/instance that likely holds all virtual > > stuff like stage-1 HWPT or so? > > yeah. Maybe this confusion partly comes when you start it with the > cache invalidation as well. I failed to get why a S2 hwpt needs to > be part of the vIOMMU obj at the first glance. Both amd and arm have direct to VM queues for the iommu and these queues have their DMA translated by the S2. So their viommu HW concepts come along with a requirement that there be a fixed translation for the VM, which we model by attaching a S2 HWPT to the VIOMMU object which get's linked into the IOMMU HW as the translation for the queue memory. Jason
On Fri, Sep 27, 2024 at 08:12:20PM +0800, Yi Liu wrote: > On 2024/9/27 14:32, Nicolin Chen wrote: > > On Fri, Sep 27, 2024 at 01:54:45PM +0800, Yi Liu wrote: > > > > > > Baolu told me that Intel may have the same: different domain IDs > > > > > > on different IOMMUs; multiple IOMMU instances on one chip: > > > > > > https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/ > > > > > > So, I think we are having the same situation here. > > > > > > > > > > yes, it's called iommu unit or dmar. A typical Intel server can have > > > > > multiple iommu units. But like Baolu mentioned in that thread, the intel > > > > > iommu driver maintains separate domain ID spaces for iommu units, which > > > > > means a given iommu domain has different DIDs when associated with > > > > > different iommu units. So intel side is not suffering from this so far. > > > > > > > > An ARM SMMU has its own VMID pool as well. The suffering comes > > > > from associating VMIDs to one shared parent S2 domain. > > > > > > Is this because of the VMID is tied with a S2 domain? > > > > On ARM, yes. VMID is a part of S2 domain stuff. > > > > > > Does a DID per S1 nested domain or parent S2? If it is per S2, > > > > I think the same suffering applies when we share the S2 across > > > > IOMMU instances? > > > > > > per S1 I think. The iotlb efficiency is low as S2 caches would be > > > tagged with different DIDs even the page table is the same. :) > > > > On ARM, the stage-1 is tagged with an ASID (Address Space ID) > > while the stage-2 is tagged with a VMID. Then an invalidation > > for a nested S1 domain must require the VMID from the S2. The > > ASID may be also required if the invalidation is specific to > > that address space (otherwise, broadcast per VMID.) > Looks like the nested s1 caches are tagged with both ASID and VMID. Yea, my understanding is similar. If both stages are enabled for a nested translation, VMID is tagged for S1 cache too. > > I feel these two might act somehow similarly to the two DIDs > > during nested translations? > > not quite the same. Is it possible that the ASID is the same for stage-1? > Intel VT-d side can have the pasid to be the same. Like the gIOVA, all > devices use the same ridpasid. Like the scenario I replied to Baolu[1], > do er choose to use different DIDs to differentiate the caches for the > two devices. On ARM, each S1 domain (either a normal stage-1 PASID=0 domain or an SVA PASID>0 domain) has a unique ASID. So it unlikely has the situation of two identical ASIDs if they are on the same vIOMMU, because the ASID pool is per IOMMU instance (whether p or v). With two vIOMMU instances, there might be the same ASIDs but they will be tagged with different VMIDs. > [1] > https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@intel.com/ Is "gIOVA" a type of invalidation that only uses "address" out of "PASID, DID and address"? I.e. PASID and DID are not provided via the invalidation request, so it's going to broadcast all viommus? Thanks Nicolin
On 2024/9/28 04:44, Nicolin Chen wrote: > On Fri, Sep 27, 2024 at 08:12:20PM +0800, Yi Liu wrote: >> On 2024/9/27 14:32, Nicolin Chen wrote: >>> On Fri, Sep 27, 2024 at 01:54:45PM +0800, Yi Liu wrote: >>>>>>> Baolu told me that Intel may have the same: different domain IDs >>>>>>> on different IOMMUs; multiple IOMMU instances on one chip: >>>>>>> https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/ >>>>>>> So, I think we are having the same situation here. >>>>>> >>>>>> yes, it's called iommu unit or dmar. A typical Intel server can have >>>>>> multiple iommu units. But like Baolu mentioned in that thread, the intel >>>>>> iommu driver maintains separate domain ID spaces for iommu units, which >>>>>> means a given iommu domain has different DIDs when associated with >>>>>> different iommu units. So intel side is not suffering from this so far. >>>>> >>>>> An ARM SMMU has its own VMID pool as well. The suffering comes >>>>> from associating VMIDs to one shared parent S2 domain. >>>> >>>> Is this because of the VMID is tied with a S2 domain? >>> >>> On ARM, yes. VMID is a part of S2 domain stuff. >>> >>>>> Does a DID per S1 nested domain or parent S2? If it is per S2, >>>>> I think the same suffering applies when we share the S2 across >>>>> IOMMU instances? >>>> >>>> per S1 I think. The iotlb efficiency is low as S2 caches would be >>>> tagged with different DIDs even the page table is the same. :) >>> >>> On ARM, the stage-1 is tagged with an ASID (Address Space ID) >>> while the stage-2 is tagged with a VMID. Then an invalidation >>> for a nested S1 domain must require the VMID from the S2. The >>> ASID may be also required if the invalidation is specific to >>> that address space (otherwise, broadcast per VMID.) > >> Looks like the nested s1 caches are tagged with both ASID and VMID. > > Yea, my understanding is similar. If both stages are enabled for > a nested translation, VMID is tagged for S1 cache too. > >>> I feel these two might act somehow similarly to the two DIDs >>> during nested translations? >> >> not quite the same. Is it possible that the ASID is the same for stage-1? >> Intel VT-d side can have the pasid to be the same. Like the gIOVA, all >> devices use the same ridpasid. Like the scenario I replied to Baolu[1], >> do er choose to use different DIDs to differentiate the caches for the >> two devices. > > On ARM, each S1 domain (either a normal stage-1 PASID=0 domain or > an SVA PASID>0 domain) has a unique ASID. I see. Looks like ASID is not the PASID. > So it unlikely has the > situation of two identical ASIDs if they are on the same vIOMMU, > because the ASID pool is per IOMMU instance (whether p or v). > > With two vIOMMU instances, there might be the same ASIDs but they > will be tagged with different VMIDs. > >> [1] >> https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@intel.com/ > > Is "gIOVA" a type of invalidation that only uses "address" out of > "PASID, DID and address"? I.e. PASID and DID are not provided via > the invalidation request, so it's going to broadcast all viommus? gIOVA is just a term v.s. vSVA. Just want to differentiate it from vSVA. :) PASID and DID are still provided in the invalidation.
On 2024/9/27 20:20, Jason Gunthorpe wrote: > On Fri, Sep 27, 2024 at 08:12:20PM +0800, Yi Liu wrote: >>> Perhaps calling it a slice sounds more accurate, as I guess all >>> the confusion comes from the name "vIOMMU" that might be thought >>> to be a user space object/instance that likely holds all virtual >>> stuff like stage-1 HWPT or so? >> >> yeah. Maybe this confusion partly comes when you start it with the >> cache invalidation as well. I failed to get why a S2 hwpt needs to >> be part of the vIOMMU obj at the first glance. > > Both amd and arm have direct to VM queues for the iommu and these > queues have their DMA translated by the S2. ok, this explains why the S2 should be part of the vIOMMU obj. > > So their viommu HW concepts come along with a requirement that there > be a fixed translation for the VM, which we model by attaching a S2 > HWPT to the VIOMMU object which get's linked into the IOMMU HW as > the translation for the queue memory. Is the mapping of the S2 be static? or it an be unmapped per userspace?
On Sun, Sep 29, 2024 at 03:16:55PM +0800, Yi Liu wrote: > > > > I feel these two might act somehow similarly to the two DIDs > > > > during nested translations? > > > > > > not quite the same. Is it possible that the ASID is the same for stage-1? > > > Intel VT-d side can have the pasid to be the same. Like the gIOVA, all > > > devices use the same ridpasid. Like the scenario I replied to Baolu[1], > > > do er choose to use different DIDs to differentiate the caches for the > > > two devices. > > > > On ARM, each S1 domain (either a normal stage-1 PASID=0 domain or > > an SVA PASID>0 domain) has a unique ASID. > > I see. Looks like ASID is not the PASID. It's not. PASID is called Substream ID in SMMU term. It's used to index the PASID table. For cache invalidations, a PASID (ssid) is for ATC (dev cache) or PASID table entry invalidation only. > > So it unlikely has the > > situation of two identical ASIDs if they are on the same vIOMMU, > > because the ASID pool is per IOMMU instance (whether p or v). > > > > With two vIOMMU instances, there might be the same ASIDs but they > > will be tagged with different VMIDs. > > > > > [1] > > > https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@intel.com/ > > > > Is "gIOVA" a type of invalidation that only uses "address" out of > > "PASID, DID and address"? I.e. PASID and DID are not provided via > > the invalidation request, so it's going to broadcast all viommus? > > gIOVA is just a term v.s. vSVA. Just want to differentiate it from vSVA. :) > PASID and DID are still provided in the invalidation. I am still not getting this gIOVA. What it does exactly v.s. vSVA? And should RIDPASID be IOMMU_NO_PASID? Nicolin
On 11/9/24 17:08, Nicolin Chen wrote: > On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote: >>> From: Nicolin Chen <nicolinc@nvidia.com> >>> Sent: Wednesday, August 28, 2024 1:00 AM >>> >> [...] >>> On a multi-IOMMU system, the VIOMMU object can be instanced to the >>> number >>> of vIOMMUs in a guest VM, while holding the same parent HWPT to share >>> the >> >> Is there restriction that multiple vIOMMU objects can be only created >> on a multi-IOMMU system? > > I think it should be generally restricted to the number of pIOMMUs, > although likely (not 100% sure) we could do multiple vIOMMUs on a > single-pIOMMU system. Any reason for doing that? Just to clarify the terminology here - what are pIOMMU and vIOMMU exactly? On AMD, IOMMU is a pretend-pcie device, one per a rootport, manages a DT - device table, one entry per BDFn, the entry owns a queue. A slice of that can be passed to a VM (== queues mapped directly to the VM, and such IOMMU appears in the VM as a pretend-pcie device too). So what is [pv]IOMMU here? Thanks, > >>> stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own >>> VMID to attach the shared stage-2 IO pagetable to the physical IOMMU: >> >> this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the >> entire context it actually means the physical 'VMID' allocated on the >> associated physical IOMMU, correct? > > Quoting Jason's narratives, a VMID is a "Security namespace for > guest owned ID". The allocation, using SMMU as an example, should > be a part of vIOMMU instance allocation in the host SMMU driver. > Then, this VMID will be used to mark the cache tags. So, it is > still a software allocated ID, while HW would use it too. > > Thanks > Nicolin
On Tue, Oct 01, 2024 at 11:55:59AM +1000, Alexey Kardashevskiy wrote: > On 11/9/24 17:08, Nicolin Chen wrote: > > On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote: > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > Sent: Wednesday, August 28, 2024 1:00 AM > > > > > > > [...] > > > > On a multi-IOMMU system, the VIOMMU object can be instanced to the > > > > number > > > > of vIOMMUs in a guest VM, while holding the same parent HWPT to share > > > > the > > > > > > Is there restriction that multiple vIOMMU objects can be only created > > > on a multi-IOMMU system? > > > > I think it should be generally restricted to the number of pIOMMUs, > > although likely (not 100% sure) we could do multiple vIOMMUs on a > > single-pIOMMU system. Any reason for doing that? > > > Just to clarify the terminology here - what are pIOMMU and vIOMMU exactly? > > On AMD, IOMMU is a pretend-pcie device, one per a rootport, manages a DT > - device table, one entry per BDFn, the entry owns a queue. A slice of > that can be passed to a VM (== queues mapped directly to the VM, and > such IOMMU appears in the VM as a pretend-pcie device too). So what is > [pv]IOMMU here? Thanks, The "p" stands for physical: the entire IOMMU unit/instance. In the IOMMU subsystem terminology, it's a struct iommu_device. It sounds like AMD would register one iommu device per rootport? The "v" stands for virtual: a slice of the pIOMMU that could be shared or passed through to a VM: - Intel IOMMU doesn't have passthrough queues, so it uses a shared queue (for invalidation). In this case, vIOMMU will be a pure SW structure for HW queue sharing (with the host machine and other VMs). That said, I think the channel (or the port) that Intel VT-d uses internally for a device to do a two-stage translation can be seen as a "passthrough" feature, held by a vIOMMU. - AMD IOMMU can assign passthrough queues to VMs, in which case, vIOMMU will be a structure holding all passthrough resource (of the pIOMMU) assisgned to a VM. If there is a shared resource, it can be packed into the vIOMMU struct too. FYI, vQUEUE (future series) on the other hand will represent each passthrough queue in a vIOMMU struct. The VM then, per that specific pIOMMU (rootport?), will have one vIOMMU holding a number of vQUEUEs. - ARM SMMU is sort of in the middle, depending on the impls. vIOMMU will be a structure holding both passthrough and shared resource. It can define vQUEUEs, if the impl has passthrough queues like AMD does. Allowing a vIOMMU to hold shared resource makes it a bit of an upgraded model for IOMMU virtualization, from the existing HWPT model that now looks like a subset of the vIOMMU model. Thanks Nicolin
On 1/10/24 13:36, Nicolin Chen wrote: > On Tue, Oct 01, 2024 at 11:55:59AM +1000, Alexey Kardashevskiy wrote: >> On 11/9/24 17:08, Nicolin Chen wrote: >>> On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote: >>>>> From: Nicolin Chen <nicolinc@nvidia.com> >>>>> Sent: Wednesday, August 28, 2024 1:00 AM >>>>> >>>> [...] >>>>> On a multi-IOMMU system, the VIOMMU object can be instanced to the >>>>> number >>>>> of vIOMMUs in a guest VM, while holding the same parent HWPT to share >>>>> the >>>> >>>> Is there restriction that multiple vIOMMU objects can be only created >>>> on a multi-IOMMU system? >>> >>> I think it should be generally restricted to the number of pIOMMUs, >>> although likely (not 100% sure) we could do multiple vIOMMUs on a >>> single-pIOMMU system. Any reason for doing that? >> >> >> Just to clarify the terminology here - what are pIOMMU and vIOMMU exactly? >> >> On AMD, IOMMU is a pretend-pcie device, one per a rootport, manages a DT >> - device table, one entry per BDFn, the entry owns a queue. A slice of >> that can be passed to a VM (== queues mapped directly to the VM, and >> such IOMMU appears in the VM as a pretend-pcie device too). So what is >> [pv]IOMMU here? Thanks, > > The "p" stands for physical: the entire IOMMU unit/instance. In > the IOMMU subsystem terminology, it's a struct iommu_device. It > sounds like AMD would register one iommu device per rootport? Yup, my test machine has 4 of these. > The "v" stands for virtual: a slice of the pIOMMU that could be > shared or passed through to a VM: > - Intel IOMMU doesn't have passthrough queues, so it uses a > shared queue (for invalidation). In this case, vIOMMU will > be a pure SW structure for HW queue sharing (with the host > machine and other VMs). That said, I think the channel (or > the port) that Intel VT-d uses internally for a device to > do a two-stage translation can be seen as a "passthrough" > feature, held by a vIOMMU. > - AMD IOMMU can assign passthrough queues to VMs, in which > case, vIOMMU will be a structure holding all passthrough > resource (of the pIOMMU) assisgned to a VM. If there is a > shared resource, it can be packed into the vIOMMU struct > too. FYI, vQUEUE (future series) on the other hand will > represent each passthrough queue in a vIOMMU struct. The > VM then, per that specific pIOMMU (rootport?), will have > one vIOMMU holding a number of vQUEUEs. > - ARM SMMU is sort of in the middle, depending on the impls. > vIOMMU will be a structure holding both passthrough and > shared resource. It can define vQUEUEs, if the impl has > passthrough queues like AMD does. > > Allowing a vIOMMU to hold shared resource makes it a bit of an > upgraded model for IOMMU virtualization, from the existing HWPT > model that now looks like a subset of the vIOMMU model. Thanks for confirming. I've just read in this thread that "it should be generally restricted to the number of pIOMMUs, although likely (not 100% sure) we could do multiple vIOMMUs on a single-pIOMMU system. Any reason for doing that?"? thought "we have every reason to do that, unless p means something different", so I decided to ask :) Thanks, > > Thanks > Nicolin
On Tue, Oct 01, 2024 at 03:06:57PM +1000, Alexey Kardashevskiy wrote: > I've just read in this thread that "it should be generally restricted to the > number of pIOMMUs, although likely (not 100% sure) we could do multiple > vIOMMUs on a single-pIOMMU system. Any reason for doing that?"? thought "we > have every reason to do that, unless p means something different", so I > decided to ask :) Thanks, I think that was inteded as "multiple vIOMMUs per pIOMMU within a single VM". There would always be multiple vIOMMUs per pIOMMU across VMs/etc. Jason
On Sun, Sep 29, 2024 at 03:19:42PM +0800, Yi Liu wrote: > > So their viommu HW concepts come along with a requirement that there > > be a fixed translation for the VM, which we model by attaching a S2 > > HWPT to the VIOMMU object which get's linked into the IOMMU HW as > > the translation for the queue memory. > > Is the mapping of the S2 be static? or it an be unmapped per userspace? In principle it should be dynamic, but I think the vCMDQ stuff will struggle to do that Jason
On Tue, Oct 01, 2024 at 10:48:15AM -0300, Jason Gunthorpe wrote: > On Sun, Sep 29, 2024 at 03:19:42PM +0800, Yi Liu wrote: > > > So their viommu HW concepts come along with a requirement that there > > > be a fixed translation for the VM, which we model by attaching a S2 > > > HWPT to the VIOMMU object which get's linked into the IOMMU HW as > > > the translation for the queue memory. > > > > Is the mapping of the S2 be static? or it an be unmapped per userspace? > > In principle it should be dynamic, but I think the vCMDQ stuff will > struggle to do that Yea. vCMDQ HW requires a setting of the physical address of the base address to a queue in the VM's ram space. If the S2 mapping changes (resulting a different queue location in the physical memory), VMM should notify the kernel for a HW reconfiguration. I wonder what all the user cases are, which can cause a shifting of S2 mappings? VM migration? Any others? Thanks Nicolin
On 2024/10/1 05:59, Nicolin Chen wrote: > On Sun, Sep 29, 2024 at 03:16:55PM +0800, Yi Liu wrote: >>>>> I feel these two might act somehow similarly to the two DIDs >>>>> during nested translations? >>>> >>>> not quite the same. Is it possible that the ASID is the same for stage-1? >>>> Intel VT-d side can have the pasid to be the same. Like the gIOVA, all >>>> devices use the same ridpasid. Like the scenario I replied to Baolu[1], >>>> do er choose to use different DIDs to differentiate the caches for the >>>> two devices. >>> >>> On ARM, each S1 domain (either a normal stage-1 PASID=0 domain or >>> an SVA PASID>0 domain) has a unique ASID. >> >> I see. Looks like ASID is not the PASID. > > It's not. PASID is called Substream ID in SMMU term. It's used to > index the PASID table. For cache invalidations, a PASID (ssid) is > for ATC (dev cache) or PASID table entry invalidation only. sure. Is there any relationship between PASID and ASID? Per the below link, ASID is used to tag the TLB entries of an application. So it's used in the SVA case. right? https://developer.arm.com/documentation/102142/0100/Stage-2-translation >>> So it unlikely has the >>> situation of two identical ASIDs if they are on the same vIOMMU, >>> because the ASID pool is per IOMMU instance (whether p or v). >>> >>> With two vIOMMU instances, there might be the same ASIDs but they >>> will be tagged with different VMIDs. >>> >>>> [1] >>>> https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@intel.com/ >>> >>> Is "gIOVA" a type of invalidation that only uses "address" out of >>> "PASID, DID and address"? I.e. PASID and DID are not provided via >>> the invalidation request, so it's going to broadcast all viommus? >> >> gIOVA is just a term v.s. vSVA. Just want to differentiate it from vSVA. :) >> PASID and DID are still provided in the invalidation. > > I am still not getting this gIOVA. What it does exactly v.s. vSVA? > And should RIDPASID be IOMMU_NO_PASID? gIOVA is the IOVA in guest. vSVA just the SVA in guest. Maybe the confusion comes why not use vIOVA instead of gIOVA. is it? I think you are clear about IOVA v.s. SVA. :) yes, RIDPASID is the IOMMU_NO_PASID although VT-d arch allows it to be non IOMMU_NO_PASID.
On Wed, Oct 09, 2024 at 03:20:57PM +0800, Yi Liu wrote: > On 2024/10/1 05:59, Nicolin Chen wrote: > > On Sun, Sep 29, 2024 at 03:16:55PM +0800, Yi Liu wrote: > > > > > > I feel these two might act somehow similarly to the two DIDs > > > > > > during nested translations? > > > > > > > > > > not quite the same. Is it possible that the ASID is the same for stage-1? > > > > > Intel VT-d side can have the pasid to be the same. Like the gIOVA, all > > > > > devices use the same ridpasid. Like the scenario I replied to Baolu[1], > > > > > do er choose to use different DIDs to differentiate the caches for the > > > > > two devices. > > > > > > > > On ARM, each S1 domain (either a normal stage-1 PASID=0 domain or > > > > an SVA PASID>0 domain) has a unique ASID. > > > > > > I see. Looks like ASID is not the PASID. > > > > It's not. PASID is called Substream ID in SMMU term. It's used to > > index the PASID table. For cache invalidations, a PASID (ssid) is > > for ATC (dev cache) or PASID table entry invalidation only. > > sure. Is there any relationship between PASID and ASID? Per the below > link, ASID is used to tag the TLB entries of an application. So it's > used in the SVA case. right? Unlike Intel and AMD the IOTLB tag is entirely controlled by software. So the HW will lookup the PASID and retrieve an ASID, then use that as a cache tag. Intel and AMD will use the PASID as the cache tag. As we've talked about several times using the PASID directly as a cache tag robs the SW of optimization possibilities in some cases. The extra ASID indirection allows the SW to always tag the same page table top pointer with the same ASID regardless of what PASID it is assigned to and guarentee IOTLB sharing. Jason