Message ID | 364cfbe5b228ab178093db2de13fa3accf7a6120.1678348754.git.nicolinc@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add Nested Translation Support for SMMUv3 | expand |
Hi Nicolin, On Thu, Mar 09, 2023 at 02:53:38AM -0800, Nicolin Chen wrote: > Add the following data structures for corresponding ioctls: > iommu_hwpt_arm_smmuv3 => IOMMUFD_CMD_HWPT_ALLOC > iommu_hwpt_invalidate_arm_smmuv3 => IOMMUFD_CMD_HWPT_INVALIDATE > > Also, add IOMMU_HW_INFO_TYPE_ARM_SMMUV3 and IOMMU_PGTBL_TYPE_ARM_SMMUV3_S1 > to the header and corresponding type/size arrays. > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > +/** > + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 specific page table data > + * > + * @flags: page table entry attributes > + * @s2vmid: Virtual machine identifier > + * @s1ctxptr: Stage-1 context descriptor pointer > + * @s1cdmax: Number of CDs pointed to by s1ContextPtr > + * @s1fmt: Stage-1 Format > + * @s1dss: Default substream > + */ > +struct iommu_hwpt_arm_smmuv3 { > +#define IOMMU_SMMUV3_FLAG_S2 (1 << 0) /* if unset, stage-1 */ I don't understand the purpose of this flag, since the structure only provides stage-1 configuration fields > +#define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */ Doesn't this break isolation? The VMID space is global for the SMMU, so the guest could access cached mappings of another device > + __u64 flags; > + __u32 s2vmid; > + __u32 __reserved; > + __u64 s1ctxptr; > + __u64 s1cdmax; > + __u64 s1fmt; > + __u64 s1dss; > +}; > + > +/** > + * struct iommu_hwpt_invalidate_arm_smmuv3 - ARM SMMUv3 cahce invalidation info > + * @flags: boolean attributes of cache invalidation command > + * @opcode: opcode of cache invalidation command > + * @ssid: SubStream ID > + * @granule_size: page/block size of the mapping in bytes > + * @range: IOVA range to invalidate > + */ > +struct iommu_hwpt_invalidate_arm_smmuv3 { > +#define IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF (1 << 0) > + __u64 flags; > + __u8 opcode; > + __u8 padding[3]; > + __u32 asid; > + __u32 ssid; > + __u32 granule_size; > + struct iommu_iova_range range; > +}; Although we can keep the alloc and hardware info separate for each IOMMU architecture, we should try to come up with common invalidation methods. It matters because things like vSVA, or just efficient dynamic mappings, will require optimal invalidation latency. A paravirtual interface like vhost-iommu can help with that, as the host kernel will handle guest invalidations directly instead of doing a round-trip to host userspace (and we'll likely want the same path for PRI.) Supporting HW page tables for a common PV IOMMU does require some architecture-specific knowledge, but invalidation messages contain roughly the same information on all architectures. The PV IOMMU won't include command opcodes for each possible architecture if one generic command does the same job. Ideally I'd like something like this for vhost-iommu: * slow path through userspace for complex requests like attach-table and probe, where the VMM can decode arch-specific information and translate them to iommufd and vhost-iommu ioctls to update the configuration. * fast path within the kernel for performance-critical requests like invalidate, page request and response. It would be absurd for the vhost-iommu driver to translate generic invalidation requests from the guest into arch-specific commands with special opcodes, when the next step is calling the IOMMU driver which does that for free. During previous discussions we came up with generic invalidations that could fit both Arm and x86 [1][2]. The only difference was the ASID (called archid/id in those proposals) which VT-d didn't need. Could we try to build on that? [1] https://elixir.bootlin.com/linux/v5.17/source/include/uapi/linux/iommu.h#L161 [2] https://lists.oasis-open.org/archives/virtio-dev/202102/msg00014.html Thanks, Jean
On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote: > Although we can keep the alloc and hardware info separate for each IOMMU > architecture, we should try to come up with common invalidation methods. The invalidation language is tightly linked to the actual cache block and cache tag in the IOMMU HW design. Generality will loose or obfuscate the necessary specificity that is required for creating real vIOMMUs. Further, invalidation is a fast path, it is crazy to take a vIOMMU of a real HW receving a native invalidation request, mangle it to some obfuscated kernel version and then de-mangle it again in the kernel driver. IMHO ideally qemu will simply point the invalidation at the WQE in the SW vIOMMU command queue and invoke the ioctl. (Nicolin, we should check more into this) The purpose of these interfaces is to support high performance full functionality vIOMMUs of the real HW, we should not loose sight of that goal. We are actually planning to go futher and expose direct invalidation work queues complete with HW doorbells to userspace. This obviously must be in native HW format. Nicolin, I think we should tweak the uAPI here so that the invalidation opaque data has a format tagged on its own, instead of re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type tag and also a virtio-viommu invalidate type tag. This will allow Jean to put the invalidation decoding in the iommu drivers if we think that is the right direction. virtio can standardize it as a "HW format". > Ideally I'd like something like this for vhost-iommu: > > * slow path through userspace for complex requests like attach-table and > probe, where the VMM can decode arch-specific information and translate > them to iommufd and vhost-iommu ioctls to update the configuration. > > * fast path within the kernel for performance-critical requests like > invalidate, page request and response. It would be absurd for the > vhost-iommu driver to translate generic invalidation requests from the > guest into arch-specific commands with special opcodes, when the next > step is calling the IOMMU driver which does that for free. Someone has to do the conversion. If you don't think virito should do it then I'd be OK to add a type tag for virtio format invalidation and put it in the IOMMU driver. But given virtio overall already has to know *alot* about how the HW it is wrapping works I don't think it is necessarily absurd for virtio to do the conversion. I'd like to evaluate this in patches in context with how much other unique HW code ends up in kernel-side vhost-iommu. However, I don't know the rational for virtio-viommu, it seems like a strange direction to me. All the iommu drivers have native command queues. ARM and AMD are both supporting native command queues directly in the guest, complete with a direct guest MMIO doorbell ring. If someone wants to optimize this I'd think the way to do it is to use virtio like techniques to put SW command queue processing in the kernel iommu driver and continue to use the HW native interface in the VM. What benifit comes from replacing the HW native interface with virtio? Especially on ARM where the native interface is pretty clean? > During previous discussions we came up with generic invalidations that > could fit both Arm and x86 [1][2]. The only difference was the ASID > (called archid/id in those proposals) which VT-d didn't need. Could we try > to build on that? IMHO this was just unioning all the different invalidation types together. It makes sense for something like virtio but it is illogical/obfuscated as a user/kernel interface since it still requires a userspace HW driver to understand what subset of the invalidations are used on the actual HW. Jason
> -----Original Message----- > From: Jean-Philippe Brucker [mailto:jean-philippe@linaro.org] > Sent: 09 March 2023 13:42 > To: Nicolin Chen <nicolinc@nvidia.com> > Cc: jgg@nvidia.com; robin.murphy@arm.com; will@kernel.org; > eric.auger@redhat.com; kevin.tian@intel.com; baolu.lu@linux.intel.com; > joro@8bytes.org; Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com>; > linux-arm-kernel@lists.infradead.org; iommu@lists.linux.dev; > linux-kernel@vger.kernel.org; yi.l.liu@intel.com > Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures > for ARM SMMUv3 > > Hi Nicolin, > > On Thu, Mar 09, 2023 at 02:53:38AM -0800, Nicolin Chen wrote: > > Add the following data structures for corresponding ioctls: > > iommu_hwpt_arm_smmuv3 => > IOMMUFD_CMD_HWPT_ALLOC > > iommu_hwpt_invalidate_arm_smmuv3 => > IOMMUFD_CMD_HWPT_INVALIDATE > > > > Also, add IOMMU_HW_INFO_TYPE_ARM_SMMUV3 and > IOMMU_PGTBL_TYPE_ARM_SMMUV3_S1 > > to the header and corresponding type/size arrays. > > > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > > > +/** > > + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 specific page table > data > > + * > > + * @flags: page table entry attributes > > + * @s2vmid: Virtual machine identifier > > + * @s1ctxptr: Stage-1 context descriptor pointer > > + * @s1cdmax: Number of CDs pointed to by s1ContextPtr > > + * @s1fmt: Stage-1 Format > > + * @s1dss: Default substream > > + */ > > +struct iommu_hwpt_arm_smmuv3 { > > +#define IOMMU_SMMUV3_FLAG_S2 (1 << 0) /* if unset, stage-1 */ > > I don't understand the purpose of this flag, since the structure only > provides stage-1 configuration fields > > > +#define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */ > > Doesn't this break isolation? The VMID space is global for the SMMU, so > the guest could access cached mappings of another device On platforms that supports BTM [1], we may need the VMID allocated by KVM. But again getting that from user pace doesn't look safe. I have attempted to revise the earlier RFC to pin and use the KVM VMID from SMMUv3 here[2]. But the problem is getting the KVM instance associated with the device. Currently I am going through the VFIO layer to retrieve the KVM instance(vfio_device->kvm). On the previous RFC discussion thread[3], Jean has proposed, " In the new design we can require from the start that creating a nesting IOMMU container through /dev/iommu *must* come with a KVM context, that way we're sure to reuse the existing VMID. " Is that something we can still do or there is a better way to handle this now? Thanks, Shameer 1. https://lore.kernel.org/linux-arm-kernel/YEEUocRn3IfIDpLj@myrica/T/#m478f7e7d5dcb729e02721beda35efa12c1d20707 2. https://github.com/hisilicon/kernel-dev/commits/iommufd-v6.2-rc4-nesting-arm-btm-v2 3. https://lore.kernel.org/linux-arm-kernel/YEEUocRn3IfIDpLj@myrica/T/#m11cde7534943ea7cf35f534cb809a023eabd9da3
On Thu, Mar 09, 2023 at 03:26:12PM +0000, Shameerali Kolothum Thodi wrote: > On platforms that supports BTM [1], we may need the VMID allocated by KVM. > But again getting that from user pace doesn't look safe. I have attempted to revise > the earlier RFC to pin and use the KVM VMID from SMMUv3 here[2]. Gurk > " In the new design we can require from the start that creating a nesting IOMMU > container through /dev/iommu *must* come with a KVM context, that way > we're sure to reuse the existing VMID. " I've been dreading this but yes I execpt we will eventually need to connect kvm and iommufd together. The iommu driver can receive a kvm pointer as part of the alloc domain operation to do any setups like this. If there is no KVM it should either fail to setup the domain or setup a domain disconnected from KVM. If IOMMU HW and KVM HW are using the same ID number space then arguably the two kernel drivers need to use a shared ID allocator in the arch, regardless of what iommufd/etc does. Using KVM should not be mandatory for iommufd. For ARM cases where there is no shared VMID space with KVM, the ARM VMID should be somehow assigned to the iommfd_ctx itself and the alloc domain op should receive it from there. Nicolin, that seems to be missing in this series? I'm not entirely sure how to elegantly code it :\ Jason
> -----Original Message----- > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > Sent: 09 March 2023 15:40 > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>; Nicolin Chen > <nicolinc@nvidia.com>; robin.murphy@arm.com; will@kernel.org; > eric.auger@redhat.com; kevin.tian@intel.com; baolu.lu@linux.intel.com; > joro@8bytes.org; linux-arm-kernel@lists.infradead.org; > iommu@lists.linux.dev; linux-kernel@vger.kernel.org; yi.l.liu@intel.com > Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures > for ARM SMMUv3 > > On Thu, Mar 09, 2023 at 03:26:12PM +0000, Shameerali Kolothum Thodi > wrote: > > > On platforms that supports BTM [1], we may need the VMID allocated by > KVM. > > But again getting that from user pace doesn't look safe. I have attempted > to revise > > the earlier RFC to pin and use the KVM VMID from SMMUv3 here[2]. > > Gurk > > > " In the new design we can require from the start that creating a nesting > IOMMU > > container through /dev/iommu *must* come with a KVM context, that way > > we're sure to reuse the existing VMID. " > > I've been dreading this but yes I execpt we will eventually need to > connect kvm and iommufd together. The iommu driver can receive a kvm > pointer as part of the alloc domain operation to do any setups like > this. That will make life easier :) > If there is no KVM it should either fail to setup the domain or setup > a domain disconnected from KVM. > If no KVM the SMMUv3 can fall back to its internal VMID allocation I guess. And my intention was to use KVM VMID only if the platform supports BTM. > If IOMMU HW and KVM HW are using the same ID number space then > arguably the two kernel drivers need to use a shared ID allocator in > the arch, regardless of what iommufd/etc does. Using KVM should not be > mandatory for iommufd. > > For ARM cases where there is no shared VMID space with KVM, the ARM > VMID should be somehow assigned to the iommfd_ctx itself and the alloc > domain op should receive it from there. Is there any use of VMID outside SMMUv3? I was thinking if nested domain alloc doesn't provide the KVM instance, then SMMUv3 can use its internal VMID. Thanks, Shameer > Nicolin, that seems to be missing in this series? I'm not entirely > sure how to elegantly code it :\ > > Jason
On Thu, Mar 09, 2023 at 03:51:42PM +0000, Shameerali Kolothum Thodi wrote: > > For ARM cases where there is no shared VMID space with KVM, the ARM > > VMID should be somehow assigned to the iommfd_ctx itself and the alloc > > domain op should receive it from there. > > Is there any use of VMID outside SMMUv3? I was thinking if nested domain alloc > doesn't provide the KVM instance, then SMMUv3 can use its internal VMID. When we talk about exposing an SMMUv3 IOMMU CMDQ directly to userspace then VMID is the security token that protects it. So in that environment every domain under the same iommufd should share the same VMID so that the CMDQ's also share the same VMID. I expect this to be a common sort of requirement as we will see userspace command queues in the other HW as well. So, I suppose the answer for now is that ARM SMMUv3 should just allocate one VMID per iommu_domain and there should be no VMID in the uapi at all. Moving all iommu_domains to share the same VMID is a future patch. Though.. I have no idea how vVMID is handled in the SMMUv3 architecture. I suppose the guest IOMMU HW caps are set in a way that it knows it does not have VMID? Jason
> -----Original Message----- > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > Sent: 09 March 2023 16:00 > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>; Nicolin Chen > <nicolinc@nvidia.com>; robin.murphy@arm.com; will@kernel.org; > eric.auger@redhat.com; kevin.tian@intel.com; baolu.lu@linux.intel.com; > joro@8bytes.org; linux-arm-kernel@lists.infradead.org; > iommu@lists.linux.dev; linux-kernel@vger.kernel.org; yi.l.liu@intel.com > Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures > for ARM SMMUv3 > > On Thu, Mar 09, 2023 at 03:51:42PM +0000, Shameerali Kolothum Thodi > wrote: > > > > For ARM cases where there is no shared VMID space with KVM, the ARM > > > VMID should be somehow assigned to the iommfd_ctx itself and the alloc > > > domain op should receive it from there. > > > > Is there any use of VMID outside SMMUv3? I was thinking if nested domain > alloc > > doesn't provide the KVM instance, then SMMUv3 can use its internal VMID. > > When we talk about exposing an SMMUv3 IOMMU CMDQ directly to > userspace then > VMID is the security token that protects it. > > So in that environment every domain under the same iommufd should > share the same VMID so that the CMDQ's also share the same VMID. > > I expect this to be a common sort of requirement as we will see > userspace command queues in the other HW as well. > > So, I suppose the answer for now is that ARM SMMUv3 should just > allocate one VMID per iommu_domain and there should be no VMID in the > uapi at all. > > Moving all iommu_domains to share the same VMID is a future patch. > > Though.. I have no idea how vVMID is handled in the SMMUv3 > architecture. I suppose the guest IOMMU HW caps are set in a way that > it knows it does not have VMID? I think, Guest only sets up the SMMUv3 S1 stage and it doesn't use VMID. Thanks, Shameer
On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote: > On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote: > > > Although we can keep the alloc and hardware info separate for each IOMMU > > architecture, we should try to come up with common invalidation methods. > > The invalidation language is tightly linked to the actual cache block > and cache tag in the IOMMU HW design. Concretely though, what are the incompatibilities between the HW designs? They all need to remove a range of TLB entries, using some address space tag. But if there is an actual difference I do need to know. > Generality will loose or > obfuscate the necessary specificity that is required for creating real > vIOMMUs. > > Further, invalidation is a fast path, it is crazy to take a vIOMMU of > a real HW receving a native invalidation request, mangle it to some > obfuscated kernel version and then de-mangle it again in the kernel > driver. IMHO ideally qemu will simply point the invalidation at the > WQE in the SW vIOMMU command queue and invoke the ioctl. (Nicolin, we > should check more into this) Avoiding copying a few bytes won't make up for the extra context switches to userspace. An emulated IOMMU can easily decode commands and translate them to generic kernel structures, in a handful of CPU cycles, just like they decode STEs. It's what they do, and it's the opposite of obfuscation. > > The purpose of these interfaces is to support high performance full > functionality vIOMMUs of the real HW, we should not loose sight of > that goal. > > We are actually planning to go futher and expose direct invalidation > work queues complete with HW doorbells to userspace. This obviously > must be in native HW format. Doesn't seem relevant since direct access to command queue wouldn't use this struct. > > Nicolin, I think we should tweak the uAPI here so that the > invalidation opaque data has a format tagged on its own, instead of > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type > tag and also a virtio-viommu invalidate type tag. > > This will allow Jean to put the invalidation decoding in the iommu > drivers if we think that is the right direction. virtio can > standardize it as a "HW format". > > > Ideally I'd like something like this for vhost-iommu: > > > > * slow path through userspace for complex requests like attach-table and > > probe, where the VMM can decode arch-specific information and translate > > them to iommufd and vhost-iommu ioctls to update the configuration. > > > > * fast path within the kernel for performance-critical requests like > > invalidate, page request and response. It would be absurd for the > > vhost-iommu driver to translate generic invalidation requests from the > > guest into arch-specific commands with special opcodes, when the next > > step is calling the IOMMU driver which does that for free. > > Someone has to do the conversion. If you don't think virito should do > it then I'd be OK to add a type tag for virtio format invalidation and > put it in the IOMMU driver. Implementing two invalidation formats in each IOMMU driver does not seem practical. > > But given virtio overall already has to know *alot* about how the HW > it is wrapping works I don't think it is necessarily absurd for virtio > to do the conversion. I'd like to evaluate this in patches in context > with how much other unique HW code ends up in kernel-side vhost-iommu. Ideally none. I'd rather leave those, attach and probe, in userspace and if possible compatible with iommufd to avoid register decoding. > > However, I don't know the rational for virtio-viommu, it seems like a > strange direction to me. A couple of reasons are relevant here: non-QEMU VMMs don't want to emulate all vendor IOMMUs, new architectures get vIOMMU mostly for free, and vhost provides a faster path. Also the ability to optimize paravirtual interfaces for things like combined invalidation (IOTLB+ATC) or, later, nested page requests. For a while the main vIOMMU use-case was assignment to guest userspace, mainly DPDK, which works great with a generic and slow map/unmap interface. Since vSVA is still a niche use-case, and nesting without page faults requires pinning the whole guest memory, map/unmap still seems more desirable to me. But there is some renewed interest in supporting page tables with virtio-iommu for the reasons above. > All the iommu drivers have native command > queues. ARM and AMD are both supporting native command queues directly > in the guest, complete with a direct guest MMIO doorbell ring. Arm SMMUv3 mandates a single global command queue (SMMUv2 uses registers). An SMMUv3 can optionally implement multiple command queues, though I don't know if they can be safely assigned to guests. For a lot of SMMUv3 implementations that have a single queue and for other architectures, we can do better than hardware emulation. > > If someone wants to optimize this I'd think the way to do it is to use > virtio like techniques to put SW command queue processing in the > kernel iommu driver and continue to use the HW native interface in the > VM. I didn't get which kernel this is. > > What benifit comes from replacing the HW native interface with virtio? > Especially on ARM where the native interface is pretty clean? > > > During previous discussions we came up with generic invalidations that > > could fit both Arm and x86 [1][2]. The only difference was the ASID > > (called archid/id in those proposals) which VT-d didn't need. Could we try > > to build on that? > > IMHO this was just unioning all the different invalidation types > together. It makes sense for something like virtio but it is > illogical/obfuscated as a user/kernel interface since it still > requires a userspace HW driver to understand what subset of the > invalidations are used on the actual HW. As above, decoding arch-specific structures into generic ones is what an emulated IOMMU does, and it doesn't make a performance difference in which format it forwards that to the kernel. The host IOMMU driver checks the guest request and copies them into the command queue. Whether that request comes in the form of a structure binary-compatible with Arm SMMUvX.Y, or some generic structure, does not make a difference. Thanks, Jean
On Thu, Mar 09, 2023 at 06:26:59PM +0000, Jean-Philippe Brucker wrote: > On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote: > > On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote: > > > > > Although we can keep the alloc and hardware info separate for each IOMMU > > > architecture, we should try to come up with common invalidation methods. > > > > The invalidation language is tightly linked to the actual cache block > > and cache tag in the IOMMU HW design. > > Concretely though, what are the incompatibilities between the HW designs? > They all need to remove a range of TLB entries, using some address space > tag. But if there is an actual difference I do need to know. For instance the address space tags and the cache entires they match to are wildly different. ARM uses a fine grained ASID and Intel uses a shared ASID called a DID and incorporates the PASID into the cache tag. AMD uses something called a DID that covers a different set of stuff than the Intel DID, and it doesn't seem to work for nesting. AMD uses PASID as the primary nested cache tag. Superficially you can say all three have an ASID and you can have an invalidate ASID Operation and make it "look" the same, but the actual behavior is totally ill defined and the whole thing is utterly obfuscated as to what does it actually MEAN. But this doesn't matter for virtio. You have already got a spec that defines invalidation in terms of virtio objects that sort of match things like iommu_domains. I hope the virtio VIRTIO_IOMMU_INVAL_S_DOMAIN is very well defined as to exactly what objects a DOMAIN match applies to. The job of the hypervisor is to translate that to whatever invalidation(s) the real HW requires. ie virito is going to say "invalidate this domain" and expect the hypervisor to spew it to every attached PASID if that is what the HW requires (eg AMD), or do a single ASID invalidation (Intel, sometimes) But when a vIOMMU gets a vDID or vPASID invalidation command it doesn't mean the same thing as virtio. The driver must invalidate exactly what the vIOMMU programming model says to invalidate because the guest is going to spew more invalidations to cover what it needs. Over invalidation would be a performance problem. Exposing these subtle differences to userspace is necessary. What I do not want is leaking those differences through an ill-defined "generic" interface. Even more importantly Intel and ARM should not have to fight about the subtle implementation specific details of the specification of the "generic" interface. If Intel needs something dumb to make their viommu work well then they should simply be able to do it. I don't want to spend 6 months of pointless arguing about language details in an unnecessary "generic" spec. > > The purpose of these interfaces is to support high performance full > > functionality vIOMMUs of the real HW, we should not loose sight of > > that goal. > > > > We are actually planning to go futher and expose direct invalidation > > work queues complete with HW doorbells to userspace. This obviously > > must be in native HW format. > > Doesn't seem relevant since direct access to command queue wouldn't use > this struct. The point is our design direction with iommufd is to expose the raw HW to userspace, not to obfuscate it with ill defined generalizations. > > Someone has to do the conversion. If you don't think virito should do > > it then I'd be OK to add a type tag for virtio format invalidation and > > put it in the IOMMU driver. > > Implementing two invalidation formats in each IOMMU driver does not seem > practical. I don't see why. The advantage of the kernel side is that the implementation is not strong ABI. If we want to adjust and review the virtio invalidation path as new HW comes along we can, so long as it is only in the kernel. On the other hand if we mess up the uABI for iommufd we are stuck with it. The safest and best uABI for iommufd is the HW native uABI because it, almost by definition, cannot be wrong. Anyhow, I'm still not very convinced adapting to virtio invalidation format should be in iommu drivers. I think what you end up with for virtio is that Intel/AMD have some nice common code to invalidate an iommu_domain address range (probably even the existing invalidation interface), and SMMUv3 is just totally different and special. This is because SMMUv3 has no option to keep the PASID table in the hypervisor so you are sadly forced to expose both the native ASID and native PASID caches to the virtio protocol. Given that the VM virtio driver has to have SMMUv3 specific code to handle the CD table it must get, I don't see the problem with also having SMMUv3 specific code in the hypervisor virtio driver to handle invalidating based on the CD table. Really, I want to see patches implementing all of this before we make any decision on what the ops interface is for virtio-iommu kernel side. > > However, I don't know the rational for virtio-viommu, it seems like a > > strange direction to me. > > A couple of reasons are relevant here: non-QEMU VMMs don't want to emulate > all vendor IOMMUs, new architectures get vIOMMU mostly for free, So your argument is you can implement a simple map/unmap API riding on the common IOMMU API and this is portable? Seems sensible, but that falls apart pretty quickly when we talk about nesting.. I don't think we can avoid VMM components to set this up, so it stops being portable. At that point I'm back to asking why not use the real HW model? > > All the iommu drivers have native command > > queues. ARM and AMD are both supporting native command queues directly > > in the guest, complete with a direct guest MMIO doorbell ring. > > Arm SMMUv3 mandates a single global command queue (SMMUv2 uses > registers). An SMMUv3 can optionally implement multiple command > queues, though I don't know if they can be safely assigned to > guests. It is not standardized by ARM, but it can (and has) been done. > For a lot of SMMUv3 implementations that have a single queue and for > other architectures, we can do better than hardware emulation. How is using a SW emulated virtio formatted queue better than using a SW emulated SMMUv3 ECMDQ? The vSMMUv3 driver controls what capabilites are shown to the guest it can definitely create a ECMDQ enabled device and do something like assign the 2ndary ECMDQs to hypervisor kernel SW queues the same way virito does. I don't think there is a very solid argument that virtio-iommu is necessary to get faster invalidation. > > If someone wants to optimize this I'd think the way to do it is to use > > virtio like techniques to put SW command queue processing in the > > kernel iommu driver and continue to use the HW native interface in the > > VM. > > I didn't get which kernel this is. hypervisor kernel. > > IMHO this was just unioning all the different invalidation types > > together. It makes sense for something like virtio but it is > > illogical/obfuscated as a user/kernel interface since it still > > requires a userspace HW driver to understand what subset of the > > invalidations are used on the actual HW. > > As above, decoding arch-specific structures into generic ones is what an > emulated IOMMU does, No, it is what virtio wants to do. We are deliberately trying not to do that for real accelerated HW vIOMMU emulators. > and it doesn't make a performance difference in which > format it forwards that to the kernel. The host IOMMU driver checks the > guest request and copies them into the command queue. Whether that request > comes in the form of a structure binary-compatible with Arm SMMUvX.Y, or > some generic structure, does not make a difference. It is not the structure layouts that matter! It is the semantic meaning of each request, on each unique piece of hardware. We actually want to leak the subtle semantic differences to userspace. Doing that and continuing to give them the same command label is exactly the kind of obfuscated ill defined "generic" interface I do not want. Jason
On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote: > Nicolin, I think we should tweak the uAPI here so that the > invalidation opaque data has a format tagged on its own, instead of > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type > tag and also a virtio-viommu invalidate type tag. The invalidation tage is shared with the hwpt allocation. Does it mean that virtio-iommu won't have it's own allocation tag? Thanks Nic
Hi Jeans, Allow me to partially reply your email: On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote: > > +struct iommu_hwpt_arm_smmuv3 { > > +#define IOMMU_SMMUV3_FLAG_S2 (1 << 0) /* if unset, stage-1 */ > > I don't understand the purpose of this flag, since the structure only > provides stage-1 configuration fields I should have probably put more description for this flag. It is used to allocate a stage-2 domain for a nested translation setup. The default allocation for a kernel-managed domain will allocate an S1 format of IO page table, at ARM_SMMU_DOMAIN_S1 stage. But a nested kernel-managed domain needs an S2 format, at ARM_SMMU_DOMAIN_S2. So the whole structure seems to only provide stage-1 info but it's used for both stages. And a stage-2 allocation will only need s2vmid if VMID flag is set (explaining below). > > +#define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */ > > Doesn't this break isolation? The VMID space is global for the SMMU, so > the guest could access cached mappings of another device This flag isn't mature yet. I kept it from my internal RFC to see if we can have a better solution. There are use cases on certain platforms where the VMIDs across all devices in the same VM need to be aligned. Thanks Nic
On Thu, Mar 09, 2023 at 11:40:16AM -0400, Jason Gunthorpe wrote: > On Thu, Mar 09, 2023 at 03:26:12PM +0000, Shameerali Kolothum Thodi wrote: > > > On platforms that supports BTM [1], we may need the VMID allocated by KVM. > > But again getting that from user pace doesn't look safe. I have attempted to revise > > the earlier RFC to pin and use the KVM VMID from SMMUv3 here[2]. > > Gurk > > > " In the new design we can require from the start that creating a nesting IOMMU > > container through /dev/iommu *must* come with a KVM context, that way > > we're sure to reuse the existing VMID. " > > I've been dreading this but yes I execpt we will eventually need to > connect kvm and iommufd together. The iommu driver can receive a kvm > pointer as part of the alloc domain operation to do any setups like > this. > > If there is no KVM it should either fail to setup the domain or setup > a domain disconnected from KVM. > > If IOMMU HW and KVM HW are using the same ID number space then > arguably the two kernel drivers need to use a shared ID allocator in > the arch, regardless of what iommufd/etc does. Using KVM should not be > mandatory for iommufd. > > For ARM cases where there is no shared VMID space with KVM, the ARM > VMID should be somehow assigned to the iommfd_ctx itself and the alloc > domain op should receive it from there. > > Nicolin, that seems to be missing in this series? I'm not entirely > sure how to elegantly code it :\ Yea, it's missing. The VMID thing is supposed to be a sneak peek of my next VCMDQ solution. Now it seems that BTM needs this too. Remember that my previous VCMDQ series had a big complication to share VMID across the passthrough devices in the same VM? During that patch review, we concluded that IOMMUFD would simply align VMIDs using a unified ctx ID or so, IIRC. Thanks Nic
On Thu, Mar 09, 2023 at 04:07:54PM +0000, Shameerali Kolothum Thodi wrote: > External email: Use caution opening links or attachments > > > > -----Original Message----- > > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > > Sent: 09 March 2023 16:00 > > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>; Nicolin Chen > > <nicolinc@nvidia.com>; robin.murphy@arm.com; will@kernel.org; > > eric.auger@redhat.com; kevin.tian@intel.com; baolu.lu@linux.intel.com; > > joro@8bytes.org; linux-arm-kernel@lists.infradead.org; > > iommu@lists.linux.dev; linux-kernel@vger.kernel.org; yi.l.liu@intel.com > > Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures > > for ARM SMMUv3 > > > > On Thu, Mar 09, 2023 at 03:51:42PM +0000, Shameerali Kolothum Thodi > > wrote: > > > > > > For ARM cases where there is no shared VMID space with KVM, the ARM > > > > VMID should be somehow assigned to the iommfd_ctx itself and the alloc > > > > domain op should receive it from there. > > > > > > Is there any use of VMID outside SMMUv3? I was thinking if nested domain > > alloc > > > doesn't provide the KVM instance, then SMMUv3 can use its internal VMID. > > > > When we talk about exposing an SMMUv3 IOMMU CMDQ directly to > > userspace then > > VMID is the security token that protects it. > > > > So in that environment every domain under the same iommufd should > > share the same VMID so that the CMDQ's also share the same VMID. > > > > I expect this to be a common sort of requirement as we will see > > userspace command queues in the other HW as well. > > > > So, I suppose the answer for now is that ARM SMMUv3 should just > > allocate one VMID per iommu_domain and there should be no VMID in the > > uapi at all. > > > > Moving all iommu_domains to share the same VMID is a future patch. > > > > Though.. I have no idea how vVMID is handled in the SMMUv3 > > architecture. I suppose the guest IOMMU HW caps are set in a way that > > it knows it does not have VMID? > > I think, Guest only sets up the SMMUv3 S1 stage and it doesn't use VMID. Yea, a vmid is only allocated in an S2 domain allocation. So, a guest allocating only S1 domains always sets VMID=0. Yet, I think that the hypervisor or some where in host kernel should replace the VMID=0 with a unified VMID. Thanks Nic
On Thu, Mar 09, 2023 at 09:26:57PM -0800, Nicolin Chen wrote: > On Thu, Mar 09, 2023 at 04:07:54PM +0000, Shameerali Kolothum Thodi wrote: > > External email: Use caution opening links or attachments > > > > > > > -----Original Message----- > > > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > > > Sent: 09 March 2023 16:00 > > > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>; Nicolin Chen > > > <nicolinc@nvidia.com>; robin.murphy@arm.com; will@kernel.org; > > > eric.auger@redhat.com; kevin.tian@intel.com; baolu.lu@linux.intel.com; > > > joro@8bytes.org; linux-arm-kernel@lists.infradead.org; > > > iommu@lists.linux.dev; linux-kernel@vger.kernel.org; yi.l.liu@intel.com > > > Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures > > > for ARM SMMUv3 > > > > > > On Thu, Mar 09, 2023 at 03:51:42PM +0000, Shameerali Kolothum Thodi > > > wrote: > > > > > > > > For ARM cases where there is no shared VMID space with KVM, the ARM > > > > > VMID should be somehow assigned to the iommfd_ctx itself and the alloc > > > > > domain op should receive it from there. > > > > > > > > Is there any use of VMID outside SMMUv3? I was thinking if nested domain > > > alloc > > > > doesn't provide the KVM instance, then SMMUv3 can use its internal VMID. > > > > > > When we talk about exposing an SMMUv3 IOMMU CMDQ directly to > > > userspace then > > > VMID is the security token that protects it. > > > > > > So in that environment every domain under the same iommufd should > > > share the same VMID so that the CMDQ's also share the same VMID. > > > > > > I expect this to be a common sort of requirement as we will see > > > userspace command queues in the other HW as well. > > > > > > So, I suppose the answer for now is that ARM SMMUv3 should just > > > allocate one VMID per iommu_domain and there should be no VMID in the > > > uapi at all. > > > > > > Moving all iommu_domains to share the same VMID is a future patch. > > > > > > Though.. I have no idea how vVMID is handled in the SMMUv3 > > > architecture. I suppose the guest IOMMU HW caps are set in a way that > > > it knows it does not have VMID? > > > > I think, Guest only sets up the SMMUv3 S1 stage and it doesn't use VMID. > > Yea, a vmid is only allocated in an S2 domain allocation. So, > a guest allocating only S1 domains always sets VMID=0. Yet, I > think that the hypervisor or some where in host kernel should > replace the VMID=0 with a unified VMID. Ah, I just recall a conversation with Jason that a VM should only have one S2 domain. In that case, the VMID is already unified? Thanks Nic
Hi, On 3/9/23 14:42, Jean-Philippe Brucker wrote: > Hi Nicolin, > > On Thu, Mar 09, 2023 at 02:53:38AM -0800, Nicolin Chen wrote: >> Add the following data structures for corresponding ioctls: >> iommu_hwpt_arm_smmuv3 => IOMMUFD_CMD_HWPT_ALLOC >> iommu_hwpt_invalidate_arm_smmuv3 => IOMMUFD_CMD_HWPT_INVALIDATE >> >> Also, add IOMMU_HW_INFO_TYPE_ARM_SMMUV3 and IOMMU_PGTBL_TYPE_ARM_SMMUV3_S1 >> to the header and corresponding type/size arrays. >> >> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> >> +/** >> + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 specific page table data >> + * >> + * @flags: page table entry attributes >> + * @s2vmid: Virtual machine identifier >> + * @s1ctxptr: Stage-1 context descriptor pointer >> + * @s1cdmax: Number of CDs pointed to by s1ContextPtr >> + * @s1fmt: Stage-1 Format >> + * @s1dss: Default substream >> + */ >> +struct iommu_hwpt_arm_smmuv3 { >> +#define IOMMU_SMMUV3_FLAG_S2 (1 << 0) /* if unset, stage-1 */ > I don't understand the purpose of this flag, since the structure only > provides stage-1 configuration fields > >> +#define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */ > Doesn't this break isolation? The VMID space is global for the SMMU, so > the guest could access cached mappings of another device > >> + __u64 flags; >> + __u32 s2vmid; >> + __u32 __reserved; >> + __u64 s1ctxptr; >> + __u64 s1cdmax; >> + __u64 s1fmt; >> + __u64 s1dss; >> +}; >> + > >> +/** >> + * struct iommu_hwpt_invalidate_arm_smmuv3 - ARM SMMUv3 cahce invalidation info >> + * @flags: boolean attributes of cache invalidation command >> + * @opcode: opcode of cache invalidation command >> + * @ssid: SubStream ID >> + * @granule_size: page/block size of the mapping in bytes >> + * @range: IOVA range to invalidate >> + */ >> +struct iommu_hwpt_invalidate_arm_smmuv3 { >> +#define IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF (1 << 0) >> + __u64 flags; >> + __u8 opcode; >> + __u8 padding[3]; >> + __u32 asid; >> + __u32 ssid; >> + __u32 granule_size; >> + struct iommu_iova_range range; >> +}; > Although we can keep the alloc and hardware info separate for each IOMMU > architecture, we should try to come up with common invalidation methods. > > It matters because things like vSVA, or just efficient dynamic mappings, > will require optimal invalidation latency. A paravirtual interface like > vhost-iommu can help with that, as the host kernel will handle guest > invalidations directly instead of doing a round-trip to host userspace > (and we'll likely want the same path for PRI.) > > Supporting HW page tables for a common PV IOMMU does require some > architecture-specific knowledge, but invalidation messages contain roughly > the same information on all architectures. The PV IOMMU won't include > command opcodes for each possible architecture if one generic command does > the same job. > > Ideally I'd like something like this for vhost-iommu: > > * slow path through userspace for complex requests like attach-table and > probe, where the VMM can decode arch-specific information and translate > them to iommufd and vhost-iommu ioctls to update the configuration. > > * fast path within the kernel for performance-critical requests like > invalidate, page request and response. It would be absurd for the > vhost-iommu driver to translate generic invalidation requests from the > guest into arch-specific commands with special opcodes, when the next > step is calling the IOMMU driver which does that for free. > > During previous discussions we came up with generic invalidations that > could fit both Arm and x86 [1][2]. The only difference was the ASID > (called archid/id in those proposals) which VT-d didn't need. Could we try > to build on that? I do agree with Jean. We spent a lot of efforts all together to define this generic invalidation API and if there is compelling reason that prevents from using it, we should try to reuse it. Thanks Eric > > [1] https://elixir.bootlin.com/linux/v5.17/source/include/uapi/linux/iommu.h#L161 > [2] https://lists.oasis-open.org/archives/virtio-dev/202102/msg00014.html > > Thanks, > Jean >
On Thu, Mar 09, 2023 at 05:01:15PM -0400, Jason Gunthorpe wrote: > > Concretely though, what are the incompatibilities between the HW designs? > > They all need to remove a range of TLB entries, using some address space > > tag. But if there is an actual difference I do need to know. > > For instance the address space tags and the cache entires they match > to are wildly different. > > ARM uses a fine grained ASID and Intel uses a shared ASID called a DID > and incorporates the PASID into the cache tag. > > AMD uses something called a DID that covers a different set of stuff > than the Intel DID, and it doesn't seem to work for nesting. AMD uses > PASID as the primary nested cache tag. Thanks, we'll look into that > This is because SMMUv3 has no option to keep the PASID table in the > hypervisor so you are sadly forced to expose both the native ASID and > native PASID caches to the virtio protocol. It is possible to keep the PASID table in the host, but you need a way to allocate GPA since the SMMU accesses it after stage-2 translation. I think that necessarily requires a PV interface, but you could look into it. Anyway, even with that, ATC invalidations take a PASID. > > Given that the VM virtio driver has to have SMMUv3 specific code to > handle the CD table it must get, I don't see the problem with also > having SMMUv3 specific code in the hypervisor virtio driver to handle > invalidating based on the CD table. There isn't much we can't do, I'm just hoping to build something straightforward instead of having to work around awkward interfaces > > A couple of reasons are relevant here: non-QEMU VMMs don't want to emulate > > all vendor IOMMUs, new architectures get vIOMMU mostly for free, > > So your argument is you can implement a simple map/unmap API riding > on the common IOMMU API and this is portable? > > Seems sensible, but that falls apart pretty quickly when we talk about > nesting.. I don't think we can avoid VMM components to set this up, so > it stops being portable. At that point I'm back to asking why not use > the real HW model? A single VMM component that shovels data from the virtqueue to the kernel API and back, rather than four different hardware emulations, four different queues, four different device tables. It's obviously better for VMMs that don't do full-system emulation like QEMU, especially as they generally already implement a virtio transport. Smaller attack surface, fewer bugs. The VMM developer gets a multi-platform vIOMMU without having to study all the different architecture manuals. There is a small amount of HW specific data in there, but it only relates to table formats. Ideally it wouldn't need any HW knowledge, but that would requires the APIs to be aligned: instead of ID registers we pass plain features, and invalidations don't require HW specific opcodes. Otherwise there is going to be a layer of glue everywhere, which is what I'm trying to avoid here. > > > > All the iommu drivers have native command > > > queues. ARM and AMD are both supporting native command queues directly > > > in the guest, complete with a direct guest MMIO doorbell ring. > > > > Arm SMMUv3 mandates a single global command queue (SMMUv2 uses > > registers). An SMMUv3 can optionally implement multiple command > > queues, though I don't know if they can be safely assigned to > > guests. > > It is not standardized by ARM, but it can (and has) been done. > > > For a lot of SMMUv3 implementations that have a single queue and for > > other architectures, we can do better than hardware emulation. > > How is using a SW emulated virtio formatted queue better than using a > SW emulated SMMUv3 ECMDQ? We don't need to repeat it for all IOMMU architectures, not emulate a new queue in the kernel. The first motivator for virtio-iommu was avoiding to emulate hardware in the kernel. The SMMU maintainer saw how painful that was to do for the GIC, saw that there is a virtualization queue readily available in vhost and, well, it just made sense. Still does. > > As above, decoding arch-specific structures into generic ones is what an > > emulated IOMMU does, > > No, it is what virtio wants to do. We are deliberately trying not to > do that for real accelerated HW vIOMMU emulators. Yes there is a line somewhere, and I'd prefer it to be the page table. Given how many possible hardware combinations exist and how many more will show up, it would be good to abstract things where possible. > > > and it doesn't make a performance difference in which > > format it forwards that to the kernel. The host IOMMU driver checks the > > guest request and copies them into the command queue. Whether that request > > comes in the form of a structure binary-compatible with Arm SMMUvX.Y, or > > some generic structure, does not make a difference. > > It is not the structure layouts that matter! > > It is the semantic meaning of each request, on each unique piece of > hardware. We actually want to leak the subtle semantic differences to > userspace. These are hardware emulations, of course they have to know about hardware semantics. The QEMU IOMMUs can work in TCG mode where they decode and handle everything themselves. Thanks, Jean
On Fri, Mar 10, 2023 at 12:33:12PM +0100, Eric Auger wrote: > I do agree with Jean. We spent a lot of efforts all together to define > this generic invalidation API and if there is compelling reason that > prevents from using it, we should try to reuse it. That's the compelling reason in a nutshell right there. Alot of time was invested to create something that might be general. We still don't know if it is well defined and general. Even more time is going to be required on it before it could go forward. In future more time will be needed for every future HW to try and fit into it. We don't even know if it will scale to future HW. Nobody has even checked what today's POWER and S390 HW need. vs, this stuff was made in a few days. We know it is correct as a uAPI since it mirrors the HW and we know it is scalable to different HW schemes if they come up. So I don't see a good reason to take a risk on a "general" uAPI. If we make this wrong it could seriously damage the main goal of iommufd - to build accelerated vIOMMU models. Especially since the motivating reason in this thread - use it for virtio-iommu - doesn't even want to use it as a uAPI! If we get a vhost-virtio then we can decide what to do in-kernel and maybe this general API returns as an in-kernel API, I dont know, we need to see what it is this thing ends up looking like. Jason
On Thu, Mar 09, 2023 at 08:50:52PM -0800, Nicolin Chen wrote: > On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote: > > > Nicolin, I think we should tweak the uAPI here so that the > > invalidation opaque data has a format tagged on its own, instead of > > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type > > tag and also a virtio-viommu invalidate type tag. > > The invalidation tage is shared with the hwpt allocation. Does > it mean that virtio-iommu won't have it's own allocation tag? I'm not entirely sure what you mean by allocation tag. For example with SMMU, when attaching page tables (SMMUv2), the guest passes an ASID at allocation, and when it modifies that address space it passes the same ASID for invalidation. When attaching PASID tables (SMMUv3), it writes the ASID/PASID in the PASID table, and passes both in the invalidation. Note that none of this is set in stone. It copies the Linux API we originally discussed, but we were waiting for progress on that front before committing to anything. Now we'll probably align to the new API where possible, leaving out what doesn't work for virtio-iommu. Thanks, Jean
On Thu, Mar 09, 2023 at 09:36:18PM -0800, Nicolin Chen wrote: > > Yea, a vmid is only allocated in an S2 domain allocation. So, > > a guest allocating only S1 domains always sets VMID=0. Yet, I > > think that the hypervisor or some where in host kernel should > > replace the VMID=0 with a unified VMID. > > Ah, I just recall a conversation with Jason that a VM should only > have one S2 domain. In that case, the VMID is already unified? Not requried per-say, but yes, most likely qemu would run that way. But you can't just re-use the VMID however you like. AFAIK the VMID is the cache tag for the S2 IOPTEs, so every VMID must refer to the same S2 translation. You can't mix different S2's with the same VMID. Thus you are stuck with the single S2 model in qemu if you want to use a userspace CMDQ. I suppose that suggests that if KVM supplies the VMID then it is assigned to a singular S2 iommu_domain also. Jason
On Fri, Mar 10, 2023 at 12:54:53PM +0000, Jean-Philippe Brucker wrote: > On Thu, Mar 09, 2023 at 08:50:52PM -0800, Nicolin Chen wrote: > > On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote: > > > > > Nicolin, I think we should tweak the uAPI here so that the > > > invalidation opaque data has a format tagged on its own, instead of > > > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type > > > tag and also a virtio-viommu invalidate type tag. > > > > The invalidation tage is shared with the hwpt allocation. Does > > it mean that virtio-iommu won't have it's own allocation tag? > > I'm not entirely sure what you mean by allocation tag. He means the tag identifying the allocation driver specific data is the same tag that is passed in to identify the invalidation driver specific data. With the notion that the allocation data and invalidation data would be in the same driver's format. > Note that none of this is set in stone. It copies the Linux API we > originally discussed, but we were waiting for progress on that front > before committing to anything. Now we'll probably align to the new API > where possible, leaving out what doesn't work for virtio-iommu. IMHO virtio-iommu should stand alone and make sense with its own internal object model. eg I would probably try not to have guests invalidate PASID. Have a strong ASID model and in most cases have the hypervisor track where the ASID's are mapped to PASID/etc and rely on the hypervisor to spew the invalidations to PASID as required. It is more abstracted from the actual HW for the guest. The guest can simply say it changed an IOPTE under a certain ASID. The ugly wrinkle is SMMUv3 but perhaps your idea of allowing the hypervisor to manage the CD table in guest memory is reasonable. IMHO it is a missing SMMUv3 HW feature that the CD table doesn't have the option to be in hypervisor memory. AMD allows both options - so I'm not sure I would invest a huge amount to make special cases to support this... Assume a SMMUv3 update might gain the option someday. Jason
On 2023-03-09 21:01, Jason Gunthorpe wrote: >> For a lot of SMMUv3 implementations that have a single queue and for >> other architectures, we can do better than hardware emulation. > > How is using a SW emulated virtio formatted queue better than using a > SW emulated SMMUv3 ECMDQ? Since it's not been said, the really big thing is that virtio explicitly informs the host whenever the guest maps something. Emulating SMMUv3 means the host has to chase all the pagetable pointers in guest memory and trap writes such that it has visibility of invalid->valid transitions and can update the physical shadow pagetable correspondingly. FWIW we spent quite some time on and off discussing something like VT-d's "caching mode", but never found a convincing argument that it was a gap which needed filling, since we already had hardware nesting for maximum performance and a paravirtualisation option for efficient emulation. Thus full SMMUv3 emulation seems to just sit at the bottom as the maximum-compatibility option for pushing an unmodified legacy bare-metal software stack into a VM where nesting isn't available. Cheers, Robin.
On Fri, Mar 10, 2023 at 02:52:42PM +0000, Robin Murphy wrote: > On 2023-03-09 21:01, Jason Gunthorpe wrote: > > > For a lot of SMMUv3 implementations that have a single queue and for > > > other architectures, we can do better than hardware emulation. > > > > How is using a SW emulated virtio formatted queue better than using a > > SW emulated SMMUv3 ECMDQ? > > Since it's not been said, the really big thing is that virtio explicitly > informs the host whenever the guest maps something. Emulating SMMUv3 means > the host has to chase all the pagetable pointers in guest memory and trap > writes such that it has visibility of invalid->valid transitions and can > update the physical shadow pagetable correspondingly. Sorry, I mean in the context of future virtio-iommu that is providing nested translation. eg why would anyone want to use virtio to provide SMMUv3 based HW accelerated nesting? Jean suggested that the invalidation flow for virtio-iommu could be faster because it is in kernel, but I'm saying that we could also make the SMMUv3 invalidation in-kernel with the same basic technique. (and actively wondering if we should put more focus on that) I understand the appeal of the virtio scheme with its current map/unmap interface. I could also see some appeal of a simple virtio-iommu SVA that could point map a CPU page table as an option. The guest already has to know how to manage these anyhow so it is nicely general. If iommufd could provide a general cross-driver API to set exactly that scenario up then VMM code could also be general. That seems prettty interesting. But if the plan is to expose more detailed stuff like the CD or GCR3 PASID tables as something the guest has to manipulate and then a bunch of special invalidation to support that, and VMM code to back it, then I'm questioning the whole point. We lost the generality. Just use the normal HW accelerated SMMUv3 nesting model instead. If virtio-iommu SVA is really important for ARM then I'd suggest SMMUv3 should gain a new HW capability to allowed the CD table to be in hypervisor memory so it works consistently for virtio-iommu SVA. Jason
On 2023-03-10 15:25, Jason Gunthorpe wrote: > On Fri, Mar 10, 2023 at 02:52:42PM +0000, Robin Murphy wrote: >> On 2023-03-09 21:01, Jason Gunthorpe wrote: >>>> For a lot of SMMUv3 implementations that have a single queue and for >>>> other architectures, we can do better than hardware emulation. >>> >>> How is using a SW emulated virtio formatted queue better than using a >>> SW emulated SMMUv3 ECMDQ? >> >> Since it's not been said, the really big thing is that virtio explicitly >> informs the host whenever the guest maps something. Emulating SMMUv3 means >> the host has to chase all the pagetable pointers in guest memory and trap >> writes such that it has visibility of invalid->valid transitions and can >> update the physical shadow pagetable correspondingly. > > Sorry, I mean in the context of future virtio-iommu that is providing > nested translation. Ah, that's probably me missing the context again. > eg why would anyone want to use virtio to provide SMMUv3 based HW > accelerated nesting? > > Jean suggested that the invalidation flow for virtio-iommu could be > faster because it is in kernel, but I'm saying that we could also make > the SMMUv3 invalidation in-kernel with the same basic technique. (and > actively wondering if we should put more focus on that) > > I understand the appeal of the virtio scheme with its current > map/unmap interface. > > I could also see some appeal of a simple virtio-iommu SVA that could > point map a CPU page table as an option. The guest already has to know > how to manage these anyhow so it is nicely general. > > If iommufd could provide a general cross-driver API to set exactly > that scenario up then VMM code could also be general. That seems > prettty interesting. Indeed, I've always assumed the niche for virtio would be that kind of in-between use-case using nesting to accelerate simple translation, where we plug a guest-owned pagetable into a host-owned context. That way the guest retains the simple virtio interface and only needs to understand a pagetable format (or as you say, simply share a CPU pagetable) without having to care about the nitty-gritty of all the IOMMU-specific moving parts around it. For guests that want to get into more advanced stuff like managing their own PASID tables, pushing them towards "native" nesting probably does make more sense. > But if the plan is to expose more detailed stuff like the CD or GCR3 > PASID tables as something the guest has to manipulate and then a bunch > of special invalidation to support that, and VMM code to back it, then > I'm questioning the whole point. We lost the generality. > > Just use the normal HW accelerated SMMUv3 nesting model instead. > > If virtio-iommu SVA is really important for ARM then I'd suggest > SMMUv3 should gain a new HW capability to allowed the CD table to be > in hypervisor memory so it works consistently for virtio-iommu SVA. Oh, maybe I should have read this far before reasoning the exact same thing from scratch... oh well, this time I'm not going to go back and edit :) Thanks, Robin.
On Fri, Mar 10, 2023 at 03:57:27PM +0000, Robin Murphy wrote: > about the nitty-gritty of all the IOMMU-specific moving parts around it. For > guests that want to get into more advanced stuff like managing their own > PASID tables, pushing them towards "native" nesting probably does make more > sense. IMHO with the simplified virtio model I would say the guest should not have its own PASID table. hyper trap to install a PASID and let the hypervisor driver handle this abstractly. If abstractly is the whole point and benifit then virtio should lean into that. This also means virtio protocol doesn't do PASID invalidation. It invalidates an ASID and the hypervisor takes care of whatever it is connected to. Very simple and general for the VM. Adding a S1 iommu_domain op for invalidate address range is perfectly fine and the virtio kernel hypervisor driver can call it generically. The primary reason to have guest-owned PASID tables is CC stuff, which definitely won't be part of virtio-iommu. Jason
On Thu, Mar 09, 2023 at 08:50:52PM -0800, Nicolin Chen wrote: > On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote: > > > Nicolin, I think we should tweak the uAPI here so that the > > invalidation opaque data has a format tagged on its own, instead of > > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type > > tag and also a virtio-viommu invalidate type tag. > > The invalidation tage is shared with the hwpt allocation. Does > it mean that virtio-iommu won't have it's own allocation tag? We probably shouldn't assume it will Jason
On Fri, Mar 10, 2023 at 12:06:18PM -0400, Jason Gunthorpe wrote: > On Thu, Mar 09, 2023 at 08:50:52PM -0800, Nicolin Chen wrote: > > On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote: > > > > > Nicolin, I think we should tweak the uAPI here so that the > > > invalidation opaque data has a format tagged on its own, instead of > > > re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type > > > tag and also a virtio-viommu invalidate type tag. > > > > The invalidation tage is shared with the hwpt allocation. Does > > it mean that virtio-iommu won't have it's own allocation tag? > > We probably shouldn't assume it will In that case, why do have need an invalidation tag/type on its own? Can't we use an IOMMU_HWPT_TYPE_VIRTIO tag for allocation and invalidation together for virtio? Or did you mean that we should define a flag inside the data structure like this? struct iommu_hwpt_invalidate_arm_smmuv3 { #define IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF (1 << 0) #define IOMMU_SMMUV3_FORMAT_VIRTIO (1 << 63) __u64 flags; } Thanks Nic
> From: Robin Murphy <robin.murphy@arm.com> > Sent: Friday, March 10, 2023 11:57 PM > > > > > If iommufd could provide a general cross-driver API to set exactly > > that scenario up then VMM code could also be general. That seems > > prettty interesting. > > Indeed, I've always assumed the niche for virtio would be that kind of > in-between use-case using nesting to accelerate simple translation, > where we plug a guest-owned pagetable into a host-owned context. That > way the guest retains the simple virtio interface and only needs to > understand a pagetable format (or as you say, simply share a CPU > pagetable) without having to care about the nitty-gritty of all the > IOMMU-specific moving parts around it. For guests that want to get into > more advanced stuff like managing their own PASID tables, pushing them > towards "native" nesting probably does make more sense. > Interesting thing is that we cannot expose both virtio-iommu and emulated vIOMMU to one guest to choose. then if the guest has been using virtio-iommu for whatever reason naturally it may want more advanced features on it too.
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Saturday, March 11, 2023 12:03 AM > > On Fri, Mar 10, 2023 at 03:57:27PM +0000, Robin Murphy wrote: > > > about the nitty-gritty of all the IOMMU-specific moving parts around it. For > > guests that want to get into more advanced stuff like managing their own > > PASID tables, pushing them towards "native" nesting probably does make > more > > sense. > > IMHO with the simplified virtio model I would say the guest should > not have its own PASID table. > > hyper trap to install a PASID and let the hypervisor driver handle > this abstractly. If abstractly is the whole point and benifit then > virtio should lean into that. > > This also means virtio protocol doesn't do PASID invalidation. It > invalidates an ASID and the hypervisor takes care of whatever it is > connected to. Very simple and general for the VM. this sounds fair, if ASID here refers a general ID identifying the page table instead of ARM specific ASID.
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, March 10, 2023 8:52 PM > > On Fri, Mar 10, 2023 at 12:33:12PM +0100, Eric Auger wrote: > > > I do agree with Jean. We spent a lot of efforts all together to define > > this generic invalidation API and if there is compelling reason that > > prevents from using it, we should try to reuse it. > > That's the compelling reason in a nutshell right there. > > Alot of time was invested to create something that might be > general. We still don't know if it is well defined and general. Even > more time is going to be required on it before it could go forward. In > future more time will be needed for every future HW to try and fit > into it. We don't even know if it will scale to future HW. Nobody has > even checked what today's POWER and S390 HW need. > > vs, this stuff was made in a few days. We know it is correct as a uAPI > since it mirrors the HW and we know it is scalable to different HW > schemes if they come up. > > So I don't see a good reason to take a risk on a "general" uAPI. If we > make this wrong it could seriously damage the main goal of iommufd - > to build accelerated vIOMMU models. > I'm with this point. We can add a virtio format when it comes.
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 8f9985bddeeb..5e798b2f9a3a 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -173,6 +173,7 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, static const size_t iommufd_hwpt_alloc_data_size[] = { [IOMMU_HWPT_TYPE_DEFAULT] = 0, [IOMMU_HWPT_TYPE_VTD_S1] = sizeof(struct iommu_hwpt_intel_vtd), + [IOMMU_HWPT_TYPE_ARM_SMMUV3] = sizeof(struct iommu_hwpt_arm_smmuv3), }; /* @@ -183,6 +184,8 @@ const u64 iommufd_hwpt_type_bitmaps[] = { [IOMMU_HW_INFO_TYPE_DEFAULT] = BIT_ULL(IOMMU_HWPT_TYPE_DEFAULT), [IOMMU_HW_INFO_TYPE_INTEL_VTD] = BIT_ULL(IOMMU_HWPT_TYPE_DEFAULT) | BIT_ULL(IOMMU_HWPT_TYPE_VTD_S1), + [IOMMU_HW_INFO_TYPE_ARM_SMMUV3] = BIT_ULL(IOMMU_HWPT_TYPE_DEFAULT) | + BIT_ULL(IOMMU_HWPT_TYPE_ARM_SMMUV3), }; /* Return true if type is supported, otherwise false */ @@ -329,6 +332,7 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd) */ static const size_t iommufd_hwpt_invalidate_info_size[] = { [IOMMU_HWPT_TYPE_VTD_S1] = sizeof(struct iommu_hwpt_invalidate_intel_vtd), + [IOMMU_HWPT_TYPE_ARM_SMMUV3] = sizeof(struct iommu_hwpt_invalidate_arm_smmuv3), }; int iommufd_hwpt_invalidate(struct iommufd_ucmd *ucmd) diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 514db4c26927..0b0097af7c86 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -280,6 +280,7 @@ union ucmd_buffer { * path. */ struct iommu_hwpt_invalidate_intel_vtd vtd; + struct iommu_hwpt_invalidate_arm_smmuv3 smmuv3; }; struct iommufd_ioctl_op { diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 2a6c326391b2..0d5551b1b2be 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -352,10 +352,13 @@ struct iommu_vfio_ioas { * enum iommu_hwpt_type - IOMMU HWPT Type * @IOMMU_HWPT_TYPE_DEFAULT: default * @IOMMU_HWPT_TYPE_VTD_S1: Intel VT-d stage-1 page table + * @IOMMU_HWPT_TYPE_ARM_SMMUV3: ARM SMMUv3 stage-1 Context Descriptor + * table */ enum iommu_hwpt_type { IOMMU_HWPT_TYPE_DEFAULT, IOMMU_HWPT_TYPE_VTD_S1, + IOMMU_HWPT_TYPE_ARM_SMMUV3, }; /** @@ -411,6 +414,28 @@ struct iommu_hwpt_intel_vtd { __u32 __reserved; }; +/** + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 specific page table data + * + * @flags: page table entry attributes + * @s2vmid: Virtual machine identifier + * @s1ctxptr: Stage-1 context descriptor pointer + * @s1cdmax: Number of CDs pointed to by s1ContextPtr + * @s1fmt: Stage-1 Format + * @s1dss: Default substream + */ +struct iommu_hwpt_arm_smmuv3 { +#define IOMMU_SMMUV3_FLAG_S2 (1 << 0) /* if unset, stage-1 */ +#define IOMMU_SMMUV3_FLAG_VMID (1 << 1) /* vmid override */ + __u64 flags; + __u32 s2vmid; + __u32 __reserved; + __u64 s1ctxptr; + __u64 s1cdmax; + __u64 s1fmt; + __u64 s1dss; +}; + /** * struct iommu_hwpt_alloc - ioctl(IOMMU_HWPT_ALLOC) * @size: sizeof(struct iommu_hwpt_alloc) @@ -446,6 +471,8 @@ struct iommu_hwpt_intel_vtd { * +------------------------------+-------------------------------------+-----------+ * | IOMMU_HWPT_TYPE_VTD_S1 | struct iommu_hwpt_intel_vtd | HWPT | * +------------------------------+-------------------------------------+-----------+ + * | IOMMU_HWPT_TYPE_ARM_SMMUV3 | struct iommu_hwpt_arm_smmuv3 | IOAS/HWPT | + * +------------------------------+-------------------------------------------------+ */ struct iommu_hwpt_alloc { __u32 size; @@ -463,10 +490,12 @@ struct iommu_hwpt_alloc { /** * enum iommu_hw_info_type - IOMMU Hardware Info Types * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type + * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type */ enum iommu_hw_info_type { IOMMU_HW_INFO_TYPE_DEFAULT, IOMMU_HW_INFO_TYPE_INTEL_VTD, + IOMMU_HW_INFO_TYPE_ARM_SMMUV3, }; /** @@ -591,6 +620,25 @@ struct iommu_hwpt_invalidate_intel_vtd { __u64 nb_granules; }; +/** + * struct iommu_hwpt_invalidate_arm_smmuv3 - ARM SMMUv3 cahce invalidation info + * @flags: boolean attributes of cache invalidation command + * @opcode: opcode of cache invalidation command + * @ssid: SubStream ID + * @granule_size: page/block size of the mapping in bytes + * @range: IOVA range to invalidate + */ +struct iommu_hwpt_invalidate_arm_smmuv3 { +#define IOMMU_SMMUV3_CMDQ_TLBI_VA_LEAF (1 << 0) + __u64 flags; + __u8 opcode; + __u8 padding[3]; + __u32 asid; + __u32 ssid; + __u32 granule_size; + struct iommu_iova_range range; +}; + /** * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE) * @size: sizeof(struct iommu_hwpt_invalidate) @@ -609,6 +657,8 @@ struct iommu_hwpt_invalidate_intel_vtd { * +------------------------------+----------------------------------------+ * | IOMMU_HWPT_TYPE_VTD_S1 | struct iommu_hwpt_invalidate_intel_vtd | * +------------------------------+----------------------------------------+ + * | IOMMU_HWPT_TYPE_ARM_SMMUV3 | struct iommu_hwpt_invalidate_arm_smmuv3| + * +------------------------------+----------------------------------------+ */ struct iommu_hwpt_invalidate { __u32 size;
Add the following data structures for corresponding ioctls: iommu_hwpt_arm_smmuv3 => IOMMUFD_CMD_HWPT_ALLOC iommu_hwpt_invalidate_arm_smmuv3 => IOMMUFD_CMD_HWPT_INVALIDATE Also, add IOMMU_HW_INFO_TYPE_ARM_SMMUV3 and IOMMU_PGTBL_TYPE_ARM_SMMUV3_S1 to the header and corresponding type/size arrays. Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> --- drivers/iommu/iommufd/hw_pagetable.c | 4 +++ drivers/iommu/iommufd/main.c | 1 + include/uapi/linux/iommufd.h | 50 ++++++++++++++++++++++++++++ 3 files changed, 55 insertions(+)