Message ID | 11-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | IOMMUFD Generic interface | expand |
On Fri, 18 Mar 2022 14:27:36 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > iommufd can directly implement the /dev/vfio/vfio container IOCTLs by > mapping them into io_pagetable operations. Doing so allows the use of > iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to > SET_CONTAINER using a iommufd instead of a container fd is a followup > series. > > Internally the compatibility API uses a normal IOAS object that, like > vfio, is automatically allocated when the first device is > attached. > > Userspace can also query or set this IOAS object directly using the > IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only > features while still using the VFIO style map/unmap ioctls. > > While this is enough to operate qemu, it is still a bit of a WIP with a > few gaps to be resolved: > > - Only the TYPE1v2 mode is supported where unmap cannot punch holes or > split areas. The old mode can be implemented with a new operation to > split an iopt_area into two without disturbing the iopt_pages or the > domains, then unmapping a whole area as normal. > > - Resource limits rely on memory cgroups to bound what userspace can do > instead of the module parameter dma_entry_limit. > > - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will > require some additional work to properly expose PFN lifecycle between > VFIO and iommfd > > - Various components of the mdev API are not completed yet > > - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not > implemented. > > - The 'dirty tracking' is not implemented > > - A full audit for pedantic compatibility details (eg errnos, etc) has > not yet been done > > - powerpc SPAPR is left out, as it is not connected to the iommu_domain > framework. My hope is that SPAPR will be moved into the iommu_domain > framework as a special HW specific type and would expect power to > support the generic interface through a normal iommu_domain. My overall question here would be whether we can actually achieve a compatibility interface that has sufficient feature transparency that we can dump vfio code in favor of this interface, or will there be enough niche use cases that we need to keep type1 and vfio containers around through a deprecation process? The locked memory differences for one seem like something that libvirt wouldn't want hidden and we have questions regarding support for vaddr hijacking and different ideas how to implement dirty page tracking, not to mention the missing features that are currently well used, like p2p mappings, coherency tracking, mdev, etc. It seems like quite an endeavor to fill all these gaps, while at the same time QEMU will be working to move to use iommufd directly in order to gain all the new features. Where do we focus attention? Is symlinking device files our proposal to userspace and is that something achievable, or do we want to use this compatibility interface as a means to test the interface and allow userspace to make use of it for transition, if their use cases allow it, perhaps eventually performing the symlink after deprecation and eventual removal of the vfio container and type1 code? Thanks, Alex
On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: > My overall question here would be whether we can actually achieve a > compatibility interface that has sufficient feature transparency that we > can dump vfio code in favor of this interface, or will there be enough > niche use cases that we need to keep type1 and vfio containers around > through a deprecation process? Other than SPAPR, I think we can. > The locked memory differences for one seem like something that > libvirt wouldn't want hidden I'm first interested to have an understanding how this change becomes a real problem in practice that requires libvirt to do something different for vfio or iommufd. We can discuss in the other thread If this is the make or break point then I think we can deal with it either by going back to what vfio does now or perhaps some other friendly compat approach.. > and we have questions regarding support for vaddr hijacking I'm not sure what vaddr hijacking is? Do you mean VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to implement it in a functionally compatible way without the deadlock problem. I estimate this as a small project. > and different ideas how to implement dirty page tracking, I don't think this is compatibility. No kernel today triggers qemu to use this feature as no kernel supports live migration. No existing qemu will trigger this feature with new kernels that support live migration v2. Therefore we can adjust qemu's dirty tracking at the same time we enable migration v2 in qemu. With Joao's work we are close to having a solid RFC to come with something that can be fully implemented. Hopefully we can agree to this soon enough that qemu can come with a full package of migration v2 support including the dirty tracking solution. > not to mention the missing features that are currently well used, > like p2p mappings, coherency tracking, mdev, etc. I consider these all mandatory things, they won't be left out. The reason they are not in the RFC is mostly because supporting them requires work outside just this iommufd area, and I'd like this series to remain self-contained. I've already got a draft to add DMABUF support to VFIO PCI which nicely solves the follow_pfn security problem, we want to do this for another reason already. I'm waiting for some testing feedback before posting it. Need some help from Daniel make the DMABUF revoke semantic him and I have been talking about. In the worst case can copy the follow_pfn approach. Intel no-snoop is simple enough, just needs some Intel cleanup parts. mdev will come along with the final VFIO integration, all the really hard parts are done already. The VFIO integration is a medium sized task overall. So, I'm not ready to give up yet :) > Where do we focus attention? Is symlinking device files our proposal > to userspace and is that something achievable, or do we want to use > this compatibility interface as a means to test the interface and > allow userspace to make use of it for transition, if their use cases > allow it, perhaps eventually performing the symlink after deprecation > and eventual removal of the vfio container and type1 code? Thanks, symlinking device files is definitely just a suggested way to expedite testing. Things like qemu that are learning to use iommufd-only features should learn to directly open iommufd instead of vfio container to activate those features. Looking long down the road I don't think we want to have type 1 and iommufd code forever. So, I would like to make an option to compile out vfio container support entirely and have that option arrange for iommufd to provide the container device node itself. I think we can get there pretty quickly, or at least I haven't got anything that is scaring me alot (beyond SPAPR of course) For the dpdk/etcs of the world I think we are already there. Jason
Hi, On 3/24/22 1:33 AM, Jason Gunthorpe wrote: > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: > >> My overall question here would be whether we can actually achieve a >> compatibility interface that has sufficient feature transparency that we >> can dump vfio code in favor of this interface, or will there be enough >> niche use cases that we need to keep type1 and vfio containers around >> through a deprecation process? > Other than SPAPR, I think we can. > >> The locked memory differences for one seem like something that >> libvirt wouldn't want hidden > I'm first interested to have an understanding how this change becomes > a real problem in practice that requires libvirt to do something > different for vfio or iommufd. We can discuss in the other thread > > If this is the make or break point then I think we can deal with it > either by going back to what vfio does now or perhaps some other > friendly compat approach.. > >> and we have questions regarding support for vaddr hijacking > I'm not sure what vaddr hijacking is? Do you mean > VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to > implement it in a functionally compatible way without the deadlock > problem. I estimate this as a small project. > >> and different ideas how to implement dirty page tracking, > I don't think this is compatibility. No kernel today triggers qemu to > use this feature as no kernel supports live migration. No existing > qemu will trigger this feature with new kernels that support live > migration v2. Therefore we can adjust qemu's dirty tracking at the > same time we enable migration v2 in qemu. > > With Joao's work we are close to having a solid RFC to come with > something that can be fully implemented. > > Hopefully we can agree to this soon enough that qemu can come with a > full package of migration v2 support including the dirty tracking > solution. > >> not to mention the missing features that are currently well used, >> like p2p mappings, coherency tracking, mdev, etc. > I consider these all mandatory things, they won't be left out. > > The reason they are not in the RFC is mostly because supporting them > requires work outside just this iommufd area, and I'd like this series > to remain self-contained. > > I've already got a draft to add DMABUF support to VFIO PCI which > nicely solves the follow_pfn security problem, we want to do this for > another reason already. I'm waiting for some testing feedback before > posting it. Need some help from Daniel make the DMABUF revoke semantic > him and I have been talking about. In the worst case can copy the > follow_pfn approach. > > Intel no-snoop is simple enough, just needs some Intel cleanup parts. > > mdev will come along with the final VFIO integration, all the really > hard parts are done already. The VFIO integration is a medium sized > task overall. > > So, I'm not ready to give up yet :) > >> Where do we focus attention? Is symlinking device files our proposal >> to userspace and is that something achievable, or do we want to use >> this compatibility interface as a means to test the interface and >> allow userspace to make use of it for transition, if their use cases >> allow it, perhaps eventually performing the symlink after deprecation >> and eventual removal of the vfio container and type1 code? Thanks, > symlinking device files is definitely just a suggested way to expedite > testing. > > Things like qemu that are learning to use iommufd-only features should > learn to directly open iommufd instead of vfio container to activate > those features. > > Looking long down the road I don't think we want to have type 1 and > iommufd code forever. So, I would like to make an option to compile > out vfio container support entirely and have that option arrange for > iommufd to provide the container device node itself. I am currently working on migrating the QEMU VFIO device onto the new API because since after our discussions the compat mode cannot be used anyway to implemented nesting. I hope I will be able to present something next week. Thanks Eric > > I think we can get there pretty quickly, or at least I haven't got > anything that is scaring me alot (beyond SPAPR of course) > > For the dpdk/etcs of the world I think we are already there. > > Jason >
On Wed, 23 Mar 2022 21:33:42 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: > > > My overall question here would be whether we can actually achieve a > > compatibility interface that has sufficient feature transparency that we > > can dump vfio code in favor of this interface, or will there be enough > > niche use cases that we need to keep type1 and vfio containers around > > through a deprecation process? > > Other than SPAPR, I think we can. Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure for POWER support? > > The locked memory differences for one seem like something that > > libvirt wouldn't want hidden > > I'm first interested to have an understanding how this change becomes > a real problem in practice that requires libvirt to do something > different for vfio or iommufd. We can discuss in the other thread > > If this is the make or break point then I think we can deal with it > either by going back to what vfio does now or perhaps some other > friendly compat approach.. > > > and we have questions regarding support for vaddr hijacking > > I'm not sure what vaddr hijacking is? Do you mean > VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to > implement it in a functionally compatible way without the deadlock > problem. I estimate this as a small project. > > > and different ideas how to implement dirty page tracking, > > I don't think this is compatibility. No kernel today triggers qemu to > use this feature as no kernel supports live migration. No existing > qemu will trigger this feature with new kernels that support live > migration v2. Therefore we can adjust qemu's dirty tracking at the > same time we enable migration v2 in qemu. I guess I was assuming that enabling v2 migration in QEMU was dependent on the existing type1 dirty tracking because it's the only means we have to tell QEMU that all memory is perpetually dirty when we have a DMA device. Is that not correct? If we don't intend to carry type1 dirty tracking into iommufd compatibility and we need it for this purpose, then our window for being able to rip it out entirely closes when QEMU gains v2 migration support. > With Joao's work we are close to having a solid RFC to come with > something that can be fully implemented. > > Hopefully we can agree to this soon enough that qemu can come with a > full package of migration v2 support including the dirty tracking > solution. > > > not to mention the missing features that are currently well used, > > like p2p mappings, coherency tracking, mdev, etc. > > I consider these all mandatory things, they won't be left out. > > The reason they are not in the RFC is mostly because supporting them > requires work outside just this iommufd area, and I'd like this series > to remain self-contained. > > I've already got a draft to add DMABUF support to VFIO PCI which > nicely solves the follow_pfn security problem, we want to do this for > another reason already. I'm waiting for some testing feedback before > posting it. Need some help from Daniel make the DMABUF revoke semantic > him and I have been talking about. In the worst case can copy the > follow_pfn approach. > > Intel no-snoop is simple enough, just needs some Intel cleanup parts. > > mdev will come along with the final VFIO integration, all the really > hard parts are done already. The VFIO integration is a medium sized > task overall. > > So, I'm not ready to give up yet :) Ok, that's a more promising outlook than I was inferring from the long list of missing features. > > Where do we focus attention? Is symlinking device files our proposal > > to userspace and is that something achievable, or do we want to use > > this compatibility interface as a means to test the interface and > > allow userspace to make use of it for transition, if their use cases > > allow it, perhaps eventually performing the symlink after deprecation > > and eventual removal of the vfio container and type1 code? Thanks, > > symlinking device files is definitely just a suggested way to expedite > testing. > > Things like qemu that are learning to use iommufd-only features should > learn to directly open iommufd instead of vfio container to activate > those features. Which is kind of the basis for my question, QEMU is racing for native support, Eric and Yi are already working on this, so some of these compatibility interfaces might only have short term usefulness. > Looking long down the road I don't think we want to have type 1 and > iommufd code forever. Agreed. > So, I would like to make an option to compile > out vfio container support entirely and have that option arrange for > iommufd to provide the container device node itself. > > I think we can get there pretty quickly, or at least I haven't got > anything that is scaring me alot (beyond SPAPR of course) > > For the dpdk/etcs of the world I think we are already there. That's essentially what I'm trying to reconcile, we're racing both to round out the compatibility interface to fully support QEMU, while also updating QEMU to use iommufd directly so it won't need that full support. It's a confusing message. Thanks, Alex
On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote: > On Wed, 23 Mar 2022 21:33:42 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: > > > > > My overall question here would be whether we can actually achieve a > > > compatibility interface that has sufficient feature transparency that we > > > can dump vfio code in favor of this interface, or will there be enough > > > niche use cases that we need to keep type1 and vfio containers around > > > through a deprecation process? > > > > Other than SPAPR, I think we can. > > Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure > for POWER support? Certainly initialy - I have no ability to do better than that. I'm hoping someone from IBM will be willing to work on this in the long run and we can do better. > > I don't think this is compatibility. No kernel today triggers qemu to > > use this feature as no kernel supports live migration. No existing > > qemu will trigger this feature with new kernels that support live > > migration v2. Therefore we can adjust qemu's dirty tracking at the > > same time we enable migration v2 in qemu. > > I guess I was assuming that enabling v2 migration in QEMU was dependent > on the existing type1 dirty tracking because it's the only means we > have to tell QEMU that all memory is perpetually dirty when we have a > DMA device. Is that not correct? I haven't looked closely at this part in qemu, but IMHO, if qemu sees that it has VFIO migration support but does not have any DMA dirty tracking capability it should not do precopy flows. If there is a bug here we should certainly fix it before progressing the v2 patches. I'll ask Yishai & Co to take a look. > > Intel no-snoop is simple enough, just needs some Intel cleanup parts. Patches for this exist now > > mdev will come along with the final VFIO integration, all the really > > hard parts are done already. The VFIO integration is a medium sized > > task overall. > > > > So, I'm not ready to give up yet :) > > Ok, that's a more promising outlook than I was inferring from the long > list of missing features. Yeah, it is just long, but they are not scary things, just priorites and patch planning. > > I think we can get there pretty quickly, or at least I haven't got > > anything that is scaring me alot (beyond SPAPR of course) > > > > For the dpdk/etcs of the world I think we are already there. > > That's essentially what I'm trying to reconcile, we're racing both > to round out the compatibility interface to fully support QEMU, while > also updating QEMU to use iommufd directly so it won't need that full > support. It's a confusing message. Thanks, The long term purpose of compatibility is to provide a config option to allow type 1 to be turned off and continue to support old user space (eg in containers) that is running old qemu/dpdk/spdk/etc. This shows that we have a plan/path to allow a distro to support only one iommu interface in their kernel should they choose without having to sacrifice uABI compatibility. As for racing, my intention is to leave the compat interface alone for awhile - the more urgent things in on my personal list are the RFC for dirty tracking, mlx5 support for dirty tracking, and VFIO preparation for iommufd support. Eric and Yi are focusing on userspace page tables and qemu updates. Joao is working on implementing iommu driver dirty tracking Lu and Jacob are working on getting PASID support infrastructure together. There is alot going on! A question to consider is what would you consider the minimum bar for merging? Thanks, Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, March 25, 2022 7:12 AM > > On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote: > > That's essentially what I'm trying to reconcile, we're racing both > > to round out the compatibility interface to fully support QEMU, while > > also updating QEMU to use iommufd directly so it won't need that full > > support. It's a confusing message. Thanks, > > The long term purpose of compatibility is to provide a config option > to allow type 1 to be turned off and continue to support old user > space (eg in containers) that is running old qemu/dpdk/spdk/etc. > > This shows that we have a plan/path to allow a distro to support only > one iommu interface in their kernel should they choose without having > to sacrifice uABI compatibility. > > As for racing, my intention is to leave the compat interface alone for > awhile - the more urgent things in on my personal list are the RFC > for dirty tracking, mlx5 support for dirty tracking, and VFIO preparation > for iommufd support. > > Eric and Yi are focusing on userspace page tables and qemu updates. > > Joao is working on implementing iommu driver dirty tracking > > Lu and Jacob are working on getting PASID support infrastructure > together. > > There is alot going on! > > A question to consider is what would you consider the minimum bar for > merging? > My two cents.
On 3/24/22 23:11, Jason Gunthorpe wrote: > On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote: >> On Wed, 23 Mar 2022 21:33:42 -0300 >> Jason Gunthorpe <jgg@nvidia.com> wrote: >>> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: >>> I don't think this is compatibility. No kernel today triggers qemu to >>> use this feature as no kernel supports live migration. No existing >>> qemu will trigger this feature with new kernels that support live >>> migration v2. Therefore we can adjust qemu's dirty tracking at the >>> same time we enable migration v2 in qemu. >> >> I guess I was assuming that enabling v2 migration in QEMU was dependent >> on the existing type1 dirty tracking because it's the only means we >> have to tell QEMU that all memory is perpetually dirty when we have a >> DMA device. Is that not correct? > > I haven't looked closely at this part in qemu, but IMHO, if qemu sees > that it has VFIO migration support but does not have any DMA dirty > tracking capability it should not do precopy flows. > > If there is a bug here we should certainly fix it before progressing > the v2 patches. I'll ask Yishai & Co to take a look. I think that's already the case. wrt to VFIO IOMMU type1, kernel always exports a migration capability and the page sizes it supports. In the VMM if it matches the page size qemu is using (x86 it is PAGE_SIZE) it determines for Qemu it will /use/ vfio container ioctls. Which well, I guess it's always if the syscall is there considering we dirty every page. In qemu, the start and stop of dirty tracking is actually unbounded (it attempts to do it without checking if the capability is there), although syncing the dirties from vfio against Qemu private tracking, it does check if the dirty page tracking is supported prior to even trying the syncing via the ioctl. /Most importantly/ prior to all of this, starting/stopping/syncing dirty tracking, Qemu adds a live migration blocker if either the device doesn't support migration or VFIO container doesn't support it (so migration won't even start). So I think VMM knows how to deal with the lack of the dirty container ioctls as far as my understanding goes. TBH, I am not overly concerned with dirty page tracking in vfio-compat layer -- I have been doing both in tandem (old and new). We mainly need to decide what do we wanna maintain in the compat layer. I can drop that IOMMU support code I have from vfio-compat or we do the 'perpectual dirtying' that current does, or not support the dirty ioctls in vfio-compat at all. Maybe the latter makes more sense, as that might mimmic more accurately what hardware supports, and deprive VMMs from even starting migration. The second looks useful for testing, but doing dirty of all DMA-mapped memory seems to be too much in a real world migration scenario :( specially as the guest size increases.
On 2022/3/24 06:51, Alex Williamson wrote: > On Fri, 18 Mar 2022 14:27:36 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > >> iommufd can directly implement the /dev/vfio/vfio container IOCTLs by >> mapping them into io_pagetable operations. Doing so allows the use of >> iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to >> SET_CONTAINER using a iommufd instead of a container fd is a followup >> series. >> >> Internally the compatibility API uses a normal IOAS object that, like >> vfio, is automatically allocated when the first device is >> attached. >> >> Userspace can also query or set this IOAS object directly using the >> IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only >> features while still using the VFIO style map/unmap ioctls. >> >> While this is enough to operate qemu, it is still a bit of a WIP with a >> few gaps to be resolved: >> >> - Only the TYPE1v2 mode is supported where unmap cannot punch holes or >> split areas. The old mode can be implemented with a new operation to >> split an iopt_area into two without disturbing the iopt_pages or the >> domains, then unmapping a whole area as normal. >> >> - Resource limits rely on memory cgroups to bound what userspace can do >> instead of the module parameter dma_entry_limit. >> >> - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will >> require some additional work to properly expose PFN lifecycle between >> VFIO and iommfd >> >> - Various components of the mdev API are not completed yet >> >> - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not >> implemented. >> >> - The 'dirty tracking' is not implemented >> >> - A full audit for pedantic compatibility details (eg errnos, etc) has >> not yet been done >> >> - powerpc SPAPR is left out, as it is not connected to the iommu_domain >> framework. My hope is that SPAPR will be moved into the iommu_domain >> framework as a special HW specific type and would expect power to >> support the generic interface through a normal iommu_domain. > > My overall question here would be whether we can actually achieve a > compatibility interface that has sufficient feature transparency that we > can dump vfio code in favor of this interface, or will there be enough > niche use cases that we need to keep type1 and vfio containers around > through a deprecation process? > > The locked memory differences for one seem like something that libvirt > wouldn't want hidden and we have questions regarding support for vaddr > hijacking and different ideas how to implement dirty page tracking, not > to mention the missing features that are currently well used, like p2p > mappings, coherency tracking, mdev, etc. > > It seems like quite an endeavor to fill all these gaps, while at the > same time QEMU will be working to move to use iommufd directly in order > to gain all the new features. Hi Alex, Jason hasn't included the vfio changes for adapting to iommufd. But it's in this branch (https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6). Eric and me are working on adding the iommufd support in QEMU as well. If wanting to run the new QEMU on old kernel, QEMU is supposed to support both the legacy group/container interface and the latest device/iommufd interface. We've got some draft code toward this direction (https://github.com/luxis1999/qemu/commits/qemu-for-5.17-rc4-vm). It works for both legacy group/container and device/iommufd path. It's just for reference so far, Eric and me will have a further sync on it. > Where do we focus attention? Is symlinking device files our proposal > to userspace and is that something achievable, or do we want to use > this compatibility interface as a means to test the interface and > allow userspace to make use of it for transition, if their use cases > allow it, perhaps eventually performing the symlink after deprecation > and eventual removal of the vfio container and type1 code? Thanks, I'm sure it is possible that one day the group/container interface will be removed in kernel. Perhaps this will happen when SPAPR is supported by iommufd. But how about QEMU, should QEMU keep backward compatibility forever? or one day QEMU may also remove the group/container path and hence unable to work on the old kernels?
On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote: > On Wed, 23 Mar 2022 21:33:42 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: > > > > > My overall question here would be whether we can actually achieve a > > > compatibility interface that has sufficient feature transparency that we > > > can dump vfio code in favor of this interface, or will there be enough > > > niche use cases that we need to keep type1 and vfio containers around > > > through a deprecation process? > > > > Other than SPAPR, I think we can. > > Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure > for POWER support? There are a few different levels to consider for dealing with PPC. For a suitable long term interface for ppc hosts and guests dropping this is fine: the ppc specific iommu model was basically an ill-conceived idea from the beginning, because none of us had sufficiently understood what things were general and what things where iommu model/hw specific. ..mostly. There are several points of divergence for the ppc iommu model. 1) Limited IOVA windows. This one turned out to not really be ppc specific, and is (rightly) handlded generically in the new interface. No problem here. 2) Costly GUPs. pseries (the most common ppc machine type) always expects a (v)IOMMU. That means that unlike the common x86 model of a host with IOMMU, but guests with no-vIOMMU, guest initiated maps/unmaps can be a hot path. Accounting in that path can be prohibitive (and on POWER8 in particular it prevented us from optimizing that path the way we wanted). We had two solutions for that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted based on the IOVA window sizes. That was improved in the v2 which used the concept of preregistration. IIUC iommufd can achieve the same effect as preregistration using IOAS_COPY, so this one isn't really a problem either. 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for 2 IOVA windows, which aren't contiguous with each other. The base addresses of each of these are fixed, but the size of each window, the pagesize (i.e. granularity) of each window and the number of levels in the IOMMU pagetable are runtime configurable. Because it's true in the hardware, it's also true of the vIOMMU interface defined by the IBM hypervisor (and adpoted by KVM as well). So, guests can request changes in how these windows are handled. Typical Linux guests will use the "low" window (IOVA 0..2GiB) dynamically, and the high window (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we can't count on that; the guest can use them however it wants. (3) still needs a plan for how to fit it into the /dev/iommufd model. This is a secondary reason that in the past I advocated for the user requesting specific DMA windows which the kernel would accept or refuse, rather than having a query function - it connects easily to the DDW model. With the query-first model we'd need some sort of extension here, not really sure what it should look like. Then, there's handling existing qemu (or other software) that is using the VFIO SPAPR_TCE interfaces. First, it's not entirely clear if this should be a goal or not: as others have noted, working actively to port qemu to the new interface at the same time as making a comprehensive in-kernel compat layer is arguably redundant work. That said, if we did want to handle this in an in-kernel compat layer, here's roughly what you'd need for SPAPR_TCE v2: - VFIO_IOMMU_SPAPR_TCE_GET_INFO I think this should be fairly straightforward; the information you need should be in the now generic IOVA window stuff and would just need massaging into the expected format. - VFIO_IOMMU_SPAPR_REGISTER_MEMORY / VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY IIUC, these could be traslated into map/unmap operations onto a second implicit IOAS which represents the preregistered memory areas (to which we'd never connect an actual device). Along with this VFIO_MAP and VFIO_UNMAP operations would need to check for this case, verify their addresses against the preregistered space and be translated into IOAS_COPY operations from the prereg address space instead of raw IOAS_MAP operations. Fiddly, but not fundamentally hard, I think. For SPAPR_TCE_v1 things are a bit trickier - VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE I suspect you could get away with implementing these as no-ops. It wouldn't be strictly correct, but I think software which is using the interface correctly should work this way, though possibly not optimally. That might be good enough for this ugly old interface. And... then there's VFIO_EEH_PE_OP. It's very hard to know what to do with this because the interface was completely broken for most of its lifetime. EEH is a fancy error handling feature of IBM PCI hardware somewhat similar in concept, though not interface, to PCIe AER. I have a very strong impression that while this was a much-touted checkbox feature for RAS, no-one, ever. actually used it. As evidenced by the fact that there was, I believe over a *decade* in which all the interfaces were completely broken by design, and apparently no-one noticed. So, cynically, you could probably get away with making this a no-op as well. If you wanted to do it properly... well... that would require training up yet another person to actually understand this and hoping they get it done before they run screaming. This one gets very ugly because the EEH operations have to operate on the hardware (or firmware) "Partitionable Endpoints" (PEs) which correspond one to one with IOMMU groups, but not necessarily with VFIO containers, but there's not really any sensible way to expose that to users. You might be able to do this by simply failing this outright if there's anything other than exactly one IOMMU group bound to the container / IOAS (which I think might be what VFIO itself does now). Handling that with a device centric API gets somewhat fiddlier, of course.
On Fri, Apr 29, 2022 at 12:53:16AM +1000, David Gibson wrote: > 2) Costly GUPs. pseries (the most common ppc machine type) always > expects a (v)IOMMU. That means that unlike the common x86 model of a > host with IOMMU, but guests with no-vIOMMU, guest initiated > maps/unmaps can be a hot path. Accounting in that path can be > prohibitive (and on POWER8 in particular it prevented us from > optimizing that path the way we wanted). We had two solutions for > that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted > based on the IOVA window sizes. That was improved in the v2 which > used the concept of preregistration. IIUC iommufd can achieve the > same effect as preregistration using IOAS_COPY, so this one isn't > really a problem either. I think PPC and S390 are solving the same problem here. I think S390 is going to go to a SW nested model where it has an iommu_domain controlled by iommufd that is populated with the pinned pages, eg stored in an xarray. Then the performance map/unmap path is simply copying pages from the xarray to the real IOPTEs - and this would be modeled as a nested iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker. Perhaps this is agreeable for PPC too? > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for 2 IOVA > windows, which aren't contiguous with each other. The base addresses > of each of these are fixed, but the size of each window, the pagesize > (i.e. granularity) of each window and the number of levels in the > IOMMU pagetable are runtime configurable. Because it's true in the > hardware, it's also true of the vIOMMU interface defined by the IBM > hypervisor (and adpoted by KVM as well). So, guests can request > changes in how these windows are handled. Typical Linux guests will > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > can't count on that; the guest can use them however it wants. As part of nesting iommufd will have a 'create iommu_domain using iommu driver specific data' primitive. The driver specific data for PPC can include a description of these windows so the PPC specific qemu driver can issue this new ioctl using the information provided by the guest. The main issue is that internally to the iommu subsystem the iommu_domain aperture is assumed to be a single window. This kAPI will have to be improved to model the PPC multi-window iommu_domain. If this API is not used then the PPC driver should choose some sensible default windows that makes things like DPDK happy. > Then, there's handling existing qemu (or other software) that is using > the VFIO SPAPR_TCE interfaces. First, it's not entirely clear if this > should be a goal or not: as others have noted, working actively to > port qemu to the new interface at the same time as making a > comprehensive in-kernel compat layer is arguably redundant work. At the moment I think I would stick with not including the SPAPR interfaces in vfio_compat, but there does seem to be a path if someone with HW wants to build and test them? > You might be able to do this by simply failing this outright if > there's anything other than exactly one IOMMU group bound to the > container / IOAS (which I think might be what VFIO itself does now). > Handling that with a device centric API gets somewhat fiddlier, of > course. Maybe every device gets a copy of the error notification? ie maybe this should be part of vfio_pci and not part of iommufd to mirror how AER works? It feels strange to put in device error notification to iommufd, is that connected the IOMMU? Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, April 28, 2022 11:11 PM > > > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for > 2 IOVA > > windows, which aren't contiguous with each other. The base addresses > > of each of these are fixed, but the size of each window, the pagesize > > (i.e. granularity) of each window and the number of levels in the > > IOMMU pagetable are runtime configurable. Because it's true in the > > hardware, it's also true of the vIOMMU interface defined by the IBM > > hypervisor (and adpoted by KVM as well). So, guests can request > > changes in how these windows are handled. Typical Linux guests will > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > can't count on that; the guest can use them however it wants. > > As part of nesting iommufd will have a 'create iommu_domain using > iommu driver specific data' primitive. > > The driver specific data for PPC can include a description of these > windows so the PPC specific qemu driver can issue this new ioctl > using the information provided by the guest. > > The main issue is that internally to the iommu subsystem the > iommu_domain aperture is assumed to be a single window. This kAPI will > have to be improved to model the PPC multi-window iommu_domain. > From the point of nesting probably each window can be a separate domain then the existing aperture should still work?
On Thu, Apr 28, 2022 at 12:10:37PM -0300, Jason Gunthorpe wrote: > On Fri, Apr 29, 2022 at 12:53:16AM +1000, David Gibson wrote: > > > 2) Costly GUPs. pseries (the most common ppc machine type) always > > expects a (v)IOMMU. That means that unlike the common x86 model of a > > host with IOMMU, but guests with no-vIOMMU, guest initiated > > maps/unmaps can be a hot path. Accounting in that path can be > > prohibitive (and on POWER8 in particular it prevented us from > > optimizing that path the way we wanted). We had two solutions for > > that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted > > based on the IOVA window sizes. That was improved in the v2 which > > used the concept of preregistration. IIUC iommufd can achieve the > > same effect as preregistration using IOAS_COPY, so this one isn't > > really a problem either. > > I think PPC and S390 are solving the same problem here. I think S390 > is going to go to a SW nested model where it has an iommu_domain > controlled by iommufd that is populated with the pinned pages, eg > stored in an xarray. > > Then the performance map/unmap path is simply copying pages from the > xarray to the real IOPTEs - and this would be modeled as a nested > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker. > > Perhaps this is agreeable for PPC too? Uh.. maybe? Note that I'm making these comments based on working on this some years ago (the initial VFIO for ppc implementation in particular). I'm no longer actively involved in ppc kernel work. > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for 2 IOVA > > windows, which aren't contiguous with each other. The base addresses > > of each of these are fixed, but the size of each window, the pagesize > > (i.e. granularity) of each window and the number of levels in the > > IOMMU pagetable are runtime configurable. Because it's true in the > > hardware, it's also true of the vIOMMU interface defined by the IBM > > hypervisor (and adpoted by KVM as well). So, guests can request > > changes in how these windows are handled. Typical Linux guests will > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > can't count on that; the guest can use them however it wants. > > As part of nesting iommufd will have a 'create iommu_domain using > iommu driver specific data' primitive. > > The driver specific data for PPC can include a description of these > windows so the PPC specific qemu driver can issue this new ioctl > using the information provided by the guest. Hmm.. not sure if that works. At the moment, qemu (for example) needs to set up the domains/containers/IOASes as it constructs the machine, because that's based on the virtual hardware topology. Initially they use the default windows (0..2GiB first window, second window disabled). Only once the guest kernel is up and running does it issue the hypercalls to set the final windows as it prefers. In theory the guest could change them during runtime though it's unlikely in practice. They could change during machine lifetime in practice, though, if you rebooted from one guest kernel to another that uses a different configuration. *Maybe* IOAS construction can be deferred somehow, though I'm not sure because the assigned devices need to live somewhere. > The main issue is that internally to the iommu subsystem the > iommu_domain aperture is assumed to be a single window. This kAPI will > have to be improved to model the PPC multi-window iommu_domain. Right. > If this API is not used then the PPC driver should choose some > sensible default windows that makes things like DPDK happy. > > > Then, there's handling existing qemu (or other software) that is using > > the VFIO SPAPR_TCE interfaces. First, it's not entirely clear if this > > should be a goal or not: as others have noted, working actively to > > port qemu to the new interface at the same time as making a > > comprehensive in-kernel compat layer is arguably redundant work. > > At the moment I think I would stick with not including the SPAPR > interfaces in vfio_compat, but there does seem to be a path if someone > with HW wants to build and test them? > > > You might be able to do this by simply failing this outright if > > there's anything other than exactly one IOMMU group bound to the > > container / IOAS (which I think might be what VFIO itself does now). > > Handling that with a device centric API gets somewhat fiddlier, of > > course. > > Maybe every device gets a copy of the error notification? Alas, it's harder than that. One of the things that can happen on an EEH fault is that the entire PE gets suspended (blocking both DMA and MMIO, IIRC) until the proper recovery steps are taken. Since that's handled at the hardware/firmware level, it will obviously only affect the host side PE (== host iommu group). However the interfaces we have only allow things to be reported to the guest at the granularity of a guest side PE (== container/IOAS == guest host bridge in practice). So to handle this correctly when guest PE != host PE we'd need to synchronize suspended / recovery state between all the host PEs in the guest PE. That *might* be technically possible, but it's really damn fiddly. > ie maybe this should be part of vfio_pci and not part of iommufd to > mirror how AER works? > > It feels strange to put in device error notification to iommufd, is > that connected the IOMMU? Only in that operates at the granularity of a PE, which is mostly an IOMMU concept.
On Fri, Apr 29, 2022 at 01:21:30AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Thursday, April 28, 2022 11:11 PM > > > > > > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for > > 2 IOVA > > > windows, which aren't contiguous with each other. The base addresses > > > of each of these are fixed, but the size of each window, the pagesize > > > (i.e. granularity) of each window and the number of levels in the > > > IOMMU pagetable are runtime configurable. Because it's true in the > > > hardware, it's also true of the vIOMMU interface defined by the IBM > > > hypervisor (and adpoted by KVM as well). So, guests can request > > > changes in how these windows are handled. Typical Linux guests will > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > > can't count on that; the guest can use them however it wants. > > > > As part of nesting iommufd will have a 'create iommu_domain using > > iommu driver specific data' primitive. > > > > The driver specific data for PPC can include a description of these > > windows so the PPC specific qemu driver can issue this new ioctl > > using the information provided by the guest. > > > > The main issue is that internally to the iommu subsystem the > > iommu_domain aperture is assumed to be a single window. This kAPI will > > have to be improved to model the PPC multi-window iommu_domain. > > > > From the point of nesting probably each window can be a separate > domain then the existing aperture should still work? Maybe. There might be several different ways to represent it, but the vital piece is that any individual device (well, group, technically) must atomically join/leave both windows at once.
On Fri, Apr 29, 2022 at 04:20:36PM +1000, David Gibson wrote: > > I think PPC and S390 are solving the same problem here. I think S390 > > is going to go to a SW nested model where it has an iommu_domain > > controlled by iommufd that is populated with the pinned pages, eg > > stored in an xarray. > > > > Then the performance map/unmap path is simply copying pages from the > > xarray to the real IOPTEs - and this would be modeled as a nested > > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker. > > > > Perhaps this is agreeable for PPC too? > > Uh.. maybe? Note that I'm making these comments based on working on > this some years ago (the initial VFIO for ppc implementation in > particular). I'm no longer actively involved in ppc kernel work. OK > > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for 2 IOVA > > > windows, which aren't contiguous with each other. The base addresses > > > of each of these are fixed, but the size of each window, the pagesize > > > (i.e. granularity) of each window and the number of levels in the > > > IOMMU pagetable are runtime configurable. Because it's true in the > > > hardware, it's also true of the vIOMMU interface defined by the IBM > > > hypervisor (and adpoted by KVM as well). So, guests can request > > > changes in how these windows are handled. Typical Linux guests will > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > > can't count on that; the guest can use them however it wants. > > > > As part of nesting iommufd will have a 'create iommu_domain using > > iommu driver specific data' primitive. > > > > The driver specific data for PPC can include a description of these > > windows so the PPC specific qemu driver can issue this new ioctl > > using the information provided by the guest. > > Hmm.. not sure if that works. At the moment, qemu (for example) needs > to set up the domains/containers/IOASes as it constructs the machine, > because that's based on the virtual hardware topology. Initially they > use the default windows (0..2GiB first window, second window > disabled). Only once the guest kernel is up and running does it issue > the hypercalls to set the final windows as it prefers. In theory the > guest could change them during runtime though it's unlikely in > practice. They could change during machine lifetime in practice, > though, if you rebooted from one guest kernel to another that uses a > different configuration. > > *Maybe* IOAS construction can be deferred somehow, though I'm not sure > because the assigned devices need to live somewhere. This is a general requirement for all the nesting implementations, we start out with some default nested page table and then later the VM does the vIOMMU call to change it. So nesting will have to come along with some kind of 'switch domains IOCTL' In this case I would guess PPC could do the same and start out with a small (nested) iommu_domain and then create the VM's desired iommu_domain from the hypercall, and switch to it. It is a bit more CPU work since maps in the lower range would have to be copied over, but conceptually the model matches the HW nesting. > > > You might be able to do this by simply failing this outright if > > > there's anything other than exactly one IOMMU group bound to the > > > container / IOAS (which I think might be what VFIO itself does now). > > > Handling that with a device centric API gets somewhat fiddlier, of > > > course. > > > > Maybe every device gets a copy of the error notification? > > Alas, it's harder than that. One of the things that can happen on an > EEH fault is that the entire PE gets suspended (blocking both DMA and > MMIO, IIRC) until the proper recovery steps are taken. I think qemu would have to de-duplicate the duplicated device notifications and then it can go from a device notifiation to the device's iommu_group to the IOAS to the vPE? A simple serial number in the event would make this pretty simple. The way back to clear the event would just forward the commands through a random device in the iommu_group to the PE? Thanks, Jason
On Fri, Apr 29, 2022 at 04:22:56PM +1000, David Gibson wrote: > On Fri, Apr 29, 2022 at 01:21:30AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Thursday, April 28, 2022 11:11 PM > > > > > > > > > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for > > > 2 IOVA > > > > windows, which aren't contiguous with each other. The base addresses > > > > of each of these are fixed, but the size of each window, the pagesize > > > > (i.e. granularity) of each window and the number of levels in the > > > > IOMMU pagetable are runtime configurable. Because it's true in the > > > > hardware, it's also true of the vIOMMU interface defined by the IBM > > > > hypervisor (and adpoted by KVM as well). So, guests can request > > > > changes in how these windows are handled. Typical Linux guests will > > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > > > can't count on that; the guest can use them however it wants. > > > > > > As part of nesting iommufd will have a 'create iommu_domain using > > > iommu driver specific data' primitive. > > > > > > The driver specific data for PPC can include a description of these > > > windows so the PPC specific qemu driver can issue this new ioctl > > > using the information provided by the guest. > > > > > > The main issue is that internally to the iommu subsystem the > > > iommu_domain aperture is assumed to be a single window. This kAPI will > > > have to be improved to model the PPC multi-window iommu_domain. > > > > > > > From the point of nesting probably each window can be a separate > > domain then the existing aperture should still work? > > Maybe. There might be several different ways to represent it, but the > vital piece is that any individual device (well, group, technically) > must atomically join/leave both windows at once. I'm not keen on the multi-iommu_domains because it means we have to create the idea that a device can be attached to multiple iommu_domains, which we don't have at all today. Since iommu_domain allows PPC to implement its special rules, like the atomicness above. Jason
On Fri, Apr 29, 2022 at 09:50:30AM -0300, Jason Gunthorpe wrote: > On Fri, Apr 29, 2022 at 04:22:56PM +1000, David Gibson wrote: > > On Fri, Apr 29, 2022 at 01:21:30AM +0000, Tian, Kevin wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Thursday, April 28, 2022 11:11 PM > > > > > > > > > > > > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for > > > > 2 IOVA > > > > > windows, which aren't contiguous with each other. The base addresses > > > > > of each of these are fixed, but the size of each window, the pagesize > > > > > (i.e. granularity) of each window and the number of levels in the > > > > > IOMMU pagetable are runtime configurable. Because it's true in the > > > > > hardware, it's also true of the vIOMMU interface defined by the IBM > > > > > hypervisor (and adpoted by KVM as well). So, guests can request > > > > > changes in how these windows are handled. Typical Linux guests will > > > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > > > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > > > > can't count on that; the guest can use them however it wants. > > > > > > > > As part of nesting iommufd will have a 'create iommu_domain using > > > > iommu driver specific data' primitive. > > > > > > > > The driver specific data for PPC can include a description of these > > > > windows so the PPC specific qemu driver can issue this new ioctl > > > > using the information provided by the guest. > > > > > > > > The main issue is that internally to the iommu subsystem the > > > > iommu_domain aperture is assumed to be a single window. This kAPI will > > > > have to be improved to model the PPC multi-window iommu_domain. > > > > > > > > > > From the point of nesting probably each window can be a separate > > > domain then the existing aperture should still work? > > > > Maybe. There might be several different ways to represent it, but the > > vital piece is that any individual device (well, group, technically) > > must atomically join/leave both windows at once. > > I'm not keen on the multi-iommu_domains because it means we have to > create the idea that a device can be attached to multiple > iommu_domains, which we don't have at all today. > > Since iommu_domain allows PPC to implement its special rules, like the > atomicness above. I tend to agree; I think extending the iommu domain concept to incorporate multiple windows makes more sense than extending to allow multiple domains per device. I'm just saying there might be other ways of representing this, and that's not a sticking point for me as long as the right properties can be preserved.
On Fri, Apr 29, 2022 at 09:48:38AM -0300, Jason Gunthorpe wrote: > On Fri, Apr 29, 2022 at 04:20:36PM +1000, David Gibson wrote: > > > > I think PPC and S390 are solving the same problem here. I think S390 > > > is going to go to a SW nested model where it has an iommu_domain > > > controlled by iommufd that is populated with the pinned pages, eg > > > stored in an xarray. > > > > > > Then the performance map/unmap path is simply copying pages from the > > > xarray to the real IOPTEs - and this would be modeled as a nested > > > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker. > > > > > > Perhaps this is agreeable for PPC too? > > > > Uh.. maybe? Note that I'm making these comments based on working on > > this some years ago (the initial VFIO for ppc implementation in > > particular). I'm no longer actively involved in ppc kernel work. > > OK > > > > > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for 2 IOVA > > > > windows, which aren't contiguous with each other. The base addresses > > > > of each of these are fixed, but the size of each window, the pagesize > > > > (i.e. granularity) of each window and the number of levels in the > > > > IOMMU pagetable are runtime configurable. Because it's true in the > > > > hardware, it's also true of the vIOMMU interface defined by the IBM > > > > hypervisor (and adpoted by KVM as well). So, guests can request > > > > changes in how these windows are handled. Typical Linux guests will > > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > > > > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > > > > can't count on that; the guest can use them however it wants. > > > > > > As part of nesting iommufd will have a 'create iommu_domain using > > > iommu driver specific data' primitive. > > > > > > The driver specific data for PPC can include a description of these > > > windows so the PPC specific qemu driver can issue this new ioctl > > > using the information provided by the guest. > > > > Hmm.. not sure if that works. At the moment, qemu (for example) needs > > to set up the domains/containers/IOASes as it constructs the machine, > > because that's based on the virtual hardware topology. Initially they > > use the default windows (0..2GiB first window, second window > > disabled). Only once the guest kernel is up and running does it issue > > the hypercalls to set the final windows as it prefers. In theory the > > guest could change them during runtime though it's unlikely in > > practice. They could change during machine lifetime in practice, > > though, if you rebooted from one guest kernel to another that uses a > > different configuration. > > > > *Maybe* IOAS construction can be deferred somehow, though I'm not sure > > because the assigned devices need to live somewhere. > > This is a general requirement for all the nesting implementations, we > start out with some default nested page table and then later the VM > does the vIOMMU call to change it. So nesting will have to come along > with some kind of 'switch domains IOCTL' > > In this case I would guess PPC could do the same and start out with a > small (nested) iommu_domain and then create the VM's desired > iommu_domain from the hypercall, and switch to it. > > It is a bit more CPU work since maps in the lower range would have to > be copied over, but conceptually the model matches the HW nesting. Ah.. ok. IIUC what you're saying is that the kernel-side IOASes have fixed windows, but we fake dynamic windows in the userspace implementation by flipping the devices over to a new IOAS with the new windows. Is that right? Where exactly would the windows be specified? My understanding was that when creating a back-end specific IOAS, that would typically be for the case where you're using a user / guest managed IO pagetable, with the backend specifying the format for that. In the ppc case we'd need to specify the windows, but we'd still need the IOAS_MAP/UNMAP operations to manage the mappings. The PAPR vIOMMU is paravirtualized, so all updates come via hypercalls, so there's no user/guest managed data structure. That should work from the point of view of the userspace and guest side interfaces. It might be fiddly from the point of view of the back end. The ppc iommu doesn't really have the notion of configurable domains - instead the address spaces are the hardware or firmware fixed PEs, so they have a fixed set of devices. At the bare metal level it's possible to sort of do domains by making the actual pagetable pointers for several PEs point to a common place. However, in the future, nested KVM under PowerVM is likely to be the norm. In that situation the L1 as well as the L2 only has the paravirtualized interfaces, which don't have any notion of domains, only PEs. All updates take place via hypercalls which explicitly specify a PE (strictly speaking they take a "Logical IO Bus Number" (LIOBN), but those generally map one to one with PEs), so it can't use shared pointer tricks either. Now obviously we can chose devices to share a logical iommu domain by just duplicating mappings across the PEs, but what we can't do is construct mappings in a domain and *then* atomically move devices into it. So, here's an alternative set of interfaces that should work for ppc, maybe you can tell me whether they also work for x86 and others: * Each domain/IOAS has a concept of one or more IOVA windows, which each have a base address, size, pagesize (granularity) and optionally other flags/attributes. * This has some bearing on hardware capabilities, but is primarily a software notion * MAP/UNMAP operations are only permitted within an existing IOVA window (and with addresses aligned to the window's pagesize) * This is enforced by software whether or not it is required by the underlying hardware * Likewise IOAS_COPY operations are only permitted if the source and destination windows have compatible attributes * A newly created kernel-managed IOAS has *no* IOVA windows * A CREATE_WINDOW operation is added * This takes a size, pagesize/granularity, optional base address and optional additional attributes * If any of the specified attributes are incompatible with the underlying hardware, the operation fails * "Compatibility" doesn't require an exact match: creating a small window within the hardware's much larger valid address range is fine, as is creating a window with a pagesize that's a multiple of the hardware pagesize * Unspecified attributes (including base address) are allocated by the kernel, based on the hardware capabilities/configuration * The properties of the window (including allocated ones) are recorded with the IOAS/domain * For IOMMUs like ppc which don't have fixed ranges, this would create windows on the backend to match, if possible, otherwise fail * A REMOVE_WINDOW operation reverses a single CREATE_WINDOW * This would also delete any MAPs within that window * ATTACH operations recaculate the hardware capabilities (as now), then verify then against the already created windows, and fail if the existing windows can't be preserved MAP/UNMAP, CREATE/REMOVE_WINDOW and ATTACH/DETACH operations can happen in any order, though whether they succeed may depend on the order in some cases. So, for a userspace driver setup (e.g. DPDK) the typical sequence would be: 1. Create IOAS 2. ATTACH all the devices we want to use 3. CREATE_WINDOW with the size and pagesize we need, other attributes unspecified - Because we already attached the devices, this will allocate a suitable base address and attributes for us 4. IOAS_MAP the buffers we want into the window that was allocated for us 5. Do work For qemu or another hypervisor with a ppc (PAPR) guest, the sequence would be: 1. Create IOAS for each guest side bus 2. Determine the expected IOVA ranges based on the guest configuration (model of vIOMMU, what virtual devices are present) 3. CREATE_WINDOW ranges for the default guest side windows. This will have base address specified - If this fails, we're out of luck, report error and quit 4. ATTACH devices that are on the guest bus associated with each IOAS - Again, if this fails, report error and quit 5. Start machine, boot guest 6. During runtime: - If the guest attempts to create new windows, do new CREATE_WINDOW operations. If they fail, fail the initiating hypercall - If the guest maps things into the IO pagetable, IOAS_MAP them. If that fails, fail the hypercall - If the user hotplugs a device, try to ATTACH it. If that fails, fail the hotplug operation. - If the machine is reset, REMOVE_WINDOW whatever's there and CREATE_WINDOW the defaults again For qemu with an x86 guest, the sequence would be: 1. Create default IOAS 2. CREATE_WINDOW for all guest memory blocks - If it fails, report and quit 3. IOAS_MAP guest RAM into default IOAS so that GPA==IOVA 4. ATTACH all virtual devices to the default IOAS - If it fails, report and quit 5. Start machine, boot guest 6. During runtime - If the guest assigns devices to domains on the vIOMMU, create new IOAS for each guest domain, and ATTACH the device to the new IOAS - If the guest triggers a vIOMMU cache flush, (re)mirror the guest IO pagetables into the host using IOAS_MAP or COPY - If the user hotplugs a device, ATTACH it to the default domain. If that fails, fail the hotplug - If the user hotplugs memory, CREATE_WINDOW in the default IOAS for the new memory region. If that fails, fail the hotplug - On system reset, re-ATTACH all devices to default IOAS, remove all non-default IOAS > > > > You might be able to do this by simply failing this outright if > > > > there's anything other than exactly one IOMMU group bound to the > > > > container / IOAS (which I think might be what VFIO itself does now). > > > > Handling that with a device centric API gets somewhat fiddlier, of > > > > course. > > > > > > Maybe every device gets a copy of the error notification? > > > > Alas, it's harder than that. One of the things that can happen on an > > EEH fault is that the entire PE gets suspended (blocking both DMA and > > MMIO, IIRC) until the proper recovery steps are taken. > > I think qemu would have to de-duplicate the duplicated device > notifications and then it can go from a device notifiation to the > device's iommu_group to the IOAS to the vPE? It's not about the notifications. When the PE goes into suspended/error state, that changes guest visible behaviour of devices. It essentially blocks all IO on the bus (the design philosophy is to prioritize not propagating bad data above all else). That state is triggered on (certain sorts of) IO errors and will remain until explicit recovery steps are taken. The guest might see that IO blockage before it gets/processes the notifications (because concurrency). On the host side only the host PE will be affected, but on the guest side we can only report and recover things at the guest PE level, which might be larger than a host PE. So the problem is if we have a guest PE with multiple host PEs, and we trip an error which puts one of the host PEs into error state, we have what's essentially an inconsistent, invalid state from the point of view of the guest interfaces. The guest has no way to sensibly query or recover from this. To make this work, essentially we'd have to trap the error on the host, then inject errors to put all of the host PEs in the guest PE into the same error state. When the guest takes recovery actions we'd need to duplicate those actions on all the host PEs. If one of those steps fails on one of the host PEs we have to work out out to report that and bring the PEs back into a synchronized state. Like I said, it might be possible, but it's really, really hairy. All for a feature that I'm not convinced anyone has ever used in earnest. > A simple serial number in the event would make this pretty simple. > > The way back to clear the event would just forward the commands > through a random device in the iommu_group to the PE? > > Thanks, > Jason >
On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote: > > It is a bit more CPU work since maps in the lower range would have to > > be copied over, but conceptually the model matches the HW nesting. > > Ah.. ok. IIUC what you're saying is that the kernel-side IOASes have > fixed windows, but we fake dynamic windows in the userspace > implementation by flipping the devices over to a new IOAS with the new > windows. Is that right? Yes > Where exactly would the windows be specified? My understanding was > that when creating a back-end specific IOAS, that would typically be > for the case where you're using a user / guest managed IO pagetable, > with the backend specifying the format for that. In the ppc case we'd > need to specify the windows, but we'd still need the IOAS_MAP/UNMAP > operations to manage the mappings. The PAPR vIOMMU is > paravirtualized, so all updates come via hypercalls, so there's no > user/guest managed data structure. When the iommu_domain is created I want to have a iommu-driver-specific struct, so PPC can customize its iommu_domain however it likes. > That should work from the point of view of the userspace and guest > side interfaces. It might be fiddly from the point of view of the > back end. The ppc iommu doesn't really have the notion of > configurable domains - instead the address spaces are the hardware or > firmware fixed PEs, so they have a fixed set of devices. At the bare > metal level it's possible to sort of do domains by making the actual > pagetable pointers for several PEs point to a common place. I'm not sure I understand this - a domain is just a storage container for an IO page table, if the HW has IOPTEs then it should be able to have a domain? Making page table pointers point to a common IOPTE tree is exactly what iommu_domains are for - why is that "sort of" for ppc? > However, in the future, nested KVM under PowerVM is likely to be the > norm. In that situation the L1 as well as the L2 only has the > paravirtualized interfaces, which don't have any notion of domains, > only PEs. All updates take place via hypercalls which explicitly > specify a PE (strictly speaking they take a "Logical IO Bus Number" > (LIOBN), but those generally map one to one with PEs), so it can't use > shared pointer tricks either. How does the paravirtualized interfaces deal with the page table? Does it call a map/unmap hypercall instead of providing guest IOPTEs? Assuming yes, I'd expect that: The iommu_domain for nested PPC is just a log of map/unmap hypervsior calls to make. Whenever a new PE is attached to that domain it gets the logged map's replayed to set it up, and when a PE is detached the log is used to unmap everything. It is not perfectly memory efficient - and we could perhaps talk about a API modification to allow re-use of the iommufd datastructure somehow, but I think this is a good logical starting point. The PE would have to be modeled as an iommu_group. > So, here's an alternative set of interfaces that should work for ppc, > maybe you can tell me whether they also work for x86 and others: Fundamentally PPC has to fit into the iommu standard framework of group and domains, we can talk about modifications, but drifting too far away is a big problem. > * Each domain/IOAS has a concept of one or more IOVA windows, which > each have a base address, size, pagesize (granularity) and optionally > other flags/attributes. > * This has some bearing on hardware capabilities, but is > primarily a software notion iommu_domain has the aperture, PPC will require extending this to a list of apertures since it is currently only one window. Once a domain is created and attached to a group the aperture should be immutable. > * MAP/UNMAP operations are only permitted within an existing IOVA > window (and with addresses aligned to the window's pagesize) > * This is enforced by software whether or not it is required by > the underlying hardware > * Likewise IOAS_COPY operations are only permitted if the source and > destination windows have compatible attributes Already done, domain's aperture restricts all the iommufd operations > * A newly created kernel-managed IOAS has *no* IOVA windows Already done, the iommufd IOAS has no iommu_domains inside it at creation time. > * A CREATE_WINDOW operation is added > * This takes a size, pagesize/granularity, optional base address > and optional additional attributes > * If any of the specified attributes are incompatible with the > underlying hardware, the operation fails iommu layer has nothing called a window. The closest thing is a domain. I really don't want to try to make a new iommu layer object that is so unique and special to PPC - we have to figure out how to fit PPC into the iommu_domain model with reasonable extensions. > > > > Maybe every device gets a copy of the error notification? > > > > > > Alas, it's harder than that. One of the things that can happen on an > > > EEH fault is that the entire PE gets suspended (blocking both DMA and > > > MMIO, IIRC) until the proper recovery steps are taken. > > > > I think qemu would have to de-duplicate the duplicated device > > notifications and then it can go from a device notifiation to the > > device's iommu_group to the IOAS to the vPE? > > It's not about the notifications. The only thing the kernel can do is rely a notification that something happened to a PE. The kernel gets an event on the PE basis, I would like it to replicate it to all the devices and push it through the VFIO device FD. qemu will de-duplicate the replicates and recover exactly the same event the kernel saw, delivered at exactly the same time. If instead you want to have one event per-PE then all that changes in the kernel is one event is generated instead of N, and qemu doesn't have to throw away the duplicates. With either method qemu still gets the same PE centric event, delivered at the same time, with the same races. This is not a general mechanism, it is some PPC specific thing to communicate a PPC specific PE centric event to userspace. I just prefer it in VFIO instead of iommufd. Jason
On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote: > On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote: > > > > It is a bit more CPU work since maps in the lower range would have to > > > be copied over, but conceptually the model matches the HW nesting. > > > > Ah.. ok. IIUC what you're saying is that the kernel-side IOASes have > > fixed windows, but we fake dynamic windows in the userspace > > implementation by flipping the devices over to a new IOAS with the new > > windows. Is that right? > > Yes > > > Where exactly would the windows be specified? My understanding was > > that when creating a back-end specific IOAS, that would typically be > > for the case where you're using a user / guest managed IO pagetable, > > with the backend specifying the format for that. In the ppc case we'd > > need to specify the windows, but we'd still need the IOAS_MAP/UNMAP > > operations to manage the mappings. The PAPR vIOMMU is > > paravirtualized, so all updates come via hypercalls, so there's no > > user/guest managed data structure. > > When the iommu_domain is created I want to have a > iommu-driver-specific struct, so PPC can customize its iommu_domain > however it likes. This requires that the client be aware of the host side IOMMU model. That's true in VFIO now, and it's nasty; I was really hoping we could *stop* doing that. Note that I'm talking here *purely* about the non-optimized case where all updates to the host side IO pagetables are handled by IOAS_MAP / IOAS_COPY, with no direct hardware access to user or guest managed IO pagetables. The optimized case obviously requires end-to-end agreement on the pagetable format amongst other domain properties. What I'm hoping is that qemu (or whatever) can use this non-optimized as a fallback case where it does't need to know the properties of whatever host side IOMMU models there are. It just requests what it needs based on the vIOMMU properties it needs to replicate and the host kernel either can supply it or can't. In many cases it should be perfectly possible to emulate a PPC style vIOMMU on an x86 host, because the x86 IOMMU has such a colossal aperture that it will encompass wherever the ppc apertures end up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc host (currently somewhere between awkward and impossible) by placing the host apertures to cover guest memory. Admittedly those are pretty niche cases, but allowing for them gives us flexibility for the future. Emulating an ARM SMMUv3 guest on an ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in the future, and AFAICT, ARM are much less conservative that x86 about maintaining similar hw interfaces over time. That's why I think considering these ppc cases will give a more robust interface for other future possibilities as well. > > That should work from the point of view of the userspace and guest > > side interfaces. It might be fiddly from the point of view of the > > back end. The ppc iommu doesn't really have the notion of > > configurable domains - instead the address spaces are the hardware or > > firmware fixed PEs, so they have a fixed set of devices. At the bare > > metal level it's possible to sort of do domains by making the actual > > pagetable pointers for several PEs point to a common place. > > I'm not sure I understand this - a domain is just a storage container > for an IO page table, if the HW has IOPTEs then it should be able to > have a domain? > > Making page table pointers point to a common IOPTE tree is exactly > what iommu_domains are for - why is that "sort of" for ppc? Ok, fair enough, it's only "sort of" in the sense that the hw specs / docs don't present any equivalent concept. > > However, in the future, nested KVM under PowerVM is likely to be the > > norm. In that situation the L1 as well as the L2 only has the > > paravirtualized interfaces, which don't have any notion of domains, > > only PEs. All updates take place via hypercalls which explicitly > > specify a PE (strictly speaking they take a "Logical IO Bus Number" > > (LIOBN), but those generally map one to one with PEs), so it can't use > > shared pointer tricks either. > > How does the paravirtualized interfaces deal with the page table? Does > it call a map/unmap hypercall instead of providing guest IOPTEs? Sort of. The main interface is H_PUT_TCE ("TCE" - Translation Control Entry - being IBMese for an IOPTE). This takes an LIOBN (which selects which PE and aperture), an IOVA and a TCE value - which is a guest physical address plus some permission bits. There are some variants for performance that can set a batch of IOPTEs from a buffer, or clear a range of IOPTEs, but they're just faster ways of doing the same thing as a bunch of H_PUT_TCE calls. H_PUT_TCE calls. You can consider that a map/unmap hypercall, but the size of the mapping is fixed (the IO pagesize which was previously set for the aperture). > Assuming yes, I'd expect that: > > The iommu_domain for nested PPC is just a log of map/unmap hypervsior > calls to make. Whenever a new PE is attached to that domain it gets > the logged map's replayed to set it up, and when a PE is detached the > log is used to unmap everything. And likewise duplicate every H_PUT_TCE to all the PEs in the domain. Sure. It means the changes won't be atomic across the domain, but I guess that doesn't matter. I guess you could have the same thing on a sufficiently complex x86 or ARM system, if you put two devices into the IOAS that were sufficiently far from each other in the bus topology that they use a different top-level host IOMMU. > It is not perfectly memory efficient - and we could perhaps talk about > a API modification to allow re-use of the iommufd datastructure > somehow, but I think this is a good logical starting point. Because the map size is fixed, a "replay log" is effectively equivalent to a mirror of the entire IO pagetable. > The PE would have to be modeled as an iommu_group. Yes, they already are. > > So, here's an alternative set of interfaces that should work for ppc, > > maybe you can tell me whether they also work for x86 and others: > > Fundamentally PPC has to fit into the iommu standard framework of > group and domains, we can talk about modifications, but drifting too > far away is a big problem. Well, what I'm trying to do here is to see how big a change to the model this really requires. If it's not too much, we gain flexibility for future unknown IOMMU models as well as for the admittedly niche ppc case. > > * Each domain/IOAS has a concept of one or more IOVA windows, which > > each have a base address, size, pagesize (granularity) and optionally > > other flags/attributes. > > * This has some bearing on hardware capabilities, but is > > primarily a software notion > > iommu_domain has the aperture, PPC will require extending this to a > list of apertures since it is currently only one window. Yes, that's needed as a minimum. > Once a domain is created and attached to a group the aperture should > be immutable. I realize that's the model right now, but is there a strong rationale for that immutability? Come to that, IIUC that's true for the iommu_domain at the lower level, but not for the IOAS at a higher level. You've stated that's *not* immutable, since it can shrink as new devices are added to the IOAS. So I guess in that case the IOAS must be mapped by multiple iommu_domains? > > * MAP/UNMAP operations are only permitted within an existing IOVA > > window (and with addresses aligned to the window's pagesize) > > * This is enforced by software whether or not it is required by > > the underlying hardware > > * Likewise IOAS_COPY operations are only permitted if the source and > > destination windows have compatible attributes > > Already done, domain's aperture restricts all the iommufd operations Right but that aperture is coming only from the hardware. What I'm suggesting here is that userspace can essentially opt into a *smaller* aperture (or apertures) than the hardware permits. The value of this is that if the effective hardware aperture shrinks due to adding devices, the kernel has the information it needs to determine if this will be a problem for the userspace client or not. > > * A newly created kernel-managed IOAS has *no* IOVA windows > > Already done, the iommufd IOAS has no iommu_domains inside it at > creation time. That.. doesn't seem like the same thing at all. If there are no domains, there are no restrictions from the hardware, so there's effectively an unlimited aperture. > > * A CREATE_WINDOW operation is added > > * This takes a size, pagesize/granularity, optional base address > > and optional additional attributes > > * If any of the specified attributes are incompatible with the > > underlying hardware, the operation fails > > iommu layer has nothing called a window. The closest thing is a > domain. Maybe "window" is a bad name. You called it "aperture" above (and I've shifted to that naming) and implied it *was* part of the IOMMU domain. That said, it doesn't need to be at the iommu layer - as I said I'm consdiering this primarily a software concept and it could be at the IOAS layer. The linkage would be that every iommufd operation which could change either the user-selected windows or the effective hardware apertures must ensure that all the user windows (still) lie within the hardware apertures. > I really don't want to try to make a new iommu layer object that is so > unique and special to PPC - we have to figure out how to fit PPC into > the iommu_domain model with reasonable extensions. So, I'm probably conflating things between the iommu layer and the iommufd/IOAS layer because it's a been a while since I looked at this code. I'm primarily interested in the interfaces at the iommufd layer, I don't much care at what layer it's implemented. Having the iommu layer only care about the hw limitations (let's call those "apertures") and the iommufd/IOAS layer care about the software chosen limitations (let's call those "windows") makes reasonable sense to me: HW advertises apertures according to its capabilities, SW requests windows according to its needs. The iommufd layer does its best to accomodate SW's operations while maintaining the invariant that the windows must always lie within the apertures. At least.. that's the model on x86 and most other hosts. For a ppc host, we'd need want to have attempts to create windows also attempt to create apertures on the hardware (or next level hypervisor). But with this model, that doesn't require userspace to do anything different, so that can be localized to the ppc host backend. > > > > > Maybe every device gets a copy of the error notification? > > > > > > > > Alas, it's harder than that. One of the things that can happen on an > > > > EEH fault is that the entire PE gets suspended (blocking both DMA and > > > > MMIO, IIRC) until the proper recovery steps are taken. > > > > > > I think qemu would have to de-duplicate the duplicated device > > > notifications and then it can go from a device notifiation to the > > > device's iommu_group to the IOAS to the vPE? > > > > It's not about the notifications. > > The only thing the kernel can do is rely a notification that something > happened to a PE. The kernel gets an event on the PE basis, I would > like it to replicate it to all the devices and push it through the > VFIO device FD. Again: it's not about the notifications (I don't even really know how those work in EEH). The way EEH is specified, the expectation is that when an error is tripped, the whole PE goes into an error state, blocking IO for every device in the PE. Something working with EEH expects that to happen at the (virtual) hardware level. So if a guest trips an error on one device, it expects IO to stop for every other device in the PE. But if hardware's notion of the PE is smaller than the guest's the host hardware won't accomplish that itself. So the kernel would somehow have to replicate the error state in one host PE to the other host PEs within the domain. I think it's theoretically possible, but it's really fiddly. It has to maintain that replicated state on every step of the recovery process as well. At last count trying to figure out how to do this correctly has burnt out 3 or more separate (competent) developers at IBM so they either just give up and run away, or procrastinate/avoid it until they got a different job. Bear in mind that in this scenario things are already not working quite right at the hw level (or there wouldn't be an error in the first place), so different results from the same recovery/injection steps on different host PEs is a very real possibility. We can't present that that back to the guest running the recovery, because the interface only allows for a single result. So every operation's error path becomes a nightmare of trying to resychronize the state across the domain, which involves steps that themselves could fail, meaning more resync... and so on. > qemu will de-duplicate the replicates and recover exactly the same > event the kernel saw, delivered at exactly the same time. qemu's not doing the recovery, the guest is. So we can't choose the interfaces, we just have what PAPR specifies (and like most PAPR interfaces, it's pretty crap). > If instead you want to have one event per-PE then all that changes in > the kernel is one event is generated instead of N, and qemu doesn't > have to throw away the duplicates. > > With either method qemu still gets the same PE centric event, > delivered at the same time, with the same races. > > This is not a general mechanism, it is some PPC specific thing to > communicate a PPC specific PE centric event to userspace. I just > prefer it in VFIO instead of iommufd. Oh, yes, this one (unlike the multi-aperture stuff, where I'm not yet convinced) is definitely ppc specific. And I'm frankly not convinced it's of any value even on ppc. Given I'm not sure anyone has ever used this in anger, I think a better approach is to just straight up fail any attempt to do any EEH operation if there's not a 1 to 1 mapping from guest PE (domain) to host PE (group). Or just ignore it entirely and count on no-one noticing. EEH is a largely irrelevant tangent to the actual interface issues I'm interested in.
> From: David Gibson <david@gibson.dropbear.id.au> > Sent: Friday, May 6, 2022 1:25 PM > > > > > When the iommu_domain is created I want to have a > > iommu-driver-specific struct, so PPC can customize its iommu_domain > > however it likes. > > This requires that the client be aware of the host side IOMMU model. > That's true in VFIO now, and it's nasty; I was really hoping we could > *stop* doing that. that model is anyway inevitable when talking about user page table, i.e. when nesting is enabled. > > Note that I'm talking here *purely* about the non-optimized case where > all updates to the host side IO pagetables are handled by IOAS_MAP / > IOAS_COPY, with no direct hardware access to user or guest managed IO > pagetables. The optimized case obviously requires end-to-end > agreement on the pagetable format amongst other domain properties. > > What I'm hoping is that qemu (or whatever) can use this non-optimized > as a fallback case where it does't need to know the properties of > whatever host side IOMMU models there are. It just requests what it > needs based on the vIOMMU properties it needs to replicate and the > host kernel either can supply it or can't. > > In many cases it should be perfectly possible to emulate a PPC style > vIOMMU on an x86 host, because the x86 IOMMU has such a colossal > aperture that it will encompass wherever the ppc apertures end > up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc > host (currently somewhere between awkward and impossible) by placing > the host apertures to cover guest memory. > > Admittedly those are pretty niche cases, but allowing for them gives > us flexibility for the future. Emulating an ARM SMMUv3 guest on an > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in > the future, and AFAICT, ARM are much less conservative that x86 about > maintaining similar hw interfaces over time. That's why I think > considering these ppc cases will give a more robust interface for > other future possibilities as well. It's not niche cases. We already have virtio-iommu which can work on both ARM and x86 platforms, i.e. what current iommufd provides is already generic enough except on PPC. Then IMHO the key open here is: Can PPC adapt to the current iommufd proposal if it can be refactored to fit the standard iommu domain/group concepts? If not, what is the remaining gap after PPC becomes a normal citizen in the iommu layer and is it worth solving it in the general interface or via iommu-driver-specific domain (given this will exist anyway)? to close that open I'm with Jason: "Fundamentally PPC has to fit into the iommu standard framework of group and domains, we can talk about modifications, but drifting too far away is a big problem." Directly jumping to the iommufd layer for what changes might be applied to all platforms sounds counter-intuitive if we haven't tried to solve the gap in the iommu layer in the first place, as even there is argument that certain changes in iommufd layer can find matching concept on other platforms it still sort of looks redundant since those platforms already work with the current model. My two cents. Thanks Kevin
On Fri, May 06, 2022 at 03:25:03PM +1000, David Gibson wrote: > On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote: > > When the iommu_domain is created I want to have a > > iommu-driver-specific struct, so PPC can customize its iommu_domain > > however it likes. > > This requires that the client be aware of the host side IOMMU model. > That's true in VFIO now, and it's nasty; I was really hoping we could > *stop* doing that. iommufd has two modes, the 'generic interface' which what this patch series shows that does not require any device specific knowledge. The default iommu_domain that the iommu driver creates will be used here, it is up to the iommu driver to choose something reasonable for use by applications like DPDK. ie PPC should probably pick its biggest x86-like aperture. The iommu-driver-specific struct is the "advanced" interface and allows a user-space IOMMU driver to tightly control the HW with full HW specific knowledge. This is where all the weird stuff that is not general should go. > Note that I'm talking here *purely* about the non-optimized case where > all updates to the host side IO pagetables are handled by IOAS_MAP / > IOAS_COPY, with no direct hardware access to user or guest managed IO > pagetables. The optimized case obviously requires end-to-end > agreement on the pagetable format amongst other domain properties. Sure, this is how things are already.. > What I'm hoping is that qemu (or whatever) can use this non-optimized > as a fallback case where it does't need to know the properties of > whatever host side IOMMU models there are. It just requests what it > needs based on the vIOMMU properties it needs to replicate and the > host kernel either can supply it or can't. There aren't really any negotiable vIOMMU properties beyond the ranges, and the ranges are not *really* negotiable. There are lots of dragons with the idea we can actually negotiate ranges - because asking for the wrong range for what the HW can do means you don't get anything. Which is completely contrary to the idea of easy generic support for things like DPDK. So DPDK can't ask for ranges, it is not generic. This means we are really talking about a qemu-only API, and IMHO, qemu is fine to have a PPC specific userspace driver to tweak this PPC unique thing if the default windows are not acceptable. IMHO it is no different from imagining an Intel specific userspace driver that is using userspace IO pagetables to optimize cross-platform qemu vIOMMU emulation. We should be comfortable with the idea that accessing the full device-specific feature set requires a HW specific user space driver. > Admittedly those are pretty niche cases, but allowing for them gives > us flexibility for the future. Emulating an ARM SMMUv3 guest on an > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in > the future, and AFAICT, ARM are much less conservative that x86 about > maintaining similar hw interfaces over time. That's why I think > considering these ppc cases will give a more robust interface for > other future possibilities as well. I don't think so - PPC has just done two things that are completely divergent from eveything else - having two IO page tables for the same end point, and using map/unmap hypercalls instead of a nested page table. Everyone else seems to be focused on IOPTEs that are similar to CPU PTEs, particularly to enable SVA and other tricks, and CPU's don't have either of this weirdness. > You can consider that a map/unmap hypercall, but the size of the > mapping is fixed (the IO pagesize which was previously set for the > aperture). Yes, I would consider that a map/unmap hypercall vs a nested translation. > > Assuming yes, I'd expect that: > > > > The iommu_domain for nested PPC is just a log of map/unmap hypervsior > > calls to make. Whenever a new PE is attached to that domain it gets > > the logged map's replayed to set it up, and when a PE is detached the > > log is used to unmap everything. > > And likewise duplicate every H_PUT_TCE to all the PEs in the domain. > Sure. It means the changes won't be atomic across the domain, but I > guess that doesn't matter. I guess you could have the same thing on a > sufficiently complex x86 or ARM system, if you put two devices into > the IOAS that were sufficiently far from each other in the bus > topology that they use a different top-level host IOMMU. Yes, strict atomicity is not needed. > > It is not perfectly memory efficient - and we could perhaps talk about > > a API modification to allow re-use of the iommufd datastructure > > somehow, but I think this is a good logical starting point. > > Because the map size is fixed, a "replay log" is effectively > equivalent to a mirror of the entire IO pagetable. So, for virtualized PPC the iommu_domain is an xarray of mapped PFNs and device attach/detach just sweeps the xarray and does the hypercalls. Very similar to what we discussed for S390. It seems OK, this isn't even really overhead in most cases as we always have to track the mapped PFNs anyhow. > > Once a domain is created and attached to a group the aperture should > > be immutable. > > I realize that's the model right now, but is there a strong rationale > for that immutability? I have a strong preference that iommu_domains have immutable properties once they are created just for overall sanity. Otherwise everything becomes a racy mess. If the iommu_domain changes dynamically then things using the aperture data get all broken - it is complicated to fix. So it would need a big reason to do it, I think. > Come to that, IIUC that's true for the iommu_domain at the lower > level, but not for the IOAS at a higher level. You've stated that's > *not* immutable, since it can shrink as new devices are added to the > IOAS. So I guess in that case the IOAS must be mapped by multiple > iommu_domains? Yes, thats right. The IOAS is expressly mutable because it its on multiple domains and multiple groups each of which contribute to the aperture. The IOAS aperture is the narrowest of everything, and we have semi-reasonable semantics here. Here we have all the special code to juggle this, but even in this case we can't handle an iommu_domain or group changing asynchronously. > Right but that aperture is coming only from the hardware. What I'm > suggesting here is that userspace can essentially opt into a *smaller* > aperture (or apertures) than the hardware permits. The value of this > is that if the effective hardware aperture shrinks due to adding > devices, the kernel has the information it needs to determine if this > will be a problem for the userspace client or not. We can do this check in userspace without more kernel APIs, userspace should fetch the ranges and confirm they are still good after attaching devices. In general I have no issue with limiting the IOVA allocator in the kernel, I just don't have a use case of an application that could use the IOVA allocator (like DPDK) and also needs a limitation.. > > > * A newly created kernel-managed IOAS has *no* IOVA windows > > > > Already done, the iommufd IOAS has no iommu_domains inside it at > > creation time. > > That.. doesn't seem like the same thing at all. If there are no > domains, there are no restrictions from the hardware, so there's > effectively an unlimited aperture. Yes.. You wanted a 0 sized window instead? Why? That breaks what I'm trying to do to make DPDK/etc portable and dead simple. > > > * A CREATE_WINDOW operation is added > > > * This takes a size, pagesize/granularity, optional base address > > > and optional additional attributes > > > * If any of the specified attributes are incompatible with the > > > underlying hardware, the operation fails > > > > iommu layer has nothing called a window. The closest thing is a > > domain. > > Maybe "window" is a bad name. You called it "aperture" above (and > I've shifted to that naming) and implied it *was* part of the IOMMU > domain. It is but not as an object that can be mutated - it is just a property. You are talking about a window object that exists somehow, I don't know this fits beyond some creation attribute of the domain.. > That said, it doesn't need to be at the iommu layer - as I > said I'm consdiering this primarily a software concept and it could be > at the IOAS layer. The linkage would be that every iommufd operation > which could change either the user-selected windows or the effective > hardware apertures must ensure that all the user windows (still) lie > within the hardware apertures. As Kevin said, we need to start at the iommu_domain layer first - when we understand how that needs to look then we can imagine what the uAPI should be. I don't want to create something in iommufd that is wildly divergent from what the iommu_domain/etc can support. > > The only thing the kernel can do is rely a notification that something > > happened to a PE. The kernel gets an event on the PE basis, I would > > like it to replicate it to all the devices and push it through the > > VFIO device FD. > > Again: it's not about the notifications (I don't even really know how > those work in EEH). Well, then I don't know what we are talking about - I'm interested in what uAPI is needed to support this, as far as can I tell that is a notification something bad happened and some control knobs? As I said, I'd prefer this to be on the vfio_device FD vs on iommufd and would like to find a way to make that work. > expects that to happen at the (virtual) hardware level. So if a guest > trips an error on one device, it expects IO to stop for every other > device in the PE. But if hardware's notion of the PE is smaller than > the guest's the host hardware won't accomplish that itself. So, why do that to the guest? Shouldn't the PE in the guest be strictly a subset of the PE in the host, otherwise you hit all these problems you are talking about? > used this in anger, I think a better approach is to just straight up > fail any attempt to do any EEH operation if there's not a 1 to 1 > mapping from guest PE (domain) to host PE (group). That makes sense Jason
On Fri, May 06, 2022 at 10:42:21AM +0000, Tian, Kevin wrote: > > From: David Gibson <david@gibson.dropbear.id.au> > > Sent: Friday, May 6, 2022 1:25 PM > > > > > > > > When the iommu_domain is created I want to have a > > > iommu-driver-specific struct, so PPC can customize its iommu_domain > > > however it likes. > > > > This requires that the client be aware of the host side IOMMU model. > > That's true in VFIO now, and it's nasty; I was really hoping we could > > *stop* doing that. > > that model is anyway inevitable when talking about user page table, Right, but I'm explicitly not talking about the user managed page table case. I'm talking about the case where the IO pagetable is still managed by the kernel and we update it via IOAS_MAP and similar operations. > i.e. when nesting is enabled. I don't really follow the connection you're drawing between a user managed table and nesting. > > Note that I'm talking here *purely* about the non-optimized case where > > all updates to the host side IO pagetables are handled by IOAS_MAP / > > IOAS_COPY, with no direct hardware access to user or guest managed IO > > pagetables. The optimized case obviously requires end-to-end > > agreement on the pagetable format amongst other domain properties. > > > > What I'm hoping is that qemu (or whatever) can use this non-optimized > > as a fallback case where it does't need to know the properties of > > whatever host side IOMMU models there are. It just requests what it > > needs based on the vIOMMU properties it needs to replicate and the > > host kernel either can supply it or can't. > > > > In many cases it should be perfectly possible to emulate a PPC style > > vIOMMU on an x86 host, because the x86 IOMMU has such a colossal > > aperture that it will encompass wherever the ppc apertures end > > up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc > > host (currently somewhere between awkward and impossible) by placing > > the host apertures to cover guest memory. > > > > Admittedly those are pretty niche cases, but allowing for them gives > > us flexibility for the future. Emulating an ARM SMMUv3 guest on an > > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in > > the future, and AFAICT, ARM are much less conservative that x86 about > > maintaining similar hw interfaces over time. That's why I think > > considering these ppc cases will give a more robust interface for > > other future possibilities as well. > > It's not niche cases. We already have virtio-iommu which can work > on both ARM and x86 platforms, i.e. what current iommufd provides > is already generic enough except on PPC. > > Then IMHO the key open here is: > > Can PPC adapt to the current iommufd proposal if it can be > refactored to fit the standard iommu domain/group concepts? Right... and I'm still trying to figure out whether it can adapt to either part of that. We absolutely need to allow for multiple IOVA apertures within a domain. If we have that I *think* we can manage (if suboptimally), but I'm trying to figure out the corner cases to make sure I haven't missed something. > If not, what is the remaining gap after PPC becomes a normal > citizen in the iommu layer and is it worth solving it in the general > interface or via iommu-driver-specific domain (given this will > exist anyway)? > > to close that open I'm with Jason: > > "Fundamentally PPC has to fit into the iommu standard framework of > group and domains, we can talk about modifications, but drifting too > far away is a big problem." > > Directly jumping to the iommufd layer for what changes might be > applied to all platforms sounds counter-intuitive if we haven't tried > to solve the gap in the iommu layer in the first place, as even > there is argument that certain changes in iommufd layer can find > matching concept on other platforms it still sort of looks redundant > since those platforms already work with the current model. I don't really follow what you're saying here.
On Fri, May 06, 2022 at 09:48:37AM -0300, Jason Gunthorpe wrote: > On Fri, May 06, 2022 at 03:25:03PM +1000, David Gibson wrote: > > On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote: > > > > When the iommu_domain is created I want to have a > > > iommu-driver-specific struct, so PPC can customize its iommu_domain > > > however it likes. > > > > This requires that the client be aware of the host side IOMMU model. > > That's true in VFIO now, and it's nasty; I was really hoping we could > > *stop* doing that. > > iommufd has two modes, the 'generic interface' which what this patch > series shows that does not require any device specific knowledge. Right, and I'm speaking specifically to that generic interface. But I'm thinking particularly about the qemu case where we do have specific knowledge of the *guest* vIOMMU, but we want to avoid having specific knowledge of the host IOMMU, because they might not be the same. It would be good to have a way of seeing if the guest vIOMMU can be emulated on this host IOMMU without qemu having to have separate logic for every host IOMMU. > The default iommu_domain that the iommu driver creates will be used > here, it is up to the iommu driver to choose something reasonable for > use by applications like DPDK. ie PPC should probably pick its biggest > x86-like aperture. So, using the big aperture means a very high base IOVA (1<<59)... which means that it won't work at all if you want to attach any devices that aren't capable of 64-bit DMA. Using the maximum possible window size would mean we either potentially waste a lot of kernel memory on pagetables, or we use unnecessarily large number of levels to the pagetable. Basically we don't have enough information to make a good decision here. More generally, the problem with the interface advertising limitations and it being up to userspace to work out if those are ok or not is that it's fragile. It's pretty plausible that some future IOMMU model will have some new kind of limitation that can't be expressed in the query structure we invented now. That means that to add support for that we need some kind of gate to prevent old userspace using the new IOMMU (e.g. only allowing the new IOMMU to be used if userspace uses newly added queries to get the new limitations). That's true even if what userspace was actually doing with the IOMMU would fit just fine into those new limitations. But if userspace requests the capabilities it wants, and the kernel acks or nacks that, we can support the new host IOMMU with existing software just fine. They won't be able to use any *new* features or capabilities of the new hardware, of course, but they'll be able to use what it does that overlaps with what they needed before. ppc - or more correctly, the POWER and PAPR IOMMU models - is just acting here as an example of an IOMMU with limitations and capabilities that don't fit into the current query model. > The iommu-driver-specific struct is the "advanced" interface and > allows a user-space IOMMU driver to tightly control the HW with full > HW specific knowledge. This is where all the weird stuff that is not > general should go. Right, but forcing anything more complicated than "give me some IOVA region" to go through the advanced interface means that qemu (or any hypervisor where the guest platform need not identically match the host) has to have n^2 complexity to match each guest IOMMU model to each host IOMMU model. > > Note that I'm talking here *purely* about the non-optimized case where > > all updates to the host side IO pagetables are handled by IOAS_MAP / > > IOAS_COPY, with no direct hardware access to user or guest managed IO > > pagetables. The optimized case obviously requires end-to-end > > agreement on the pagetable format amongst other domain properties. > > Sure, this is how things are already.. > > > What I'm hoping is that qemu (or whatever) can use this non-optimized > > as a fallback case where it does't need to know the properties of > > whatever host side IOMMU models there are. It just requests what it > > needs based on the vIOMMU properties it needs to replicate and the > > host kernel either can supply it or can't. > > There aren't really any negotiable vIOMMU properties beyond the > ranges, and the ranges are not *really* negotiable. Errr.. how do you figure? On ppc the ranges and pagesizes are definitely negotiable. I'm not really familiar with other models, but anything which allows *any* variations in the pagetable structure will effectively have at least some negotiable properties. Even if any individual host IOMMU doesn't have negotiable properties (which ppc demonstrates is false), there's still a negotiation here in the context that userspace doesn't know (and doesn't care) what specific host IOMMU model it has. > There are lots of dragons with the idea we can actually negotiate > ranges - because asking for the wrong range for what the HW can do > means you don't get anything. Which is completely contrary to the idea > of easy generic support for things like DPDK. > > So DPDK can't ask for ranges, it is not generic. Which is why I'm suggesting that the base address be an optional request. DPDK *will* care about the size of the range, so it just requests that and gets told a base address. Qemu emulating a vIOMMU absolutely does care about the ranges, and if the HW can't do it, failing outright is the correct behaviour. > This means we are really talking about a qemu-only API, Kind of, yes. I can't immediately think of anything which would need this fine gained control over the IOMMU capabilities other than an emulator/hypervisor emulating a different vIOMMU than the host IOMMU. qemu is by far the most prominent example of this. > and IMHO, qemu > is fine to have a PPC specific userspace driver to tweak this PPC > unique thing if the default windows are not acceptable. Not really, because it's the ppc *target* (guest) side which requires the specific properties, but selecting the "advanced" interface requires special knowledge on the *host* side. > IMHO it is no different from imagining an Intel specific userspace > driver that is using userspace IO pagetables to optimize > cross-platform qemu vIOMMU emulation. I'm not quite sure what you have in mind here. How is it both Intel specific and cross-platform? > We should be comfortable with > the idea that accessing the full device-specific feature set requires > a HW specific user space driver. > > > Admittedly those are pretty niche cases, but allowing for them gives > > us flexibility for the future. Emulating an ARM SMMUv3 guest on an > > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in > > the future, and AFAICT, ARM are much less conservative that x86 about > > maintaining similar hw interfaces over time. That's why I think > > considering these ppc cases will give a more robust interface for > > other future possibilities as well. > > I don't think so - PPC has just done two things that are completely > divergent from eveything else - having two IO page tables for the same > end point, and using map/unmap hypercalls instead of a nested page > table. > > Everyone else seems to be focused on IOPTEs that are similar to CPU > PTEs, particularly to enable SVA and other tricks, and CPU's don't > have either of this weirdness. Well, you may have a point there. Though, my experience makes me pretty reluctant to ever bet on hw designers *not* inserting some weird new constraint and assuming sw can just figure it out somehow. Note however, that having multiple apertures isn't really ppc specific. Anything with an IO hole effectively has separate apertures above and below the hole. They're much closer together in address than POWER's two apertures, but I don't see that makes any fundamental difference to the handling of them. > > You can consider that a map/unmap hypercall, but the size of the > > mapping is fixed (the IO pagesize which was previously set for the > > aperture). > > Yes, I would consider that a map/unmap hypercall vs a nested translation. > > > > Assuming yes, I'd expect that: > > > > > > The iommu_domain for nested PPC is just a log of map/unmap hypervsior > > > calls to make. Whenever a new PE is attached to that domain it gets > > > the logged map's replayed to set it up, and when a PE is detached the > > > log is used to unmap everything. > > > > And likewise duplicate every H_PUT_TCE to all the PEs in the domain. > > Sure. It means the changes won't be atomic across the domain, but I > > guess that doesn't matter. I guess you could have the same thing on a > > sufficiently complex x86 or ARM system, if you put two devices into > > the IOAS that were sufficiently far from each other in the bus > > topology that they use a different top-level host IOMMU. > > Yes, strict atomicity is not needed. Ok, good to know. > > > It is not perfectly memory efficient - and we could perhaps talk about > > > a API modification to allow re-use of the iommufd datastructure > > > somehow, but I think this is a good logical starting point. > > > > Because the map size is fixed, a "replay log" is effectively > > equivalent to a mirror of the entire IO pagetable. > > So, for virtualized PPC the iommu_domain is an xarray of mapped PFNs > and device attach/detach just sweeps the xarray and does the > hypercalls. Very similar to what we discussed for S390. > > It seems OK, this isn't even really overhead in most cases as we > always have to track the mapped PFNs anyhow. Fair enough, I think that works. > > > Once a domain is created and attached to a group the aperture should > > > be immutable. > > > > I realize that's the model right now, but is there a strong rationale > > for that immutability? > > I have a strong preference that iommu_domains have immutable > properties once they are created just for overall sanity. Otherwise > everything becomes a racy mess. > > If the iommu_domain changes dynamically then things using the aperture > data get all broken - it is complicated to fix. So it would need a big > reason to do it, I think. Ok, that makes sense. I'm pretty ok with the iommu_domain having static apertures. We can't optimally implement ddw that way, but I think we can do it well enough by creating new domains and copying the mappings, as you suggested. > > Come to that, IIUC that's true for the iommu_domain at the lower > > level, but not for the IOAS at a higher level. You've stated that's > > *not* immutable, since it can shrink as new devices are added to the > > IOAS. So I guess in that case the IOAS must be mapped by multiple > > iommu_domains? > > Yes, thats right. The IOAS is expressly mutable because it its on > multiple domains and multiple groups each of which contribute to the > aperture. The IOAS aperture is the narrowest of everything, and we > have semi-reasonable semantics here. Here we have all the special code > to juggle this, but even in this case we can't handle an iommu_domain > or group changing asynchronously. Ok. So at the IOAS level, the apertures are already dynamic. Arguably the main point of my proposal is that it makes any changes in IOAS apertures explicit, rather than implicit. Typically userspace wouldn't need to rely on the aperture information from a query, because they know it will never shrink below the parameters they specified. Another approach would be to give the required apertures / pagesizes in the initial creation of the domain/IOAS. In that case they would be static for the IOAS, as well as the underlying iommu_domains: any ATTACH which would be incompatible would fail. That makes life hard for the DPDK case, though. Obviously we can still make the base address optional, but for it to be static the kernel would have to pick it immediately, before we know what devices will be attached, which will be a problem on any system where there are multiple IOMMUs with different constraints. > > Right but that aperture is coming only from the hardware. What I'm > > suggesting here is that userspace can essentially opt into a *smaller* > > aperture (or apertures) than the hardware permits. The value of this > > is that if the effective hardware aperture shrinks due to adding > > devices, the kernel has the information it needs to determine if this > > will be a problem for the userspace client or not. > > We can do this check in userspace without more kernel APIs, userspace > should fetch the ranges and confirm they are still good after > attaching devices. Well, yes, but I always think an API which stops you doing the wrong thing is better than one which requires you take a bunch of extra steps to use safely. Plus, as noted above, this breaks down if some future IOMMU has a new constraint not expressed in the current query API. > In general I have no issue with limiting the IOVA allocator in the > kernel, I just don't have a use case of an application that could use > the IOVA allocator (like DPDK) and also needs a limitation.. Well, I imagine DPDK has at least the minimal limitation that it needs the aperture to be a certain minimum size (I'm guessing at least the size of its pinned hugepage working memory region). That's a limitation that's unlikely to fail on modern hardware, but it's there. > > > > * A newly created kernel-managed IOAS has *no* IOVA windows > > > > > > Already done, the iommufd IOAS has no iommu_domains inside it at > > > creation time. > > > > That.. doesn't seem like the same thing at all. If there are no > > domains, there are no restrictions from the hardware, so there's > > effectively an unlimited aperture. > > Yes.. You wanted a 0 sized window instead? Yes. Well.. since I'm thinking in terms of multiple windows, I was thinking of it as "0 windows" rather than a 0-sized window, but whatever. Starting with no permitted IOVAs is the relevant point. > Why? To force the app to declare its windows with CREATE_WINDOW before using them. > That breaks what I'm > trying to do to make DPDK/etc portable and dead simple. It doesn't break portability at all. As for simplicity, yes it adds an extra required step, but the payoff is that it's now impossible to subtly screw up by failing to recheck your apertures after an ATTACH. That is, it's taking a step which was implicitly required and replacing it with one that's explicitly required. > > > > * A CREATE_WINDOW operation is added > > > > * This takes a size, pagesize/granularity, optional base address > > > > and optional additional attributes > > > > * If any of the specified attributes are incompatible with the > > > > underlying hardware, the operation fails > > > > > > iommu layer has nothing called a window. The closest thing is a > > > domain. > > > > Maybe "window" is a bad name. You called it "aperture" above (and > > I've shifted to that naming) and implied it *was* part of the IOMMU > > domain. > > It is but not as an object that can be mutated - it is just a > property. > > You are talking about a window object that exists somehow, I don't > know this fits beyond some creation attribute of the domain.. At the domain layer, I think we can manage it that way, albeit suboptimally. > > That said, it doesn't need to be at the iommu layer - as I > > said I'm consdiering this primarily a software concept and it could be > > at the IOAS layer. The linkage would be that every iommufd operation > > which could change either the user-selected windows or the effective > > hardware apertures must ensure that all the user windows (still) lie > > within the hardware apertures. > > As Kevin said, we need to start at the iommu_domain layer first - > when we understand how that needs to look then we can imagine what the > uAPI should be. Hm, ok. I'm looking in the other direction - what uAPIs are needed to do the things I'm considering in userspace, given the constraints of the hardware I know about. I'm seeing what layer we match one to the other at as secondary. > I don't want to create something in iommufd that is wildly divergent > from what the iommu_domain/etc can support. Fair enough. > > > The only thing the kernel can do is rely a notification that something > > > happened to a PE. The kernel gets an event on the PE basis, I would > > > like it to replicate it to all the devices and push it through the > > > VFIO device FD. > > > > Again: it's not about the notifications (I don't even really know how > > those work in EEH). > > Well, then I don't know what we are talking about - I'm interested in > what uAPI is needed to support this, as far as can I tell that is a > notification something bad happened and some control knobs? As far as I remember that's basically right, but the question is the granularity at which those control knobs operate, given that some of the knobs can be automatically flipped in case of a hardware error. > As I said, I'd prefer this to be on the vfio_device FD vs on iommufd > and would like to find a way to make that work. It probably does belong on the device FD, but we need to consider the IOMMU, because these things are operating at the granularity of (at least) an IOMMU group. > > expects that to happen at the (virtual) hardware level. So if a guest > > trips an error on one device, it expects IO to stop for every other > > device in the PE. But if hardware's notion of the PE is smaller than > > the guest's the host hardware won't accomplish that itself. > > So, why do that to the guest? Shouldn't the PE in the guest be > strictly a subset of the PE in the host, otherwise you hit all these > problems you are talking about? Ah.. yes, I did leave that out. It's moderately complicated - The guest PE can never be a (strict) subset of the host PE. A PE is an IOMMU group in Linux terms, so if a host PE were split amongst guest PEs we couldn't supply the isolation semantics that the guest kernel will expect. So one guest PE must always consist of 1 or more host PEs. - We can't choose the guest PE boundaries completely arbitrarily. The PAPR interfaces tie the PE boundaries to (guest) PCI(e) topology in a pretty confusing way that comes from the history of IBM hardware. - To keep things simple, qemu makes the choice that each guest PCI Host Bridge (vPHB) is a single (guest) PE. (Having many independent PHBs is routine on POWER systems.) - So, where the guest PEs lie depends on how the user/configuration assigns devices to PHBs in the guest topology. - We enforce (as we must) that devices from a single host PE can't be split across guest PEs. I think qemu has specific checks for this in order to give better errors, but it will also be enforced by VFIO: we have one container per PHB (guest PE) so we can't attach the same group (host PE) to two of them. - We don't enforce that we have at *most* one host PE per PHB - and that's useful, because lots of scripts / management tools / whatever designed for x86 just put all PCI devices onto the same vPHB, regardless of how they're grouped on the host. - There are some additional complications because of the possibility of qemu emulated devices, which don't have *any* host PE. > > used this in anger, I think a better approach is to just straight up > > fail any attempt to do any EEH operation if there's not a 1 to 1 > > mapping from guest PE (domain) to host PE (group). > > That makes sense Short of a time machine to murd^H^H^H^Heducate the people who wrote the EEH interfaces, I think it's the best we have.
On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > The default iommu_domain that the iommu driver creates will be used > > here, it is up to the iommu driver to choose something reasonable for > > use by applications like DPDK. ie PPC should probably pick its biggest > > x86-like aperture. > > So, using the big aperture means a very high base IOVA > (1<<59)... which means that it won't work at all if you want to attach > any devices that aren't capable of 64-bit DMA. I'd expect to include the 32 bit window too.. > Using the maximum possible window size would mean we either > potentially waste a lot of kernel memory on pagetables, or we use > unnecessarily large number of levels to the pagetable. All drivers have this issue to one degree or another. We seem to be ignoring it - in any case this is a micro optimization, not a functional need? > More generally, the problem with the interface advertising limitations > and it being up to userspace to work out if those are ok or not is > that it's fragile. It's pretty plausible that some future IOMMU model > will have some new kind of limitation that can't be expressed in the > query structure we invented now. The basic API is very simple - the driver needs to provide ranges of IOVA and map/unmap - I don't think we have a future problem here we need to try and guess and solve today. Even PPC fits this just fine, the open question for DPDK is more around optimization, not functional. > But if userspace requests the capabilities it wants, and the kernel > acks or nacks that, we can support the new host IOMMU with existing > software just fine. No, this just makes it fragile in the other direction because now userspace has to know what platform specific things to ask for *or it doesn't work at all*. This is not a improvement for the DPDK cases. Kernel decides, using all the kernel knowledge it has and tells the application what it can do - this is the basic simplified interface. > > The iommu-driver-specific struct is the "advanced" interface and > > allows a user-space IOMMU driver to tightly control the HW with full > > HW specific knowledge. This is where all the weird stuff that is not > > general should go. > > Right, but forcing anything more complicated than "give me some IOVA > region" to go through the advanced interface means that qemu (or any > hypervisor where the guest platform need not identically match the > host) has to have n^2 complexity to match each guest IOMMU model to > each host IOMMU model. I wouldn't say n^2, but yes, qemu needs to have a userspace driver for the platform IOMMU, and yes it needs this to reach optimal behavior. We already know this is a hard requirement for using nesting as acceleration, I don't see why apertures are so different. > Errr.. how do you figure? On ppc the ranges and pagesizes are > definitely negotiable. I'm not really familiar with other models, but > anything which allows *any* variations in the pagetable structure will > effectively have at least some negotiable properties. As above, if you ask for the wrong thing then you don't get anything. If DPDK asks for something that works on ARM like 0 -> 4G then PPC and x86 will always fail. How is this improving anything to require applications to carefully ask for exactly the right platform specific ranges? It isn't like there is some hard coded value we can put into DPDK that will work on every platform. So kernel must pick for DPDK, IMHO. I don't see any feasible alternative. > Which is why I'm suggesting that the base address be an optional > request. DPDK *will* care about the size of the range, so it just > requests that and gets told a base address. We've talked about a size of IOVA address space before, strictly as a hint, to possible optimize page table layout, or something, and I'm fine with that idea. But - we have no driver implementation today, so I'm not sure what we can really do with this right now.. Kevin could Intel consume a hint on IOVA space and optimize the number of IO page table levels? > > and IMHO, qemu > > is fine to have a PPC specific userspace driver to tweak this PPC > > unique thing if the default windows are not acceptable. > > Not really, because it's the ppc *target* (guest) side which requires > the specific properties, but selecting the "advanced" interface > requires special knowledge on the *host* side. The ppc specific driver would be on the generic side of qemu in its viommu support framework. There is lots of host driver optimization possible here with knowledge of the underlying host iommu HW. It should not be connected to the qemu target. It is not so different from today where qemu has to know about ppc's special vfio interface generically even to emulate x86. > > IMHO it is no different from imagining an Intel specific userspace > > driver that is using userspace IO pagetables to optimize > > cross-platform qemu vIOMMU emulation. > > I'm not quite sure what you have in mind here. How is it both Intel > specific and cross-platform? It is part of the generic qemu iommu interface layer. For nesting qemu would copy the guest page table format to the host page table format in userspace and trigger invalidation - no pinning, no kernel map/unmap calls. It can only be done with detailed knowledge of the host iommu since the host iommu io page table format is exposed directly to userspace. > Note however, that having multiple apertures isn't really ppc specific. > Anything with an IO hole effectively has separate apertures above and > below the hole. They're much closer together in address than POWER's > two apertures, but I don't see that makes any fundamental difference > to the handling of them. In the iommu core it handled the io holes and things through the group reserved IOVA list - there isn't actualy a limit in the iommu_domain, it has a flat pagetable format - and in cases like PASID/SVA the group reserved list doesn't apply at all. > Another approach would be to give the required apertures / pagesizes > in the initial creation of the domain/IOAS. In that case they would > be static for the IOAS, as well as the underlying iommu_domains: any > ATTACH which would be incompatible would fail. This is the device-specific iommu_domain creation path. The domain can have information defining its aperture. > That makes life hard for the DPDK case, though. Obviously we can > still make the base address optional, but for it to be static the > kernel would have to pick it immediately, before we know what devices > will be attached, which will be a problem on any system where there > are multiple IOMMUs with different constraints. Which is why the current scheme is fully automatic and we rely on the iommu driver to automatically select something sane for DPDK/etc today. > > In general I have no issue with limiting the IOVA allocator in the > > kernel, I just don't have a use case of an application that could use > > the IOVA allocator (like DPDK) and also needs a limitation.. > > Well, I imagine DPDK has at least the minimal limitation that it needs > the aperture to be a certain minimum size (I'm guessing at least the > size of its pinned hugepage working memory region). That's a > limitation that's unlikely to fail on modern hardware, but it's there. Yes, DPDK does assume there is some fairly large available aperture, that should be the driver default behavior, IMHO. > > That breaks what I'm > > trying to do to make DPDK/etc portable and dead simple. > > It doesn't break portability at all. As for simplicity, yes it adds > an extra required step, but the payoff is that it's now impossible to > subtly screw up by failing to recheck your apertures after an ATTACH. > That is, it's taking a step which was implicitly required and > replacing it with one that's explicitly required. Again, as above, it breaks portability because apps have no hope to know what window range to ask for to succeed. It cannot just be a hard coded range. Jason
On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote: > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > > > The default iommu_domain that the iommu driver creates will be used > > > here, it is up to the iommu driver to choose something reasonable for > > > use by applications like DPDK. ie PPC should probably pick its biggest > > > x86-like aperture. > > > > So, using the big aperture means a very high base IOVA > > (1<<59)... which means that it won't work at all if you want to attach > > any devices that aren't capable of 64-bit DMA. > > I'd expect to include the 32 bit window too.. I'm not entirely sure what you mean. Are you working on the assumption that we've extended to allowing multiple apertures, so we'd default to advertising both a small/low aperture and a large/high aperture? > > Using the maximum possible window size would mean we either > > potentially waste a lot of kernel memory on pagetables, or we use > > unnecessarily large number of levels to the pagetable. > > All drivers have this issue to one degree or another. We seem to be > ignoring it - in any case this is a micro optimization, not a > functional need? Ok, fair point. > > More generally, the problem with the interface advertising limitations > > and it being up to userspace to work out if those are ok or not is > > that it's fragile. It's pretty plausible that some future IOMMU model > > will have some new kind of limitation that can't be expressed in the > > query structure we invented now. > > The basic API is very simple - the driver needs to provide ranges of > IOVA and map/unmap - I don't think we have a future problem here we > need to try and guess and solve today. Well.. maybe. My experience of encountering hardware doing weird-arse stuff makes me less sanguine. > Even PPC fits this just fine, the open question for DPDK is more > around optimization, not functional. > > > But if userspace requests the capabilities it wants, and the kernel > > acks or nacks that, we can support the new host IOMMU with existing > > software just fine. > > No, this just makes it fragile in the other direction because now > userspace has to know what platform specific things to ask for *or it > doesn't work at all*. This is not a improvement for the DPDK cases. Um.. no. The idea is that userspace requests *what it needs*, not anything platform specific. In the case of DPDK that would be nothing more than the (minimum) aperture size. Nothing platform specific about that. > Kernel decides, using all the kernel knowledge it has and tells the > application what it can do - this is the basic simplified interface. > > > > The iommu-driver-specific struct is the "advanced" interface and > > > allows a user-space IOMMU driver to tightly control the HW with full > > > HW specific knowledge. This is where all the weird stuff that is not > > > general should go. > > > > Right, but forcing anything more complicated than "give me some IOVA > > region" to go through the advanced interface means that qemu (or any > > hypervisor where the guest platform need not identically match the > > host) has to have n^2 complexity to match each guest IOMMU model to > > each host IOMMU model. > > I wouldn't say n^2, but yes, qemu needs to have a userspace driver for > the platform IOMMU, and yes it needs this to reach optimal > behavior. We already know this is a hard requirement for using nesting > as acceleration, I don't see why apertures are so different. For one thing, because we only care about optimal behaviour on the host ~= guest KVM case. That means it's not n^2, just (roughly) one host driver for each matching guest driver. I'm considering the general X on Y case - we don't need to optimize it, but it would be nice for it to work without considering every combination separately. > > Errr.. how do you figure? On ppc the ranges and pagesizes are > > definitely negotiable. I'm not really familiar with other models, but > > anything which allows *any* variations in the pagetable structure will > > effectively have at least some negotiable properties. > > As above, if you ask for the wrong thing then you don't get > anything. If DPDK asks for something that works on ARM like 0 -> 4G > then PPC and x86 will always fail. How is this improving anything to > require applications to carefully ask for exactly the right platform > specific ranges? Hm, looks like I didn't sufficiently emphasize that the base address would be optional for userspace to supply. So userspace would request a range *size* only, unless it needs a specific IOVA base address. It only requests the latter if it actually needs it - so failing in that case is correct. (Qemu, with or without an vIOMMU is the obvious case for that, though I could also imagine it for a specialized driver for some broken device which has weird limitations on what IOVA addresses it can generate on the bus). > It isn't like there is some hard coded value we can put into DPDK that > will work on every platform. So kernel must pick for DPDK, IMHO. I > don't see any feasible alternative. Yes, hence *optionally specified* base address only. > > Which is why I'm suggesting that the base address be an optional > > request. DPDK *will* care about the size of the range, so it just > > requests that and gets told a base address. > > We've talked about a size of IOVA address space before, strictly as a > hint, to possible optimize page table layout, or something, and I'm > fine with that idea. But - we have no driver implementation today, so > I'm not sure what we can really do with this right now.. You can check that the hardware aperture is at least as large as the requested range. Even on the pretty general x86 IOMMU, if the program wants 2^62 bytes of aperture, I'm pretty sure you won't be able to supply it. > Kevin could Intel consume a hint on IOVA space and optimize the number > of IO page table levels? > > > > and IMHO, qemu > > > is fine to have a PPC specific userspace driver to tweak this PPC > > > unique thing if the default windows are not acceptable. > > > > Not really, because it's the ppc *target* (guest) side which requires > > the specific properties, but selecting the "advanced" interface > > requires special knowledge on the *host* side. > > The ppc specific driver would be on the generic side of qemu in its > viommu support framework. There is lots of host driver optimization > possible here with knowledge of the underlying host iommu HW. It > should not be connected to the qemu target. Thinking through this... So, I guess we could have basically the same logic I'm suggesting be in the qemu backend iommu driver instead. So the target side (machine type, strictly speaking) would request of the host side the apertures it needs, and the host side driver would see if it can do that, based on both specific knowledge of that driver and the query reponses. ppc on x86 should work with that.. at least if the x86 aperture is large enough to reach up to ppc's high window. I guess we'd have the option here of using either the generic host driver or the x86-specific driver. The latter would mean qemu maintaining an x86-format shadow of the io pagetables; mildly tedious, but doable. x86-without-viommu on ppc could work in at least some cases if the ppc host driver requests a low window large enough to cover guest memory. Memory hotplug we can handle by creating a new IOAS using the ppc specific interface with a new window covering the hotplugged region. x86-with-viommu on ppc probably can't be done, since I don't think the ppc low window can be extended far enough to allow for the guest's expected IOVA range.. but there's probably no way around that. Unless there's some way of configuring / advertising a "crippled" version of the x86 IOMMU with a more limited IOVA range. Device hotplug is the remaining curly case. We're set up happily, then we hotplug a device. The backend aperture shrinks, the host-side qemu driver notices this and notices it longer covers the ranges that the target-side expects. So... is there any way of backing out of this gracefully. We could detach the device, but in the meantime ongoing DMA maps from previous devices might have failed. We could pre-attach the new device to a new IOAS and check the apertures there - but when we move it to the final IOAS is it guaranteed that the apertures will be (at least) the intersection of the old and new apertures, or is that just the probable outcome. Or I guess we could add a pre-attach-query operation of some sort in the kernel to check what the effect will be before doing the attach for real. Ok.. you convinced me. As long as we have some way to handle the device hotplug case, we can work with this. > It is not so different from today where qemu has to know about ppc's > special vfio interface generically even to emulate x86. Well, yes, and that's a horrible aspect of the current vfio interface. It arose because of (partly) *my* mistake, for not realizing at the time that we could reasonably extend the type1 interface to work for ppc as well. I'm hoping iommufd doesn't repeat my mistake. > > > IMHO it is no different from imagining an Intel specific userspace > > > driver that is using userspace IO pagetables to optimize > > > cross-platform qemu vIOMMU emulation. > > > > I'm not quite sure what you have in mind here. How is it both Intel > > specific and cross-platform? > > It is part of the generic qemu iommu interface layer. For nesting qemu > would copy the guest page table format to the host page table format > in userspace and trigger invalidation - no pinning, no kernel > map/unmap calls. It can only be done with detailed knowledge of the > host iommu since the host iommu io page table format is exposed > directly to userspace. Ok, I see. That can certainly be done. I was really hoping we could have a working, though non-optimized, implementation using just the generic interface. > > Note however, that having multiple apertures isn't really ppc specific. > > Anything with an IO hole effectively has separate apertures above and > > below the hole. They're much closer together in address than POWER's > > two apertures, but I don't see that makes any fundamental difference > > to the handling of them. > > In the iommu core it handled the io holes and things through the group > reserved IOVA list - there isn't actualy a limit in the iommu_domain, > it has a flat pagetable format - and in cases like PASID/SVA the group > reserved list doesn't apply at all. Sure, but how it's implemented internally doesn't change the user visible fact: some IOVAs can't be used, some can. Userspace needs to know which is which in order to operate correctly, and the only reasonable way for it to get them is to be told by the kernel. We should advertise that in a common way, not have different ways for "holes" versus "windows". We can choose either one; I think "windows" rather than "holes" makes more sense, but it doesn't really matter. Whichever one we choose, we need more than one of them for both ppc and x86: - x86 has a "window" from 0 to the bottom IO hole, and a window from the top of the IO hole to the maximum address describable in the IO page table. - x86 has a hole for the IO hole (duh), and another hole from the maximum IO pagetable address to 2^64-1 (you could call it the "out of bounds hole", I guess) - ppc has a "window" at 0 to a configurable maximum, and another "window" from 2^59 to a configurable maximum - ppc has a hole between the two windows, and another from the end of the high window to 2^64-1 So either representation, either arch, it's 2 windows, 2 holes. There may be other cases that only need 1 of each (SVA, ancient ppc without the high window, probably others). Point is there are common cases that require more than 1. > > Another approach would be to give the required apertures / pagesizes > > in the initial creation of the domain/IOAS. In that case they would > > be static for the IOAS, as well as the underlying iommu_domains: any > > ATTACH which would be incompatible would fail. > > This is the device-specific iommu_domain creation path. The domain can > have information defining its aperture. But that also requires managing the pagetables yourself; I think tying these two concepts together is inflexible. > > That makes life hard for the DPDK case, though. Obviously we can > > still make the base address optional, but for it to be static the > > kernel would have to pick it immediately, before we know what devices > > will be attached, which will be a problem on any system where there > > are multiple IOMMUs with different constraints. > > Which is why the current scheme is fully automatic and we rely on the > iommu driver to automatically select something sane for DPDK/etc > today. But the cost is that the allowable addresses can change implicitly with every ATTACH. > > > In general I have no issue with limiting the IOVA allocator in the > > > kernel, I just don't have a use case of an application that could use > > > the IOVA allocator (like DPDK) and also needs a limitation.. > > > > Well, I imagine DPDK has at least the minimal limitation that it needs > > the aperture to be a certain minimum size (I'm guessing at least the > > size of its pinned hugepage working memory region). That's a > > limitation that's unlikely to fail on modern hardware, but it's there. > > Yes, DPDK does assume there is some fairly large available aperture, > that should be the driver default behavior, IMHO. Well, sure, but "fairly large" tends to change meaning over time. The idea is to ensure that the app's idea of "fairly large" matches the kernel's idea of "fairly large". > > > That breaks what I'm > > > trying to do to make DPDK/etc portable and dead simple. > > > > It doesn't break portability at all. As for simplicity, yes it adds > > an extra required step, but the payoff is that it's now impossible to > > subtly screw up by failing to recheck your apertures after an ATTACH. > > That is, it's taking a step which was implicitly required and > > replacing it with one that's explicitly required. > > Again, as above, it breaks portability because apps have no hope to > know what window range to ask for to succeed. It cannot just be a hard > coded range. I see the problem if you have an app where there's a difference between the smallest window it can cope with versus the largest window it can take advantage of. Not sure if that's likely in pratice. AFAIK, DPDK will alway require it's hugepage memory pool mapped, can't deal with less, can't benefit from more. But maybe there's some use case for this I haven't thought of. Ok... here's a revised version of my proposal which I think addresses your concerns and simplfies things. - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY will probably need matching changes) - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA is chosen by the kernel within the aperture(s). This is closer to how mmap() operates, and DPDK and similar shouldn't care about having specific IOVAs, even at the individual mapping level. - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED, for when you really do want to control the IOVA (qemu, maybe some special userspace driver cases) - ATTACH will fail if the new device would shrink the aperture to exclude any already established mappings (I assume this is already the case) - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a PROT_NONE mmap(). It reserves that IOVA space, so other (non-FIXED) MAPs won't use it, but doesn't actually put anything into the IO pagetables. - Like a regular mapping, ATTACHes that are incompatible with an IOMAP_RESERVEed region will fail - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED mapping So, for DPDK the sequence would be: 1. Create IOAS 2. ATTACH devices 3. IOAS_MAP some stuff 4. Do DMA with the IOVAs that IOAS_MAP returned (Note, not even any need for QUERY in simple cases) For (unoptimized) qemu it would be: 1. Create IOAS 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of the guest platform 3. ATTACH devices (this will fail if they're not compatible with the reserved IOVA regions) 4. Boot the guest (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part of the reserved regions (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the reserved regions) (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as necessary (which might fail)
On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote: > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > > > > > The default iommu_domain that the iommu driver creates will be used > > > > here, it is up to the iommu driver to choose something reasonable for > > > > use by applications like DPDK. ie PPC should probably pick its biggest > > > > x86-like aperture. > > > > > > So, using the big aperture means a very high base IOVA > > > (1<<59)... which means that it won't work at all if you want to attach > > > any devices that aren't capable of 64-bit DMA. > > > > I'd expect to include the 32 bit window too.. > > I'm not entirely sure what you mean. Are you working on the > assumption that we've extended to allowing multiple apertures, so we'd > default to advertising both a small/low aperture and a large/high > aperture? Yes > > No, this just makes it fragile in the other direction because now > > userspace has to know what platform specific things to ask for *or it > > doesn't work at all*. This is not a improvement for the DPDK cases. > > Um.. no. The idea is that userspace requests *what it needs*, not > anything platform specific. In the case of DPDK that would be nothing > more than the (minimum) aperture size. Nothing platform specific > about that. Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit platform can do more, but it varies how much more, etc. There is no constant value DPDK could stuff in this request, unless it needs a really small amount of IOVA, like 1G or something. > > It isn't like there is some hard coded value we can put into DPDK that > > will work on every platform. So kernel must pick for DPDK, IMHO. I > > don't see any feasible alternative. > > Yes, hence *optionally specified* base address only. Okay, so imagine we've already done this and DPDK is not optionally specifying anything :) The structs can be extended so we can add this as an input to creation when a driver can implement it. > > The ppc specific driver would be on the generic side of qemu in its > > viommu support framework. There is lots of host driver optimization > > possible here with knowledge of the underlying host iommu HW. It > > should not be connected to the qemu target. > > Thinking through this... > > So, I guess we could have basically the same logic I'm suggesting be > in the qemu backend iommu driver instead. So the target side (machine > type, strictly speaking) would request of the host side the apertures > it needs, and the host side driver would see if it can do that, based > on both specific knowledge of that driver and the query reponses. Yes, this is what I'm thinking > ppc on x86 should work with that.. at least if the x86 aperture is > large enough to reach up to ppc's high window. I guess we'd have the > option here of using either the generic host driver or the > x86-specific driver. The latter would mean qemu maintaining an > x86-format shadow of the io pagetables; mildly tedious, but doable. The appeal of having userspace page tables is performance, so it is tedious to shadow, but it should run faster. > So... is there any way of backing out of this gracefully. We could > detach the device, but in the meantime ongoing DMA maps from > previous devices might have failed. This sounds like a good use case for qemu to communicate ranges - but as I mentioned before Alex said qemu didn't know the ranges.. > We could pre-attach the new device to a new IOAS and check the > apertures there - but when we move it to the final IOAS is it > guaranteed that the apertures will be (at least) the intersection of > the old and new apertures, or is that just the probable outcome. Should be guarenteed > Ok.. you convinced me. As long as we have some way to handle the > device hotplug case, we can work with this. I like the communicate ranges for hotplug, so long as we can actually implement it in qemu - I'm a bit unclear on that honestly. > Ok, I see. That can certainly be done. I was really hoping we could > have a working, though non-optimized, implementation using just the > generic interface. Oh, sure that should largely work as well too, this is just an additional direction people may find interesting and helps explain why qemu should have an iommu layer inside. > "holes" versus "windows". We can choose either one; I think "windows" > rather than "holes" makes more sense, but it doesn't really matter. Yes, I picked windows aka ranges for the uAPI - we translate the holes from the groups into windows and intersect them with the apertures. > > > Another approach would be to give the required apertures / pagesizes > > > in the initial creation of the domain/IOAS. In that case they would > > > be static for the IOAS, as well as the underlying iommu_domains: any > > > ATTACH which would be incompatible would fail. > > > > This is the device-specific iommu_domain creation path. The domain can > > have information defining its aperture. > > But that also requires managing the pagetables yourself; I think tying > these two concepts together is inflexible. Oh, no, those need to be independent for HW nesting already One device-specific creation path will create the kernel owned map/unmap iommu_domain with device-specific parameters to allow it to be the root of a nest - ie specify S2 on ARM. The second device-specific creation path will create the user owned iommu_domain with device-specific parameters, with the first as a parent. So you get to do both. > > Which is why the current scheme is fully automatic and we rely on the > > iommu driver to automatically select something sane for DPDK/etc > > today. > > But the cost is that the allowable addresses can change implicitly > with every ATTACH. Yes, dpdk/etc don't care. > I see the problem if you have an app where there's a difference > between the smallest window it can cope with versus the largest window > it can take advantage of. Not sure if that's likely in pratice. > AFAIK, DPDK will alway require it's hugepage memory pool mapped, can't > deal with less, can't benefit from more. But maybe there's some use > case for this I haven't thought of. Other apps I've seen don't even have a fixed memory pool, they just malloc and can't really predict how much IOVA they need. "approximately the same amount as a process VA" is a reasonable goal for the kernel to default too. A tunable to allow efficiency from smaller allocations sounds great - but let's have driver support before adding the uAPI for it. Intel/AMD/ARM support to have fewer page table levels for instance would be perfect. > Ok... here's a revised version of my proposal which I think addresses > your concerns and simplfies things. > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY > will probably need matching changes) > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > is chosen by the kernel within the aperture(s). This is closer to > how mmap() operates, and DPDK and similar shouldn't care about > having specific IOVAs, even at the individual mapping level. > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED, > for when you really do want to control the IOVA (qemu, maybe some > special userspace driver cases) We already did both of these, the flag is called IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will select the IOVA internally. > - ATTACH will fail if the new device would shrink the aperture to > exclude any already established mappings (I assume this is already > the case) Yes > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > PROT_NONE mmap(). It reserves that IOVA space, so other (non-FIXED) > MAPs won't use it, but doesn't actually put anything into the IO > pagetables. > - Like a regular mapping, ATTACHes that are incompatible with an > IOMAP_RESERVEed region will fail > - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED > mapping Yeah, this seems OK, I'm thinking a new API might make sense because you don't really want mmap replacement semantics but a permanent record of what IOVA must always be valid. IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to IOMMUFD_CMD_IOAS_IOVA_RANGES: struct iommu_ioas_require_iova { __u32 size; __u32 ioas_id; __u32 num_iovas; __u32 __reserved; struct iommu_required_iovas { __aligned_u64 start; __aligned_u64 last; } required_iovas[]; }; > So, for DPDK the sequence would be: > > 1. Create IOAS > 2. ATTACH devices > 3. IOAS_MAP some stuff > 4. Do DMA with the IOVAs that IOAS_MAP returned > > (Note, not even any need for QUERY in simple cases) Yes, this is done already > For (unoptimized) qemu it would be: > > 1. Create IOAS > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of the > guest platform > 3. ATTACH devices (this will fail if they're not compatible with the > reserved IOVA regions) > 4. Boot the guest > > (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part of > the reserved regions > (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the > reserved regions) > (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as > necessary (which might fail) OK, I will take care of it Thanks, Jason
> From: Jason Gunthorpe > Sent: Monday, May 9, 2022 10:01 PM > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > > Which is why I'm suggesting that the base address be an optional > > request. DPDK *will* care about the size of the range, so it just > > requests that and gets told a base address. > > We've talked about a size of IOVA address space before, strictly as a > hint, to possible optimize page table layout, or something, and I'm > fine with that idea. But - we have no driver implementation today, so > I'm not sure what we can really do with this right now.. > > Kevin could Intel consume a hint on IOVA space and optimize the number > of IO page table levels? > It could, but not implemented now.
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, May 11, 2022 3:00 AM > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > Ok... here's a revised version of my proposal which I think addresses > > your concerns and simplfies things. > > > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY > > will probably need matching changes) > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > is chosen by the kernel within the aperture(s). This is closer to > > how mmap() operates, and DPDK and similar shouldn't care about > > having specific IOVAs, even at the individual mapping level. > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED, > > for when you really do want to control the IOVA (qemu, maybe some > > special userspace driver cases) > > We already did both of these, the flag is called > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > select the IOVA internally. > > > - ATTACH will fail if the new device would shrink the aperture to > > exclude any already established mappings (I assume this is already > > the case) > > Yes > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > PROT_NONE mmap(). It reserves that IOVA space, so other (non-FIXED) > > MAPs won't use it, but doesn't actually put anything into the IO > > pagetables. > > - Like a regular mapping, ATTACHes that are incompatible with an > > IOMAP_RESERVEed region will fail > > - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED > > mapping > > Yeah, this seems OK, I'm thinking a new API might make sense because > you don't really want mmap replacement semantics but a permanent > record of what IOVA must always be valid. > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > struct iommu_ioas_require_iova { > __u32 size; > __u32 ioas_id; > __u32 num_iovas; > __u32 __reserved; > struct iommu_required_iovas { > __aligned_u64 start; > __aligned_u64 last; > } required_iovas[]; > }; As a permanent record do we want to enforce that once the required range list is set all FIXED and non-FIXED allocations must be within the list of ranges? If yes we can take the end of the last range as the max size of the iova address space to optimize the page table layout. otherwise we may need another dedicated hint for that optimization. > > > So, for DPDK the sequence would be: > > > > 1. Create IOAS > > 2. ATTACH devices > > 3. IOAS_MAP some stuff > > 4. Do DMA with the IOVAs that IOAS_MAP returned > > > > (Note, not even any need for QUERY in simple cases) > > Yes, this is done already > > > For (unoptimized) qemu it would be: > > > > 1. Create IOAS > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of > the > > guest platform > > 3. ATTACH devices (this will fail if they're not compatible with the > > reserved IOVA regions) > > 4. Boot the guest I suppose above is only the sample flow for PPC vIOMMU. For non-PPC vIOMMUs regular mappings are required before booting the guest and reservation might be done but not mandatory (at least not what current Qemu vfio can afford as it simply replays valid ranges in the CPU address space). > > > > (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part > of > > the reserved regions > > (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the > > reserved regions) > > (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as > > necessary (which might fail) > > OK, I will take care of it > > Thanks, > Jason
On Tue, May 10, 2022 at 04:00:09PM -0300, Jason Gunthorpe wrote: > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote: > > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > > > > > > > The default iommu_domain that the iommu driver creates will be used > > > > > here, it is up to the iommu driver to choose something reasonable for > > > > > use by applications like DPDK. ie PPC should probably pick its biggest > > > > > x86-like aperture. > > > > > > > > So, using the big aperture means a very high base IOVA > > > > (1<<59)... which means that it won't work at all if you want to attach > > > > any devices that aren't capable of 64-bit DMA. > > > > > > I'd expect to include the 32 bit window too.. > > > > I'm not entirely sure what you mean. Are you working on the > > assumption that we've extended to allowing multiple apertures, so we'd > > default to advertising both a small/low aperture and a large/high > > aperture? > > Yes Ok, that works assuming we can advertise multiple windows. > > > No, this just makes it fragile in the other direction because now > > > userspace has to know what platform specific things to ask for *or it > > > doesn't work at all*. This is not a improvement for the DPDK cases. > > > > Um.. no. The idea is that userspace requests *what it needs*, not > > anything platform specific. In the case of DPDK that would be nothing > > more than the (minimum) aperture size. Nothing platform specific > > about that. > > Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit > platform can do more, but it varies how much more, etc. > > There is no constant value DPDK could stuff in this request, unless it > needs a really small amount of IOVA, like 1G or something. Well, my assumption was that DPDK always wanted an IOVA window to cover its hugepage buffer space. So not "constant" exactly, but a value it will know at start up time. But I think we cover that more closely below. > > > It isn't like there is some hard coded value we can put into DPDK that > > > will work on every platform. So kernel must pick for DPDK, IMHO. I > > > don't see any feasible alternative. > > > > Yes, hence *optionally specified* base address only. > > Okay, so imagine we've already done this and DPDK is not optionally > specifying anything :) > > The structs can be extended so we can add this as an input to creation > when a driver can implement it. > > > > The ppc specific driver would be on the generic side of qemu in its > > > viommu support framework. There is lots of host driver optimization > > > possible here with knowledge of the underlying host iommu HW. It > > > should not be connected to the qemu target. > > > > Thinking through this... > > > > So, I guess we could have basically the same logic I'm suggesting be > > in the qemu backend iommu driver instead. So the target side (machine > > type, strictly speaking) would request of the host side the apertures > > it needs, and the host side driver would see if it can do that, based > > on both specific knowledge of that driver and the query reponses. > > Yes, this is what I'm thinking > > > ppc on x86 should work with that.. at least if the x86 aperture is > > large enough to reach up to ppc's high window. I guess we'd have the > > option here of using either the generic host driver or the > > x86-specific driver. The latter would mean qemu maintaining an > > x86-format shadow of the io pagetables; mildly tedious, but doable. > > The appeal of having userspace page tables is performance, so it is > tedious to shadow, but it should run faster. I doubt the difference is meaningful in the context of an emulated guest, though. > > So... is there any way of backing out of this gracefully. We could > > detach the device, but in the meantime ongoing DMA maps from > > previous devices might have failed. > > This sounds like a good use case for qemu to communicate ranges - but > as I mentioned before Alex said qemu didn't know the ranges.. Yeah, I'm a bit baffled by that, and I don't know the context. Note that there are at least two different very different users of the host IOMMU backends in: one is for emulation of guest DMA (with or without a vIOMMU). In that case the details of the guest platform should let qemu know the ranges. There's also a VFIO based NVME backend; that one's much more like a "normal" userspace driver, where it doesn't care about the address ranges (because they're not guest visible). > > We could pre-attach the new device to a new IOAS and check the > > apertures there - but when we move it to the final IOAS is it > > guaranteed that the apertures will be (at least) the intersection of > > the old and new apertures, or is that just the probable outcome. > > Should be guarenteed Ok; that would need to be documented. > > Ok.. you convinced me. As long as we have some way to handle the > > device hotplug case, we can work with this. > > I like the communicate ranges for hotplug, so long as we can actually > implement it in qemu - I'm a bit unclear on that honestly. > > > Ok, I see. That can certainly be done. I was really hoping we could > > have a working, though non-optimized, implementation using just the > > generic interface. > > Oh, sure that should largely work as well too, this is just an > additional direction people may find interesting and helps explain why > qemu should have an iommu layer inside. > > "holes" versus "windows". We can choose either one; I think "windows" > > rather than "holes" makes more sense, but it doesn't really matter. > > Yes, I picked windows aka ranges for the uAPI - we translate the holes > from the groups into windows and intersect them with the apertures. Ok. > > > > Another approach would be to give the required apertures / pagesizes > > > > in the initial creation of the domain/IOAS. In that case they would > > > > be static for the IOAS, as well as the underlying iommu_domains: any > > > > ATTACH which would be incompatible would fail. > > > > > > This is the device-specific iommu_domain creation path. The domain can > > > have information defining its aperture. > > > > But that also requires managing the pagetables yourself; I think tying > > these two concepts together is inflexible. > > Oh, no, those need to be independent for HW nesting already > > One device-specific creation path will create the kernel owned > map/unmap iommu_domain with device-specific parameters to allow it to > be the root of a nest - ie specify S2 on ARM. > > The second device-specific creation path will create the user owned > iommu_domain with device-specific parameters, with the first as a > parent. > > So you get to do both. Ah! Good to know. > > > Which is why the current scheme is fully automatic and we rely on the > > > iommu driver to automatically select something sane for DPDK/etc > > > today. > > > > But the cost is that the allowable addresses can change implicitly > > with every ATTACH. > > Yes, dpdk/etc don't care. Well... as long as nothing they've already mapped goes away. > > I see the problem if you have an app where there's a difference > > between the smallest window it can cope with versus the largest window > > it can take advantage of. Not sure if that's likely in pratice. > > AFAIK, DPDK will alway require it's hugepage memory pool mapped, can't > > deal with less, can't benefit from more. But maybe there's some use > > case for this I haven't thought of. > > Other apps I've seen don't even have a fixed memory pool, they just > malloc and can't really predict how much IOVA they > need. "approximately the same amount as a process VA" is a reasonable > goal for the kernel to default too. Hm, ok, I guess that makes sense. > A tunable to allow efficiency from smaller allocations sounds great - > but let's have driver support before adding the uAPI for > it. Intel/AMD/ARM support to have fewer page table levels for instance > would be perfect. > > > Ok... here's a revised version of my proposal which I think addresses > > your concerns and simplfies things. > > > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY > > will probably need matching changes) > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > is chosen by the kernel within the aperture(s). This is closer to > > how mmap() operates, and DPDK and similar shouldn't care about > > having specific IOVAs, even at the individual mapping level. > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED, > > for when you really do want to control the IOVA (qemu, maybe some > > special userspace driver cases) > > We already did both of these, the flag is called > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > select the IOVA internally. Ok, great. > > - ATTACH will fail if the new device would shrink the aperture to > > exclude any already established mappings (I assume this is already > > the case) > > Yes Good to know. > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > PROT_NONE mmap(). It reserves that IOVA space, so other (non-FIXED) > > MAPs won't use it, but doesn't actually put anything into the IO > > pagetables. > > - Like a regular mapping, ATTACHes that are incompatible with an > > IOMAP_RESERVEed region will fail > > - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED > > mapping > > Yeah, this seems OK, I'm thinking a new API might make sense because > you don't really want mmap replacement semantics but a permanent > record of what IOVA must always be valid. > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > struct iommu_ioas_require_iova { > __u32 size; > __u32 ioas_id; > __u32 num_iovas; > __u32 __reserved; > struct iommu_required_iovas { > __aligned_u64 start; > __aligned_u64 last; > } required_iovas[]; > }; Sounds reasonable. > > So, for DPDK the sequence would be: > > > > 1. Create IOAS > > 2. ATTACH devices > > 3. IOAS_MAP some stuff > > 4. Do DMA with the IOVAs that IOAS_MAP returned > > > > (Note, not even any need for QUERY in simple cases) > > Yes, this is done already > > > For (unoptimized) qemu it would be: > > > > 1. Create IOAS > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of the > > guest platform > > 3. ATTACH devices (this will fail if they're not compatible with the > > reserved IOVA regions) > > 4. Boot the guest > > > > (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part of > > the reserved regions > > (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the > > reserved regions) > > (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as > > necessary (which might fail) > > OK, I will take care of it Hooray! Long contentious thread eventually reaches productive resolution :).
On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, May 11, 2022 3:00 AM > > > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > > Ok... here's a revised version of my proposal which I think addresses > > > your concerns and simplfies things. > > > > > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY > > > will probably need matching changes) > > > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > > is chosen by the kernel within the aperture(s). This is closer to > > > how mmap() operates, and DPDK and similar shouldn't care about > > > having specific IOVAs, even at the individual mapping level. > > > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED, > > > for when you really do want to control the IOVA (qemu, maybe some > > > special userspace driver cases) > > > > We already did both of these, the flag is called > > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > > select the IOVA internally. > > > > > - ATTACH will fail if the new device would shrink the aperture to > > > exclude any already established mappings (I assume this is already > > > the case) > > > > Yes > > > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > > PROT_NONE mmap(). It reserves that IOVA space, so other (non-FIXED) > > > MAPs won't use it, but doesn't actually put anything into the IO > > > pagetables. > > > - Like a regular mapping, ATTACHes that are incompatible with an > > > IOMAP_RESERVEed region will fail > > > - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED > > > mapping > > > > Yeah, this seems OK, I'm thinking a new API might make sense because > > you don't really want mmap replacement semantics but a permanent > > record of what IOVA must always be valid. > > > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > > > struct iommu_ioas_require_iova { > > __u32 size; > > __u32 ioas_id; > > __u32 num_iovas; > > __u32 __reserved; > > struct iommu_required_iovas { > > __aligned_u64 start; > > __aligned_u64 last; > > } required_iovas[]; > > }; > > As a permanent record do we want to enforce that once the required > range list is set all FIXED and non-FIXED allocations must be within the > list of ranges? No, I would just use this as a guarntee that going forward any get_ranges will always return ranges that cover the listed required ranges. Ie any narrowing of the ranges will be refused. map/unmap should only be restricted to the get_ranges output. Wouldn't burn CPU cycles to nanny userspace here. > If yes we can take the end of the last range as the max size of the iova > address space to optimize the page table layout. I think this API should not interact with the driver. Its only job is to prevent devices from attaching that would narrow the ranges. If we also use it to adjust the aperture of the created iommu_domain then it looses its usefullness as guard since something like qemu would have to leave room for hotplug as well. I suppose optimizing the created iommu_domains should be some other API, with a different set of ranges and the '# of bytes of IOVA' hint as well. > > > For (unoptimized) qemu it would be: > > > > > > 1. Create IOAS > > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of > > the > > > guest platform > > > 3. ATTACH devices (this will fail if they're not compatible with the > > > reserved IOVA regions) > > > 4. Boot the guest > > I suppose above is only the sample flow for PPC vIOMMU. For non-PPC > vIOMMUs regular mappings are required before booting the guest and > reservation might be done but not mandatory (at least not what current > Qemu vfio can afford as it simply replays valid ranges in the CPU address > space). I think qemu can always do it, it feels like it would simplify error cases around aperture mismatches. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, May 12, 2022 12:32 AM > > On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Wednesday, May 11, 2022 3:00 AM > > > > > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > > > Ok... here's a revised version of my proposal which I think addresses > > > > your concerns and simplfies things. > > > > > > > > - No new operations, but IOAS_MAP gets some new flags (and > IOAS_COPY > > > > will probably need matching changes) > > > > > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > > > is chosen by the kernel within the aperture(s). This is closer to > > > > how mmap() operates, and DPDK and similar shouldn't care about > > > > having specific IOVAs, even at the individual mapping level. > > > > > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s > MAP_FIXED, > > > > for when you really do want to control the IOVA (qemu, maybe some > > > > special userspace driver cases) > > > > > > We already did both of these, the flag is called > > > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > > > select the IOVA internally. > > > > > > > - ATTACH will fail if the new device would shrink the aperture to > > > > exclude any already established mappings (I assume this is already > > > > the case) > > > > > > Yes > > > > > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > > > PROT_NONE mmap(). It reserves that IOVA space, so other (non- > FIXED) > > > > MAPs won't use it, but doesn't actually put anything into the IO > > > > pagetables. > > > > - Like a regular mapping, ATTACHes that are incompatible with an > > > > IOMAP_RESERVEed region will fail > > > > - An IOMAP_RESERVEed area can be overmapped with an > IOMAP_FIXED > > > > mapping > > > > > > Yeah, this seems OK, I'm thinking a new API might make sense because > > > you don't really want mmap replacement semantics but a permanent > > > record of what IOVA must always be valid. > > > > > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > > > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > > > > > struct iommu_ioas_require_iova { > > > __u32 size; > > > __u32 ioas_id; > > > __u32 num_iovas; > > > __u32 __reserved; > > > struct iommu_required_iovas { > > > __aligned_u64 start; > > > __aligned_u64 last; > > > } required_iovas[]; > > > }; > > > > As a permanent record do we want to enforce that once the required > > range list is set all FIXED and non-FIXED allocations must be within the > > list of ranges? > > No, I would just use this as a guarntee that going forward any > get_ranges will always return ranges that cover the listed required > ranges. Ie any narrowing of the ranges will be refused. > > map/unmap should only be restricted to the get_ranges output. > > Wouldn't burn CPU cycles to nanny userspace here. fair enough. > > > If yes we can take the end of the last range as the max size of the iova > > address space to optimize the page table layout. > > I think this API should not interact with the driver. Its only job is > to prevent devices from attaching that would narrow the ranges. > > If we also use it to adjust the aperture of the created iommu_domain > then it looses its usefullness as guard since something like qemu > would have to leave room for hotplug as well. > > I suppose optimizing the created iommu_domains should be some other > API, with a different set of ranges and the '# of bytes of IOVA' hint > as well. make sense. > > > > > For (unoptimized) qemu it would be: > > > > > > > > 1. Create IOAS > > > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of > > > the > > > > guest platform > > > > 3. ATTACH devices (this will fail if they're not compatible with the > > > > reserved IOVA regions) > > > > 4. Boot the guest > > > > I suppose above is only the sample flow for PPC vIOMMU. For non-PPC > > vIOMMUs regular mappings are required before booting the guest and > > reservation might be done but not mandatory (at least not what current > > Qemu vfio can afford as it simply replays valid ranges in the CPU address > > space). > > I think qemu can always do it, it feels like it would simplify error > cases around aperture mismatches. > It could, but require more changes in Qemu to define required ranges in platform logic and then convey it from Qemu address space to VFIO. I view it as an optimization hence not necessarily to be done immediately. Thanks Kevin
On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, May 11, 2022 3:00 AM > > > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > > Ok... here's a revised version of my proposal which I think addresses > > > your concerns and simplfies things. > > > > > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY > > > will probably need matching changes) > > > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > > is chosen by the kernel within the aperture(s). This is closer to > > > how mmap() operates, and DPDK and similar shouldn't care about > > > having specific IOVAs, even at the individual mapping level. > > > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED, > > > for when you really do want to control the IOVA (qemu, maybe some > > > special userspace driver cases) > > > > We already did both of these, the flag is called > > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > > select the IOVA internally. > > > > > - ATTACH will fail if the new device would shrink the aperture to > > > exclude any already established mappings (I assume this is already > > > the case) > > > > Yes > > > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > > PROT_NONE mmap(). It reserves that IOVA space, so other (non-FIXED) > > > MAPs won't use it, but doesn't actually put anything into the IO > > > pagetables. > > > - Like a regular mapping, ATTACHes that are incompatible with an > > > IOMAP_RESERVEed region will fail > > > - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED > > > mapping > > > > Yeah, this seems OK, I'm thinking a new API might make sense because > > you don't really want mmap replacement semantics but a permanent > > record of what IOVA must always be valid. > > > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > > > struct iommu_ioas_require_iova { > > __u32 size; > > __u32 ioas_id; > > __u32 num_iovas; > > __u32 __reserved; > > struct iommu_required_iovas { > > __aligned_u64 start; > > __aligned_u64 last; > > } required_iovas[]; > > }; > > As a permanent record do we want to enforce that once the required > range list is set all FIXED and non-FIXED allocations must be within the > list of ranges? No, I don't think so. In fact the way I was envisaging this, non-FIXED mappings will *never* go into the reserved ranges. This is for the benefit of any use cases that need both mappings where they don't care about the IOVA and those which do. Essentially, reserving a region here is saying to the kernel "I want to manage this IOVA space; make sure nothing else touches it". That means both that the kernel must disallow any hw associated changes (like ATTACH) which would impinge on the reserved region, and also any IOVA allocations that would take parts away from that space. Whether we want to restrict FIXED mappings to the reserved regions is an interesting question. I wasn't thinking that would be necessary (just as you can use mmap() MAP_FIXED anywhere). However.. much as MAP_FIXED is very dangerous to use if you don't previously reserve address space, I think IOMAP_FIXED is dangerous if you haven't previously reserved space. So maybe it would make sense to only allow FIXED mappings within reserved regions. Strictly dividing the IOVA space into kernel managed and user managed regions does make a certain amount of sense. > If yes we can take the end of the last range as the max size of the iova > address space to optimize the page table layout. > > otherwise we may need another dedicated hint for that optimization. Right. With the revised model where reserving windows is optional, not required, I don't think we can quite re-use this for optimization hints. Which is a bit unfortunate. I can't immediately see a way to tweak this which handles both more neatly, but I like the idea if we can figure out a way. > > > So, for DPDK the sequence would be: > > > > > > 1. Create IOAS > > > 2. ATTACH devices > > > 3. IOAS_MAP some stuff > > > 4. Do DMA with the IOVAs that IOAS_MAP returned > > > > > > (Note, not even any need for QUERY in simple cases) > > > > Yes, this is done already > > > > > For (unoptimized) qemu it would be: > > > > > > 1. Create IOAS > > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of > > the > > > guest platform > > > 3. ATTACH devices (this will fail if they're not compatible with the > > > reserved IOVA regions) > > > 4. Boot the guest > > I suppose above is only the sample flow for PPC vIOMMU. For non-PPC > vIOMMUs regular mappings are required before booting the guest and > reservation might be done but not mandatory (at least not what current > Qemu vfio can afford as it simply replays valid ranges in the CPU address > space). That was a somewhat simplified description. When we look in more detail, I think the ppc and x86 models become more similar. So, in more detail, I think it would look like this: 1. Create base IOAS 2. Map guest memory into base IOAS so that IOVA==GPA 3. Create IOASes for each vIOMMU domain 4. Reserve windows in domain IOASes where the vIOMMU will allow mappings by default 5. ATTACH devices to appropriate IOASes (***) 6. Boot the guest On guest map/invalidate: Use IOAS_COPY to take mappings from base IOAS and put them into the domain IOAS On memory hotplug: IOAS_MAP new memory block into base IOAS On dev hotplug: (***) ATTACH devices to appropriate IOAS On guest reconfiguration of vIOMMU domains (x86 only): DETACH device from base IOAS, attach to vIOMMU domain IOAS On guest reconfiguration of vIOMMU apertures (ppc only): Alter reserved regions to match vIOMMU The difference between ppc and x86 is at the places marked (***): which IOAS each device gets attached to and when. For x86 all devices live in the base IOAS by default, and only get moved to domain IOASes when those domains are set up in the vIOMMU. For POWER each device starts in a domain IOAS based on its guest PE, and never moves. [This is still a bit simplified. In practice, I imagine you'd optimize to only create the domain IOASes at the point they're needed - on boot for ppc, but only when the vIOMMU is configured for x86. I don't think that really changes the model, though.] A few aspects of the model interact quite nicely here. Mapping a large memory guest with IOVA==GPA would probably fail on a ppc host IOMMU. But if both guest and host are ppc, then no devices get attached to that base IOAS, so its apertures don't get restricted by the host hardware. That way we get a common model, and the benefits of GUP sharing via IOAS_COPY, without it failing in the ppc-on-ppc case. x86-on-ppc and ppc-on-x86 will probably only work in limited cases where the various sizes and windows line up, but the possibility isn't precluded by the model or interfaces. > > > (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part > > of > > > the reserved regions > > > (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the > > > reserved regions) > > > (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as > > > necessary (which might fail) > > > > OK, I will take care of it > > > > Thanks, > > Jason >
On 4/29/22 00:53, David Gibson wrote: > On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote: >> On Wed, 23 Mar 2022 21:33:42 -0300 >> Jason Gunthorpe <jgg@nvidia.com> wrote: >> >>> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote: >>> >>>> My overall question here would be whether we can actually achieve a >>>> compatibility interface that has sufficient feature transparency that we >>>> can dump vfio code in favor of this interface, or will there be enough >>>> niche use cases that we need to keep type1 and vfio containers around >>>> through a deprecation process? >>> >>> Other than SPAPR, I think we can. >> >> Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure >> for POWER support? > > There are a few different levels to consider for dealing with PPC. > For a suitable long term interface for ppc hosts and guests dropping > this is fine: the ppc specific iommu model was basically an > ill-conceived idea from the beginning, because none of us had > sufficiently understood what things were general and what things where > iommu model/hw specific. > > ..mostly. There are several points of divergence for the ppc iommu > model. > > 1) Limited IOVA windows. This one turned out to not really be ppc > specific, and is (rightly) handlded generically in the new interface. > No problem here. > > 2) Costly GUPs. pseries (the most common ppc machine type) always > expects a (v)IOMMU. That means that unlike the common x86 model of a > host with IOMMU, but guests with no-vIOMMU, guest initiated > maps/unmaps can be a hot path. Accounting in that path can be > prohibitive (and on POWER8 in particular it prevented us from > optimizing that path the way we wanted). We had two solutions for > that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted > based on the IOVA window sizes. That was improved in the v2 which > used the concept of preregistration. IIUC iommufd can achieve the > same effect as preregistration using IOAS_COPY, so this one isn't > really a problem either. I am getting rid of those POWER8-related realmode handlers as POWER9 has MMU enabled when hcalls are handled. Costly GUP problem is still there though (which base IOAS should solve?). > 3) "dynamic DMA windows" (DDW). The IBM IOMMU hardware allows for 2 IOVA > windows, which aren't contiguous with each other. The base addresses > of each of these are fixed, but the size of each window, the pagesize > (i.e. granularity) of each window and the number of levels in the > IOMMU pagetable are runtime configurable. Because it's true in the > hardware, it's also true of the vIOMMU interface defined by the IBM > hypervisor (and adpoted by KVM as well). So, guests can request > changes in how these windows are handled. Typical Linux guests will > use the "low" window (IOVA 0..2GiB) dynamically, and the high window > (IOVA 1<<60..???) to map all of RAM. However, as a hypervisor we > can't count on that; the guest can use them however it wants. The guest actually does this already. AIX has always been like that, Linux is forced to do that for SRIOV VFs as there can be many VFs and TCEs (==IOPTEs) are limited resource. The today's pseries IOMMU code first tried mapping 1:1 (as it has been for ages) but if there is not enough TCEs - it removes the first window (which increases the TCE budget), creates a new 64bit window (as big as possible but not necessarily enough for 1:1, 64K/2M IOMMU page sizes allowed) and does map/unmap as drivers go. Which means the guest RAM does not need to be all mapped in that base IOAS suggested down this thread as that would mean all memory is pinned and powervm won't be able to swap it out (yeah, it can do such thing now!). Not sure if we really want to support this or stick to a simpler design. > > (3) still needs a plan for how to fit it into the /dev/iommufd model. > This is a secondary reason that in the past I advocated for the user > requesting specific DMA windows which the kernel would accept or > refuse, rather than having a query function - it connects easily to > the DDW model. With the query-first model we'd need some sort of > extension here, not really sure what it should look like. > > > > Then, there's handling existing qemu (or other software) that is using > the VFIO SPAPR_TCE interfaces. First, it's not entirely clear if this > should be a goal or not: as others have noted, working actively to > port qemu to the new interface at the same time as making a > comprehensive in-kernel compat layer is arguably redundant work. > > That said, if we did want to handle this in an in-kernel compat layer, > here's roughly what you'd need for SPAPR_TCE v2: > > - VFIO_IOMMU_SPAPR_TCE_GET_INFO > I think this should be fairly straightforward; the information you > need should be in the now generic IOVA window stuff and would just > need massaging into the expected format. > - VFIO_IOMMU_SPAPR_REGISTER_MEMORY / > VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY > IIUC, these could be traslated into map/unmap operations onto a > second implicit IOAS which represents the preregistered memory > areas (to which we'd never connect an actual device). Along with > this VFIO_MAP and VFIO_UNMAP operations would need to check for > this case, verify their addresses against the preregistered space > and be translated into IOAS_COPY operations from the prereg > address space instead of raw IOAS_MAP operations. Fiddly, but not > fundamentally hard, I think. > > For SPAPR_TCE_v1 things are a bit trickier > > - VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE > I suspect you could get away with implementing these as no-ops. > It wouldn't be strictly correct, but I think software which is > using the interface correctly should work this way, though > possibly not optimally. That might be good enough for this ugly > old interface. > > And... then there's VFIO_EEH_PE_OP. It's very hard to know what to do > with this because the interface was completely broken for most of its > lifetime. EEH is a fancy error handling feature of IBM PCI hardware > somewhat similar in concept, though not interface, to PCIe AER. I have > a very strong impression that while this was a much-touted checkbox > feature for RAS, no-one, ever. actually used it. As evidenced by the > fact that there was, I believe over a *decade* in which all the > interfaces were completely broken by design, and apparently no-one > noticed. > > So, cynically, you could probably get away with making this a no-op as > well. If you wanted to do it properly... well... that would require > training up yet another person to actually understand this and hoping > they get it done before they run screaming. This one gets very ugly > because the EEH operations have to operate on the hardware (or > firmware) "Partitionable Endpoints" (PEs) which correspond one to one > with IOMMU groups, but not necessarily with VFIO containers, but > there's not really any sensible way to expose that to users. > > You might be able to do this by simply failing this outright if > there's anything other than exactly one IOMMU group bound to the > container / IOAS (which I think might be what VFIO itself does now). > Handling that with a device centric API gets somewhat fiddlier, of > course. > > > _______________________________________________ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu
On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote: > Which means the guest RAM does not need to be all mapped in that base IOAS > suggested down this thread as that would mean all memory is pinned and > powervm won't be able to swap it out (yeah, it can do such thing now!). Not > sure if we really want to support this or stick to a simpler design. Huh? How can it swap? Calling GUP is not optional. Either you call GUP at the start and there is no swap, or you call GUP for each vIOMMU hypercall. Since everyone says PPC doesn't call GUP during the hypercall - how is it working? Jason
On Tue, May 24, 2022 at 10:25:53AM -0300, Jason Gunthorpe wrote: > On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote: > > > Which means the guest RAM does not need to be all mapped in that base IOAS > > suggested down this thread as that would mean all memory is pinned and > > powervm won't be able to swap it out (yeah, it can do such thing now!). Not > > sure if we really want to support this or stick to a simpler design. > > Huh? How can it swap? Calling GUP is not optional. Either you call GUP > at the start and there is no swap, or you call GUP for each vIOMMU > hypercall. > > Since everyone says PPC doesn't call GUP during the hypercall - how is > it working? The current implementation does GUP during the pre-reserve. I think Alexey's talking about a new PowerVM (IBM hypervisor) feature; I don't know how that works.
On 5/24/22 23:25, Jason Gunthorpe wrote: > On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote: > >> Which means the guest RAM does not need to be all mapped in that base IOAS >> suggested down this thread as that would mean all memory is pinned and >> powervm won't be able to swap it out (yeah, it can do such thing now!). Not >> sure if we really want to support this or stick to a simpler design. > > Huh? How can it swap? Calling GUP is not optional. Either you call GUP > at the start and there is no swap, or you call GUP for each vIOMMU > hypercall. Correct, not optional. > Since everyone says PPC doesn't call GUP during the hypercall - how is > it working? It does not call GUP during hypercalls because all VM pages are GUPed in advance at a special memory preregistration step as we could not call GUP from a hypercall handler with MMU off (often the case with POWER8 when this was developed in the first place). Things are better with POWER9 (bare metal can do all sorts of things pretty much) but the PowerVM interface with 2 windows is still there and this iommufd proposal is going to be ported on top of PowerVM at first. I am just saying there is a model when not everything is mapped and this has its use. The PowerVM's swapping capability is something new and I do not really know how that works though.
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile index ca28a135b9675f..2fdff04000b326 100644 --- a/drivers/iommu/iommufd/Makefile +++ b/drivers/iommu/iommufd/Makefile @@ -5,6 +5,7 @@ iommufd-y := \ io_pagetable.o \ ioas.o \ main.o \ - pages.o + pages.o \ + vfio_compat.o obj-$(CONFIG_IOMMUFD) += iommufd.o diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index e5c717231f851e..31628591591c17 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -67,6 +67,8 @@ void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner); struct iommufd_ctx { struct file *filp; struct xarray objects; + + struct iommufd_ioas *vfio_ioas; }; struct iommufd_ctx *iommufd_fget(int fd); @@ -78,6 +80,9 @@ struct iommufd_ucmd { void *cmd; }; +int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd, + unsigned long arg); + /* Copy the response in ucmd->cmd back to userspace. */ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd, size_t cmd_len) @@ -186,6 +191,7 @@ int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd); int iommufd_ioas_map(struct iommufd_ucmd *ucmd); int iommufd_ioas_copy(struct iommufd_ucmd *ucmd); int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd); +int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd); /* * A HW pagetable is called an iommu_domain inside the kernel. This user object diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 6a895489fb5b82..f746fcff8145cb 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -122,6 +122,8 @@ bool iommufd_object_destroy_user(struct iommufd_ctx *ictx, return false; } __xa_erase(&ictx->objects, obj->id); + if (ictx->vfio_ioas && &ictx->vfio_ioas->obj == obj) + ictx->vfio_ioas = NULL; xa_unlock(&ictx->objects); iommufd_object_ops[obj->type].destroy(obj); @@ -219,27 +221,31 @@ static struct iommufd_ioctl_op iommufd_ioctl_ops[] = { __reserved), IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap, length), + IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas, + __reserved), }; static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { + struct iommufd_ctx *ictx = filp->private_data; struct iommufd_ucmd ucmd = {}; struct iommufd_ioctl_op *op; union ucmd_buffer buf; unsigned int nr; int ret; - ucmd.ictx = filp->private_data; + nr = _IOC_NR(cmd); + if (nr < IOMMUFD_CMD_BASE || + (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops)) + return iommufd_vfio_ioctl(ictx, cmd, arg); + + ucmd.ictx = ictx; ucmd.ubuffer = (void __user *)arg; ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer); if (ret) return ret; - nr = _IOC_NR(cmd); - if (nr < IOMMUFD_CMD_BASE || - (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops)) - return -ENOIOCTLCMD; op = &iommufd_ioctl_ops[nr - IOMMUFD_CMD_BASE]; if (op->ioctl_num != cmd) return -ENOIOCTLCMD; diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c new file mode 100644 index 00000000000000..5c996bc9b44d48 --- /dev/null +++ b/drivers/iommu/iommufd/vfio_compat.c @@ -0,0 +1,401 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES + */ +#include <linux/file.h> +#include <linux/interval_tree.h> +#include <linux/iommu.h> +#include <linux/iommufd.h> +#include <linux/slab.h> +#include <linux/vfio.h> +#include <uapi/linux/vfio.h> +#include <uapi/linux/iommufd.h> + +#include "iommufd_private.h" + +static struct iommufd_ioas *get_compat_ioas(struct iommufd_ctx *ictx) +{ + struct iommufd_ioas *ioas = ERR_PTR(-ENODEV); + + xa_lock(&ictx->objects); + if (!ictx->vfio_ioas || !iommufd_lock_obj(&ictx->vfio_ioas->obj)) + goto out_unlock; + ioas = ictx->vfio_ioas; +out_unlock: + xa_unlock(&ictx->objects); + return ioas; +} + +/* + * Only attaching a group should cause a default creation of the internal ioas, + * this returns the existing ioas if it has already been assigned somehow + * FIXME: maybe_unused + */ +static __maybe_unused struct iommufd_ioas * +create_compat_ioas(struct iommufd_ctx *ictx) +{ + struct iommufd_ioas *ioas = NULL; + struct iommufd_ioas *out_ioas; + + ioas = iommufd_ioas_alloc(ictx); + if (IS_ERR(ioas)) + return ioas; + + xa_lock(&ictx->objects); + if (ictx->vfio_ioas && iommufd_lock_obj(&ictx->vfio_ioas->obj)) + out_ioas = ictx->vfio_ioas; + else + out_ioas = ioas; + xa_unlock(&ictx->objects); + + if (out_ioas != ioas) { + iommufd_object_abort(ictx, &ioas->obj); + return out_ioas; + } + if (!iommufd_lock_obj(&ioas->obj)) + WARN_ON(true); + iommufd_object_finalize(ictx, &ioas->obj); + return ioas; +} + +int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd) +{ + struct iommu_vfio_ioas *cmd = ucmd->cmd; + struct iommufd_ioas *ioas; + + if (cmd->__reserved) + return -EOPNOTSUPP; + switch (cmd->op) { + case IOMMU_VFIO_IOAS_GET: + ioas = get_compat_ioas(ucmd->ictx); + if (IS_ERR(ioas)) + return PTR_ERR(ioas); + cmd->ioas_id = ioas->obj.id; + iommufd_put_object(&ioas->obj); + return iommufd_ucmd_respond(ucmd, sizeof(*cmd)); + + case IOMMU_VFIO_IOAS_SET: + ioas = iommufd_get_ioas(ucmd, cmd->ioas_id); + if (IS_ERR(ioas)) + return PTR_ERR(ioas); + xa_lock(&ucmd->ictx->objects); + ucmd->ictx->vfio_ioas = ioas; + xa_unlock(&ucmd->ictx->objects); + iommufd_put_object(&ioas->obj); + return 0; + + case IOMMU_VFIO_IOAS_CLEAR: + xa_lock(&ucmd->ictx->objects); + ucmd->ictx->vfio_ioas = NULL; + xa_unlock(&ucmd->ictx->objects); + return 0; + default: + return -EOPNOTSUPP; + } +} + +static int iommufd_vfio_map_dma(struct iommufd_ctx *ictx, unsigned int cmd, + void __user *arg) +{ + u32 supported_flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; + size_t minsz = offsetofend(struct vfio_iommu_type1_dma_map, size); + struct vfio_iommu_type1_dma_map map; + int iommu_prot = IOMMU_CACHE; + struct iommufd_ioas *ioas; + unsigned long iova; + int rc; + + if (copy_from_user(&map, arg, minsz)) + return -EFAULT; + + if (map.argsz < minsz || map.flags & ~supported_flags) + return -EINVAL; + + if (map.flags & VFIO_DMA_MAP_FLAG_READ) + iommu_prot |= IOMMU_READ; + if (map.flags & VFIO_DMA_MAP_FLAG_WRITE) + iommu_prot |= IOMMU_WRITE; + + ioas = get_compat_ioas(ictx); + if (IS_ERR(ioas)) + return PTR_ERR(ioas); + + iova = map.iova; + rc = iopt_map_user_pages(&ioas->iopt, &iova, + u64_to_user_ptr(map.vaddr), map.size, + iommu_prot, 0); + iommufd_put_object(&ioas->obj); + return rc; +} + +static int iommufd_vfio_unmap_dma(struct iommufd_ctx *ictx, unsigned int cmd, + void __user *arg) +{ + size_t minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size); + u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL; + struct vfio_iommu_type1_dma_unmap unmap; + struct iommufd_ioas *ioas; + int rc; + + if (copy_from_user(&unmap, arg, minsz)) + return -EFAULT; + + if (unmap.argsz < minsz || unmap.flags & ~supported_flags) + return -EINVAL; + + ioas = get_compat_ioas(ictx); + if (IS_ERR(ioas)) + return PTR_ERR(ioas); + + if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL) + rc = iopt_unmap_all(&ioas->iopt); + else + rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size); + iommufd_put_object(&ioas->obj); + return rc; +} + +static int iommufd_vfio_check_extension(unsigned long type) +{ + switch (type) { + case VFIO_TYPE1v2_IOMMU: + case VFIO_UNMAP_ALL: + return 1; + /* + * FIXME: The type1 iommu allows splitting of maps, which can fail. This is doable but + * is a bunch of extra code that is only for supporting this case. + */ + case VFIO_TYPE1_IOMMU: + /* + * FIXME: No idea what VFIO_TYPE1_NESTING_IOMMU does as far as the uAPI + * is concerned. Seems like it was never completed, it only does + * something on ARM, but I can't figure out what or how to use it. Can't + * find any user implementation either. + */ + case VFIO_TYPE1_NESTING_IOMMU: + /* + * FIXME: Easy to support, but needs rework in the Intel iommu driver + * to expose the no snoop squashing to iommufd + */ + case VFIO_DMA_CC_IOMMU: + /* + * FIXME: VFIO_DMA_MAP_FLAG_VADDR + * https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/ + * Wow, what a wild feature. This should have been implemented by + * allowing a iopt_pages to be associated with a memfd. It can then + * source mapping requests directly from a memfd without going through a + * mm_struct and thus doesn't care that the original qemu exec'd itself. + * The idea that userspace can flip a flag and cause kernel users to + * block indefinately is unacceptable. + * + * For VFIO compat we should implement this in a slightly different way, + * Creating a access_user that spans the whole area will immediately + * stop new faults as they will be handled from the xarray. We can then + * reparent the iopt_pages to the new mm_struct and undo the + * access_user. No blockage of kernel users required, does require + * filling the xarray with pages though. + */ + case VFIO_UPDATE_VADDR: + default: + return 0; + } + + /* FIXME: VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP I think everything with dirty + * tracking should be in its own ioctl, not muddled in unmap. If we want to + * atomically unmap and get the dirty bitmap it should be a flag in the dirty + * tracking ioctl, not here in unmap. Overall dirty tracking needs a careful + * review along side HW drivers implementing it. + */ +} + +static int iommufd_vfio_set_iommu(struct iommufd_ctx *ictx, unsigned long type) +{ + struct iommufd_ioas *ioas = NULL; + + if (type != VFIO_TYPE1v2_IOMMU) + return -EINVAL; + + /* VFIO fails the set_iommu if there is no group */ + ioas = get_compat_ioas(ictx); + if (IS_ERR(ioas)) + return PTR_ERR(ioas); + iommufd_put_object(&ioas->obj); + return 0; +} + +static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas) +{ + /* FIXME: See vfio_update_pgsize_bitmap(), for compat this should return + * the high bits too, and we need to decide if we should report that + * iommufd supports less than PAGE_SIZE alignment or stick to strict + * compatibility. qemu only cares about the first set bit. + */ + return ioas->iopt.iova_alignment; +} + +static int iommufd_fill_cap_iova(struct iommufd_ioas *ioas, + struct vfio_info_cap_header __user *cur, + size_t avail) +{ + struct vfio_iommu_type1_info_cap_iova_range __user *ucap_iovas = + container_of(cur, struct vfio_iommu_type1_info_cap_iova_range, + header); + struct vfio_iommu_type1_info_cap_iova_range cap_iovas = { + .header = { + .id = VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE, + .version = 1, + }, + }; + struct interval_tree_span_iter span; + + for (interval_tree_span_iter_first( + &span, &ioas->iopt.reserved_iova_itree, 0, ULONG_MAX); + !interval_tree_span_iter_done(&span); + interval_tree_span_iter_next(&span)) { + struct vfio_iova_range range; + + if (!span.is_hole) + continue; + range.start = span.start_hole; + range.end = span.last_hole; + if (avail >= struct_size(&cap_iovas, iova_ranges, + cap_iovas.nr_iovas + 1) && + copy_to_user(&ucap_iovas->iova_ranges[cap_iovas.nr_iovas], + &range, sizeof(range))) + return -EFAULT; + cap_iovas.nr_iovas++; + } + if (avail >= struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas) && + copy_to_user(ucap_iovas, &cap_iovas, sizeof(cap_iovas))) + return -EFAULT; + return struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas); +} + +static int iommufd_fill_cap_dma_avail(struct iommufd_ioas *ioas, + struct vfio_info_cap_header __user *cur, + size_t avail) +{ + struct vfio_iommu_type1_info_dma_avail cap_dma = { + .header = { + .id = VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL, + .version = 1, + }, + /* iommufd has no limit, return the same value as VFIO. */ + .avail = U16_MAX, + }; + + if (avail >= sizeof(cap_dma) && + copy_to_user(cur, &cap_dma, sizeof(cap_dma))) + return -EFAULT; + return sizeof(cap_dma); +} + +static int iommufd_vfio_iommu_get_info(struct iommufd_ctx *ictx, + void __user *arg) +{ + typedef int (*fill_cap_fn)(struct iommufd_ioas *ioas, + struct vfio_info_cap_header __user *cur, + size_t avail); + static const fill_cap_fn fill_fns[] = { + iommufd_fill_cap_iova, + iommufd_fill_cap_dma_avail, + }; + size_t minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes); + struct vfio_info_cap_header __user *last_cap = NULL; + struct vfio_iommu_type1_info info; + struct iommufd_ioas *ioas; + size_t total_cap_size; + int rc; + int i; + + if (copy_from_user(&info, arg, minsz)) + return -EFAULT; + + if (info.argsz < minsz) + return -EINVAL; + minsz = min_t(size_t, info.argsz, sizeof(info)); + + ioas = get_compat_ioas(ictx); + if (IS_ERR(ioas)) + return PTR_ERR(ioas); + + down_read(&ioas->iopt.iova_rwsem); + info.flags = VFIO_IOMMU_INFO_PGSIZES; + info.iova_pgsizes = iommufd_get_pagesizes(ioas); + info.cap_offset = 0; + + total_cap_size = sizeof(info); + for (i = 0; i != ARRAY_SIZE(fill_fns); i++) { + int cap_size; + + if (info.argsz > total_cap_size) + cap_size = fill_fns[i](ioas, arg + total_cap_size, + info.argsz - total_cap_size); + else + cap_size = fill_fns[i](ioas, NULL, 0); + if (cap_size < 0) { + rc = cap_size; + goto out_put; + } + if (last_cap && info.argsz >= total_cap_size && + put_user(total_cap_size, &last_cap->next)) { + rc = -EFAULT; + goto out_put; + } + last_cap = arg + total_cap_size; + total_cap_size += cap_size; + } + + /* + * If the user did not provide enough space then only some caps are + * returned and the argsz will be updated to the correct amount to get + * all caps. + */ + if (info.argsz >= total_cap_size) + info.cap_offset = sizeof(info); + info.argsz = total_cap_size; + info.flags |= VFIO_IOMMU_INFO_CAPS; + if (copy_to_user(arg, &info, minsz)) + rc = -EFAULT; + rc = 0; + +out_put: + up_read(&ioas->iopt.iova_rwsem); + iommufd_put_object(&ioas->obj); + return rc; +} + +/* FIXME TODO: +PowerPC SPAPR only: +#define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15) +#define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16) +#define VFIO_IOMMU_SPAPR_TCE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12) +#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 17) +#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 18) +#define VFIO_IOMMU_SPAPR_TCE_CREATE _IO(VFIO_TYPE, VFIO_BASE + 19) +#define VFIO_IOMMU_SPAPR_TCE_REMOVE _IO(VFIO_TYPE, VFIO_BASE + 20) +*/ + +int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd, + unsigned long arg) +{ + void __user *uarg = (void __user *)arg; + + switch (cmd) { + case VFIO_GET_API_VERSION: + return VFIO_API_VERSION; + case VFIO_SET_IOMMU: + return iommufd_vfio_set_iommu(ictx, arg); + case VFIO_CHECK_EXTENSION: + return iommufd_vfio_check_extension(arg); + case VFIO_IOMMU_GET_INFO: + return iommufd_vfio_iommu_get_info(ictx, uarg); + case VFIO_IOMMU_MAP_DMA: + return iommufd_vfio_map_dma(ictx, cmd, uarg); + case VFIO_IOMMU_UNMAP_DMA: + return iommufd_vfio_unmap_dma(ictx, cmd, uarg); + case VFIO_IOMMU_DIRTY_PAGES: + default: + return -ENOIOCTLCMD; + } + return -ENOIOCTLCMD; +} diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index ba7b17ec3002e3..2c0f5ced417335 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -42,6 +42,7 @@ enum { IOMMUFD_CMD_IOAS_MAP, IOMMUFD_CMD_IOAS_COPY, IOMMUFD_CMD_IOAS_UNMAP, + IOMMUFD_CMD_VFIO_IOAS, }; /** @@ -184,4 +185,39 @@ struct iommu_ioas_unmap { __aligned_u64 length; }; #define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP) + +/** + * enum iommufd_vfio_ioas_op + * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS + * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS + * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility + */ +enum iommufd_vfio_ioas_op { + IOMMU_VFIO_IOAS_GET = 0, + IOMMU_VFIO_IOAS_SET = 1, + IOMMU_VFIO_IOAS_CLEAR = 2, +}; + +/** + * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS) + * @size: sizeof(struct iommu_ioas_copy) + * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set + * For IOMMU_VFIO_IOAS_GET will output the IOAS ID + * @op: One of enum iommufd_vfio_ioas_op + * @__reserved: Must be 0 + * + * The VFIO compatibility support uses a single ioas because VFIO APIs do not + * support the ID field. Set or Get the IOAS that VFIO compatibility will use. + * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the + * compatibility ioas, either by taking what is already set, or auto creating + * one. From then on VFIO will continue to use that ioas and is not effected by + * this ioctl. SET or CLEAR does not destroy any auto-created IOAS. + */ +struct iommu_vfio_ioas { + __u32 size; + __u32 ioas_id; + __u16 op; + __u16 __reserved; +}; +#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS) #endif