diff mbox series

[RFC,11/12] iommufd: vfio container FD ioctl compatibility

Message ID 11-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com (mailing list archive)
State New, archived
Headers show
Series IOMMUFD Generic interface | expand

Commit Message

Jason Gunthorpe March 18, 2022, 5:27 p.m. UTC
iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into io_pagetable operations. Doing so allows the use of
iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to
SET_CONTAINER using a iommufd instead of a container fd is a followup
series.

Internally the compatibility API uses a normal IOAS object that, like
vfio, is automatically allocated when the first device is
attached.

Userspace can also query or set this IOAS object directly using the
IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
features while still using the VFIO style map/unmap ioctls.

While this is enough to operate qemu, it is still a bit of a WIP with a
few gaps to be resolved:

 - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
   split areas. The old mode can be implemented with a new operation to
   split an iopt_area into two without disturbing the iopt_pages or the
   domains, then unmapping a whole area as normal.

 - Resource limits rely on memory cgroups to bound what userspace can do
   instead of the module parameter dma_entry_limit.

 - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will
   require some additional work to properly expose PFN lifecycle between
   VFIO and iommfd

 - Various components of the mdev API are not completed yet

 - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
   implemented.

 - The 'dirty tracking' is not implemented

 - A full audit for pedantic compatibility details (eg errnos, etc) has
   not yet been done

 - powerpc SPAPR is left out, as it is not connected to the iommu_domain
   framework. My hope is that SPAPR will be moved into the iommu_domain
   framework as a special HW specific type and would expect power to
   support the generic interface through a normal iommu_domain.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   3 +-
 drivers/iommu/iommufd/iommufd_private.h |   6 +
 drivers/iommu/iommufd/main.c            |  16 +-
 drivers/iommu/iommufd/vfio_compat.c     | 401 ++++++++++++++++++++++++
 include/uapi/linux/iommufd.h            |  36 +++
 5 files changed, 456 insertions(+), 6 deletions(-)
 create mode 100644 drivers/iommu/iommufd/vfio_compat.c

Comments

Alex Williamson March 23, 2022, 10:51 p.m. UTC | #1
On Fri, 18 Mar 2022 14:27:36 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
> mapping them into io_pagetable operations. Doing so allows the use of
> iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to
> SET_CONTAINER using a iommufd instead of a container fd is a followup
> series.
> 
> Internally the compatibility API uses a normal IOAS object that, like
> vfio, is automatically allocated when the first device is
> attached.
> 
> Userspace can also query or set this IOAS object directly using the
> IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
> features while still using the VFIO style map/unmap ioctls.
> 
> While this is enough to operate qemu, it is still a bit of a WIP with a
> few gaps to be resolved:
> 
>  - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
>    split areas. The old mode can be implemented with a new operation to
>    split an iopt_area into two without disturbing the iopt_pages or the
>    domains, then unmapping a whole area as normal.
> 
>  - Resource limits rely on memory cgroups to bound what userspace can do
>    instead of the module parameter dma_entry_limit.
> 
>  - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will
>    require some additional work to properly expose PFN lifecycle between
>    VFIO and iommfd
> 
>  - Various components of the mdev API are not completed yet
>
>  - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
>    implemented.
> 
>  - The 'dirty tracking' is not implemented
> 
>  - A full audit for pedantic compatibility details (eg errnos, etc) has
>    not yet been done
> 
>  - powerpc SPAPR is left out, as it is not connected to the iommu_domain
>    framework. My hope is that SPAPR will be moved into the iommu_domain
>    framework as a special HW specific type and would expect power to
>    support the generic interface through a normal iommu_domain.

My overall question here would be whether we can actually achieve a
compatibility interface that has sufficient feature transparency that we
can dump vfio code in favor of this interface, or will there be enough
niche use cases that we need to keep type1 and vfio containers around
through a deprecation process?

The locked memory differences for one seem like something that libvirt
wouldn't want hidden and we have questions regarding support for vaddr
hijacking and different ideas how to implement dirty page tracking, not
to mention the missing features that are currently well used, like p2p
mappings, coherency tracking, mdev, etc.

It seems like quite an endeavor to fill all these gaps, while at the
same time QEMU will be working to move to use iommufd directly in order
to gain all the new features.

Where do we focus attention?  Is symlinking device files our proposal
to userspace and is that something achievable, or do we want to use
this compatibility interface as a means to test the interface and
allow userspace to make use of it for transition, if their use cases
allow it, perhaps eventually performing the symlink after deprecation
and eventual removal of the vfio container and type1 code?  Thanks,

Alex
Jason Gunthorpe March 24, 2022, 12:33 a.m. UTC | #2
On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:

> My overall question here would be whether we can actually achieve a
> compatibility interface that has sufficient feature transparency that we
> can dump vfio code in favor of this interface, or will there be enough
> niche use cases that we need to keep type1 and vfio containers around
> through a deprecation process?

Other than SPAPR, I think we can.

> The locked memory differences for one seem like something that
> libvirt wouldn't want hidden

I'm first interested to have an understanding how this change becomes
a real problem in practice that requires libvirt to do something
different for vfio or iommufd. We can discuss in the other thread

If this is the make or break point then I think we can deal with it
either by going back to what vfio does now or perhaps some other
friendly compat approach..

> and we have questions regarding support for vaddr hijacking

I'm not sure what vaddr hijacking is? Do you mean
VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to
implement it in a functionally compatible way without the deadlock
problem. I estimate this as a small project.

> and different ideas how to implement dirty page tracking, 

I don't think this is compatibility. No kernel today triggers qemu to
use this feature as no kernel supports live migration. No existing
qemu will trigger this feature with new kernels that support live
migration v2. Therefore we can adjust qemu's dirty tracking at the
same time we enable migration v2 in qemu.

With Joao's work we are close to having a solid RFC to come with
something that can be fully implemented.

Hopefully we can agree to this soon enough that qemu can come with a
full package of migration v2 support including the dirty tracking
solution.

> not to mention the missing features that are currently well used,
> like p2p mappings, coherency tracking, mdev, etc.

I consider these all mandatory things, they won't be left out.

The reason they are not in the RFC is mostly because supporting them
requires work outside just this iommufd area, and I'd like this series
to remain self-contained.

I've already got a draft to add DMABUF support to VFIO PCI which
nicely solves the follow_pfn security problem, we want to do this for
another reason already. I'm waiting for some testing feedback before
posting it. Need some help from Daniel make the DMABUF revoke semantic
him and I have been talking about. In the worst case can copy the
follow_pfn approach.

Intel no-snoop is simple enough, just needs some Intel cleanup parts.

mdev will come along with the final VFIO integration, all the really
hard parts are done already. The VFIO integration is a medium sized
task overall.

So, I'm not ready to give up yet :)

> Where do we focus attention?  Is symlinking device files our proposal
> to userspace and is that something achievable, or do we want to use
> this compatibility interface as a means to test the interface and
> allow userspace to make use of it for transition, if their use cases
> allow it, perhaps eventually performing the symlink after deprecation
> and eventual removal of the vfio container and type1 code?  Thanks,

symlinking device files is definitely just a suggested way to expedite
testing.

Things like qemu that are learning to use iommufd-only features should
learn to directly open iommufd instead of vfio container to activate
those features.

Looking long down the road I don't think we want to have type 1 and
iommufd code forever. So, I would like to make an option to compile
out vfio container support entirely and have that option arrange for
iommufd to provide the container device node itself.

I think we can get there pretty quickly, or at least I haven't got
anything that is scaring me alot (beyond SPAPR of course)

For the dpdk/etcs of the world I think we are already there.

Jason
Eric Auger March 24, 2022, 8:13 a.m. UTC | #3
Hi,

On 3/24/22 1:33 AM, Jason Gunthorpe wrote:
> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
>
>> My overall question here would be whether we can actually achieve a
>> compatibility interface that has sufficient feature transparency that we
>> can dump vfio code in favor of this interface, or will there be enough
>> niche use cases that we need to keep type1 and vfio containers around
>> through a deprecation process?
> Other than SPAPR, I think we can.
>
>> The locked memory differences for one seem like something that
>> libvirt wouldn't want hidden
> I'm first interested to have an understanding how this change becomes
> a real problem in practice that requires libvirt to do something
> different for vfio or iommufd. We can discuss in the other thread
>
> If this is the make or break point then I think we can deal with it
> either by going back to what vfio does now or perhaps some other
> friendly compat approach..
>
>> and we have questions regarding support for vaddr hijacking
> I'm not sure what vaddr hijacking is? Do you mean
> VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to
> implement it in a functionally compatible way without the deadlock
> problem. I estimate this as a small project.
>
>> and different ideas how to implement dirty page tracking, 
> I don't think this is compatibility. No kernel today triggers qemu to
> use this feature as no kernel supports live migration. No existing
> qemu will trigger this feature with new kernels that support live
> migration v2. Therefore we can adjust qemu's dirty tracking at the
> same time we enable migration v2 in qemu.
>
> With Joao's work we are close to having a solid RFC to come with
> something that can be fully implemented.
>
> Hopefully we can agree to this soon enough that qemu can come with a
> full package of migration v2 support including the dirty tracking
> solution.
>
>> not to mention the missing features that are currently well used,
>> like p2p mappings, coherency tracking, mdev, etc.
> I consider these all mandatory things, they won't be left out.
>
> The reason they are not in the RFC is mostly because supporting them
> requires work outside just this iommufd area, and I'd like this series
> to remain self-contained.
>
> I've already got a draft to add DMABUF support to VFIO PCI which
> nicely solves the follow_pfn security problem, we want to do this for
> another reason already. I'm waiting for some testing feedback before
> posting it. Need some help from Daniel make the DMABUF revoke semantic
> him and I have been talking about. In the worst case can copy the
> follow_pfn approach.
>
> Intel no-snoop is simple enough, just needs some Intel cleanup parts.
>
> mdev will come along with the final VFIO integration, all the really
> hard parts are done already. The VFIO integration is a medium sized
> task overall.
>
> So, I'm not ready to give up yet :)
>
>> Where do we focus attention?  Is symlinking device files our proposal
>> to userspace and is that something achievable, or do we want to use
>> this compatibility interface as a means to test the interface and
>> allow userspace to make use of it for transition, if their use cases
>> allow it, perhaps eventually performing the symlink after deprecation
>> and eventual removal of the vfio container and type1 code?  Thanks,
> symlinking device files is definitely just a suggested way to expedite
> testing.
>
> Things like qemu that are learning to use iommufd-only features should
> learn to directly open iommufd instead of vfio container to activate
> those features.
>
> Looking long down the road I don't think we want to have type 1 and
> iommufd code forever. So, I would like to make an option to compile
> out vfio container support entirely and have that option arrange for
> iommufd to provide the container device node itself.
I am currently working on migrating the QEMU VFIO device onto the new
API because since after our discussions the compat mode cannot be used
anyway to implemented nesting. I hope I will be able to present
something next week.

Thanks

Eric
>
> I think we can get there pretty quickly, or at least I haven't got
> anything that is scaring me alot (beyond SPAPR of course)
>
> For the dpdk/etcs of the world I think we are already there.
>
> Jason
>
Alex Williamson March 24, 2022, 10:04 p.m. UTC | #4
On Wed, 23 Mar 2022 21:33:42 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
> 
> > My overall question here would be whether we can actually achieve a
> > compatibility interface that has sufficient feature transparency that we
> > can dump vfio code in favor of this interface, or will there be enough
> > niche use cases that we need to keep type1 and vfio containers around
> > through a deprecation process?  
> 
> Other than SPAPR, I think we can.

Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
for POWER support?

> > The locked memory differences for one seem like something that
> > libvirt wouldn't want hidden  
> 
> I'm first interested to have an understanding how this change becomes
> a real problem in practice that requires libvirt to do something
> different for vfio or iommufd. We can discuss in the other thread
> 
> If this is the make or break point then I think we can deal with it
> either by going back to what vfio does now or perhaps some other
> friendly compat approach..
> 
> > and we have questions regarding support for vaddr hijacking  
> 
> I'm not sure what vaddr hijacking is? Do you mean
> VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to
> implement it in a functionally compatible way without the deadlock
> problem. I estimate this as a small project.
> 
> > and different ideas how to implement dirty page tracking,   
> 
> I don't think this is compatibility. No kernel today triggers qemu to
> use this feature as no kernel supports live migration. No existing
> qemu will trigger this feature with new kernels that support live
> migration v2. Therefore we can adjust qemu's dirty tracking at the
> same time we enable migration v2 in qemu.

I guess I was assuming that enabling v2 migration in QEMU was dependent
on the existing type1 dirty tracking because it's the only means we
have to tell QEMU that all memory is perpetually dirty when we have a
DMA device.  Is that not correct?  If we don't intend to carry type1
dirty tracking into iommufd compatibility and we need it for this
purpose, then our window for being able to rip it out entirely closes
when QEMU gains v2 migration support.

> With Joao's work we are close to having a solid RFC to come with
> something that can be fully implemented.
> 
> Hopefully we can agree to this soon enough that qemu can come with a
> full package of migration v2 support including the dirty tracking
> solution.
> 
> > not to mention the missing features that are currently well used,
> > like p2p mappings, coherency tracking, mdev, etc.  
> 
> I consider these all mandatory things, they won't be left out.
> 
> The reason they are not in the RFC is mostly because supporting them
> requires work outside just this iommufd area, and I'd like this series
> to remain self-contained.
> 
> I've already got a draft to add DMABUF support to VFIO PCI which
> nicely solves the follow_pfn security problem, we want to do this for
> another reason already. I'm waiting for some testing feedback before
> posting it. Need some help from Daniel make the DMABUF revoke semantic
> him and I have been talking about. In the worst case can copy the
> follow_pfn approach.
> 
> Intel no-snoop is simple enough, just needs some Intel cleanup parts.
> 
> mdev will come along with the final VFIO integration, all the really
> hard parts are done already. The VFIO integration is a medium sized
> task overall.
> 
> So, I'm not ready to give up yet :)

Ok, that's a more promising outlook than I was inferring from the long
list of missing features.

> > Where do we focus attention?  Is symlinking device files our proposal
> > to userspace and is that something achievable, or do we want to use
> > this compatibility interface as a means to test the interface and
> > allow userspace to make use of it for transition, if their use cases
> > allow it, perhaps eventually performing the symlink after deprecation
> > and eventual removal of the vfio container and type1 code?  Thanks,  
> 
> symlinking device files is definitely just a suggested way to expedite
> testing.
> 
> Things like qemu that are learning to use iommufd-only features should
> learn to directly open iommufd instead of vfio container to activate
> those features.

Which is kind of the basis for my question, QEMU is racing for native
support, Eric and Yi are already working on this, so some of these
compatibility interfaces might only have short term usefulness.

> Looking long down the road I don't think we want to have type 1 and
> iommufd code forever.

Agreed.

> So, I would like to make an option to compile
> out vfio container support entirely and have that option arrange for
> iommufd to provide the container device node itself.
> 
> I think we can get there pretty quickly, or at least I haven't got
> anything that is scaring me alot (beyond SPAPR of course)
> 
> For the dpdk/etcs of the world I think we are already there.

That's essentially what I'm trying to reconcile, we're racing both
to round out the compatibility interface to fully support QEMU, while
also updating QEMU to use iommufd directly so it won't need that full
support.  It's a confusing message.  Thanks,

Alex
Jason Gunthorpe March 24, 2022, 11:11 p.m. UTC | #5
On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
> On Wed, 23 Mar 2022 21:33:42 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
> > 
> > > My overall question here would be whether we can actually achieve a
> > > compatibility interface that has sufficient feature transparency that we
> > > can dump vfio code in favor of this interface, or will there be enough
> > > niche use cases that we need to keep type1 and vfio containers around
> > > through a deprecation process?  
> > 
> > Other than SPAPR, I think we can.
> 
> Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
> for POWER support?

Certainly initialy - I have no ability to do better than that.

I'm hoping someone from IBM will be willing to work on this in the
long run and we can do better.

> > I don't think this is compatibility. No kernel today triggers qemu to
> > use this feature as no kernel supports live migration. No existing
> > qemu will trigger this feature with new kernels that support live
> > migration v2. Therefore we can adjust qemu's dirty tracking at the
> > same time we enable migration v2 in qemu.
> 
> I guess I was assuming that enabling v2 migration in QEMU was dependent
> on the existing type1 dirty tracking because it's the only means we
> have to tell QEMU that all memory is perpetually dirty when we have a
> DMA device.  Is that not correct?

I haven't looked closely at this part in qemu, but IMHO, if qemu sees
that it has VFIO migration support but does not have any DMA dirty
tracking capability it should not do precopy flows.

If there is a bug here we should certainly fix it before progressing
the v2 patches. I'll ask Yishai & Co to take a look.

> > Intel no-snoop is simple enough, just needs some Intel cleanup parts.

Patches for this exist now
 
> > mdev will come along with the final VFIO integration, all the really
> > hard parts are done already. The VFIO integration is a medium sized
> > task overall.
> > 
> > So, I'm not ready to give up yet :)
> 
> Ok, that's a more promising outlook than I was inferring from the long
> list of missing features.

Yeah, it is just long, but they are not scary things, just priorites
and patch planning.

> > I think we can get there pretty quickly, or at least I haven't got
> > anything that is scaring me alot (beyond SPAPR of course)
> > 
> > For the dpdk/etcs of the world I think we are already there.
> 
> That's essentially what I'm trying to reconcile, we're racing both
> to round out the compatibility interface to fully support QEMU, while
> also updating QEMU to use iommufd directly so it won't need that full
> support.  It's a confusing message.  Thanks,

The long term purpose of compatibility is to provide a config option
to allow type 1 to be turned off and continue to support old user
space (eg in containers) that is running old qemu/dpdk/spdk/etc.

This shows that we have a plan/path to allow a distro to support only
one iommu interface in their kernel should they choose without having
to sacrifice uABI compatibility.

As for racing, my intention is to leave the compat interface alone for
awhile - the more urgent things in on my personal list are the RFC
for dirty tracking, mlx5 support for dirty tracking, and VFIO preparation
for iommufd support.

Eric and Yi are focusing on userspace page tables and qemu updates.

Joao is working on implementing iommu driver dirty tracking

Lu and Jacob are working on getting PASID support infrastructure
together.

There is alot going on!

A question to consider is what would you consider the minimum bar for
merging?

Thanks,
Jason
Tian, Kevin March 25, 2022, 3:10 a.m. UTC | #6
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, March 25, 2022 7:12 AM
> 
> On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
> > That's essentially what I'm trying to reconcile, we're racing both
> > to round out the compatibility interface to fully support QEMU, while
> > also updating QEMU to use iommufd directly so it won't need that full
> > support.  It's a confusing message.  Thanks,
> 
> The long term purpose of compatibility is to provide a config option
> to allow type 1 to be turned off and continue to support old user
> space (eg in containers) that is running old qemu/dpdk/spdk/etc.
> 
> This shows that we have a plan/path to allow a distro to support only
> one iommu interface in their kernel should they choose without having
> to sacrifice uABI compatibility.
> 
> As for racing, my intention is to leave the compat interface alone for
> awhile - the more urgent things in on my personal list are the RFC
> for dirty tracking, mlx5 support for dirty tracking, and VFIO preparation
> for iommufd support.
> 
> Eric and Yi are focusing on userspace page tables and qemu updates.
> 
> Joao is working on implementing iommu driver dirty tracking
> 
> Lu and Jacob are working on getting PASID support infrastructure
> together.
> 
> There is alot going on!
> 
> A question to consider is what would you consider the minimum bar for
> merging?
> 

My two cents. 
Joao Martins March 25, 2022, 11:24 a.m. UTC | #7
On 3/24/22 23:11, Jason Gunthorpe wrote:
> On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
>> On Wed, 23 Mar 2022 21:33:42 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
>>> I don't think this is compatibility. No kernel today triggers qemu to
>>> use this feature as no kernel supports live migration. No existing
>>> qemu will trigger this feature with new kernels that support live
>>> migration v2. Therefore we can adjust qemu's dirty tracking at the
>>> same time we enable migration v2 in qemu.
>>
>> I guess I was assuming that enabling v2 migration in QEMU was dependent
>> on the existing type1 dirty tracking because it's the only means we
>> have to tell QEMU that all memory is perpetually dirty when we have a
>> DMA device.  Is that not correct?
> 
> I haven't looked closely at this part in qemu, but IMHO, if qemu sees
> that it has VFIO migration support but does not have any DMA dirty
> tracking capability it should not do precopy flows.
> 
> If there is a bug here we should certainly fix it before progressing
> the v2 patches. I'll ask Yishai & Co to take a look.

I think that's already the case.

wrt to VFIO IOMMU type1, kernel always exports a migration capability
and the page sizes it supports. In the VMM if it matches the page size
qemu is using (x86 it is PAGE_SIZE) it determines for Qemu it will /use/ vfio
container ioctls. Which well, I guess it's always if the syscall is
there considering we dirty every page.

In qemu, the start and stop of dirty tracking is actually unbounded (it attempts
to do it without checking if the capability is there), although
syncing the dirties from vfio against Qemu private tracking, it does check
if the dirty page tracking is supported prior to even trying the syncing via the
ioctl. /Most importantly/ prior to all of this, starting/stopping/syncing dirty
tracking, Qemu adds a live migration blocker if either the device doesn't support
migration or VFIO container doesn't support it (so migration won't even start).
So I think VMM knows how to deal with the lack of the dirty container ioctls as
far as my understanding goes.

TBH, I am not overly concerned with dirty page tracking in vfio-compat layer --
I have been doing both in tandem (old and new). We mainly need to decide what do
we wanna maintain in the compat layer. I can drop that IOMMU support code I have
from vfio-compat or we do the 'perpectual dirtying' that current does, or not
support the dirty ioctls in vfio-compat at all. Maybe the latter makes more sense,
as that might mimmic more accurately what hardware supports, and deprive VMMs from
even starting migration. The second looks useful for testing, but doing dirty of all
DMA-mapped memory seems to be too much in a real world migration scenario :(
specially as the guest size increases.
Yi Liu March 29, 2022, 9:17 a.m. UTC | #8
On 2022/3/24 06:51, Alex Williamson wrote:
> On Fri, 18 Mar 2022 14:27:36 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
>> mapping them into io_pagetable operations. Doing so allows the use of
>> iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to
>> SET_CONTAINER using a iommufd instead of a container fd is a followup
>> series.
>>
>> Internally the compatibility API uses a normal IOAS object that, like
>> vfio, is automatically allocated when the first device is
>> attached.
>>
>> Userspace can also query or set this IOAS object directly using the
>> IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
>> features while still using the VFIO style map/unmap ioctls.
>>
>> While this is enough to operate qemu, it is still a bit of a WIP with a
>> few gaps to be resolved:
>>
>>   - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
>>     split areas. The old mode can be implemented with a new operation to
>>     split an iopt_area into two without disturbing the iopt_pages or the
>>     domains, then unmapping a whole area as normal.
>>
>>   - Resource limits rely on memory cgroups to bound what userspace can do
>>     instead of the module parameter dma_entry_limit.
>>
>>   - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will
>>     require some additional work to properly expose PFN lifecycle between
>>     VFIO and iommfd
>>
>>   - Various components of the mdev API are not completed yet
>>
>>   - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
>>     implemented.
>>
>>   - The 'dirty tracking' is not implemented
>>
>>   - A full audit for pedantic compatibility details (eg errnos, etc) has
>>     not yet been done
>>
>>   - powerpc SPAPR is left out, as it is not connected to the iommu_domain
>>     framework. My hope is that SPAPR will be moved into the iommu_domain
>>     framework as a special HW specific type and would expect power to
>>     support the generic interface through a normal iommu_domain.
> 
> My overall question here would be whether we can actually achieve a
> compatibility interface that has sufficient feature transparency that we
> can dump vfio code in favor of this interface, or will there be enough
> niche use cases that we need to keep type1 and vfio containers around
> through a deprecation process?
> 
> The locked memory differences for one seem like something that libvirt
> wouldn't want hidden and we have questions regarding support for vaddr
> hijacking and different ideas how to implement dirty page tracking, not
> to mention the missing features that are currently well used, like p2p
> mappings, coherency tracking, mdev, etc.
>
> It seems like quite an endeavor to fill all these gaps, while at the
> same time QEMU will be working to move to use iommufd directly in order
> to gain all the new features.

Hi Alex,

Jason hasn't included the vfio changes for adapting to iommufd. But it's
in this branch 
(https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6). Eric and 
me are working on adding the iommufd support in QEMU as well. If wanting to 
run the new QEMU on old kernel, QEMU is supposed to support both the legacy 
group/container interface and the latest device/iommufd interface. We've 
got some draft code toward this direction 
(https://github.com/luxis1999/qemu/commits/qemu-for-5.17-rc4-vm). It works 
for both legacy group/container and device/iommufd path. It's just for 
reference so far, Eric and me will have a further sync on it.

> Where do we focus attention?  Is symlinking device files our proposal
> to userspace and is that something achievable, or do we want to use
> this compatibility interface as a means to test the interface and
> allow userspace to make use of it for transition, if their use cases
> allow it, perhaps eventually performing the symlink after deprecation
> and eventual removal of the vfio container and type1 code?  Thanks,

I'm sure it is possible that one day the group/container interface will be
removed in kernel. Perhaps this will happen when SPAPR is supported by 
iommufd. But how about QEMU, should QEMU keep backward compatibility 
forever? or one day QEMU may also remove the group/container path and hence
unable to work on the old kernels?
David Gibson April 28, 2022, 2:53 p.m. UTC | #9
On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
> On Wed, 23 Mar 2022 21:33:42 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
> > 
> > > My overall question here would be whether we can actually achieve a
> > > compatibility interface that has sufficient feature transparency that we
> > > can dump vfio code in favor of this interface, or will there be enough
> > > niche use cases that we need to keep type1 and vfio containers around
> > > through a deprecation process?  
> > 
> > Other than SPAPR, I think we can.
> 
> Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
> for POWER support?

There are a few different levels to consider for dealing with PPC.
For a suitable long term interface for ppc hosts and guests dropping
this is fine: the ppc specific iommu model was basically an
ill-conceived idea from the beginning, because none of us had
sufficiently understood what things were general and what things where
iommu model/hw specific.

..mostly.  There are several points of divergence for the ppc iommu
model.

1) Limited IOVA windows.  This one turned out to not really be ppc
specific, and is (rightly) handlded generically in the new interface.
No problem here.

2) Costly GUPs.  pseries (the most common ppc machine type) always
expects a (v)IOMMU.  That means that unlike the common x86 model of a
host with IOMMU, but guests with no-vIOMMU, guest initiated
maps/unmaps can be a hot path.  Accounting in that path can be
prohibitive (and on POWER8 in particular it prevented us from
optimizing that path the way we wanted).  We had two solutions for
that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
based on the IOVA window sizes.  That was improved in the v2 which
used the concept of preregistration.  IIUC iommufd can achieve the
same effect as preregistration using IOAS_COPY, so this one isn't
really a problem either.

3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
windows, which aren't contiguous with each other.  The base addresses
of each of these are fixed, but the size of each window, the pagesize
(i.e. granularity) of each window and the number of levels in the
IOMMU pagetable are runtime configurable.  Because it's true in the
hardware, it's also true of the vIOMMU interface defined by the IBM
hypervisor (and adpoted by KVM as well).  So, guests can request
changes in how these windows are handled.  Typical Linux guests will
use the "low" window (IOVA 0..2GiB) dynamically, and the high window
(IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
can't count on that; the guest can use them however it wants.


(3) still needs a plan for how to fit it into the /dev/iommufd model.
This is a secondary reason that in the past I advocated for the user
requesting specific DMA windows which the kernel would accept or
refuse, rather than having a query function - it connects easily to
the DDW model.  With the query-first model we'd need some sort of
extension here, not really sure what it should look like.



Then, there's handling existing qemu (or other software) that is using
the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
should be a goal or not: as others have noted, working actively to
port qemu to the new interface at the same time as making a
comprehensive in-kernel compat layer is arguably redundant work.

That said, if we did want to handle this in an in-kernel compat layer,
here's roughly what you'd need for SPAPR_TCE v2:

- VFIO_IOMMU_SPAPR_TCE_GET_INFO
    I think this should be fairly straightforward; the information you
    need should be in the now generic IOVA window stuff and would just
    need massaging into the expected format.
- VFIO_IOMMU_SPAPR_REGISTER_MEMORY /
  VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
    IIUC, these could be traslated into map/unmap operations onto a
    second implicit IOAS which represents the preregistered memory
    areas (to which we'd never connect an actual device).  Along with
    this VFIO_MAP and VFIO_UNMAP operations would need to check for
    this case, verify their addresses against the preregistered space
    and be translated into IOAS_COPY operations from the prereg
    address space instead of raw IOAS_MAP operations.  Fiddly, but not
    fundamentally hard, I think.

For SPAPR_TCE_v1 things are a bit trickier

- VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE
    I suspect you could get away with implementing these as no-ops.
    It wouldn't be strictly correct, but I think software which is
    using the interface correctly should work this way, though
    possibly not optimally.  That might be good enough for this ugly
    old interface.

And... then there's VFIO_EEH_PE_OP.  It's very hard to know what to do
with this because the interface was completely broken for most of its
lifetime.  EEH is a fancy error handling feature of IBM PCI hardware
somewhat similar in concept, though not interface, to PCIe AER.  I have
a very strong impression that while this was a much-touted checkbox
feature for RAS, no-one, ever. actually used it.  As evidenced by the
fact that there was, I believe over a *decade* in which all the
interfaces were completely broken by design, and apparently no-one
noticed.

So, cynically, you could probably get away with making this a no-op as
well.  If you wanted to do it properly... well... that would require
training up yet another person to actually understand this and hoping
they get it done before they run screaming.  This one gets very ugly
because the EEH operations have to operate on the hardware (or
firmware) "Partitionable Endpoints" (PEs) which correspond one to one
with IOMMU groups, but not necessarily with VFIO containers, but
there's not really any sensible way to expose that to users.

You might be able to do this by simply failing this outright if
there's anything other than exactly one IOMMU group bound to the
container / IOAS (which I think might be what VFIO itself does now).
Handling that with a device centric API gets somewhat fiddlier, of
course.
Jason Gunthorpe April 28, 2022, 3:10 p.m. UTC | #10
On Fri, Apr 29, 2022 at 12:53:16AM +1000, David Gibson wrote:

> 2) Costly GUPs.  pseries (the most common ppc machine type) always
> expects a (v)IOMMU.  That means that unlike the common x86 model of a
> host with IOMMU, but guests with no-vIOMMU, guest initiated
> maps/unmaps can be a hot path.  Accounting in that path can be
> prohibitive (and on POWER8 in particular it prevented us from
> optimizing that path the way we wanted).  We had two solutions for
> that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
> based on the IOVA window sizes.  That was improved in the v2 which
> used the concept of preregistration.  IIUC iommufd can achieve the
> same effect as preregistration using IOAS_COPY, so this one isn't
> really a problem either.

I think PPC and S390 are solving the same problem here. I think S390
is going to go to a SW nested model where it has an iommu_domain
controlled by iommufd that is populated with the pinned pages, eg
stored in an xarray.

Then the performance map/unmap path is simply copying pages from the
xarray to the real IOPTEs - and this would be modeled as a nested
iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.

Perhaps this is agreeable for PPC too?

> 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> windows, which aren't contiguous with each other.  The base addresses
> of each of these are fixed, but the size of each window, the pagesize
> (i.e. granularity) of each window and the number of levels in the
> IOMMU pagetable are runtime configurable.  Because it's true in the
> hardware, it's also true of the vIOMMU interface defined by the IBM
> hypervisor (and adpoted by KVM as well).  So, guests can request
> changes in how these windows are handled.  Typical Linux guests will
> use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> can't count on that; the guest can use them however it wants.

As part of nesting iommufd will have a 'create iommu_domain using
iommu driver specific data' primitive.

The driver specific data for PPC can include a description of these
windows so the PPC specific qemu driver can issue this new ioctl
using the information provided by the guest.

The main issue is that internally to the iommu subsystem the
iommu_domain aperture is assumed to be a single window. This kAPI will
have to be improved to model the PPC multi-window iommu_domain.

If this API is not used then the PPC driver should choose some
sensible default windows that makes things like DPDK happy.

> Then, there's handling existing qemu (or other software) that is using
> the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
> should be a goal or not: as others have noted, working actively to
> port qemu to the new interface at the same time as making a
> comprehensive in-kernel compat layer is arguably redundant work.

At the moment I think I would stick with not including the SPAPR
interfaces in vfio_compat, but there does seem to be a path if someone
with HW wants to build and test them?

> You might be able to do this by simply failing this outright if
> there's anything other than exactly one IOMMU group bound to the
> container / IOAS (which I think might be what VFIO itself does now).
> Handling that with a device centric API gets somewhat fiddlier, of
> course.

Maybe every device gets a copy of the error notification?

ie maybe this should be part of vfio_pci and not part of iommufd to
mirror how AER works?

It feels strange to put in device error notification to iommufd, is
that connected the IOMMU?

Jason
Tian, Kevin April 29, 2022, 1:21 a.m. UTC | #11
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 28, 2022 11:11 PM
> 
> 
> > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> 2 IOVA
> > windows, which aren't contiguous with each other.  The base addresses
> > of each of these are fixed, but the size of each window, the pagesize
> > (i.e. granularity) of each window and the number of levels in the
> > IOMMU pagetable are runtime configurable.  Because it's true in the
> > hardware, it's also true of the vIOMMU interface defined by the IBM
> > hypervisor (and adpoted by KVM as well).  So, guests can request
> > changes in how these windows are handled.  Typical Linux guests will
> > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > can't count on that; the guest can use them however it wants.
> 
> As part of nesting iommufd will have a 'create iommu_domain using
> iommu driver specific data' primitive.
> 
> The driver specific data for PPC can include a description of these
> windows so the PPC specific qemu driver can issue this new ioctl
> using the information provided by the guest.
> 
> The main issue is that internally to the iommu subsystem the
> iommu_domain aperture is assumed to be a single window. This kAPI will
> have to be improved to model the PPC multi-window iommu_domain.
> 

From the point of nesting probably each window can be a separate
domain then the existing aperture should still work?
David Gibson April 29, 2022, 6:20 a.m. UTC | #12
On Thu, Apr 28, 2022 at 12:10:37PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 12:53:16AM +1000, David Gibson wrote:
> 
> > 2) Costly GUPs.  pseries (the most common ppc machine type) always
> > expects a (v)IOMMU.  That means that unlike the common x86 model of a
> > host with IOMMU, but guests with no-vIOMMU, guest initiated
> > maps/unmaps can be a hot path.  Accounting in that path can be
> > prohibitive (and on POWER8 in particular it prevented us from
> > optimizing that path the way we wanted).  We had two solutions for
> > that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
> > based on the IOVA window sizes.  That was improved in the v2 which
> > used the concept of preregistration.  IIUC iommufd can achieve the
> > same effect as preregistration using IOAS_COPY, so this one isn't
> > really a problem either.
> 
> I think PPC and S390 are solving the same problem here. I think S390
> is going to go to a SW nested model where it has an iommu_domain
> controlled by iommufd that is populated with the pinned pages, eg
> stored in an xarray.
> 
> Then the performance map/unmap path is simply copying pages from the
> xarray to the real IOPTEs - and this would be modeled as a nested
> iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.
> 
> Perhaps this is agreeable for PPC too?

Uh.. maybe?  Note that I'm making these comments based on working on
this some years ago (the initial VFIO for ppc implementation in
particular).  I'm no longer actively involved in ppc kernel work.

> > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> > windows, which aren't contiguous with each other.  The base addresses
> > of each of these are fixed, but the size of each window, the pagesize
> > (i.e. granularity) of each window and the number of levels in the
> > IOMMU pagetable are runtime configurable.  Because it's true in the
> > hardware, it's also true of the vIOMMU interface defined by the IBM
> > hypervisor (and adpoted by KVM as well).  So, guests can request
> > changes in how these windows are handled.  Typical Linux guests will
> > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > can't count on that; the guest can use them however it wants.
> 
> As part of nesting iommufd will have a 'create iommu_domain using
> iommu driver specific data' primitive.
> 
> The driver specific data for PPC can include a description of these
> windows so the PPC specific qemu driver can issue this new ioctl
> using the information provided by the guest.

Hmm.. not sure if that works.  At the moment, qemu (for example) needs
to set up the domains/containers/IOASes as it constructs the machine,
because that's based on the virtual hardware topology.  Initially they
use the default windows (0..2GiB first window, second window
disabled).  Only once the guest kernel is up and running does it issue
the hypercalls to set the final windows as it prefers.  In theory the
guest could change them during runtime though it's unlikely in
practice.  They could change during machine lifetime in practice,
though, if you rebooted from one guest kernel to another that uses a
different configuration.

*Maybe* IOAS construction can be deferred somehow, though I'm not sure
because the assigned devices need to live somewhere.

> The main issue is that internally to the iommu subsystem the
> iommu_domain aperture is assumed to be a single window. This kAPI will
> have to be improved to model the PPC multi-window iommu_domain.

Right.

> If this API is not used then the PPC driver should choose some
> sensible default windows that makes things like DPDK happy.
> 
> > Then, there's handling existing qemu (or other software) that is using
> > the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
> > should be a goal or not: as others have noted, working actively to
> > port qemu to the new interface at the same time as making a
> > comprehensive in-kernel compat layer is arguably redundant work.
> 
> At the moment I think I would stick with not including the SPAPR
> interfaces in vfio_compat, but there does seem to be a path if someone
> with HW wants to build and test them?
> 
> > You might be able to do this by simply failing this outright if
> > there's anything other than exactly one IOMMU group bound to the
> > container / IOAS (which I think might be what VFIO itself does now).
> > Handling that with a device centric API gets somewhat fiddlier, of
> > course.
> 
> Maybe every device gets a copy of the error notification?

Alas, it's harder than that.  One of the things that can happen on an
EEH fault is that the entire PE gets suspended (blocking both DMA and
MMIO, IIRC) until the proper recovery steps are taken.  Since that's
handled at the hardware/firmware level, it will obviously only affect
the host side PE (== host iommu group).  However the interfaces we
have only allow things to be reported to the guest at the granularity
of a guest side PE (== container/IOAS == guest host bridge in
practice).  So to handle this correctly when guest PE != host PE we'd
need to synchronize suspended / recovery state between all the host
PEs in the guest PE.  That *might* be technically possible, but it's
really damn fiddly.

> ie maybe this should be part of vfio_pci and not part of iommufd to
> mirror how AER works?
> 
> It feels strange to put in device error notification to iommufd, is
> that connected the IOMMU?

Only in that operates at the granularity of a PE, which is mostly an
IOMMU concept.
David Gibson April 29, 2022, 6:22 a.m. UTC | #13
On Fri, Apr 29, 2022 at 01:21:30AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, April 28, 2022 11:11 PM
> > 
> > 
> > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> > 2 IOVA
> > > windows, which aren't contiguous with each other.  The base addresses
> > > of each of these are fixed, but the size of each window, the pagesize
> > > (i.e. granularity) of each window and the number of levels in the
> > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > changes in how these windows are handled.  Typical Linux guests will
> > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > can't count on that; the guest can use them however it wants.
> > 
> > As part of nesting iommufd will have a 'create iommu_domain using
> > iommu driver specific data' primitive.
> > 
> > The driver specific data for PPC can include a description of these
> > windows so the PPC specific qemu driver can issue this new ioctl
> > using the information provided by the guest.
> > 
> > The main issue is that internally to the iommu subsystem the
> > iommu_domain aperture is assumed to be a single window. This kAPI will
> > have to be improved to model the PPC multi-window iommu_domain.
> > 
> 
> From the point of nesting probably each window can be a separate
> domain then the existing aperture should still work?

Maybe.  There might be several different ways to represent it, but the
vital piece is that any individual device (well, group, technically)
must atomically join/leave both windows at once.
Jason Gunthorpe April 29, 2022, 12:48 p.m. UTC | #14
On Fri, Apr 29, 2022 at 04:20:36PM +1000, David Gibson wrote:

> > I think PPC and S390 are solving the same problem here. I think S390
> > is going to go to a SW nested model where it has an iommu_domain
> > controlled by iommufd that is populated with the pinned pages, eg
> > stored in an xarray.
> > 
> > Then the performance map/unmap path is simply copying pages from the
> > xarray to the real IOPTEs - and this would be modeled as a nested
> > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.
> > 
> > Perhaps this is agreeable for PPC too?
> 
> Uh.. maybe?  Note that I'm making these comments based on working on
> this some years ago (the initial VFIO for ppc implementation in
> particular).  I'm no longer actively involved in ppc kernel work.

OK
 
> > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> > > windows, which aren't contiguous with each other.  The base addresses
> > > of each of these are fixed, but the size of each window, the pagesize
> > > (i.e. granularity) of each window and the number of levels in the
> > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > changes in how these windows are handled.  Typical Linux guests will
> > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > can't count on that; the guest can use them however it wants.
> > 
> > As part of nesting iommufd will have a 'create iommu_domain using
> > iommu driver specific data' primitive.
> > 
> > The driver specific data for PPC can include a description of these
> > windows so the PPC specific qemu driver can issue this new ioctl
> > using the information provided by the guest.
> 
> Hmm.. not sure if that works.  At the moment, qemu (for example) needs
> to set up the domains/containers/IOASes as it constructs the machine,
> because that's based on the virtual hardware topology.  Initially they
> use the default windows (0..2GiB first window, second window
> disabled).  Only once the guest kernel is up and running does it issue
> the hypercalls to set the final windows as it prefers.  In theory the
> guest could change them during runtime though it's unlikely in
> practice.  They could change during machine lifetime in practice,
> though, if you rebooted from one guest kernel to another that uses a
> different configuration.
> 
> *Maybe* IOAS construction can be deferred somehow, though I'm not sure
> because the assigned devices need to live somewhere.

This is a general requirement for all the nesting implementations, we
start out with some default nested page table and then later the VM
does the vIOMMU call to change it. So nesting will have to come along
with some kind of 'switch domains IOCTL'

In this case I would guess PPC could do the same and start out with a
small (nested) iommu_domain and then create the VM's desired
iommu_domain from the hypercall, and switch to it.

It is a bit more CPU work since maps in the lower range would have to
be copied over, but conceptually the model matches the HW nesting.

> > > You might be able to do this by simply failing this outright if
> > > there's anything other than exactly one IOMMU group bound to the
> > > container / IOAS (which I think might be what VFIO itself does now).
> > > Handling that with a device centric API gets somewhat fiddlier, of
> > > course.
> > 
> > Maybe every device gets a copy of the error notification?
> 
> Alas, it's harder than that.  One of the things that can happen on an
> EEH fault is that the entire PE gets suspended (blocking both DMA and
> MMIO, IIRC) until the proper recovery steps are taken.  

I think qemu would have to de-duplicate the duplicated device
notifications and then it can go from a device notifiation to the
device's iommu_group to the IOAS to the vPE?

A simple serial number in the event would make this pretty simple.

The way back to clear the event would just forward the commands
through a random device in the iommu_group to the PE?

Thanks,
Jason
Jason Gunthorpe April 29, 2022, 12:50 p.m. UTC | #15
On Fri, Apr 29, 2022 at 04:22:56PM +1000, David Gibson wrote:
> On Fri, Apr 29, 2022 at 01:21:30AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, April 28, 2022 11:11 PM
> > > 
> > > 
> > > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> > > 2 IOVA
> > > > windows, which aren't contiguous with each other.  The base addresses
> > > > of each of these are fixed, but the size of each window, the pagesize
> > > > (i.e. granularity) of each window and the number of levels in the
> > > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > > changes in how these windows are handled.  Typical Linux guests will
> > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > > can't count on that; the guest can use them however it wants.
> > > 
> > > As part of nesting iommufd will have a 'create iommu_domain using
> > > iommu driver specific data' primitive.
> > > 
> > > The driver specific data for PPC can include a description of these
> > > windows so the PPC specific qemu driver can issue this new ioctl
> > > using the information provided by the guest.
> > > 
> > > The main issue is that internally to the iommu subsystem the
> > > iommu_domain aperture is assumed to be a single window. This kAPI will
> > > have to be improved to model the PPC multi-window iommu_domain.
> > > 
> > 
> > From the point of nesting probably each window can be a separate
> > domain then the existing aperture should still work?
> 
> Maybe.  There might be several different ways to represent it, but the
> vital piece is that any individual device (well, group, technically)
> must atomically join/leave both windows at once.

I'm not keen on the multi-iommu_domains because it means we have to
create the idea that a device can be attached to multiple
iommu_domains, which we don't have at all today.

Since iommu_domain allows PPC to implement its special rules, like the
atomicness above.

Jason
David Gibson May 2, 2022, 4:10 a.m. UTC | #16
On Fri, Apr 29, 2022 at 09:50:30AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 04:22:56PM +1000, David Gibson wrote:
> > On Fri, Apr 29, 2022 at 01:21:30AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Thursday, April 28, 2022 11:11 PM
> > > > 
> > > > 
> > > > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> > > > 2 IOVA
> > > > > windows, which aren't contiguous with each other.  The base addresses
> > > > > of each of these are fixed, but the size of each window, the pagesize
> > > > > (i.e. granularity) of each window and the number of levels in the
> > > > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > > > changes in how these windows are handled.  Typical Linux guests will
> > > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > > > can't count on that; the guest can use them however it wants.
> > > > 
> > > > As part of nesting iommufd will have a 'create iommu_domain using
> > > > iommu driver specific data' primitive.
> > > > 
> > > > The driver specific data for PPC can include a description of these
> > > > windows so the PPC specific qemu driver can issue this new ioctl
> > > > using the information provided by the guest.
> > > > 
> > > > The main issue is that internally to the iommu subsystem the
> > > > iommu_domain aperture is assumed to be a single window. This kAPI will
> > > > have to be improved to model the PPC multi-window iommu_domain.
> > > > 
> > > 
> > > From the point of nesting probably each window can be a separate
> > > domain then the existing aperture should still work?
> > 
> > Maybe.  There might be several different ways to represent it, but the
> > vital piece is that any individual device (well, group, technically)
> > must atomically join/leave both windows at once.
> 
> I'm not keen on the multi-iommu_domains because it means we have to
> create the idea that a device can be attached to multiple
> iommu_domains, which we don't have at all today.
> 
> Since iommu_domain allows PPC to implement its special rules, like the
> atomicness above.

I tend to agree; I think extending the iommu domain concept to
incorporate multiple windows makes more sense than extending to allow
multiple domains per device.  I'm just saying there might be other
ways of representing this, and that's not a sticking point for me as
long as the right properties can be preserved.
David Gibson May 2, 2022, 7:30 a.m. UTC | #17
On Fri, Apr 29, 2022 at 09:48:38AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 04:20:36PM +1000, David Gibson wrote:
> 
> > > I think PPC and S390 are solving the same problem here. I think S390
> > > is going to go to a SW nested model where it has an iommu_domain
> > > controlled by iommufd that is populated with the pinned pages, eg
> > > stored in an xarray.
> > > 
> > > Then the performance map/unmap path is simply copying pages from the
> > > xarray to the real IOPTEs - and this would be modeled as a nested
> > > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.
> > > 
> > > Perhaps this is agreeable for PPC too?
> > 
> > Uh.. maybe?  Note that I'm making these comments based on working on
> > this some years ago (the initial VFIO for ppc implementation in
> > particular).  I'm no longer actively involved in ppc kernel work.
> 
> OK
>  
> > > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> > > > windows, which aren't contiguous with each other.  The base addresses
> > > > of each of these are fixed, but the size of each window, the pagesize
> > > > (i.e. granularity) of each window and the number of levels in the
> > > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > > changes in how these windows are handled.  Typical Linux guests will
> > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > > can't count on that; the guest can use them however it wants.
> > > 
> > > As part of nesting iommufd will have a 'create iommu_domain using
> > > iommu driver specific data' primitive.
> > > 
> > > The driver specific data for PPC can include a description of these
> > > windows so the PPC specific qemu driver can issue this new ioctl
> > > using the information provided by the guest.
> > 
> > Hmm.. not sure if that works.  At the moment, qemu (for example) needs
> > to set up the domains/containers/IOASes as it constructs the machine,
> > because that's based on the virtual hardware topology.  Initially they
> > use the default windows (0..2GiB first window, second window
> > disabled).  Only once the guest kernel is up and running does it issue
> > the hypercalls to set the final windows as it prefers.  In theory the
> > guest could change them during runtime though it's unlikely in
> > practice.  They could change during machine lifetime in practice,
> > though, if you rebooted from one guest kernel to another that uses a
> > different configuration.
> > 
> > *Maybe* IOAS construction can be deferred somehow, though I'm not sure
> > because the assigned devices need to live somewhere.
> 
> This is a general requirement for all the nesting implementations, we
> start out with some default nested page table and then later the VM
> does the vIOMMU call to change it. So nesting will have to come along
> with some kind of 'switch domains IOCTL'
> 
> In this case I would guess PPC could do the same and start out with a
> small (nested) iommu_domain and then create the VM's desired
> iommu_domain from the hypercall, and switch to it.
> 
> It is a bit more CPU work since maps in the lower range would have to
> be copied over, but conceptually the model matches the HW nesting.

Ah.. ok.  IIUC what you're saying is that the kernel-side IOASes have
fixed windows, but we fake dynamic windows in the userspace
implementation by flipping the devices over to a new IOAS with the new
windows.  Is that right?

Where exactly would the windows be specified?  My understanding was
that when creating a back-end specific IOAS, that would typically be
for the case where you're using a user / guest managed IO pagetable,
with the backend specifying the format for that.  In the ppc case we'd
need to specify the windows, but we'd still need the IOAS_MAP/UNMAP
operations to manage the mappings.  The PAPR vIOMMU is
paravirtualized, so all updates come via hypercalls, so there's no
user/guest managed data structure.

That should work from the point of view of the userspace and guest
side interfaces.  It might be fiddly from the point of view of the
back end.  The ppc iommu doesn't really have the notion of
configurable domains - instead the address spaces are the hardware or
firmware fixed PEs, so they have a fixed set of devices.  At the bare
metal level it's possible to sort of do domains by making the actual
pagetable pointers for several PEs point to a common place.

However, in the future, nested KVM under PowerVM is likely to be the
norm.  In that situation the L1 as well as the L2 only has the
paravirtualized interfaces, which don't have any notion of domains,
only PEs.  All updates take place via hypercalls which explicitly
specify a PE (strictly speaking they take a "Logical IO Bus Number"
(LIOBN), but those generally map one to one with PEs), so it can't use
shared pointer tricks either.

Now obviously we can chose devices to share a logical iommu domain by
just duplicating mappings across the PEs, but what we can't do is
construct mappings in a domain and *then* atomically move devices into
it.


So, here's an alternative set of interfaces that should work for ppc,
maybe you can tell me whether they also work for x86 and others:

  * Each domain/IOAS has a concept of one or more IOVA windows, which
    each have a base address, size, pagesize (granularity) and optionally
    other flags/attributes.
      * This has some bearing on hardware capabilities, but is
        primarily a software notion
  * MAP/UNMAP operations are only permitted within an existing IOVA
    window (and with addresses aligned to the window's pagesize)
      * This is enforced by software whether or not it is required by
        the underlying hardware
  * Likewise IOAS_COPY operations are only permitted if the source and
    destination windows have compatible attributes
  * A newly created kernel-managed IOAS has *no* IOVA windows
  * A CREATE_WINDOW operation is added
      * This takes a size, pagesize/granularity, optional base address
        and optional additional attributes 
      * If any of the specified attributes are incompatible with the
        underlying hardware, the operation fails
    * "Compatibility" doesn't require an exact match: creating a small
       window within the hardware's much larger valid address range is
       fine, as is creating a window with a pagesize that's a multiple
       of the hardware pagesize
    * Unspecified attributes (including base address) are allocated by
      the kernel, based on the hardware capabilities/configuration
    * The properties of the window (including allocated ones) are
      recorded with the IOAS/domain
    * For IOMMUs like ppc which don't have fixed ranges, this would
      create windows on the backend to match, if possible, otherwise
      fail
  * A REMOVE_WINDOW operation reverses a single CREATE_WINDOW
    * This would also delete any MAPs within that window
  * ATTACH operations recaculate the hardware capabilities (as now),
    then verify then against the already created windows, and fail if
    the existing windows can't be preserved

MAP/UNMAP, CREATE/REMOVE_WINDOW and ATTACH/DETACH operations can
happen in any order, though whether they succeed may depend on the
order in some cases.

So, for a userspace driver setup (e.g. DPDK) the typical sequence
would be:

  1. Create IOAS
  2. ATTACH all the devices we want to use
  3. CREATE_WINDOW with the size and pagesize we need, other
     attributes unspecified
       - Because we already attached the devices, this will allocate a
         suitable base address and attributes for us
  4. IOAS_MAP the buffers we want into the window that was allocated
     for us
  5. Do work

For qemu or another hypervisor with a ppc (PAPR) guest, the sequence would be:

  1. Create IOAS for each guest side bus
  2. Determine the expected IOVA ranges based on the guest
     configuration (model of vIOMMU, what virtual devices are present)
  3. CREATE_WINDOW ranges for the default guest side windows.  This
     will have base address specified
       - If this fails, we're out of luck, report error and quit
  4. ATTACH devices that are on the guest bus associated with each
     IOAS
       - Again, if this fails, report error and quit
  5. Start machine, boot guest
  6. During runtime:
       - If the guest attempts to create new windows, do new
         CREATE_WINDOW operations.  If they fail, fail the initiating
	 hypercall
       - If the guest maps things into the IO pagetable, IOAS_MAP
         them.  If that fails, fail the hypercall
       - If the user hotplugs a device, try to ATTACH it.  If that
         fails, fail the hotplug operation.
       - If the machine is reset, REMOVE_WINDOW whatever's there and
         CREATE_WINDOW the defaults again

For qemu with an x86 guest, the sequence would be:

  1. Create default IOAS
  2. CREATE_WINDOW for all guest memory blocks
     - If it fails, report and quit
  3. IOAS_MAP guest RAM into default IOAS so that GPA==IOVA
  4. ATTACH all virtual devices to the default IOAS
     - If it fails, report and quit
  5. Start machine, boot guest
  6. During runtime
     - If the guest assigns devices to domains on the vIOMMU, create
       new IOAS for each guest domain, and ATTACH the device to the
       new IOAS
     - If the guest triggers a vIOMMU cache flush, (re)mirror the guest IO
       pagetables into the host using IOAS_MAP or COPY
     - If the user hotplugs a device, ATTACH it to the default
       domain.  If that fails, fail the hotplug
     - If the user hotplugs memory, CREATE_WINDOW in the default IOAS
       for the new memory region.  If that fails, fail the hotplug
     - On system reset, re-ATTACH all devices to default IOAS, remove
       all non-default IOAS


> > > > You might be able to do this by simply failing this outright if
> > > > there's anything other than exactly one IOMMU group bound to the
> > > > container / IOAS (which I think might be what VFIO itself does now).
> > > > Handling that with a device centric API gets somewhat fiddlier, of
> > > > course.
> > > 
> > > Maybe every device gets a copy of the error notification?
> > 
> > Alas, it's harder than that.  One of the things that can happen on an
> > EEH fault is that the entire PE gets suspended (blocking both DMA and
> > MMIO, IIRC) until the proper recovery steps are taken.  
> 
> I think qemu would have to de-duplicate the duplicated device
> notifications and then it can go from a device notifiation to the
> device's iommu_group to the IOAS to the vPE?

It's not about the notifications.  When the PE goes into
suspended/error state, that changes guest visible behaviour of
devices.  It essentially blocks all IO on the bus (the design
philosophy is to prioritize not propagating bad data above all else).
That state is triggered on (certain sorts of) IO errors and will
remain until explicit recovery steps are taken.

The guest might see that IO blockage before it gets/processes the
notifications (because concurrency).  On the host side only the host
PE will be affected, but on the guest side we can only report and
recover things at the guest PE level, which might be larger than a
host PE.

So the problem is if we have a guest PE with multiple host PEs, and we
trip an error which puts one of the host PEs into error state, we have
what's essentially an inconsistent, invalid state from the point of
view of the guest interfaces.  The guest has no way to sensibly query
or recover from this.

To make this work, essentially we'd have to trap the error on the
host, then inject errors to put all of the host PEs in the guest PE
into the same error state.  When the guest takes recovery actions we'd
need to duplicate those actions on all the host PEs.  If one of those
steps fails on one of the host PEs we have to work out out to report
that and bring the PEs back into a synchronized state.

Like I said, it might be possible, but it's really, really hairy.  All
for a feature that I'm not convinced anyone has ever used in earnest.

> A simple serial number in the event would make this pretty simple.
> 
> The way back to clear the event would just forward the commands
> through a random device in the iommu_group to the PE?
> 
> Thanks,
> Jason
>
Jason Gunthorpe May 5, 2022, 7:07 p.m. UTC | #18
On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote:

> > It is a bit more CPU work since maps in the lower range would have to
> > be copied over, but conceptually the model matches the HW nesting.
> 
> Ah.. ok.  IIUC what you're saying is that the kernel-side IOASes have
> fixed windows, but we fake dynamic windows in the userspace
> implementation by flipping the devices over to a new IOAS with the new
> windows.  Is that right?

Yes

> Where exactly would the windows be specified?  My understanding was
> that when creating a back-end specific IOAS, that would typically be
> for the case where you're using a user / guest managed IO pagetable,
> with the backend specifying the format for that.  In the ppc case we'd
> need to specify the windows, but we'd still need the IOAS_MAP/UNMAP
> operations to manage the mappings.  The PAPR vIOMMU is
> paravirtualized, so all updates come via hypercalls, so there's no
> user/guest managed data structure.

When the iommu_domain is created I want to have a
iommu-driver-specific struct, so PPC can customize its iommu_domain
however it likes.

> That should work from the point of view of the userspace and guest
> side interfaces.  It might be fiddly from the point of view of the
> back end.  The ppc iommu doesn't really have the notion of
> configurable domains - instead the address spaces are the hardware or
> firmware fixed PEs, so they have a fixed set of devices.  At the bare
> metal level it's possible to sort of do domains by making the actual
> pagetable pointers for several PEs point to a common place.

I'm not sure I understand this - a domain is just a storage container
for an IO page table, if the HW has IOPTEs then it should be able to
have a domain?

Making page table pointers point to a common IOPTE tree is exactly
what iommu_domains are for - why is that "sort of" for ppc?

> However, in the future, nested KVM under PowerVM is likely to be the
> norm.  In that situation the L1 as well as the L2 only has the
> paravirtualized interfaces, which don't have any notion of domains,
> only PEs.  All updates take place via hypercalls which explicitly
> specify a PE (strictly speaking they take a "Logical IO Bus Number"
> (LIOBN), but those generally map one to one with PEs), so it can't use
> shared pointer tricks either.

How does the paravirtualized interfaces deal with the page table? Does
it call a map/unmap hypercall instead of providing guest IOPTEs?

Assuming yes, I'd expect that:

The iommu_domain for nested PPC is just a log of map/unmap hypervsior
calls to make. Whenever a new PE is attached to that domain it gets
the logged map's replayed to set it up, and when a PE is detached the
log is used to unmap everything.

It is not perfectly memory efficient - and we could perhaps talk about
a API modification to allow re-use of the iommufd datastructure
somehow, but I think this is a good logical starting point.

The PE would have to be modeled as an iommu_group.

> So, here's an alternative set of interfaces that should work for ppc,
> maybe you can tell me whether they also work for x86 and others:

Fundamentally PPC has to fit into the iommu standard framework of
group and domains, we can talk about modifications, but drifting too
far away is a big problem.

>   * Each domain/IOAS has a concept of one or more IOVA windows, which
>     each have a base address, size, pagesize (granularity) and optionally
>     other flags/attributes.
>       * This has some bearing on hardware capabilities, but is
>         primarily a software notion

iommu_domain has the aperture, PPC will require extending this to a
list of apertures since it is currently only one window.

Once a domain is created and attached to a group the aperture should
be immutable.

>   * MAP/UNMAP operations are only permitted within an existing IOVA
>     window (and with addresses aligned to the window's pagesize)
>       * This is enforced by software whether or not it is required by
>         the underlying hardware
>   * Likewise IOAS_COPY operations are only permitted if the source and
>     destination windows have compatible attributes

Already done, domain's aperture restricts all the iommufd operations

>   * A newly created kernel-managed IOAS has *no* IOVA windows

Already done, the iommufd IOAS has no iommu_domains inside it at
creation time.

>   * A CREATE_WINDOW operation is added
>       * This takes a size, pagesize/granularity, optional base address
>         and optional additional attributes 
>       * If any of the specified attributes are incompatible with the
>         underlying hardware, the operation fails

iommu layer has nothing called a window. The closest thing is a
domain.

I really don't want to try to make a new iommu layer object that is so
unique and special to PPC - we have to figure out how to fit PPC into
the iommu_domain model with reasonable extensions.

> > > > Maybe every device gets a copy of the error notification?
> > > 
> > > Alas, it's harder than that.  One of the things that can happen on an
> > > EEH fault is that the entire PE gets suspended (blocking both DMA and
> > > MMIO, IIRC) until the proper recovery steps are taken.  
> > 
> > I think qemu would have to de-duplicate the duplicated device
> > notifications and then it can go from a device notifiation to the
> > device's iommu_group to the IOAS to the vPE?
> 
> It's not about the notifications. 

The only thing the kernel can do is rely a notification that something
happened to a PE. The kernel gets an event on the PE basis, I would
like it to replicate it to all the devices and push it through the
VFIO device FD.

qemu will de-duplicate the replicates and recover exactly the same
event the kernel saw, delivered at exactly the same time.

If instead you want to have one event per-PE then all that changes in
the kernel is one event is generated instead of N, and qemu doesn't
have to throw away the duplicates.

With either method qemu still gets the same PE centric event,
delivered at the same time, with the same races.

This is not a general mechanism, it is some PPC specific thing to
communicate a PPC specific PE centric event to userspace. I just
prefer it in VFIO instead of iommufd.

Jason
David Gibson May 6, 2022, 5:25 a.m. UTC | #19
On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote:
> On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote:
> 
> > > It is a bit more CPU work since maps in the lower range would have to
> > > be copied over, but conceptually the model matches the HW nesting.
> > 
> > Ah.. ok.  IIUC what you're saying is that the kernel-side IOASes have
> > fixed windows, but we fake dynamic windows in the userspace
> > implementation by flipping the devices over to a new IOAS with the new
> > windows.  Is that right?
> 
> Yes
> 
> > Where exactly would the windows be specified?  My understanding was
> > that when creating a back-end specific IOAS, that would typically be
> > for the case where you're using a user / guest managed IO pagetable,
> > with the backend specifying the format for that.  In the ppc case we'd
> > need to specify the windows, but we'd still need the IOAS_MAP/UNMAP
> > operations to manage the mappings.  The PAPR vIOMMU is
> > paravirtualized, so all updates come via hypercalls, so there's no
> > user/guest managed data structure.
> 
> When the iommu_domain is created I want to have a
> iommu-driver-specific struct, so PPC can customize its iommu_domain
> however it likes.

This requires that the client be aware of the host side IOMMU model.
That's true in VFIO now, and it's nasty; I was really hoping we could
*stop* doing that.

Note that I'm talking here *purely* about the non-optimized case where
all updates to the host side IO pagetables are handled by IOAS_MAP /
IOAS_COPY, with no direct hardware access to user or guest managed IO
pagetables.  The optimized case obviously requires end-to-end
agreement on the pagetable format amongst other domain properties.

What I'm hoping is that qemu (or whatever) can use this non-optimized
as a fallback case where it does't need to know the properties of
whatever host side IOMMU models there are.  It just requests what it
needs based on the vIOMMU properties it needs to replicate and the
host kernel either can supply it or can't.

In many cases it should be perfectly possible to emulate a PPC style
vIOMMU on an x86 host, because the x86 IOMMU has such a colossal
aperture that it will encompass wherever the ppc apertures end
up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc
host (currently somewhere between awkward and impossible) by placing
the host apertures to cover guest memory.

Admittedly those are pretty niche cases, but allowing for them gives
us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
the future, and AFAICT, ARM are much less conservative that x86 about
maintaining similar hw interfaces over time.  That's why I think
considering these ppc cases will give a more robust interface for
other future possibilities as well.

> > That should work from the point of view of the userspace and guest
> > side interfaces.  It might be fiddly from the point of view of the
> > back end.  The ppc iommu doesn't really have the notion of
> > configurable domains - instead the address spaces are the hardware or
> > firmware fixed PEs, so they have a fixed set of devices.  At the bare
> > metal level it's possible to sort of do domains by making the actual
> > pagetable pointers for several PEs point to a common place.
> 
> I'm not sure I understand this - a domain is just a storage container
> for an IO page table, if the HW has IOPTEs then it should be able to
> have a domain?
> 
> Making page table pointers point to a common IOPTE tree is exactly
> what iommu_domains are for - why is that "sort of" for ppc?

Ok, fair enough, it's only "sort of" in the sense that the hw specs /
docs don't present any equivalent concept.

> > However, in the future, nested KVM under PowerVM is likely to be the
> > norm.  In that situation the L1 as well as the L2 only has the
> > paravirtualized interfaces, which don't have any notion of domains,
> > only PEs.  All updates take place via hypercalls which explicitly
> > specify a PE (strictly speaking they take a "Logical IO Bus Number"
> > (LIOBN), but those generally map one to one with PEs), so it can't use
> > shared pointer tricks either.
> 
> How does the paravirtualized interfaces deal with the page table? Does
> it call a map/unmap hypercall instead of providing guest IOPTEs?

Sort of.  The main interface is H_PUT_TCE ("TCE" - Translation Control
Entry - being IBMese for an IOPTE). This takes an LIOBN (which selects
which PE and aperture), an IOVA and a TCE value - which is a guest
physical address plus some permission bits.  There are some variants
for performance that can set a batch of IOPTEs from a buffer, or clear
a range of IOPTEs, but they're just faster ways of doing the same
thing as a bunch of H_PUT_TCE calls.  H_PUT_TCE calls.

You can consider that a map/unmap hypercall, but the size of the
mapping is fixed (the IO pagesize which was previously set for the
aperture).

> Assuming yes, I'd expect that:
> 
> The iommu_domain for nested PPC is just a log of map/unmap hypervsior
> calls to make. Whenever a new PE is attached to that domain it gets
> the logged map's replayed to set it up, and when a PE is detached the
> log is used to unmap everything.

And likewise duplicate every H_PUT_TCE to all the PEs in the domain.
Sure.  It means the changes won't be atomic across the domain, but I
guess that doesn't matter.  I guess you could have the same thing on a
sufficiently complex x86 or ARM system, if you put two devices into
the IOAS that were sufficiently far from each other in the bus
topology that they use a different top-level host IOMMU.

> It is not perfectly memory efficient - and we could perhaps talk about
> a API modification to allow re-use of the iommufd datastructure
> somehow, but I think this is a good logical starting point.

Because the map size is fixed, a "replay log" is effectively
equivalent to a mirror of the entire IO pagetable.

> The PE would have to be modeled as an iommu_group.

Yes, they already are.

> > So, here's an alternative set of interfaces that should work for ppc,
> > maybe you can tell me whether they also work for x86 and others:
> 
> Fundamentally PPC has to fit into the iommu standard framework of
> group and domains, we can talk about modifications, but drifting too
> far away is a big problem.

Well, what I'm trying to do here is to see how big a change to the
model this really requires.  If it's not too much, we gain flexibility
for future unknown IOMMU models as well as for the admittedly niche
ppc case.

> >   * Each domain/IOAS has a concept of one or more IOVA windows, which
> >     each have a base address, size, pagesize (granularity) and optionally
> >     other flags/attributes.
> >       * This has some bearing on hardware capabilities, but is
> >         primarily a software notion
> 
> iommu_domain has the aperture, PPC will require extending this to a
> list of apertures since it is currently only one window.

Yes, that's needed as a minimum.

> Once a domain is created and attached to a group the aperture should
> be immutable.

I realize that's the model right now, but is there a strong rationale
for that immutability?

Come to that, IIUC that's true for the iommu_domain at the lower
level, but not for the IOAS at a higher level.  You've stated that's
*not* immutable, since it can shrink as new devices are added to the
IOAS.  So I guess in that case the IOAS must be mapped by multiple
iommu_domains?

> >   * MAP/UNMAP operations are only permitted within an existing IOVA
> >     window (and with addresses aligned to the window's pagesize)
> >       * This is enforced by software whether or not it is required by
> >         the underlying hardware
> >   * Likewise IOAS_COPY operations are only permitted if the source and
> >     destination windows have compatible attributes
> 
> Already done, domain's aperture restricts all the iommufd operations

Right but that aperture is coming only from the hardware.  What I'm
suggesting here is that userspace can essentially opt into a *smaller*
aperture (or apertures) than the hardware permits.  The value of this
is that if the effective hardware aperture shrinks due to adding
devices, the kernel has the information it needs to determine if this
will be a problem for the userspace client or not.

> >   * A newly created kernel-managed IOAS has *no* IOVA windows
> 
> Already done, the iommufd IOAS has no iommu_domains inside it at
> creation time.

That.. doesn't seem like the same thing at all.  If there are no
domains, there are no restrictions from the hardware, so there's
effectively an unlimited aperture.

> >   * A CREATE_WINDOW operation is added
> >       * This takes a size, pagesize/granularity, optional base address
> >         and optional additional attributes 
> >       * If any of the specified attributes are incompatible with the
> >         underlying hardware, the operation fails
> 
> iommu layer has nothing called a window. The closest thing is a
> domain.

Maybe "window" is a bad name.  You called it "aperture" above (and
I've shifted to that naming) and implied it *was* part of the IOMMU
domain.  That said, it doesn't need to be at the iommu layer - as I
said I'm consdiering this primarily a software concept and it could be
at the IOAS layer.  The linkage would be that every iommufd operation
which could change either the user-selected windows or the effective
hardware apertures must ensure that all the user windows (still) lie
within the hardware apertures.

> I really don't want to try to make a new iommu layer object that is so
> unique and special to PPC - we have to figure out how to fit PPC into
> the iommu_domain model with reasonable extensions.

So, I'm probably conflating things between the iommu layer and the
iommufd/IOAS layer because it's a been a while since I looked at this
code. I'm primarily interested in the interfaces at the iommufd layer,
I don't much care at what layer it's implemented.

Having the iommu layer only care about the hw limitations (let's call
those "apertures") and the iommufd/IOAS layer care about the software
chosen limitations (let's call those "windows") makes reasonable sense
to me: HW advertises apertures according to its capabilities, SW
requests windows according to its needs.  The iommufd layer does its
best to accomodate SW's operations while maintaining the invariant
that the windows must always lie within the apertures.

At least.. that's the model on x86 and most other hosts.  For a ppc
host, we'd need want to have attempts to create windows also attempt
to create apertures on the hardware (or next level hypervisor).  But
with this model, that doesn't require userspace to do anything
different, so that can be localized to the ppc host backend.

> > > > > Maybe every device gets a copy of the error notification?
> > > > 
> > > > Alas, it's harder than that.  One of the things that can happen on an
> > > > EEH fault is that the entire PE gets suspended (blocking both DMA and
> > > > MMIO, IIRC) until the proper recovery steps are taken.  
> > > 
> > > I think qemu would have to de-duplicate the duplicated device
> > > notifications and then it can go from a device notifiation to the
> > > device's iommu_group to the IOAS to the vPE?
> > 
> > It's not about the notifications. 
> 
> The only thing the kernel can do is rely a notification that something
> happened to a PE. The kernel gets an event on the PE basis, I would
> like it to replicate it to all the devices and push it through the
> VFIO device FD.

Again: it's not about the notifications (I don't even really know how
those work in EEH).  The way EEH is specified, the expectation is that
when an error is tripped, the whole PE goes into an error state,
blocking IO for every device in the PE.  Something working with EEH
expects that to happen at the (virtual) hardware level.  So if a guest
trips an error on one device, it expects IO to stop for every other
device in the PE.  But if hardware's notion of the PE is smaller than
the guest's the host hardware won't accomplish that itself.

So the kernel would somehow have to replicate the error state in one
host PE to the other host PEs within the domain.  I think it's
theoretically possible, but it's really fiddly. It has to maintain
that replicated state on every step of the recovery process as
well. At last count trying to figure out how to do this correctly has
burnt out 3 or more separate (competent) developers at IBM so they
either just give up and run away, or procrastinate/avoid it until they
got a different job.

Bear in mind that in this scenario things are already not working
quite right at the hw level (or there wouldn't be an error in the
first place), so different results from the same recovery/injection
steps on different host PEs is a very real possibility.  We can't
present that that back to the guest running the recovery, because the
interface only allows for a single result.  So every operation's error
path becomes a nightmare of trying to resychronize the state across
the domain, which involves steps that themselves could fail, meaning
more resync... and so on.

> qemu will de-duplicate the replicates and recover exactly the same
> event the kernel saw, delivered at exactly the same time.

qemu's not doing the recovery, the guest is.  So we can't choose the
interfaces, we just have what PAPR specifies (and like most PAPR
interfaces, it's pretty crap).

> If instead you want to have one event per-PE then all that changes in
> the kernel is one event is generated instead of N, and qemu doesn't
> have to throw away the duplicates.
> 
> With either method qemu still gets the same PE centric event,
> delivered at the same time, with the same races.
> 
> This is not a general mechanism, it is some PPC specific thing to
> communicate a PPC specific PE centric event to userspace. I just
> prefer it in VFIO instead of iommufd.

Oh, yes, this one (unlike the multi-aperture stuff, where I'm not yet
convinced) is definitely ppc specific.  And I'm frankly not convinced
it's of any value even on ppc.  Given I'm not sure anyone has ever
used this in anger, I think a better approach is to just straight up
fail any attempt to do any EEH operation if there's not a 1 to 1
mapping from guest PE (domain) to host PE (group).

Or just ignore it entirely and count on no-one noticing.  EEH is
a largely irrelevant tangent to the actual interface issues I'm
interested in.
Tian, Kevin May 6, 2022, 10:42 a.m. UTC | #20
> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Friday, May 6, 2022 1:25 PM
> 
> >
> > When the iommu_domain is created I want to have a
> > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > however it likes.
> 
> This requires that the client be aware of the host side IOMMU model.
> That's true in VFIO now, and it's nasty; I was really hoping we could
> *stop* doing that.

that model is anyway inevitable when talking about user page table,
i.e. when nesting is enabled.

> 
> Note that I'm talking here *purely* about the non-optimized case where
> all updates to the host side IO pagetables are handled by IOAS_MAP /
> IOAS_COPY, with no direct hardware access to user or guest managed IO
> pagetables.  The optimized case obviously requires end-to-end
> agreement on the pagetable format amongst other domain properties.
> 
> What I'm hoping is that qemu (or whatever) can use this non-optimized
> as a fallback case where it does't need to know the properties of
> whatever host side IOMMU models there are.  It just requests what it
> needs based on the vIOMMU properties it needs to replicate and the
> host kernel either can supply it or can't.
> 
> In many cases it should be perfectly possible to emulate a PPC style
> vIOMMU on an x86 host, because the x86 IOMMU has such a colossal
> aperture that it will encompass wherever the ppc apertures end
> up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc
> host (currently somewhere between awkward and impossible) by placing
> the host apertures to cover guest memory.
> 
> Admittedly those are pretty niche cases, but allowing for them gives
> us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> the future, and AFAICT, ARM are much less conservative that x86 about
> maintaining similar hw interfaces over time.  That's why I think
> considering these ppc cases will give a more robust interface for
> other future possibilities as well.

It's not niche cases. We already have virtio-iommu which can work
on both ARM and x86 platforms, i.e. what current iommufd provides
is already generic enough except on PPC.

Then IMHO the key open here is:

Can PPC adapt to the current iommufd proposal if it can be
refactored to fit the standard iommu domain/group concepts?

If not, what is the remaining gap after PPC becomes a normal
citizen in the iommu layer and is it worth solving it in the general
interface or via iommu-driver-specific domain (given this will
exist anyway)?

to close that open I'm with Jason:

   "Fundamentally PPC has to fit into the iommu standard framework of
   group and domains, we can talk about modifications, but drifting too
   far away is a big problem."

Directly jumping to the iommufd layer for what changes might be
applied to all platforms sounds counter-intuitive if we haven't tried 
to solve the gap in the iommu layer in the first place, as even
there is argument that certain changes in iommufd layer can find
matching concept on other platforms it still sort of looks redundant
since those platforms already work with the current model.

My two cents.

Thanks
Kevin
Jason Gunthorpe May 6, 2022, 12:48 p.m. UTC | #21
On Fri, May 06, 2022 at 03:25:03PM +1000, David Gibson wrote:
> On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote:

> > When the iommu_domain is created I want to have a
> > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > however it likes.
> 
> This requires that the client be aware of the host side IOMMU model.
> That's true in VFIO now, and it's nasty; I was really hoping we could
> *stop* doing that.

iommufd has two modes, the 'generic interface' which what this patch
series shows that does not require any device specific knowledge.

The default iommu_domain that the iommu driver creates will be used
here, it is up to the iommu driver to choose something reasonable for
use by applications like DPDK. ie PPC should probably pick its biggest
x86-like aperture.

The iommu-driver-specific struct is the "advanced" interface and
allows a user-space IOMMU driver to tightly control the HW with full
HW specific knowledge. This is where all the weird stuff that is not
general should go.

> Note that I'm talking here *purely* about the non-optimized case where
> all updates to the host side IO pagetables are handled by IOAS_MAP /
> IOAS_COPY, with no direct hardware access to user or guest managed IO
> pagetables.  The optimized case obviously requires end-to-end
> agreement on the pagetable format amongst other domain properties.

Sure, this is how things are already..

> What I'm hoping is that qemu (or whatever) can use this non-optimized
> as a fallback case where it does't need to know the properties of
> whatever host side IOMMU models there are.  It just requests what it
> needs based on the vIOMMU properties it needs to replicate and the
> host kernel either can supply it or can't.

There aren't really any negotiable vIOMMU properties beyond the
ranges, and the ranges are not *really* negotiable.

There are lots of dragons with the idea we can actually negotiate
ranges - because asking for the wrong range for what the HW can do
means you don't get anything. Which is completely contrary to the idea
of easy generic support for things like DPDK.

So DPDK can't ask for ranges, it is not generic.

This means we are really talking about a qemu-only API, and IMHO, qemu
is fine to have a PPC specific userspace driver to tweak this PPC
unique thing if the default windows are not acceptable.

IMHO it is no different from imagining an Intel specific userspace
driver that is using userspace IO pagetables to optimize
cross-platform qemu vIOMMU emulation. We should be comfortable with
the idea that accessing the full device-specific feature set requires
a HW specific user space driver.

> Admittedly those are pretty niche cases, but allowing for them gives
> us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> the future, and AFAICT, ARM are much less conservative that x86 about
> maintaining similar hw interfaces over time.  That's why I think
> considering these ppc cases will give a more robust interface for
> other future possibilities as well.

I don't think so - PPC has just done two things that are completely
divergent from eveything else - having two IO page tables for the same
end point, and using map/unmap hypercalls instead of a nested page
table.

Everyone else seems to be focused on IOPTEs that are similar to CPU
PTEs, particularly to enable SVA and other tricks, and CPU's don't
have either of this weirdness.

> You can consider that a map/unmap hypercall, but the size of the
> mapping is fixed (the IO pagesize which was previously set for the
> aperture).

Yes, I would consider that a map/unmap hypercall vs a nested translation.
 
> > Assuming yes, I'd expect that:
> > 
> > The iommu_domain for nested PPC is just a log of map/unmap hypervsior
> > calls to make. Whenever a new PE is attached to that domain it gets
> > the logged map's replayed to set it up, and when a PE is detached the
> > log is used to unmap everything.
> 
> And likewise duplicate every H_PUT_TCE to all the PEs in the domain.
> Sure.  It means the changes won't be atomic across the domain, but I
> guess that doesn't matter.  I guess you could have the same thing on a
> sufficiently complex x86 or ARM system, if you put two devices into
> the IOAS that were sufficiently far from each other in the bus
> topology that they use a different top-level host IOMMU.

Yes, strict atomicity is not needed.

> > It is not perfectly memory efficient - and we could perhaps talk about
> > a API modification to allow re-use of the iommufd datastructure
> > somehow, but I think this is a good logical starting point.
> 
> Because the map size is fixed, a "replay log" is effectively
> equivalent to a mirror of the entire IO pagetable.

So, for virtualized PPC the iommu_domain is an xarray of mapped PFNs
and device attach/detach just sweeps the xarray and does the
hypercalls. Very similar to what we discussed for S390.

It seems OK, this isn't even really overhead in most cases as we
always have to track the mapped PFNs anyhow.

> > Once a domain is created and attached to a group the aperture should
> > be immutable.
> 
> I realize that's the model right now, but is there a strong rationale
> for that immutability?

I have a strong preference that iommu_domains have immutable
properties once they are created just for overall sanity. Otherwise
everything becomes a racy mess.

If the iommu_domain changes dynamically then things using the aperture
data get all broken - it is complicated to fix. So it would need a big
reason to do it, I think.
 
> Come to that, IIUC that's true for the iommu_domain at the lower
> level, but not for the IOAS at a higher level.  You've stated that's
> *not* immutable, since it can shrink as new devices are added to the
> IOAS.  So I guess in that case the IOAS must be mapped by multiple
> iommu_domains?

Yes, thats right. The IOAS is expressly mutable because it its on
multiple domains and multiple groups each of which contribute to the
aperture. The IOAS aperture is the narrowest of everything, and we
have semi-reasonable semantics here. Here we have all the special code
to juggle this, but even in this case we can't handle an iommu_domain
or group changing asynchronously.

> Right but that aperture is coming only from the hardware.  What I'm
> suggesting here is that userspace can essentially opt into a *smaller*
> aperture (or apertures) than the hardware permits.  The value of this
> is that if the effective hardware aperture shrinks due to adding
> devices, the kernel has the information it needs to determine if this
> will be a problem for the userspace client or not.

We can do this check in userspace without more kernel APIs, userspace
should fetch the ranges and confirm they are still good after
attaching devices.

In general I have no issue with limiting the IOVA allocator in the
kernel, I just don't have a use case of an application that could use
the IOVA allocator (like DPDK) and also needs a limitation..

> > >   * A newly created kernel-managed IOAS has *no* IOVA windows
> > 
> > Already done, the iommufd IOAS has no iommu_domains inside it at
> > creation time.
> 
> That.. doesn't seem like the same thing at all.  If there are no
> domains, there are no restrictions from the hardware, so there's
> effectively an unlimited aperture.

Yes.. You wanted a 0 sized window instead? Why? That breaks what I'm
trying to do to make DPDK/etc portable and dead simple.
 
> > >   * A CREATE_WINDOW operation is added
> > >       * This takes a size, pagesize/granularity, optional base address
> > >         and optional additional attributes 
> > >       * If any of the specified attributes are incompatible with the
> > >         underlying hardware, the operation fails
> > 
> > iommu layer has nothing called a window. The closest thing is a
> > domain.
> 
> Maybe "window" is a bad name.  You called it "aperture" above (and
> I've shifted to that naming) and implied it *was* part of the IOMMU
> domain. 

It is but not as an object that can be mutated - it is just a
property.

You are talking about a window object that exists somehow, I don't
know this fits beyond some creation attribute of the domain..

> That said, it doesn't need to be at the iommu layer - as I
> said I'm consdiering this primarily a software concept and it could be
> at the IOAS layer.  The linkage would be that every iommufd operation
> which could change either the user-selected windows or the effective
> hardware apertures must ensure that all the user windows (still) lie
> within the hardware apertures.

As Kevin said, we need to start at the iommu_domain layer first -
when we understand how that needs to look then we can imagine what the
uAPI should be.

I don't want to create something in iommufd that is wildly divergent
from what the iommu_domain/etc can support.

> > The only thing the kernel can do is rely a notification that something
> > happened to a PE. The kernel gets an event on the PE basis, I would
> > like it to replicate it to all the devices and push it through the
> > VFIO device FD.
> 
> Again: it's not about the notifications (I don't even really know how
> those work in EEH).

Well, then I don't know what we are talking about - I'm interested in
what uAPI is needed to support this, as far as can I tell that is a
notification something bad happened and some control knobs?

As I said, I'd prefer this to be on the vfio_device FD vs on iommufd
and would like to find a way to make that work.

> expects that to happen at the (virtual) hardware level.  So if a guest
> trips an error on one device, it expects IO to stop for every other
> device in the PE.  But if hardware's notion of the PE is smaller than
> the guest's the host hardware won't accomplish that itself.

So, why do that to the guest? Shouldn't the PE in the guest be
strictly a subset of the PE in the host, otherwise you hit all these
problems you are talking about?

> used this in anger, I think a better approach is to just straight up
> fail any attempt to do any EEH operation if there's not a 1 to 1
> mapping from guest PE (domain) to host PE (group).

That makes sense

Jason
David Gibson May 9, 2022, 3:36 a.m. UTC | #22
On Fri, May 06, 2022 at 10:42:21AM +0000, Tian, Kevin wrote:
> > From: David Gibson <david@gibson.dropbear.id.au>
> > Sent: Friday, May 6, 2022 1:25 PM
> > 
> > >
> > > When the iommu_domain is created I want to have a
> > > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > > however it likes.
> > 
> > This requires that the client be aware of the host side IOMMU model.
> > That's true in VFIO now, and it's nasty; I was really hoping we could
> > *stop* doing that.
> 
> that model is anyway inevitable when talking about user page table,

Right, but I'm explicitly not talking about the user managed page
table case.  I'm talking about the case where the IO pagetable is
still managed by the kernel and we update it via IOAS_MAP and similar
operations.

> i.e. when nesting is enabled.

I don't really follow the connection you're drawing between a user
managed table and nesting.

> > Note that I'm talking here *purely* about the non-optimized case where
> > all updates to the host side IO pagetables are handled by IOAS_MAP /
> > IOAS_COPY, with no direct hardware access to user or guest managed IO
> > pagetables.  The optimized case obviously requires end-to-end
> > agreement on the pagetable format amongst other domain properties.
> > 
> > What I'm hoping is that qemu (or whatever) can use this non-optimized
> > as a fallback case where it does't need to know the properties of
> > whatever host side IOMMU models there are.  It just requests what it
> > needs based on the vIOMMU properties it needs to replicate and the
> > host kernel either can supply it or can't.
> > 
> > In many cases it should be perfectly possible to emulate a PPC style
> > vIOMMU on an x86 host, because the x86 IOMMU has such a colossal
> > aperture that it will encompass wherever the ppc apertures end
> > up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc
> > host (currently somewhere between awkward and impossible) by placing
> > the host apertures to cover guest memory.
> > 
> > Admittedly those are pretty niche cases, but allowing for them gives
> > us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> > the future, and AFAICT, ARM are much less conservative that x86 about
> > maintaining similar hw interfaces over time.  That's why I think
> > considering these ppc cases will give a more robust interface for
> > other future possibilities as well.
> 
> It's not niche cases. We already have virtio-iommu which can work
> on both ARM and x86 platforms, i.e. what current iommufd provides
> is already generic enough except on PPC.
> 
> Then IMHO the key open here is:
> 
> Can PPC adapt to the current iommufd proposal if it can be
> refactored to fit the standard iommu domain/group concepts?

Right...  and I'm still trying to figure out whether it can adapt to
either part of that.  We absolutely need to allow for multiple IOVA
apertures within a domain.  If we have that I *think* we can manage
(if suboptimally), but I'm trying to figure out the corner cases to
make sure I haven't missed something.

> If not, what is the remaining gap after PPC becomes a normal
> citizen in the iommu layer and is it worth solving it in the general
> interface or via iommu-driver-specific domain (given this will
> exist anyway)?
> 
> to close that open I'm with Jason:
> 
>    "Fundamentally PPC has to fit into the iommu standard framework of
>    group and domains, we can talk about modifications, but drifting too
>    far away is a big problem."
> 
> Directly jumping to the iommufd layer for what changes might be
> applied to all platforms sounds counter-intuitive if we haven't tried 
> to solve the gap in the iommu layer in the first place, as even
> there is argument that certain changes in iommufd layer can find
> matching concept on other platforms it still sort of looks redundant
> since those platforms already work with the current model.

I don't really follow what you're saying here.
David Gibson May 9, 2022, 6:01 a.m. UTC | #23
On Fri, May 06, 2022 at 09:48:37AM -0300, Jason Gunthorpe wrote:
> On Fri, May 06, 2022 at 03:25:03PM +1000, David Gibson wrote:
> > On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote:
> 
> > > When the iommu_domain is created I want to have a
> > > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > > however it likes.
> > 
> > This requires that the client be aware of the host side IOMMU model.
> > That's true in VFIO now, and it's nasty; I was really hoping we could
> > *stop* doing that.
> 
> iommufd has two modes, the 'generic interface' which what this patch
> series shows that does not require any device specific knowledge.

Right, and I'm speaking specifically to that generic interface.  But
I'm thinking particularly about the qemu case where we do have
specific knowledge of the *guest* vIOMMU, but we want to avoid having
specific knowledge of the host IOMMU, because they might not be the same.

It would be good to have a way of seeing if the guest vIOMMU can be
emulated on this host IOMMU without qemu having to have separate
logic for every host IOMMU.

> The default iommu_domain that the iommu driver creates will be used
> here, it is up to the iommu driver to choose something reasonable for
> use by applications like DPDK. ie PPC should probably pick its biggest
> x86-like aperture.

So, using the big aperture means a very high base IOVA
(1<<59)... which means that it won't work at all if you want to attach
any devices that aren't capable of 64-bit DMA.  Using the maximum
possible window size would mean we either potentially waste a lot of
kernel memory on pagetables, or we use unnecessarily large number of
levels to the pagetable.

Basically we don't have enough information to make a good decision
here.

More generally, the problem with the interface advertising limitations
and it being up to userspace to work out if those are ok or not is
that it's fragile.  It's pretty plausible that some future IOMMU model
will have some new kind of limitation that can't be expressed in the
query structure we invented now.  That means that to add support for
that we need some kind of gate to prevent old userspace using the new
IOMMU (e.g. only allowing the new IOMMU to be used if userspace uses
newly added queries to get the new limitations).  That's true even if
what userspace was actually doing with the IOMMU would fit just fine
into those new limitations.

But if userspace requests the capabilities it wants, and the kernel
acks or nacks that, we can support the new host IOMMU with existing
software just fine.  They won't be able to use any *new* features or
capabilities of the new hardware, of course, but they'll be able to
use what it does that overlaps with what they needed before.

ppc - or more correctly, the POWER and PAPR IOMMU models - is just
acting here as an example of an IOMMU with limitations and
capabilities that don't fit into the current query model.

> The iommu-driver-specific struct is the "advanced" interface and
> allows a user-space IOMMU driver to tightly control the HW with full
> HW specific knowledge. This is where all the weird stuff that is not
> general should go.

Right, but forcing anything more complicated than "give me some IOVA
region" to go through the advanced interface means that qemu (or any
hypervisor where the guest platform need not identically match the
host) has to have n^2 complexity to match each guest IOMMU model to
each host IOMMU model.

> > Note that I'm talking here *purely* about the non-optimized case where
> > all updates to the host side IO pagetables are handled by IOAS_MAP /
> > IOAS_COPY, with no direct hardware access to user or guest managed IO
> > pagetables.  The optimized case obviously requires end-to-end
> > agreement on the pagetable format amongst other domain properties.
> 
> Sure, this is how things are already..
> 
> > What I'm hoping is that qemu (or whatever) can use this non-optimized
> > as a fallback case where it does't need to know the properties of
> > whatever host side IOMMU models there are.  It just requests what it
> > needs based on the vIOMMU properties it needs to replicate and the
> > host kernel either can supply it or can't.
> 
> There aren't really any negotiable vIOMMU properties beyond the
> ranges, and the ranges are not *really* negotiable.

Errr.. how do you figure?  On ppc the ranges and pagesizes are
definitely negotiable.  I'm not really familiar with other models, but
anything which allows *any* variations in the pagetable structure will
effectively have at least some negotiable properties.

Even if any individual host IOMMU doesn't have negotiable properties
(which ppc demonstrates is false), there's still a negotiation here in
the context that userspace doesn't know (and doesn't care) what
specific host IOMMU model it has.

> There are lots of dragons with the idea we can actually negotiate
> ranges - because asking for the wrong range for what the HW can do
> means you don't get anything. Which is completely contrary to the idea
> of easy generic support for things like DPDK.
>
> So DPDK can't ask for ranges, it is not generic.

Which is why I'm suggesting that the base address be an optional
request.  DPDK *will* care about the size of the range, so it just
requests that and gets told a base address.

Qemu emulating a vIOMMU absolutely does care about the ranges, and if
the HW can't do it, failing outright is the correct behaviour.

> This means we are really talking about a qemu-only API,

Kind of, yes.  I can't immediately think of anything which would need
this fine gained control over the IOMMU capabilities other than an
emulator/hypervisor emulating a different vIOMMU than the host IOMMU.
qemu is by far the most prominent example of this.

> and IMHO, qemu
> is fine to have a PPC specific userspace driver to tweak this PPC
> unique thing if the default windows are not acceptable.

Not really, because it's the ppc *target* (guest) side which requires
the specific properties, but selecting the "advanced" interface
requires special knowledge on the *host* side.

> IMHO it is no different from imagining an Intel specific userspace
> driver that is using userspace IO pagetables to optimize
> cross-platform qemu vIOMMU emulation.

I'm not quite sure what you have in mind here.  How is it both Intel
specific and cross-platform?

> We should be comfortable with
> the idea that accessing the full device-specific feature set requires
> a HW specific user space driver.
> 
> > Admittedly those are pretty niche cases, but allowing for them gives
> > us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> > the future, and AFAICT, ARM are much less conservative that x86 about
> > maintaining similar hw interfaces over time.  That's why I think
> > considering these ppc cases will give a more robust interface for
> > other future possibilities as well.
> 
> I don't think so - PPC has just done two things that are completely
> divergent from eveything else - having two IO page tables for the same
> end point, and using map/unmap hypercalls instead of a nested page
> table.
> 
> Everyone else seems to be focused on IOPTEs that are similar to CPU
> PTEs, particularly to enable SVA and other tricks, and CPU's don't
> have either of this weirdness.

Well, you may have a point there.  Though, my experience makes me
pretty reluctant to ever bet on hw designers *not* inserting some
weird new constraint and assuming sw can just figure it out somehow.

Note however, that having multiple apertures isn't really ppc specific.
Anything with an IO hole effectively has separate apertures above and
below the hole.  They're much closer together in address than POWER's
two apertures, but I don't see that makes any fundamental difference
to the handling of them.

> > You can consider that a map/unmap hypercall, but the size of the
> > mapping is fixed (the IO pagesize which was previously set for the
> > aperture).
> 
> Yes, I would consider that a map/unmap hypercall vs a nested translation.
>  
> > > Assuming yes, I'd expect that:
> > > 
> > > The iommu_domain for nested PPC is just a log of map/unmap hypervsior
> > > calls to make. Whenever a new PE is attached to that domain it gets
> > > the logged map's replayed to set it up, and when a PE is detached the
> > > log is used to unmap everything.
> > 
> > And likewise duplicate every H_PUT_TCE to all the PEs in the domain.
> > Sure.  It means the changes won't be atomic across the domain, but I
> > guess that doesn't matter.  I guess you could have the same thing on a
> > sufficiently complex x86 or ARM system, if you put two devices into
> > the IOAS that were sufficiently far from each other in the bus
> > topology that they use a different top-level host IOMMU.
> 
> Yes, strict atomicity is not needed.

Ok, good to know.

> > > It is not perfectly memory efficient - and we could perhaps talk about
> > > a API modification to allow re-use of the iommufd datastructure
> > > somehow, but I think this is a good logical starting point.
> > 
> > Because the map size is fixed, a "replay log" is effectively
> > equivalent to a mirror of the entire IO pagetable.
> 
> So, for virtualized PPC the iommu_domain is an xarray of mapped PFNs
> and device attach/detach just sweeps the xarray and does the
> hypercalls. Very similar to what we discussed for S390.
> 
> It seems OK, this isn't even really overhead in most cases as we
> always have to track the mapped PFNs anyhow.

Fair enough, I think that works.

> > > Once a domain is created and attached to a group the aperture should
> > > be immutable.
> > 
> > I realize that's the model right now, but is there a strong rationale
> > for that immutability?
> 
> I have a strong preference that iommu_domains have immutable
> properties once they are created just for overall sanity. Otherwise
> everything becomes a racy mess.
> 
> If the iommu_domain changes dynamically then things using the aperture
> data get all broken - it is complicated to fix. So it would need a big
> reason to do it, I think.

Ok, that makes sense.  I'm pretty ok with the iommu_domain having
static apertures.  We can't optimally implement ddw that way, but I
think we can do it well enough by creating new domains and copying the
mappings, as you suggested.

> > Come to that, IIUC that's true for the iommu_domain at the lower
> > level, but not for the IOAS at a higher level.  You've stated that's
> > *not* immutable, since it can shrink as new devices are added to the
> > IOAS.  So I guess in that case the IOAS must be mapped by multiple
> > iommu_domains?
> 
> Yes, thats right. The IOAS is expressly mutable because it its on
> multiple domains and multiple groups each of which contribute to the
> aperture. The IOAS aperture is the narrowest of everything, and we
> have semi-reasonable semantics here. Here we have all the special code
> to juggle this, but even in this case we can't handle an iommu_domain
> or group changing asynchronously.

Ok.

So at the IOAS level, the apertures are already dynamic.  Arguably the
main point of my proposal is that it makes any changes in IOAS
apertures explicit, rather than implicit.  Typically userspace
wouldn't need to rely on the aperture information from a query,
because they know it will never shrink below the parameters they
specified.

Another approach would be to give the required apertures / pagesizes
in the initial creation of the domain/IOAS.  In that case they would
be static for the IOAS, as well as the underlying iommu_domains: any
ATTACH which would be incompatible would fail.

That makes life hard for the DPDK case, though.  Obviously we can
still make the base address optional, but for it to be static the
kernel would have to pick it immediately, before we know what devices
will be attached, which will be a problem on any system where there
are multiple IOMMUs with different constraints.

> > Right but that aperture is coming only from the hardware.  What I'm
> > suggesting here is that userspace can essentially opt into a *smaller*
> > aperture (or apertures) than the hardware permits.  The value of this
> > is that if the effective hardware aperture shrinks due to adding
> > devices, the kernel has the information it needs to determine if this
> > will be a problem for the userspace client or not.
> 
> We can do this check in userspace without more kernel APIs, userspace
> should fetch the ranges and confirm they are still good after
> attaching devices.

Well, yes, but I always think an API which stops you doing the wrong
thing is better than one which requires you take a bunch of extra
steps to use safely.  Plus, as noted above, this breaks down if some
future IOMMU has a new constraint not expressed in the current query
API.

> In general I have no issue with limiting the IOVA allocator in the
> kernel, I just don't have a use case of an application that could use
> the IOVA allocator (like DPDK) and also needs a limitation..

Well, I imagine DPDK has at least the minimal limitation that it needs
the aperture to be a certain minimum size (I'm guessing at least the
size of its pinned hugepage working memory region).  That's a
limitation that's unlikely to fail on modern hardware, but it's there.

> > > >   * A newly created kernel-managed IOAS has *no* IOVA windows
> > > 
> > > Already done, the iommufd IOAS has no iommu_domains inside it at
> > > creation time.
> > 
> > That.. doesn't seem like the same thing at all.  If there are no
> > domains, there are no restrictions from the hardware, so there's
> > effectively an unlimited aperture.
> 
> Yes.. You wanted a 0 sized window instead?

Yes.  Well.. since I'm thinking in terms of multiple windows, I was
thinking of it as "0 windows" rather than a 0-sized window, but
whatever.  Starting with no permitted IOVAs is the relevant point.

> Why?

To force the app to declare its windows with CREATE_WINDOW before
using them.

> That breaks what I'm
> trying to do to make DPDK/etc portable and dead simple.

It doesn't break portability at all.  As for simplicity, yes it adds
an extra required step, but the payoff is that it's now impossible to
subtly screw up by failing to recheck your apertures after an ATTACH.
That is, it's taking a step which was implicitly required and
replacing it with one that's explicitly required.

> > > >   * A CREATE_WINDOW operation is added
> > > >       * This takes a size, pagesize/granularity, optional base address
> > > >         and optional additional attributes 
> > > >       * If any of the specified attributes are incompatible with the
> > > >         underlying hardware, the operation fails
> > > 
> > > iommu layer has nothing called a window. The closest thing is a
> > > domain.
> > 
> > Maybe "window" is a bad name.  You called it "aperture" above (and
> > I've shifted to that naming) and implied it *was* part of the IOMMU
> > domain. 
> 
> It is but not as an object that can be mutated - it is just a
> property.
> 
> You are talking about a window object that exists somehow, I don't
> know this fits beyond some creation attribute of the domain..

At the domain layer, I think we can manage it that way, albeit
suboptimally.

> > That said, it doesn't need to be at the iommu layer - as I
> > said I'm consdiering this primarily a software concept and it could be
> > at the IOAS layer.  The linkage would be that every iommufd operation
> > which could change either the user-selected windows or the effective
> > hardware apertures must ensure that all the user windows (still) lie
> > within the hardware apertures.
> 
> As Kevin said, we need to start at the iommu_domain layer first -
> when we understand how that needs to look then we can imagine what the
> uAPI should be.

Hm, ok.  I'm looking in the other direction - what uAPIs are needed to
do the things I'm considering in userspace, given the constraints of
the hardware I know about.  I'm seeing what layer we match one to the
other at as secondary.

> I don't want to create something in iommufd that is wildly divergent
> from what the iommu_domain/etc can support.

Fair enough.

> > > The only thing the kernel can do is rely a notification that something
> > > happened to a PE. The kernel gets an event on the PE basis, I would
> > > like it to replicate it to all the devices and push it through the
> > > VFIO device FD.
> > 
> > Again: it's not about the notifications (I don't even really know how
> > those work in EEH).
> 
> Well, then I don't know what we are talking about - I'm interested in
> what uAPI is needed to support this, as far as can I tell that is a
> notification something bad happened and some control knobs?

As far as I remember that's basically right, but the question is the
granularity at which those control knobs operate, given that some of
the knobs can be automatically flipped in case of a hardware error.

> As I said, I'd prefer this to be on the vfio_device FD vs on iommufd
> and would like to find a way to make that work.

It probably does belong on the device FD, but we need to consider the
IOMMU, because these things are operating at the granularity of (at
least) an IOMMU group.

> > expects that to happen at the (virtual) hardware level.  So if a guest
> > trips an error on one device, it expects IO to stop for every other
> > device in the PE.  But if hardware's notion of the PE is smaller than
> > the guest's the host hardware won't accomplish that itself.
> 
> So, why do that to the guest? Shouldn't the PE in the guest be
> strictly a subset of the PE in the host, otherwise you hit all these
> problems you are talking about?

Ah.. yes, I did leave that out.  It's moderately complicated

- The guest PE can never be a (strict) subset of the host PE.  A PE
  is an IOMMU group in Linux terms, so if a host PE were split
  amongst guest PEs we couldn't supply the isolation semantics that
  the guest kernel will expect.  So one guest PE must always consist
  of 1 or more host PEs.

- We can't choose the guest PE boundaries completely arbitrarily.  The
  PAPR interfaces tie the PE boundaries to (guest) PCI(e) topology in a
  pretty confusing way that comes from the history of IBM hardware.

- To keep things simple, qemu makes the choice that each guest PCI
  Host Bridge (vPHB) is a single (guest) PE.  (Having many independent
  PHBs is routine on POWER systems.)

- So, where the guest PEs lie depends on how the user/configuration
  assigns devices to PHBs in the guest topology.

- We enforce (as we must) that devices from a single host PE can't be
  split across guest PEs.  I think qemu has specific checks for this
  in order to give better errors, but it will also be enforced by
  VFIO: we have one container per PHB (guest PE) so we can't attach
  the same group (host PE) to two of them.

- We don't enforce that we have at *most* one host PE per PHB - and
  that's useful, because lots of scripts / management tools / whatever
  designed for x86 just put all PCI devices onto the same vPHB,
  regardless of how they're grouped on the host.

- There are some additional complications because of the possibility
  of qemu emulated devices, which don't have *any* host PE.

> > used this in anger, I think a better approach is to just straight up
> > fail any attempt to do any EEH operation if there's not a 1 to 1
> > mapping from guest PE (domain) to host PE (group).
> 
> That makes sense

Short of a time machine to murd^H^H^H^Heducate the people who wrote
the EEH interfaces, I think it's the best we have.
Jason Gunthorpe May 9, 2022, 2 p.m. UTC | #24
On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:

> > The default iommu_domain that the iommu driver creates will be used
> > here, it is up to the iommu driver to choose something reasonable for
> > use by applications like DPDK. ie PPC should probably pick its biggest
> > x86-like aperture.
> 
> So, using the big aperture means a very high base IOVA
> (1<<59)... which means that it won't work at all if you want to attach
> any devices that aren't capable of 64-bit DMA.

I'd expect to include the 32 bit window too..

> Using the maximum possible window size would mean we either
> potentially waste a lot of kernel memory on pagetables, or we use
> unnecessarily large number of levels to the pagetable.

All drivers have this issue to one degree or another. We seem to be
ignoring it - in any case this is a micro optimization, not a
functional need?

> More generally, the problem with the interface advertising limitations
> and it being up to userspace to work out if those are ok or not is
> that it's fragile.  It's pretty plausible that some future IOMMU model
> will have some new kind of limitation that can't be expressed in the
> query structure we invented now.

The basic API is very simple - the driver needs to provide ranges of
IOVA and map/unmap - I don't think we have a future problem here we
need to try and guess and solve today.

Even PPC fits this just fine, the open question for DPDK is more
around optimization, not functional.

> But if userspace requests the capabilities it wants, and the kernel
> acks or nacks that, we can support the new host IOMMU with existing
> software just fine.

No, this just makes it fragile in the other direction because now
userspace has to know what platform specific things to ask for *or it
doesn't work at all*. This is not a improvement for the DPDK cases.

Kernel decides, using all the kernel knowledge it has and tells the
application what it can do - this is the basic simplified interface.

> > The iommu-driver-specific struct is the "advanced" interface and
> > allows a user-space IOMMU driver to tightly control the HW with full
> > HW specific knowledge. This is where all the weird stuff that is not
> > general should go.
> 
> Right, but forcing anything more complicated than "give me some IOVA
> region" to go through the advanced interface means that qemu (or any
> hypervisor where the guest platform need not identically match the
> host) has to have n^2 complexity to match each guest IOMMU model to
> each host IOMMU model.

I wouldn't say n^2, but yes, qemu needs to have a userspace driver for
the platform IOMMU, and yes it needs this to reach optimal
behavior. We already know this is a hard requirement for using nesting
as acceleration, I don't see why apertures are so different.

> Errr.. how do you figure?  On ppc the ranges and pagesizes are
> definitely negotiable.  I'm not really familiar with other models, but
> anything which allows *any* variations in the pagetable structure will
> effectively have at least some negotiable properties.

As above, if you ask for the wrong thing then you don't get
anything. If DPDK asks for something that works on ARM like 0 -> 4G
then PPC and x86 will always fail. How is this improving anything to
require applications to carefully ask for exactly the right platform
specific ranges?

It isn't like there is some hard coded value we can put into DPDK that
will work on every platform. So kernel must pick for DPDK, IMHO. I
don't see any feasible alternative.

> Which is why I'm suggesting that the base address be an optional
> request.  DPDK *will* care about the size of the range, so it just
> requests that and gets told a base address.

We've talked about a size of IOVA address space before, strictly as a
hint, to possible optimize page table layout, or something, and I'm
fine with that idea. But - we have no driver implementation today, so
I'm not sure what we can really do with this right now..

Kevin could Intel consume a hint on IOVA space and optimize the number
of IO page table levels?

> > and IMHO, qemu
> > is fine to have a PPC specific userspace driver to tweak this PPC
> > unique thing if the default windows are not acceptable.
> 
> Not really, because it's the ppc *target* (guest) side which requires
> the specific properties, but selecting the "advanced" interface
> requires special knowledge on the *host* side.

The ppc specific driver would be on the generic side of qemu in its
viommu support framework. There is lots of host driver optimization
possible here with knowledge of the underlying host iommu HW. It
should not be connected to the qemu target.

It is not so different from today where qemu has to know about ppc's
special vfio interface generically even to emulate x86.

> > IMHO it is no different from imagining an Intel specific userspace
> > driver that is using userspace IO pagetables to optimize
> > cross-platform qemu vIOMMU emulation.
> 
> I'm not quite sure what you have in mind here.  How is it both Intel
> specific and cross-platform?

It is part of the generic qemu iommu interface layer. For nesting qemu
would copy the guest page table format to the host page table format
in userspace and trigger invalidation - no pinning, no kernel
map/unmap calls. It can only be done with detailed knowledge of the
host iommu since the host iommu io page table format is exposed
directly to userspace.

> Note however, that having multiple apertures isn't really ppc specific.
> Anything with an IO hole effectively has separate apertures above and
> below the hole.  They're much closer together in address than POWER's
> two apertures, but I don't see that makes any fundamental difference
> to the handling of them.

In the iommu core it handled the io holes and things through the group
reserved IOVA list - there isn't actualy a limit in the iommu_domain,
it has a flat pagetable format - and in cases like PASID/SVA the group
reserved list doesn't apply at all.

> Another approach would be to give the required apertures / pagesizes
> in the initial creation of the domain/IOAS.  In that case they would
> be static for the IOAS, as well as the underlying iommu_domains: any
> ATTACH which would be incompatible would fail.

This is the device-specific iommu_domain creation path. The domain can
have information defining its aperture.

> That makes life hard for the DPDK case, though.  Obviously we can
> still make the base address optional, but for it to be static the
> kernel would have to pick it immediately, before we know what devices
> will be attached, which will be a problem on any system where there
> are multiple IOMMUs with different constraints.

Which is why the current scheme is fully automatic and we rely on the
iommu driver to automatically select something sane for DPDK/etc
today.

> > In general I have no issue with limiting the IOVA allocator in the
> > kernel, I just don't have a use case of an application that could use
> > the IOVA allocator (like DPDK) and also needs a limitation..
> 
> Well, I imagine DPDK has at least the minimal limitation that it needs
> the aperture to be a certain minimum size (I'm guessing at least the
> size of its pinned hugepage working memory region).  That's a
> limitation that's unlikely to fail on modern hardware, but it's there.

Yes, DPDK does assume there is some fairly large available aperture,
that should be the driver default behavior, IMHO.

> > That breaks what I'm
> > trying to do to make DPDK/etc portable and dead simple.
> 
> It doesn't break portability at all.  As for simplicity, yes it adds
> an extra required step, but the payoff is that it's now impossible to
> subtly screw up by failing to recheck your apertures after an ATTACH.
> That is, it's taking a step which was implicitly required and
> replacing it with one that's explicitly required.

Again, as above, it breaks portability because apps have no hope to
know what window range to ask for to succeed. It cannot just be a hard
coded range.

Jason
David Gibson May 10, 2022, 7:12 a.m. UTC | #25
On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote:
> On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> 
> > > The default iommu_domain that the iommu driver creates will be used
> > > here, it is up to the iommu driver to choose something reasonable for
> > > use by applications like DPDK. ie PPC should probably pick its biggest
> > > x86-like aperture.
> > 
> > So, using the big aperture means a very high base IOVA
> > (1<<59)... which means that it won't work at all if you want to attach
> > any devices that aren't capable of 64-bit DMA.
> 
> I'd expect to include the 32 bit window too..

I'm not entirely sure what you mean.  Are you working on the
assumption that we've extended to allowing multiple apertures, so we'd
default to advertising both a small/low aperture and a large/high
aperture?

> > Using the maximum possible window size would mean we either
> > potentially waste a lot of kernel memory on pagetables, or we use
> > unnecessarily large number of levels to the pagetable.
> 
> All drivers have this issue to one degree or another. We seem to be
> ignoring it - in any case this is a micro optimization, not a
> functional need?

Ok, fair point.

> > More generally, the problem with the interface advertising limitations
> > and it being up to userspace to work out if those are ok or not is
> > that it's fragile.  It's pretty plausible that some future IOMMU model
> > will have some new kind of limitation that can't be expressed in the
> > query structure we invented now.
> 
> The basic API is very simple - the driver needs to provide ranges of
> IOVA and map/unmap - I don't think we have a future problem here we
> need to try and guess and solve today.

Well.. maybe.  My experience of encountering hardware doing weird-arse
stuff makes me less sanguine.

> Even PPC fits this just fine, the open question for DPDK is more
> around optimization, not functional.
> 
> > But if userspace requests the capabilities it wants, and the kernel
> > acks or nacks that, we can support the new host IOMMU with existing
> > software just fine.
> 
> No, this just makes it fragile in the other direction because now
> userspace has to know what platform specific things to ask for *or it
> doesn't work at all*. This is not a improvement for the DPDK cases.

Um.. no.  The idea is that userspace requests *what it needs*, not
anything platform specific.  In the case of DPDK that would be nothing
more than the (minimum) aperture size.  Nothing platform specific
about that.

> Kernel decides, using all the kernel knowledge it has and tells the
> application what it can do - this is the basic simplified interface.
> 
> > > The iommu-driver-specific struct is the "advanced" interface and
> > > allows a user-space IOMMU driver to tightly control the HW with full
> > > HW specific knowledge. This is where all the weird stuff that is not
> > > general should go.
> > 
> > Right, but forcing anything more complicated than "give me some IOVA
> > region" to go through the advanced interface means that qemu (or any
> > hypervisor where the guest platform need not identically match the
> > host) has to have n^2 complexity to match each guest IOMMU model to
> > each host IOMMU model.
> 
> I wouldn't say n^2, but yes, qemu needs to have a userspace driver for
> the platform IOMMU, and yes it needs this to reach optimal
> behavior. We already know this is a hard requirement for using nesting
> as acceleration, I don't see why apertures are so different.

For one thing, because we only care about optimal behaviour on the
host ~= guest KVM case.  That means it's not n^2, just (roughly) one
host driver for each matching guest driver.  I'm considering the
general X on Y case - we don't need to optimize it, but it would be
nice for it to work without considering every combination separately.

> > Errr.. how do you figure?  On ppc the ranges and pagesizes are
> > definitely negotiable.  I'm not really familiar with other models, but
> > anything which allows *any* variations in the pagetable structure will
> > effectively have at least some negotiable properties.
> 
> As above, if you ask for the wrong thing then you don't get
> anything. If DPDK asks for something that works on ARM like 0 -> 4G
> then PPC and x86 will always fail. How is this improving anything to
> require applications to carefully ask for exactly the right platform
> specific ranges?

Hm, looks like I didn't sufficiently emphasize that the base address
would be optional for userspace to supply.  So userspace would request
a range *size* only, unless it needs a specific IOVA base address.  It
only requests the latter if it actually needs it - so failing in that
case is correct.  (Qemu, with or without an vIOMMU is the obvious case
for that, though I could also imagine it for a specialized driver for
some broken device which has weird limitations on what IOVA addresses
it can generate on the bus).

> It isn't like there is some hard coded value we can put into DPDK that
> will work on every platform. So kernel must pick for DPDK, IMHO. I
> don't see any feasible alternative.

Yes, hence *optionally specified* base address only.

> > Which is why I'm suggesting that the base address be an optional
> > request.  DPDK *will* care about the size of the range, so it just
> > requests that and gets told a base address.
> 
> We've talked about a size of IOVA address space before, strictly as a
> hint, to possible optimize page table layout, or something, and I'm
> fine with that idea. But - we have no driver implementation today, so
> I'm not sure what we can really do with this right now..

You can check that the hardware aperture is at least as large as the
requested range.  Even on the pretty general x86 IOMMU, if the program
wants 2^62 bytes of aperture, I'm pretty sure you won't be able to
supply it.

> Kevin could Intel consume a hint on IOVA space and optimize the number
> of IO page table levels?
> 
> > > and IMHO, qemu
> > > is fine to have a PPC specific userspace driver to tweak this PPC
> > > unique thing if the default windows are not acceptable.
> > 
> > Not really, because it's the ppc *target* (guest) side which requires
> > the specific properties, but selecting the "advanced" interface
> > requires special knowledge on the *host* side.
> 
> The ppc specific driver would be on the generic side of qemu in its
> viommu support framework. There is lots of host driver optimization
> possible here with knowledge of the underlying host iommu HW. It
> should not be connected to the qemu target.

Thinking through this...

So, I guess we could have basically the same logic I'm suggesting be
in the qemu backend iommu driver instead.  So the target side (machine
type, strictly speaking) would request of the host side the apertures
it needs, and the host side driver would see if it can do that, based
on both specific knowledge of that driver and the query reponses.

ppc on x86 should work with that.. at least if the x86 aperture is
large enough to reach up to ppc's high window.  I guess we'd have the
option here of using either the generic host driver or the
x86-specific driver.  The latter would mean qemu maintaining an
x86-format shadow of the io pagetables; mildly tedious, but doable.

x86-without-viommu on ppc could work in at least some cases if the ppc
host driver requests a low window large enough to cover guest memory.
Memory hotplug we can handle by creating a new IOAS using the ppc
specific interface with a new window covering the hotplugged region.

x86-with-viommu on ppc probably can't be done, since I don't think the
ppc low window can be extended far enough to allow for the guest's
expected IOVA range.. but there's probably no way around that.  Unless
there's some way of configuring / advertising a "crippled" version of
the x86 IOMMU with a more limited IOVA range.

Device hotplug is the remaining curly case.  We're set up happily,
then we hotplug a device.  The backend aperture shrinks, the host-side
qemu driver notices this and notices it longer covers the ranges that
the target-side expects.  So... is there any way of backing out of
this gracefully.  We could detach the device, but in the meantime
ongoing DMA maps from previous devices might have failed.  We could
pre-attach the new device to a new IOAS and check the apertures there
- but when we move it to the final IOAS is it guaranteed that the
apertures will be (at least) the intersection of the old and new
apertures, or is that just the probable outcome.  Or I guess we could
add a pre-attach-query operation of some sort in the kernel to check
what the effect will be before doing the attach for real.

Ok.. you convinced me.  As long as we have some way to handle the
device hotplug case, we can work with this.

> It is not so different from today where qemu has to know about ppc's
> special vfio interface generically even to emulate x86.

Well, yes, and that's a horrible aspect of the current vfio interface.
It arose because of (partly) *my* mistake, for not realizing at the
time that we could reasonably extend the type1 interface to work for
ppc as well.  I'm hoping iommufd doesn't repeat my mistake.

> > > IMHO it is no different from imagining an Intel specific userspace
> > > driver that is using userspace IO pagetables to optimize
> > > cross-platform qemu vIOMMU emulation.
> > 
> > I'm not quite sure what you have in mind here.  How is it both Intel
> > specific and cross-platform?
> 
> It is part of the generic qemu iommu interface layer. For nesting qemu
> would copy the guest page table format to the host page table format
> in userspace and trigger invalidation - no pinning, no kernel
> map/unmap calls. It can only be done with detailed knowledge of the
> host iommu since the host iommu io page table format is exposed
> directly to userspace.

Ok, I see.  That can certainly be done.  I was really hoping we could
have a working, though non-optimized, implementation using just the
generic interface.

> > Note however, that having multiple apertures isn't really ppc specific.
> > Anything with an IO hole effectively has separate apertures above and
> > below the hole.  They're much closer together in address than POWER's
> > two apertures, but I don't see that makes any fundamental difference
> > to the handling of them.
> 
> In the iommu core it handled the io holes and things through the group
> reserved IOVA list - there isn't actualy a limit in the iommu_domain,
> it has a flat pagetable format - and in cases like PASID/SVA the group
> reserved list doesn't apply at all.

Sure, but how it's implemented internally doesn't change the user
visible fact: some IOVAs can't be used, some can.  Userspace needs to
know which is which in order to operate correctly, and the only
reasonable way for it to get them is to be told by the kernel.  We
should advertise that in a common way, not have different ways for
"holes" versus "windows".  We can choose either one; I think "windows"
rather than "holes" makes more sense, but it doesn't really matter.
Whichever one we choose, we need more than one of them for both ppc
and x86:

    - x86 has a "window" from 0 to the bottom IO hole, and a window
      from the top of the IO hole to the maximum address describable
      in the IO page table.
    - x86 has a hole for the IO hole (duh), and another hole from the
      maximum IO pagetable address to 2^64-1 (you could call it the
      "out of bounds hole", I guess)

    - ppc has a "window" at 0 to a configurable maximum, and another
      "window" from 2^59 to a configurable maximum
    - ppc has a hole between the two windows, and another from the end
      of the high window to 2^64-1

So either representation, either arch, it's 2 windows, 2 holes.  There
may be other cases that only need 1 of each (SVA, ancient ppc without
the high window, probably others).  Point is there are common cases
that require more than 1.

> > Another approach would be to give the required apertures / pagesizes
> > in the initial creation of the domain/IOAS.  In that case they would
> > be static for the IOAS, as well as the underlying iommu_domains: any
> > ATTACH which would be incompatible would fail.
> 
> This is the device-specific iommu_domain creation path. The domain can
> have information defining its aperture.

But that also requires managing the pagetables yourself; I think tying
these two concepts together is inflexible.

> > That makes life hard for the DPDK case, though.  Obviously we can
> > still make the base address optional, but for it to be static the
> > kernel would have to pick it immediately, before we know what devices
> > will be attached, which will be a problem on any system where there
> > are multiple IOMMUs with different constraints.
> 
> Which is why the current scheme is fully automatic and we rely on the
> iommu driver to automatically select something sane for DPDK/etc
> today.

But the cost is that the allowable addresses can change implicitly
with every ATTACH.

> > > In general I have no issue with limiting the IOVA allocator in the
> > > kernel, I just don't have a use case of an application that could use
> > > the IOVA allocator (like DPDK) and also needs a limitation..
> > 
> > Well, I imagine DPDK has at least the minimal limitation that it needs
> > the aperture to be a certain minimum size (I'm guessing at least the
> > size of its pinned hugepage working memory region).  That's a
> > limitation that's unlikely to fail on modern hardware, but it's there.
> 
> Yes, DPDK does assume there is some fairly large available aperture,
> that should be the driver default behavior, IMHO.

Well, sure, but "fairly large" tends to change meaning over time.  The
idea is to ensure that the app's idea of "fairly large" matches the
kernel's idea of "fairly large".

> > > That breaks what I'm
> > > trying to do to make DPDK/etc portable and dead simple.
> > 
> > It doesn't break portability at all.  As for simplicity, yes it adds
> > an extra required step, but the payoff is that it's now impossible to
> > subtly screw up by failing to recheck your apertures after an ATTACH.
> > That is, it's taking a step which was implicitly required and
> > replacing it with one that's explicitly required.
> 
> Again, as above, it breaks portability because apps have no hope to
> know what window range to ask for to succeed. It cannot just be a hard
> coded range.

I see the problem if you have an app where there's a difference
between the smallest window it can cope with versus the largest window
it can take advantage of.  Not sure if that's likely in pratice.
AFAIK, DPDK will alway require it's hugepage memory pool mapped, can't
deal with less, can't benefit from more.  But maybe there's some use
case for this I haven't thought of.


Ok... here's a revised version of my proposal which I think addresses
your concerns and simplfies things.

- No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
  will probably need matching changes)

- By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
  is chosen by the kernel within the aperture(s).  This is closer to
  how mmap() operates, and DPDK and similar shouldn't care about
  having specific IOVAs, even at the individual mapping level.

- IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
  for when you really do want to control the IOVA (qemu, maybe some
  special userspace driver cases)

- ATTACH will fail if the new device would shrink the aperture to
  exclude any already established mappings (I assume this is already
  the case)

- IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
  PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
  MAPs won't use it, but doesn't actually put anything into the IO
  pagetables.
    - Like a regular mapping, ATTACHes that are incompatible with an
      IOMAP_RESERVEed region will fail
    - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
      mapping

So, for DPDK the sequence would be:

1. Create IOAS
2. ATTACH devices
3. IOAS_MAP some stuff
4. Do DMA with the IOVAs that IOAS_MAP returned

(Note, not even any need for QUERY in simple cases)

For (unoptimized) qemu it would be:

1. Create IOAS
2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of the
   guest platform
3. ATTACH devices (this will fail if they're not compatible with the
   reserved IOVA regions)
4. Boot the guest

  (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part of
                               the reserved regions
  (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
                      reserved regions)
  (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
                                 necessary (which might fail)
Jason Gunthorpe May 10, 2022, 7 p.m. UTC | #26
On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote:
> > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> > 
> > > > The default iommu_domain that the iommu driver creates will be used
> > > > here, it is up to the iommu driver to choose something reasonable for
> > > > use by applications like DPDK. ie PPC should probably pick its biggest
> > > > x86-like aperture.
> > > 
> > > So, using the big aperture means a very high base IOVA
> > > (1<<59)... which means that it won't work at all if you want to attach
> > > any devices that aren't capable of 64-bit DMA.
> > 
> > I'd expect to include the 32 bit window too..
> 
> I'm not entirely sure what you mean.  Are you working on the
> assumption that we've extended to allowing multiple apertures, so we'd
> default to advertising both a small/low aperture and a large/high
> aperture?

Yes

> > No, this just makes it fragile in the other direction because now
> > userspace has to know what platform specific things to ask for *or it
> > doesn't work at all*. This is not a improvement for the DPDK cases.
> 
> Um.. no.  The idea is that userspace requests *what it needs*, not
> anything platform specific.  In the case of DPDK that would be nothing
> more than the (minimum) aperture size.  Nothing platform specific
> about that.

Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit
platform can do more, but it varies how much more, etc.

There is no constant value DPDK could stuff in this request, unless it
needs a really small amount of IOVA, like 1G or something.

> > It isn't like there is some hard coded value we can put into DPDK that
> > will work on every platform. So kernel must pick for DPDK, IMHO. I
> > don't see any feasible alternative.
> 
> Yes, hence *optionally specified* base address only.

Okay, so imagine we've already done this and DPDK is not optionally
specifying anything :)

The structs can be extended so we can add this as an input to creation
when a driver can implement it.

> > The ppc specific driver would be on the generic side of qemu in its
> > viommu support framework. There is lots of host driver optimization
> > possible here with knowledge of the underlying host iommu HW. It
> > should not be connected to the qemu target.
> 
> Thinking through this...
> 
> So, I guess we could have basically the same logic I'm suggesting be
> in the qemu backend iommu driver instead.  So the target side (machine
> type, strictly speaking) would request of the host side the apertures
> it needs, and the host side driver would see if it can do that, based
> on both specific knowledge of that driver and the query reponses.

Yes, this is what I'm thinking

> ppc on x86 should work with that.. at least if the x86 aperture is
> large enough to reach up to ppc's high window.  I guess we'd have the
> option here of using either the generic host driver or the
> x86-specific driver.  The latter would mean qemu maintaining an
> x86-format shadow of the io pagetables; mildly tedious, but doable.

The appeal of having userspace page tables is performance, so it is
tedious to shadow, but it should run faster.

> So... is there any way of backing out of this gracefully.  We could
> detach the device, but in the meantime ongoing DMA maps from
> previous devices might have failed.  

This sounds like a good use case for qemu to communicate ranges - but
as I mentioned before Alex said qemu didn't know the ranges..

> We could pre-attach the new device to a new IOAS and check the
> apertures there - but when we move it to the final IOAS is it
> guaranteed that the apertures will be (at least) the intersection of
> the old and new apertures, or is that just the probable outcome. 

Should be guarenteed

> Ok.. you convinced me.  As long as we have some way to handle the
> device hotplug case, we can work with this.

I like the communicate ranges for hotplug, so long as we can actually
implement it in qemu - I'm a bit unclear on that honestly.

> Ok, I see.  That can certainly be done.  I was really hoping we could
> have a working, though non-optimized, implementation using just the
> generic interface.

Oh, sure that should largely work as well too, this is just an
additional direction people may find interesting and helps explain why
qemu should have an iommu layer inside.

> "holes" versus "windows".  We can choose either one; I think "windows"
> rather than "holes" makes more sense, but it doesn't really matter.

Yes, I picked windows aka ranges for the uAPI - we translate the holes
from the groups into windows and intersect them with the apertures.

> > > Another approach would be to give the required apertures / pagesizes
> > > in the initial creation of the domain/IOAS.  In that case they would
> > > be static for the IOAS, as well as the underlying iommu_domains: any
> > > ATTACH which would be incompatible would fail.
> > 
> > This is the device-specific iommu_domain creation path. The domain can
> > have information defining its aperture.
> 
> But that also requires managing the pagetables yourself; I think tying
> these two concepts together is inflexible.

Oh, no, those need to be independent for HW nesting already

One device-specific creation path will create the kernel owned
map/unmap iommu_domain with device-specific parameters to allow it to
be the root of a nest - ie specify S2 on ARM.

The second device-specific creation path will create the user owned
iommu_domain with device-specific parameters, with the first as a
parent.

So you get to do both.

> > Which is why the current scheme is fully automatic and we rely on the
> > iommu driver to automatically select something sane for DPDK/etc
> > today.
> 
> But the cost is that the allowable addresses can change implicitly
> with every ATTACH.

Yes, dpdk/etc don't care.
 
> I see the problem if you have an app where there's a difference
> between the smallest window it can cope with versus the largest window
> it can take advantage of.  Not sure if that's likely in pratice.
> AFAIK, DPDK will alway require it's hugepage memory pool mapped, can't
> deal with less, can't benefit from more.  But maybe there's some use
> case for this I haven't thought of.

Other apps I've seen don't even have a fixed memory pool, they just
malloc and can't really predict how much IOVA they
need. "approximately the same amount as a process VA" is a reasonable
goal for the kernel to default too.

A tunable to allow efficiency from smaller allocations sounds great -
but let's have driver support before adding the uAPI for
it. Intel/AMD/ARM support to have fewer page table levels for instance
would be perfect.
 
> Ok... here's a revised version of my proposal which I think addresses
> your concerns and simplfies things.
> 
> - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
>   will probably need matching changes)
> 
> - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
>   is chosen by the kernel within the aperture(s).  This is closer to
>   how mmap() operates, and DPDK and similar shouldn't care about
>   having specific IOVAs, even at the individual mapping level.
>
> - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
>   for when you really do want to control the IOVA (qemu, maybe some
>   special userspace driver cases)

We already did both of these, the flag is called
IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
select the IOVA internally.

> - ATTACH will fail if the new device would shrink the aperture to
>   exclude any already established mappings (I assume this is already
>   the case)

Yes

> - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
>   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
>   MAPs won't use it, but doesn't actually put anything into the IO
>   pagetables.
>     - Like a regular mapping, ATTACHes that are incompatible with an
>       IOMAP_RESERVEed region will fail
>     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
>       mapping

Yeah, this seems OK, I'm thinking a new API might make sense because
you don't really want mmap replacement semantics but a permanent
record of what IOVA must always be valid.

IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
IOMMUFD_CMD_IOAS_IOVA_RANGES:

struct iommu_ioas_require_iova {
        __u32 size;
        __u32 ioas_id;
        __u32 num_iovas;
        __u32 __reserved;
        struct iommu_required_iovas {
                __aligned_u64 start;
                __aligned_u64 last;
        } required_iovas[];
};

> So, for DPDK the sequence would be:
> 
> 1. Create IOAS
> 2. ATTACH devices
> 3. IOAS_MAP some stuff
> 4. Do DMA with the IOVAs that IOAS_MAP returned
> 
> (Note, not even any need for QUERY in simple cases)

Yes, this is done already

> For (unoptimized) qemu it would be:
> 
> 1. Create IOAS
> 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of the
>    guest platform
> 3. ATTACH devices (this will fail if they're not compatible with the
>    reserved IOVA regions)
> 4. Boot the guest
> 
>   (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part of
>                                the reserved regions
>   (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
>                       reserved regions)
>   (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
>                                  necessary (which might fail)

OK, I will take care of it

Thanks,
Jason
Tian, Kevin May 11, 2022, 2:46 a.m. UTC | #27
> From: Jason Gunthorpe
> Sent: Monday, May 9, 2022 10:01 PM
> 
> On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> 
> > Which is why I'm suggesting that the base address be an optional
> > request.  DPDK *will* care about the size of the range, so it just
> > requests that and gets told a base address.
> 
> We've talked about a size of IOVA address space before, strictly as a
> hint, to possible optimize page table layout, or something, and I'm
> fine with that idea. But - we have no driver implementation today, so
> I'm not sure what we can really do with this right now..
> 
> Kevin could Intel consume a hint on IOVA space and optimize the number
> of IO page table levels?
> 

It could, but not implemented now.
Tian, Kevin May 11, 2022, 3:15 a.m. UTC | #28
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, May 11, 2022 3:00 AM
> 
> On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > Ok... here's a revised version of my proposal which I think addresses
> > your concerns and simplfies things.
> >
> > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> >   will probably need matching changes)
> >
> > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> >   is chosen by the kernel within the aperture(s).  This is closer to
> >   how mmap() operates, and DPDK and similar shouldn't care about
> >   having specific IOVAs, even at the individual mapping level.
> >
> > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> >   for when you really do want to control the IOVA (qemu, maybe some
> >   special userspace driver cases)
> 
> We already did both of these, the flag is called
> IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> select the IOVA internally.
> 
> > - ATTACH will fail if the new device would shrink the aperture to
> >   exclude any already established mappings (I assume this is already
> >   the case)
> 
> Yes
> 
> > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> >   MAPs won't use it, but doesn't actually put anything into the IO
> >   pagetables.
> >     - Like a regular mapping, ATTACHes that are incompatible with an
> >       IOMAP_RESERVEed region will fail
> >     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> >       mapping
> 
> Yeah, this seems OK, I'm thinking a new API might make sense because
> you don't really want mmap replacement semantics but a permanent
> record of what IOVA must always be valid.
> 
> IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> IOMMUFD_CMD_IOAS_IOVA_RANGES:
> 
> struct iommu_ioas_require_iova {
>         __u32 size;
>         __u32 ioas_id;
>         __u32 num_iovas;
>         __u32 __reserved;
>         struct iommu_required_iovas {
>                 __aligned_u64 start;
>                 __aligned_u64 last;
>         } required_iovas[];
> };

As a permanent record do we want to enforce that once the required
range list is set all FIXED and non-FIXED allocations must be within the
list of ranges?

If yes we can take the end of the last range as the max size of the iova
address space to optimize the page table layout.

otherwise we may need another dedicated hint for that optimization.

> 
> > So, for DPDK the sequence would be:
> >
> > 1. Create IOAS
> > 2. ATTACH devices
> > 3. IOAS_MAP some stuff
> > 4. Do DMA with the IOVAs that IOAS_MAP returned
> >
> > (Note, not even any need for QUERY in simple cases)
> 
> Yes, this is done already
> 
> > For (unoptimized) qemu it would be:
> >
> > 1. Create IOAS
> > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> the
> >    guest platform
> > 3. ATTACH devices (this will fail if they're not compatible with the
> >    reserved IOVA regions)
> > 4. Boot the guest

I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
vIOMMUs regular mappings are required before booting the guest and
reservation might be done but not mandatory (at least not what current
Qemu vfio can afford as it simply replays valid ranges in the CPU address
space).

> >
> >   (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part
> of
> >                                the reserved regions
> >   (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
> >                       reserved regions)
> >   (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
> >                                  necessary (which might fail)
> 
> OK, I will take care of it
> 
> Thanks,
> Jason
David Gibson May 11, 2022, 4:40 a.m. UTC | #29
On Tue, May 10, 2022 at 04:00:09PM -0300, Jason Gunthorpe wrote:
> On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote:
> > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> > > 
> > > > > The default iommu_domain that the iommu driver creates will be used
> > > > > here, it is up to the iommu driver to choose something reasonable for
> > > > > use by applications like DPDK. ie PPC should probably pick its biggest
> > > > > x86-like aperture.
> > > > 
> > > > So, using the big aperture means a very high base IOVA
> > > > (1<<59)... which means that it won't work at all if you want to attach
> > > > any devices that aren't capable of 64-bit DMA.
> > > 
> > > I'd expect to include the 32 bit window too..
> > 
> > I'm not entirely sure what you mean.  Are you working on the
> > assumption that we've extended to allowing multiple apertures, so we'd
> > default to advertising both a small/low aperture and a large/high
> > aperture?
> 
> Yes

Ok, that works assuming we can advertise multiple windows.

> > > No, this just makes it fragile in the other direction because now
> > > userspace has to know what platform specific things to ask for *or it
> > > doesn't work at all*. This is not a improvement for the DPDK cases.
> > 
> > Um.. no.  The idea is that userspace requests *what it needs*, not
> > anything platform specific.  In the case of DPDK that would be nothing
> > more than the (minimum) aperture size.  Nothing platform specific
> > about that.
> 
> Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit
> platform can do more, but it varies how much more, etc.
> 
> There is no constant value DPDK could stuff in this request, unless it
> needs a really small amount of IOVA, like 1G or something.

Well, my assumption was that DPDK always wanted an IOVA window to
cover its hugepage buffer space.  So not "constant" exactly, but a
value it will know at start up time.  But I think we cover that more
closely below.

> > > It isn't like there is some hard coded value we can put into DPDK that
> > > will work on every platform. So kernel must pick for DPDK, IMHO. I
> > > don't see any feasible alternative.
> > 
> > Yes, hence *optionally specified* base address only.
> 
> Okay, so imagine we've already done this and DPDK is not optionally
> specifying anything :)
> 
> The structs can be extended so we can add this as an input to creation
> when a driver can implement it.
> 
> > > The ppc specific driver would be on the generic side of qemu in its
> > > viommu support framework. There is lots of host driver optimization
> > > possible here with knowledge of the underlying host iommu HW. It
> > > should not be connected to the qemu target.
> > 
> > Thinking through this...
> > 
> > So, I guess we could have basically the same logic I'm suggesting be
> > in the qemu backend iommu driver instead.  So the target side (machine
> > type, strictly speaking) would request of the host side the apertures
> > it needs, and the host side driver would see if it can do that, based
> > on both specific knowledge of that driver and the query reponses.
> 
> Yes, this is what I'm thinking
> 
> > ppc on x86 should work with that.. at least if the x86 aperture is
> > large enough to reach up to ppc's high window.  I guess we'd have the
> > option here of using either the generic host driver or the
> > x86-specific driver.  The latter would mean qemu maintaining an
> > x86-format shadow of the io pagetables; mildly tedious, but doable.
> 
> The appeal of having userspace page tables is performance, so it is
> tedious to shadow, but it should run faster.

I doubt the difference is meaningful in the context of an emulated
guest, though.

> > So... is there any way of backing out of this gracefully.  We could
> > detach the device, but in the meantime ongoing DMA maps from
> > previous devices might have failed.  
> 
> This sounds like a good use case for qemu to communicate ranges - but
> as I mentioned before Alex said qemu didn't know the ranges..

Yeah, I'm a bit baffled by that, and I don't know the context.  Note
that there are at least two different very different users of the host
IOMMU backends in: one is for emulation of guest DMA (with or without
a vIOMMU).  In that case the details of the guest platform should let
qemu know the ranges.  There's also a VFIO based NVME backend; that
one's much more like a "normal" userspace driver, where it doesn't
care about the address ranges (because they're not guest visible).

> > We could pre-attach the new device to a new IOAS and check the
> > apertures there - but when we move it to the final IOAS is it
> > guaranteed that the apertures will be (at least) the intersection of
> > the old and new apertures, or is that just the probable outcome. 
> 
> Should be guarenteed

Ok; that would need to be documented.

> > Ok.. you convinced me.  As long as we have some way to handle the
> > device hotplug case, we can work with this.
> 
> I like the communicate ranges for hotplug, so long as we can actually
> implement it in qemu - I'm a bit unclear on that honestly.
> 
> > Ok, I see.  That can certainly be done.  I was really hoping we could
> > have a working, though non-optimized, implementation using just the
> > generic interface.
> 
> Oh, sure that should largely work as well too, this is just an
> additional direction people may find interesting and helps explain why
> qemu should have an iommu layer inside.

> > "holes" versus "windows".  We can choose either one; I think "windows"
> > rather than "holes" makes more sense, but it doesn't really matter.
> 
> Yes, I picked windows aka ranges for the uAPI - we translate the holes
> from the groups into windows and intersect them with the apertures.

Ok.

> > > > Another approach would be to give the required apertures / pagesizes
> > > > in the initial creation of the domain/IOAS.  In that case they would
> > > > be static for the IOAS, as well as the underlying iommu_domains: any
> > > > ATTACH which would be incompatible would fail.
> > > 
> > > This is the device-specific iommu_domain creation path. The domain can
> > > have information defining its aperture.
> > 
> > But that also requires managing the pagetables yourself; I think tying
> > these two concepts together is inflexible.
> 
> Oh, no, those need to be independent for HW nesting already
> 
> One device-specific creation path will create the kernel owned
> map/unmap iommu_domain with device-specific parameters to allow it to
> be the root of a nest - ie specify S2 on ARM.
> 
> The second device-specific creation path will create the user owned
> iommu_domain with device-specific parameters, with the first as a
> parent.
> 
> So you get to do both.

Ah! Good to know.

> > > Which is why the current scheme is fully automatic and we rely on the
> > > iommu driver to automatically select something sane for DPDK/etc
> > > today.
> > 
> > But the cost is that the allowable addresses can change implicitly
> > with every ATTACH.
> 
> Yes, dpdk/etc don't care.

Well... as long as nothing they've already mapped goes away.

> > I see the problem if you have an app where there's a difference
> > between the smallest window it can cope with versus the largest window
> > it can take advantage of.  Not sure if that's likely in pratice.
> > AFAIK, DPDK will alway require it's hugepage memory pool mapped, can't
> > deal with less, can't benefit from more.  But maybe there's some use
> > case for this I haven't thought of.
> 
> Other apps I've seen don't even have a fixed memory pool, they just
> malloc and can't really predict how much IOVA they
> need. "approximately the same amount as a process VA" is a reasonable
> goal for the kernel to default too.

Hm, ok, I guess that makes sense.

> A tunable to allow efficiency from smaller allocations sounds great -
> but let's have driver support before adding the uAPI for
> it. Intel/AMD/ARM support to have fewer page table levels for instance
> would be perfect.
>  
> > Ok... here's a revised version of my proposal which I think addresses
> > your concerns and simplfies things.
> > 
> > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> >   will probably need matching changes)
> > 
> > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> >   is chosen by the kernel within the aperture(s).  This is closer to
> >   how mmap() operates, and DPDK and similar shouldn't care about
> >   having specific IOVAs, even at the individual mapping level.
> >
> > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> >   for when you really do want to control the IOVA (qemu, maybe some
> >   special userspace driver cases)
> 
> We already did both of these, the flag is called
> IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> select the IOVA internally.

Ok, great.

> > - ATTACH will fail if the new device would shrink the aperture to
> >   exclude any already established mappings (I assume this is already
> >   the case)
> 
> Yes

Good to know.

> > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> >   MAPs won't use it, but doesn't actually put anything into the IO
> >   pagetables.
> >     - Like a regular mapping, ATTACHes that are incompatible with an
> >       IOMAP_RESERVEed region will fail
> >     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> >       mapping
> 
> Yeah, this seems OK, I'm thinking a new API might make sense because
> you don't really want mmap replacement semantics but a permanent
> record of what IOVA must always be valid.
> 
> IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> IOMMUFD_CMD_IOAS_IOVA_RANGES:
> 
> struct iommu_ioas_require_iova {
>         __u32 size;
>         __u32 ioas_id;
>         __u32 num_iovas;
>         __u32 __reserved;
>         struct iommu_required_iovas {
>                 __aligned_u64 start;
>                 __aligned_u64 last;
>         } required_iovas[];
> };

Sounds reasonable.

> > So, for DPDK the sequence would be:
> > 
> > 1. Create IOAS
> > 2. ATTACH devices
> > 3. IOAS_MAP some stuff
> > 4. Do DMA with the IOVAs that IOAS_MAP returned
> > 
> > (Note, not even any need for QUERY in simple cases)
> 
> Yes, this is done already
> 
> > For (unoptimized) qemu it would be:
> > 
> > 1. Create IOAS
> > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of the
> >    guest platform
> > 3. ATTACH devices (this will fail if they're not compatible with the
> >    reserved IOVA regions)
> > 4. Boot the guest
> > 
> >   (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part of
> >                                the reserved regions
> >   (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
> >                       reserved regions)
> >   (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
> >                                  necessary (which might fail)
> 
> OK, I will take care of it

Hooray!  Long contentious thread eventually reaches productive
resolution :).
Jason Gunthorpe May 11, 2022, 4:32 p.m. UTC | #30
On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, May 11, 2022 3:00 AM
> > 
> > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > > Ok... here's a revised version of my proposal which I think addresses
> > > your concerns and simplfies things.
> > >
> > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> > >   will probably need matching changes)
> > >
> > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> > >   is chosen by the kernel within the aperture(s).  This is closer to
> > >   how mmap() operates, and DPDK and similar shouldn't care about
> > >   having specific IOVAs, even at the individual mapping level.
> > >
> > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> > >   for when you really do want to control the IOVA (qemu, maybe some
> > >   special userspace driver cases)
> > 
> > We already did both of these, the flag is called
> > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> > select the IOVA internally.
> > 
> > > - ATTACH will fail if the new device would shrink the aperture to
> > >   exclude any already established mappings (I assume this is already
> > >   the case)
> > 
> > Yes
> > 
> > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> > >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> > >   MAPs won't use it, but doesn't actually put anything into the IO
> > >   pagetables.
> > >     - Like a regular mapping, ATTACHes that are incompatible with an
> > >       IOMAP_RESERVEed region will fail
> > >     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> > >       mapping
> > 
> > Yeah, this seems OK, I'm thinking a new API might make sense because
> > you don't really want mmap replacement semantics but a permanent
> > record of what IOVA must always be valid.
> > 
> > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> > IOMMUFD_CMD_IOAS_IOVA_RANGES:
> > 
> > struct iommu_ioas_require_iova {
> >         __u32 size;
> >         __u32 ioas_id;
> >         __u32 num_iovas;
> >         __u32 __reserved;
> >         struct iommu_required_iovas {
> >                 __aligned_u64 start;
> >                 __aligned_u64 last;
> >         } required_iovas[];
> > };
> 
> As a permanent record do we want to enforce that once the required
> range list is set all FIXED and non-FIXED allocations must be within the
> list of ranges?

No, I would just use this as a guarntee that going forward any
get_ranges will always return ranges that cover the listed required
ranges. Ie any narrowing of the ranges will be refused.

map/unmap should only be restricted to the get_ranges output.

Wouldn't burn CPU cycles to nanny userspace here.

> If yes we can take the end of the last range as the max size of the iova
> address space to optimize the page table layout.

I think this API should not interact with the driver. Its only job is
to prevent devices from attaching that would narrow the ranges.

If we also use it to adjust the aperture of the created iommu_domain
then it looses its usefullness as guard since something like qemu
would have to leave room for hotplug as well.

I suppose optimizing the created iommu_domains should be some other
API, with a different set of ranges and the '# of bytes of IOVA' hint
as well.

> > > For (unoptimized) qemu it would be:
> > >
> > > 1. Create IOAS
> > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> > the
> > >    guest platform
> > > 3. ATTACH devices (this will fail if they're not compatible with the
> > >    reserved IOVA regions)
> > > 4. Boot the guest
> 
> I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
> vIOMMUs regular mappings are required before booting the guest and
> reservation might be done but not mandatory (at least not what current
> Qemu vfio can afford as it simply replays valid ranges in the CPU address
> space).

I think qemu can always do it, it feels like it would simplify error
cases around aperture mismatches.

Jason
Tian, Kevin May 11, 2022, 11:23 p.m. UTC | #31
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, May 12, 2022 12:32 AM
> 
> On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, May 11, 2022 3:00 AM
> > >
> > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > > > Ok... here's a revised version of my proposal which I think addresses
> > > > your concerns and simplfies things.
> > > >
> > > > - No new operations, but IOAS_MAP gets some new flags (and
> IOAS_COPY
> > > >   will probably need matching changes)
> > > >
> > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> > > >   is chosen by the kernel within the aperture(s).  This is closer to
> > > >   how mmap() operates, and DPDK and similar shouldn't care about
> > > >   having specific IOVAs, even at the individual mapping level.
> > > >
> > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s
> MAP_FIXED,
> > > >   for when you really do want to control the IOVA (qemu, maybe some
> > > >   special userspace driver cases)
> > >
> > > We already did both of these, the flag is called
> > > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> > > select the IOVA internally.
> > >
> > > > - ATTACH will fail if the new device would shrink the aperture to
> > > >   exclude any already established mappings (I assume this is already
> > > >   the case)
> > >
> > > Yes
> > >
> > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> > > >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-
> FIXED)
> > > >   MAPs won't use it, but doesn't actually put anything into the IO
> > > >   pagetables.
> > > >     - Like a regular mapping, ATTACHes that are incompatible with an
> > > >       IOMAP_RESERVEed region will fail
> > > >     - An IOMAP_RESERVEed area can be overmapped with an
> IOMAP_FIXED
> > > >       mapping
> > >
> > > Yeah, this seems OK, I'm thinking a new API might make sense because
> > > you don't really want mmap replacement semantics but a permanent
> > > record of what IOVA must always be valid.
> > >
> > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> > > IOMMUFD_CMD_IOAS_IOVA_RANGES:
> > >
> > > struct iommu_ioas_require_iova {
> > >         __u32 size;
> > >         __u32 ioas_id;
> > >         __u32 num_iovas;
> > >         __u32 __reserved;
> > >         struct iommu_required_iovas {
> > >                 __aligned_u64 start;
> > >                 __aligned_u64 last;
> > >         } required_iovas[];
> > > };
> >
> > As a permanent record do we want to enforce that once the required
> > range list is set all FIXED and non-FIXED allocations must be within the
> > list of ranges?
> 
> No, I would just use this as a guarntee that going forward any
> get_ranges will always return ranges that cover the listed required
> ranges. Ie any narrowing of the ranges will be refused.
> 
> map/unmap should only be restricted to the get_ranges output.
> 
> Wouldn't burn CPU cycles to nanny userspace here.

fair enough.

> 
> > If yes we can take the end of the last range as the max size of the iova
> > address space to optimize the page table layout.
> 
> I think this API should not interact with the driver. Its only job is
> to prevent devices from attaching that would narrow the ranges.
> 
> If we also use it to adjust the aperture of the created iommu_domain
> then it looses its usefullness as guard since something like qemu
> would have to leave room for hotplug as well.
> 
> I suppose optimizing the created iommu_domains should be some other
> API, with a different set of ranges and the '# of bytes of IOVA' hint
> as well.

make sense.

> 
> > > > For (unoptimized) qemu it would be:
> > > >
> > > > 1. Create IOAS
> > > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> > > the
> > > >    guest platform
> > > > 3. ATTACH devices (this will fail if they're not compatible with the
> > > >    reserved IOVA regions)
> > > > 4. Boot the guest
> >
> > I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
> > vIOMMUs regular mappings are required before booting the guest and
> > reservation might be done but not mandatory (at least not what current
> > Qemu vfio can afford as it simply replays valid ranges in the CPU address
> > space).
> 
> I think qemu can always do it, it feels like it would simplify error
> cases around aperture mismatches.
> 

It could, but require more changes in Qemu to define required ranges
in platform logic and then convey it from Qemu address space to VFIO.
I view it as an optimization hence not necessarily to be done immediately.

Thanks
Kevin
David Gibson May 13, 2022, 4:35 a.m. UTC | #32
On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, May 11, 2022 3:00 AM
> > 
> > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > > Ok... here's a revised version of my proposal which I think addresses
> > > your concerns and simplfies things.
> > >
> > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> > >   will probably need matching changes)
> > >
> > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> > >   is chosen by the kernel within the aperture(s).  This is closer to
> > >   how mmap() operates, and DPDK and similar shouldn't care about
> > >   having specific IOVAs, even at the individual mapping level.
> > >
> > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> > >   for when you really do want to control the IOVA (qemu, maybe some
> > >   special userspace driver cases)
> > 
> > We already did both of these, the flag is called
> > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> > select the IOVA internally.
> > 
> > > - ATTACH will fail if the new device would shrink the aperture to
> > >   exclude any already established mappings (I assume this is already
> > >   the case)
> > 
> > Yes
> > 
> > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> > >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> > >   MAPs won't use it, but doesn't actually put anything into the IO
> > >   pagetables.
> > >     - Like a regular mapping, ATTACHes that are incompatible with an
> > >       IOMAP_RESERVEed region will fail
> > >     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> > >       mapping
> > 
> > Yeah, this seems OK, I'm thinking a new API might make sense because
> > you don't really want mmap replacement semantics but a permanent
> > record of what IOVA must always be valid.
> > 
> > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> > IOMMUFD_CMD_IOAS_IOVA_RANGES:
> > 
> > struct iommu_ioas_require_iova {
> >         __u32 size;
> >         __u32 ioas_id;
> >         __u32 num_iovas;
> >         __u32 __reserved;
> >         struct iommu_required_iovas {
> >                 __aligned_u64 start;
> >                 __aligned_u64 last;
> >         } required_iovas[];
> > };
> 
> As a permanent record do we want to enforce that once the required
> range list is set all FIXED and non-FIXED allocations must be within the
> list of ranges?

No, I don't think so.  In fact the way I was envisaging this,
non-FIXED mappings will *never* go into the reserved ranges.  This is
for the benefit of any use cases that need both mappings where they
don't care about the IOVA and those which do.

Essentially, reserving a region here is saying to the kernel "I want
to manage this IOVA space; make sure nothing else touches it".  That
means both that the kernel must disallow any hw associated changes
(like ATTACH) which would impinge on the reserved region, and also any
IOVA allocations that would take parts away from that space.

Whether we want to restrict FIXED mappings to the reserved regions is
an interesting question.  I wasn't thinking that would be necessary
(just as you can use mmap() MAP_FIXED anywhere).  However.. much as
MAP_FIXED is very dangerous to use if you don't previously reserve
address space, I think IOMAP_FIXED is dangerous if you haven't
previously reserved space.  So maybe it would make sense to only allow
FIXED mappings within reserved regions.

Strictly dividing the IOVA space into kernel managed and user managed
regions does make a certain amount of sense.

> If yes we can take the end of the last range as the max size of the iova
> address space to optimize the page table layout.
> 
> otherwise we may need another dedicated hint for that optimization.

Right.  With the revised model where reserving windows is optional,
not required, I don't think we can quite re-use this for optimization
hints.  Which is a bit unfortunate.

I can't immediately see a way to tweak this which handles both more
neatly, but I like the idea if we can figure out a way.

> > > So, for DPDK the sequence would be:
> > >
> > > 1. Create IOAS
> > > 2. ATTACH devices
> > > 3. IOAS_MAP some stuff
> > > 4. Do DMA with the IOVAs that IOAS_MAP returned
> > >
> > > (Note, not even any need for QUERY in simple cases)
> > 
> > Yes, this is done already
> > 
> > > For (unoptimized) qemu it would be:
> > >
> > > 1. Create IOAS
> > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> > the
> > >    guest platform
> > > 3. ATTACH devices (this will fail if they're not compatible with the
> > >    reserved IOVA regions)
> > > 4. Boot the guest
> 
> I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
> vIOMMUs regular mappings are required before booting the guest and
> reservation might be done but not mandatory (at least not what current
> Qemu vfio can afford as it simply replays valid ranges in the CPU address
> space).

That was a somewhat simplified description.  When we look in more
detail, I think the ppc and x86 models become more similar.  So, in
more detail, I think it would look like this:

1. Create base IOAS
2. Map guest memory into base IOAS so that IOVA==GPA
3. Create IOASes for each vIOMMU domain
4. Reserve windows in domain IOASes where the vIOMMU will allow
   mappings by default
5. ATTACH devices to appropriate IOASes (***)
6. Boot the guest

  On guest map/invalidate:
        Use IOAS_COPY to take mappings from base IOAS and put them
	into the domain IOAS
  On memory hotplug:
        IOAS_MAP new memory block into base IOAS
  On dev hotplug: (***)
        ATTACH devices to appropriate IOAS
  On guest reconfiguration of vIOMMU domains (x86 only):
        DETACH device from base IOAS, attach to vIOMMU domain IOAS
  On guest reconfiguration of vIOMMU apertures (ppc only):
        Alter reserved regions to match vIOMMU

The difference between ppc and x86 is at the places marked (***):
which IOAS each device gets attached to and when. For x86 all devices
live in the base IOAS by default, and only get moved to domain IOASes
when those domains are set up in the vIOMMU.  For POWER each device
starts in a domain IOAS based on its guest PE, and never moves.

[This is still a bit simplified.  In practice, I imagine you'd
 optimize to only create the domain IOASes at the point
 they're needed - on boot for ppc, but only when the vIOMMU is
 configured for x86.  I don't think that really changes the model,
 though.]

A few aspects of the model interact quite nicely here.  Mapping a
large memory guest with IOVA==GPA would probably fail on a ppc host
IOMMU.  But if both guest and host are ppc, then no devices get
attached to that base IOAS, so its apertures don't get restricted by
the host hardware.  That way we get a common model, and the benefits
of GUP sharing via IOAS_COPY, without it failing in the ppc-on-ppc
case.

x86-on-ppc and ppc-on-x86 will probably only work in limited cases
where the various sizes and windows line up, but the possibility isn't
precluded by the model or interfaces.

> > >   (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part
> > of
> > >                                the reserved regions
> > >   (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
> > >                       reserved regions)
> > >   (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
> > >                                  necessary (which might fail)
> > 
> > OK, I will take care of it
> > 
> > Thanks,
> > Jason
>
Alexey Kardashevskiy May 23, 2022, 6:02 a.m. UTC | #33
On 4/29/22 00:53, David Gibson wrote:
> On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
>> On Wed, 23 Mar 2022 21:33:42 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
>>>
>>>> My overall question here would be whether we can actually achieve a
>>>> compatibility interface that has sufficient feature transparency that we
>>>> can dump vfio code in favor of this interface, or will there be enough
>>>> niche use cases that we need to keep type1 and vfio containers around
>>>> through a deprecation process?
>>>
>>> Other than SPAPR, I think we can.
>>
>> Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
>> for POWER support?
> 
> There are a few different levels to consider for dealing with PPC.
> For a suitable long term interface for ppc hosts and guests dropping
> this is fine: the ppc specific iommu model was basically an
> ill-conceived idea from the beginning, because none of us had
> sufficiently understood what things were general and what things where
> iommu model/hw specific.
> 
> ..mostly.  There are several points of divergence for the ppc iommu
> model.
> 
> 1) Limited IOVA windows.  This one turned out to not really be ppc
> specific, and is (rightly) handlded generically in the new interface.
> No problem here.
> 
> 2) Costly GUPs.  pseries (the most common ppc machine type) always
> expects a (v)IOMMU.  That means that unlike the common x86 model of a
> host with IOMMU, but guests with no-vIOMMU, guest initiated
> maps/unmaps can be a hot path.  Accounting in that path can be
> prohibitive (and on POWER8 in particular it prevented us from
> optimizing that path the way we wanted).  We had two solutions for
> that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
> based on the IOVA window sizes.  That was improved in the v2 which
> used the concept of preregistration.  IIUC iommufd can achieve the
> same effect as preregistration using IOAS_COPY, so this one isn't
> really a problem either.


I am getting rid of those POWER8-related realmode handlers as POWER9 has 
MMU enabled when hcalls are handled. Costly GUP problem is still there 
though (which base IOAS should solve?).


> 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> windows, which aren't contiguous with each other.  The base addresses
> of each of these are fixed, but the size of each window, the pagesize
> (i.e. granularity) of each window and the number of levels in the
> IOMMU pagetable are runtime configurable.  Because it's true in the
> hardware, it's also true of the vIOMMU interface defined by the IBM
> hypervisor (and adpoted by KVM as well).  So, guests can request
> changes in how these windows are handled.  Typical Linux guests will
> use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> can't count on that; the guest can use them however it wants.


The guest actually does this already. AIX has always been like that, 
Linux is forced to do that for SRIOV VFs as there can be many VFs and 
TCEs (==IOPTEs) are limited resource. The today's pseries IOMMU code 
first tried mapping 1:1 (as it has been for ages) but if there is not 
enough TCEs - it removes the first window (which increases the TCE 
budget), creates a new 64bit window (as big as possible but not 
necessarily enough for 1:1, 64K/2M IOMMU page sizes allowed) and does 
map/unmap as drivers go.


Which means the guest RAM does not need to be all mapped in that base 
IOAS suggested down this thread as that would mean all memory is pinned 
and powervm won't be able to swap it out (yeah, it can do such thing 
now!). Not sure if we really want to support this or stick to a simpler 
design.



> 
> (3) still needs a plan for how to fit it into the /dev/iommufd model.
> This is a secondary reason that in the past I advocated for the user
> requesting specific DMA windows which the kernel would accept or
> refuse, rather than having a query function - it connects easily to
> the DDW model.  With the query-first model we'd need some sort of
> extension here, not really sure what it should look like.
> 
> 
> 
> Then, there's handling existing qemu (or other software) that is using
> the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
> should be a goal or not: as others have noted, working actively to
> port qemu to the new interface at the same time as making a
> comprehensive in-kernel compat layer is arguably redundant work.
> 
> That said, if we did want to handle this in an in-kernel compat layer,
> here's roughly what you'd need for SPAPR_TCE v2:
> 
> - VFIO_IOMMU_SPAPR_TCE_GET_INFO
>      I think this should be fairly straightforward; the information you
>      need should be in the now generic IOVA window stuff and would just
>      need massaging into the expected format.
> - VFIO_IOMMU_SPAPR_REGISTER_MEMORY /
>    VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
>      IIUC, these could be traslated into map/unmap operations onto a
>      second implicit IOAS which represents the preregistered memory
>      areas (to which we'd never connect an actual device).  Along with
>      this VFIO_MAP and VFIO_UNMAP operations would need to check for
>      this case, verify their addresses against the preregistered space
>      and be translated into IOAS_COPY operations from the prereg
>      address space instead of raw IOAS_MAP operations.  Fiddly, but not
>      fundamentally hard, I think.
> 
> For SPAPR_TCE_v1 things are a bit trickier
> 
> - VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE
>      I suspect you could get away with implementing these as no-ops.
>      It wouldn't be strictly correct, but I think software which is
>      using the interface correctly should work this way, though
>      possibly not optimally.  That might be good enough for this ugly
>      old interface.
> 
> And... then there's VFIO_EEH_PE_OP.  It's very hard to know what to do
> with this because the interface was completely broken for most of its
> lifetime.  EEH is a fancy error handling feature of IBM PCI hardware
> somewhat similar in concept, though not interface, to PCIe AER.  I have
> a very strong impression that while this was a much-touted checkbox
> feature for RAS, no-one, ever. actually used it.  As evidenced by the
> fact that there was, I believe over a *decade* in which all the
> interfaces were completely broken by design, and apparently no-one
> noticed.
> 
> So, cynically, you could probably get away with making this a no-op as
> well.  If you wanted to do it properly... well... that would require
> training up yet another person to actually understand this and hoping
> they get it done before they run screaming.  This one gets very ugly
> because the EEH operations have to operate on the hardware (or
> firmware) "Partitionable Endpoints" (PEs) which correspond one to one
> with IOMMU groups, but not necessarily with VFIO containers, but
> there's not really any sensible way to expose that to users.
> 
> You might be able to do this by simply failing this outright if
> there's anything other than exactly one IOMMU group bound to the
> container / IOAS (which I think might be what VFIO itself does now).
> Handling that with a device centric API gets somewhat fiddlier, of
> course.
> 
> 
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
Jason Gunthorpe May 24, 2022, 1:25 p.m. UTC | #34
On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote:

> Which means the guest RAM does not need to be all mapped in that base IOAS
> suggested down this thread as that would mean all memory is pinned and
> powervm won't be able to swap it out (yeah, it can do such thing now!). Not
> sure if we really want to support this or stick to a simpler design.

Huh? How can it swap? Calling GUP is not optional. Either you call GUP
at the start and there is no swap, or you call GUP for each vIOMMU
hypercall.

Since everyone says PPC doesn't call GUP during the hypercall - how is
it working?

Jason
David Gibson May 25, 2022, 1:39 a.m. UTC | #35
On Tue, May 24, 2022 at 10:25:53AM -0300, Jason Gunthorpe wrote:
> On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote:
> 
> > Which means the guest RAM does not need to be all mapped in that base IOAS
> > suggested down this thread as that would mean all memory is pinned and
> > powervm won't be able to swap it out (yeah, it can do such thing now!). Not
> > sure if we really want to support this or stick to a simpler design.
> 
> Huh? How can it swap? Calling GUP is not optional. Either you call GUP
> at the start and there is no swap, or you call GUP for each vIOMMU
> hypercall.
> 
> Since everyone says PPC doesn't call GUP during the hypercall - how is
> it working?

The current implementation does GUP during the pre-reserve.  I think
Alexey's talking about a new PowerVM (IBM hypervisor) feature; I don't
know how that works.
Alexey Kardashevskiy May 25, 2022, 2:09 a.m. UTC | #36
On 5/24/22 23:25, Jason Gunthorpe wrote:
> On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote:
> 
>> Which means the guest RAM does not need to be all mapped in that base IOAS
>> suggested down this thread as that would mean all memory is pinned and
>> powervm won't be able to swap it out (yeah, it can do such thing now!). Not
>> sure if we really want to support this or stick to a simpler design.
> 
> Huh? How can it swap? Calling GUP is not optional. Either you call GUP
> at the start and there is no swap, or you call GUP for each vIOMMU
> hypercall.

Correct, not optional.


> Since everyone says PPC doesn't call GUP during the hypercall - how is
> it working?

It does not call GUP during hypercalls because all VM pages are GUPed in 
advance at a special memory preregistration step as we could not call 
GUP from a hypercall handler with MMU off (often the case with POWER8 
when this was developed in the first place). Things are better with 
POWER9 (bare metal can do all sorts of things pretty much) but the 
PowerVM interface with 2 windows is still there and this iommufd 
proposal is going to be ported on top of PowerVM at first.

I am just saying there is a model when not everything is mapped and this 
has its use. The PowerVM's swapping capability is something new and I do 
not really know how that works though.
diff mbox series

Patch

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index ca28a135b9675f..2fdff04000b326 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -5,6 +5,7 @@  iommufd-y := \
 	io_pagetable.o \
 	ioas.o \
 	main.o \
-	pages.o
+	pages.o \
+	vfio_compat.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index e5c717231f851e..31628591591c17 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -67,6 +67,8 @@  void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner);
 struct iommufd_ctx {
 	struct file *filp;
 	struct xarray objects;
+
+	struct iommufd_ioas *vfio_ioas;
 };
 
 struct iommufd_ctx *iommufd_fget(int fd);
@@ -78,6 +80,9 @@  struct iommufd_ucmd {
 	void *cmd;
 };
 
+int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
+		       unsigned long arg);
+
 /* Copy the response in ucmd->cmd back to userspace. */
 static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 				       size_t cmd_len)
@@ -186,6 +191,7 @@  int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
+int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
 
 /*
  * A HW pagetable is called an iommu_domain inside the kernel. This user object
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 6a895489fb5b82..f746fcff8145cb 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -122,6 +122,8 @@  bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
 		return false;
 	}
 	__xa_erase(&ictx->objects, obj->id);
+	if (ictx->vfio_ioas && &ictx->vfio_ioas->obj == obj)
+		ictx->vfio_ioas = NULL;
 	xa_unlock(&ictx->objects);
 
 	iommufd_object_ops[obj->type].destroy(obj);
@@ -219,27 +221,31 @@  static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 __reserved),
 	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
 		 length),
+	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
+		 __reserved),
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
 			       unsigned long arg)
 {
+	struct iommufd_ctx *ictx = filp->private_data;
 	struct iommufd_ucmd ucmd = {};
 	struct iommufd_ioctl_op *op;
 	union ucmd_buffer buf;
 	unsigned int nr;
 	int ret;
 
-	ucmd.ictx = filp->private_data;
+	nr = _IOC_NR(cmd);
+	if (nr < IOMMUFD_CMD_BASE ||
+	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
+		return iommufd_vfio_ioctl(ictx, cmd, arg);
+
+	ucmd.ictx = ictx;
 	ucmd.ubuffer = (void __user *)arg;
 	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
 	if (ret)
 		return ret;
 
-	nr = _IOC_NR(cmd);
-	if (nr < IOMMUFD_CMD_BASE ||
-	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
-		return -ENOIOCTLCMD;
 	op = &iommufd_ioctl_ops[nr - IOMMUFD_CMD_BASE];
 	if (op->ioctl_num != cmd)
 		return -ENOIOCTLCMD;
diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
new file mode 100644
index 00000000000000..5c996bc9b44d48
--- /dev/null
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -0,0 +1,401 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/file.h>
+#include <linux/interval_tree.h>
+#include <linux/iommu.h>
+#include <linux/iommufd.h>
+#include <linux/slab.h>
+#include <linux/vfio.h>
+#include <uapi/linux/vfio.h>
+#include <uapi/linux/iommufd.h>
+
+#include "iommufd_private.h"
+
+static struct iommufd_ioas *get_compat_ioas(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas = ERR_PTR(-ENODEV);
+
+	xa_lock(&ictx->objects);
+	if (!ictx->vfio_ioas || !iommufd_lock_obj(&ictx->vfio_ioas->obj))
+		goto out_unlock;
+	ioas = ictx->vfio_ioas;
+out_unlock:
+	xa_unlock(&ictx->objects);
+	return ioas;
+}
+
+/*
+ * Only attaching a group should cause a default creation of the internal ioas,
+ * this returns the existing ioas if it has already been assigned somehow
+ * FIXME: maybe_unused
+ */
+static __maybe_unused struct iommufd_ioas *
+create_compat_ioas(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas = NULL;
+	struct iommufd_ioas *out_ioas;
+
+	ioas = iommufd_ioas_alloc(ictx);
+	if (IS_ERR(ioas))
+		return ioas;
+
+	xa_lock(&ictx->objects);
+	if (ictx->vfio_ioas && iommufd_lock_obj(&ictx->vfio_ioas->obj))
+		out_ioas = ictx->vfio_ioas;
+	else
+		out_ioas = ioas;
+	xa_unlock(&ictx->objects);
+
+	if (out_ioas != ioas) {
+		iommufd_object_abort(ictx, &ioas->obj);
+		return out_ioas;
+	}
+	if (!iommufd_lock_obj(&ioas->obj))
+		WARN_ON(true);
+	iommufd_object_finalize(ictx, &ioas->obj);
+	return ioas;
+}
+
+int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_vfio_ioas *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+	switch (cmd->op) {
+	case IOMMU_VFIO_IOAS_GET:
+		ioas = get_compat_ioas(ucmd->ictx);
+		if (IS_ERR(ioas))
+			return PTR_ERR(ioas);
+		cmd->ioas_id = ioas->obj.id;
+		iommufd_put_object(&ioas->obj);
+		return iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+	case IOMMU_VFIO_IOAS_SET:
+		ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+		if (IS_ERR(ioas))
+			return PTR_ERR(ioas);
+		xa_lock(&ucmd->ictx->objects);
+		ucmd->ictx->vfio_ioas = ioas;
+		xa_unlock(&ucmd->ictx->objects);
+		iommufd_put_object(&ioas->obj);
+		return 0;
+
+	case IOMMU_VFIO_IOAS_CLEAR:
+		xa_lock(&ucmd->ictx->objects);
+		ucmd->ictx->vfio_ioas = NULL;
+		xa_unlock(&ucmd->ictx->objects);
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int iommufd_vfio_map_dma(struct iommufd_ctx *ictx, unsigned int cmd,
+				void __user *arg)
+{
+	u32 supported_flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+	struct vfio_iommu_type1_dma_map map;
+	int iommu_prot = IOMMU_CACHE;
+	struct iommufd_ioas *ioas;
+	unsigned long iova;
+	int rc;
+
+	if (copy_from_user(&map, arg, minsz))
+		return -EFAULT;
+
+	if (map.argsz < minsz || map.flags & ~supported_flags)
+		return -EINVAL;
+
+	if (map.flags & VFIO_DMA_MAP_FLAG_READ)
+		iommu_prot |= IOMMU_READ;
+	if (map.flags & VFIO_DMA_MAP_FLAG_WRITE)
+		iommu_prot |= IOMMU_WRITE;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	iova = map.iova;
+	rc = iopt_map_user_pages(&ioas->iopt, &iova,
+				 u64_to_user_ptr(map.vaddr), map.size,
+				 iommu_prot, 0);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_vfio_unmap_dma(struct iommufd_ctx *ictx, unsigned int cmd,
+				  void __user *arg)
+{
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+	u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL;
+	struct vfio_iommu_type1_dma_unmap unmap;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	if (copy_from_user(&unmap, arg, minsz))
+		return -EFAULT;
+
+	if (unmap.argsz < minsz || unmap.flags & ~supported_flags)
+		return -EINVAL;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL)
+		rc = iopt_unmap_all(&ioas->iopt);
+	else
+		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_vfio_check_extension(unsigned long type)
+{
+	switch (type) {
+	case VFIO_TYPE1v2_IOMMU:
+	case VFIO_UNMAP_ALL:
+		return 1;
+	/*
+	 * FIXME: The type1 iommu allows splitting of maps, which can fail. This is doable but
+	 * is a bunch of extra code that is only for supporting this case.
+	 */
+	case VFIO_TYPE1_IOMMU:
+	/*
+	 * FIXME: No idea what VFIO_TYPE1_NESTING_IOMMU does as far as the uAPI
+	 * is concerned. Seems like it was never completed, it only does
+	 * something on ARM, but I can't figure out what or how to use it. Can't
+	 * find any user implementation either.
+	 */
+	case VFIO_TYPE1_NESTING_IOMMU:
+	/*
+	 * FIXME: Easy to support, but needs rework in the Intel iommu driver
+	 * to expose the no snoop squashing to iommufd
+	 */
+	case VFIO_DMA_CC_IOMMU:
+	/*
+	 * FIXME: VFIO_DMA_MAP_FLAG_VADDR
+	 * https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
+	 * Wow, what a wild feature. This should have been implemented by
+	 * allowing a iopt_pages to be associated with a memfd. It can then
+	 * source mapping requests directly from a memfd without going through a
+	 * mm_struct and thus doesn't care that the original qemu exec'd itself.
+	 * The idea that userspace can flip a flag and cause kernel users to
+	 * block indefinately is unacceptable.
+	 *
+	 * For VFIO compat we should implement this in a slightly different way,
+	 * Creating a access_user that spans the whole area will immediately
+	 * stop new faults as they will be handled from the xarray. We can then
+	 * reparent the iopt_pages to the new mm_struct and undo the
+	 * access_user. No blockage of kernel users required, does require
+	 * filling the xarray with pages though.
+	 */
+	case VFIO_UPDATE_VADDR:
+	default:
+		return 0;
+	}
+
+ /* FIXME: VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP I think everything with dirty
+  * tracking should be in its own ioctl, not muddled in unmap. If we want to
+  * atomically unmap and get the dirty bitmap it should be a flag in the dirty
+  * tracking ioctl, not here in unmap. Overall dirty tracking needs a careful
+  * review along side HW drivers implementing it.
+  */
+}
+
+static int iommufd_vfio_set_iommu(struct iommufd_ctx *ictx, unsigned long type)
+{
+	struct iommufd_ioas *ioas = NULL;
+
+	if (type != VFIO_TYPE1v2_IOMMU)
+		return -EINVAL;
+
+	/* VFIO fails the set_iommu if there is no group */
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	iommufd_put_object(&ioas->obj);
+	return 0;
+}
+
+static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas)
+{
+	/* FIXME: See vfio_update_pgsize_bitmap(), for compat this should return
+	 * the high bits too, and we need to decide if we should report that
+	 * iommufd supports less than PAGE_SIZE alignment or stick to strict
+	 * compatibility. qemu only cares about the first set bit.
+	 */
+	return ioas->iopt.iova_alignment;
+}
+
+static int iommufd_fill_cap_iova(struct iommufd_ioas *ioas,
+				 struct vfio_info_cap_header __user *cur,
+				 size_t avail)
+{
+	struct vfio_iommu_type1_info_cap_iova_range __user *ucap_iovas =
+		container_of(cur, struct vfio_iommu_type1_info_cap_iova_range,
+			     header);
+	struct vfio_iommu_type1_info_cap_iova_range cap_iovas = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE,
+			.version = 1,
+		},
+	};
+	struct interval_tree_span_iter span;
+
+	for (interval_tree_span_iter_first(
+		     &span, &ioas->iopt.reserved_iova_itree, 0, ULONG_MAX);
+	     !interval_tree_span_iter_done(&span);
+	     interval_tree_span_iter_next(&span)) {
+		struct vfio_iova_range range;
+
+		if (!span.is_hole)
+			continue;
+		range.start = span.start_hole;
+		range.end = span.last_hole;
+		if (avail >= struct_size(&cap_iovas, iova_ranges,
+					 cap_iovas.nr_iovas + 1) &&
+		    copy_to_user(&ucap_iovas->iova_ranges[cap_iovas.nr_iovas],
+				 &range, sizeof(range)))
+			return -EFAULT;
+		cap_iovas.nr_iovas++;
+	}
+	if (avail >= struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas) &&
+	    copy_to_user(ucap_iovas, &cap_iovas, sizeof(cap_iovas)))
+		return -EFAULT;
+	return struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas);
+}
+
+static int iommufd_fill_cap_dma_avail(struct iommufd_ioas *ioas,
+				      struct vfio_info_cap_header __user *cur,
+				      size_t avail)
+{
+	struct vfio_iommu_type1_info_dma_avail cap_dma = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL,
+			.version = 1,
+		},
+		/* iommufd has no limit, return the same value as VFIO. */
+		.avail = U16_MAX,
+	};
+
+	if (avail >= sizeof(cap_dma) &&
+	    copy_to_user(cur, &cap_dma, sizeof(cap_dma)))
+		return -EFAULT;
+	return sizeof(cap_dma);
+}
+
+static int iommufd_vfio_iommu_get_info(struct iommufd_ctx *ictx,
+				       void __user *arg)
+{
+	typedef int (*fill_cap_fn)(struct iommufd_ioas *ioas,
+				   struct vfio_info_cap_header __user *cur,
+				   size_t avail);
+	static const fill_cap_fn fill_fns[] = {
+		iommufd_fill_cap_iova,
+		iommufd_fill_cap_dma_avail,
+	};
+	size_t minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+	struct vfio_info_cap_header __user *last_cap = NULL;
+	struct vfio_iommu_type1_info info;
+	struct iommufd_ioas *ioas;
+	size_t total_cap_size;
+	int rc;
+	int i;
+
+	if (copy_from_user(&info, arg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz)
+		return -EINVAL;
+	minsz = min_t(size_t, info.argsz, sizeof(info));
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	down_read(&ioas->iopt.iova_rwsem);
+	info.flags = VFIO_IOMMU_INFO_PGSIZES;
+	info.iova_pgsizes = iommufd_get_pagesizes(ioas);
+	info.cap_offset = 0;
+
+	total_cap_size = sizeof(info);
+	for (i = 0; i != ARRAY_SIZE(fill_fns); i++) {
+		int cap_size;
+
+		if (info.argsz > total_cap_size)
+			cap_size = fill_fns[i](ioas, arg + total_cap_size,
+					       info.argsz - total_cap_size);
+		else
+			cap_size = fill_fns[i](ioas, NULL, 0);
+		if (cap_size < 0) {
+			rc = cap_size;
+			goto out_put;
+		}
+		if (last_cap && info.argsz >= total_cap_size &&
+		    put_user(total_cap_size, &last_cap->next)) {
+			rc = -EFAULT;
+			goto out_put;
+		}
+		last_cap = arg + total_cap_size;
+		total_cap_size += cap_size;
+	}
+
+	/*
+	 * If the user did not provide enough space then only some caps are
+	 * returned and the argsz will be updated to the correct amount to get
+	 * all caps.
+	 */
+	if (info.argsz >= total_cap_size)
+		info.cap_offset = sizeof(info);
+	info.argsz = total_cap_size;
+	info.flags |= VFIO_IOMMU_INFO_CAPS;
+	if (copy_to_user(arg, &info, minsz))
+		rc = -EFAULT;
+	rc = 0;
+
+out_put:
+	up_read(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+/* FIXME TODO:
+PowerPC SPAPR only:
+#define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
+#define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
+#define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
+#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
+*/
+
+int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
+		       unsigned long arg)
+{
+	void __user *uarg = (void __user *)arg;
+
+	switch (cmd) {
+	case VFIO_GET_API_VERSION:
+		return VFIO_API_VERSION;
+	case VFIO_SET_IOMMU:
+		return iommufd_vfio_set_iommu(ictx, arg);
+	case VFIO_CHECK_EXTENSION:
+		return iommufd_vfio_check_extension(arg);
+	case VFIO_IOMMU_GET_INFO:
+		return iommufd_vfio_iommu_get_info(ictx, uarg);
+	case VFIO_IOMMU_MAP_DMA:
+		return iommufd_vfio_map_dma(ictx, cmd, uarg);
+	case VFIO_IOMMU_UNMAP_DMA:
+		return iommufd_vfio_unmap_dma(ictx, cmd, uarg);
+	case VFIO_IOMMU_DIRTY_PAGES:
+	default:
+		return -ENOIOCTLCMD;
+	}
+	return -ENOIOCTLCMD;
+}
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index ba7b17ec3002e3..2c0f5ced417335 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -42,6 +42,7 @@  enum {
 	IOMMUFD_CMD_IOAS_MAP,
 	IOMMUFD_CMD_IOAS_COPY,
 	IOMMUFD_CMD_IOAS_UNMAP,
+	IOMMUFD_CMD_VFIO_IOAS,
 };
 
 /**
@@ -184,4 +185,39 @@  struct iommu_ioas_unmap {
 	__aligned_u64 length;
 };
 #define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
+
+/**
+ * enum iommufd_vfio_ioas_op
+ * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility
+ */
+enum iommufd_vfio_ioas_op {
+	IOMMU_VFIO_IOAS_GET = 0,
+	IOMMU_VFIO_IOAS_SET = 1,
+	IOMMU_VFIO_IOAS_CLEAR = 2,
+};
+
+/**
+ * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set
+ *           For IOMMU_VFIO_IOAS_GET will output the IOAS ID
+ * @op: One of enum iommufd_vfio_ioas_op
+ * @__reserved: Must be 0
+ *
+ * The VFIO compatibility support uses a single ioas because VFIO APIs do not
+ * support the ID field. Set or Get the IOAS that VFIO compatibility will use.
+ * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the
+ * compatibility ioas, either by taking what is already set, or auto creating
+ * one. From then on VFIO will continue to use that ioas and is not effected by
+ * this ioctl. SET or CLEAR does not destroy any auto-created IOAS.
+ */
+struct iommu_vfio_ioas {
+	__u32 size;
+	__u32 ioas_id;
+	__u16 op;
+	__u16 __reserved;
+};
+#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
 #endif