[RFC,08/12] iommufd: IOCTLs for the io_pagetable

Message ID	8-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Jason Gunthorpe <jgg@nvidia.com> Cc: Alex Williamson <alex.williamson@redhat.com>, Lu Baolu <baolu.lu@linux.intel.com>, Chaitanya Kulkarni <chaitanyak@nvidia.com>, Cornelia Huck <cohuck@redhat.com>, Daniel Jordan <daniel.m.jordan@oracle.com>, David Gibson <david@gibson.dropbear.id.au>, Eric Auger <eric.auger@redhat.com>, iommu@lists.linux-foundation.org, Jason Wang <jasowang@redhat.com>, Jean-Philippe Brucker <jean-philippe@linaro.org>, Joao Martins <joao.m.martins@oracle.com>, Kevin Tian <kevin.tian@intel.com>, kvm@vger.kernel.org, Matthew Rosato <mjrosato@linux.ibm.com>, "Michael S. Tsirkin" <mst@redhat.com>, Nicolin Chen <nicolinc@nvidia.com>, Niklas Schnelle <schnelle@linux.ibm.com>, Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>, Yi Liu <yi.l.liu@intel.com>, Keqian Zhu <zhukeqian1@huawei.com> Subject: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable Date: Fri, 18 Mar 2022 14:27:33 -0300 Message-Id: <8-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com> In-Reply-To: <0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com> References: Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Precedence: bulk
Series	IOMMUFD Generic interface \| expand [RFC,00/12] IOMMUFD Generic interface [RFC,01/12] interval-tree: Add a utility to iterate over spans in an interval tree [RFC,02/12] iommufd: Overview documentation [RFC,03/12] iommufd: File descriptor, context, kconfig and makefiles [RFC,04/12] kernel/user: Allow user::locked_vm to be usable for iommufd [RFC,05/12] iommufd: PFN handling for iopt_pages [RFC,06/12] iommufd: Algorithms for PFN storage [RFC,07/12] iommufd: Data structure to provide IOVA to PFN mapping [RFC,08/12] iommufd: IOCTLs for the io_pagetable [RFC,09/12] iommufd: Add a HW pagetable object [RFC,10/12] iommufd: Add kAPI toward external drivers [RFC,11/12] iommufd: vfio container FD ioctl compatibility [RFC,12/12] iommufd: Add a selftest

Jason Gunthorpe March 18, 2022, 5:27 p.m. UTC

Connect the IOAS to its IOCTL interface. This exposes most of the
functionality in the io_pagetable to userspace.

This is intended to be the core of the generic interface that IOMMUFD will
provide. Every IOMMU driver should be able to implement an iommu_domain
that is compatible with this generic mechanism.

It is also designed to be easy to use for simple non virtual machine
monitor users, like DPDK:
 - Universal simple support for all IOMMUs (no PPC special path)
 - An IOVA allocator that considerds the aperture and the reserved ranges
 - io_pagetable allows any number of iommu_domains to be connected to the
   IOAS

Along with room in the design to add non-generic features to cater to
specific HW functionality.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/ioas.c            | 248 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  27 +++
 drivers/iommu/iommufd/main.c            |  17 ++
 include/uapi/linux/iommufd.h            | 132 +++++++++++++
 5 files changed, 425 insertions(+)
 create mode 100644 drivers/iommu/iommufd/ioas.c

Alex Williamson March 23, 2022, 7:10 p.m. UTC | #1

On Fri, 18 Mar 2022 14:27:33 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> +static int conv_iommu_prot(u32 map_flags)
> +{
> +	int iommu_prot;
> +
> +	/*
> +	 * We provide no manual cache coherency ioctls to userspace and most
> +	 * architectures make the CPU ops for cache flushing privileged.
> +	 * Therefore we require the underlying IOMMU to support CPU coherent
> +	 * operation.
> +	 */
> +	iommu_prot = IOMMU_CACHE;

Where is this requirement enforced?  AIUI we'd need to test
IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
intel_iommu_map() simply drop the flag when not supported by HW.

This also seems like an issue relative to vfio compatibility that I
don't see mentioned in that patch.  Thanks,

Alex

> +	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
> +		iommu_prot |= IOMMU_WRITE;
> +	if (map_flags & IOMMU_IOAS_MAP_READABLE)
> +		iommu_prot |= IOMMU_READ;
> +	return iommu_prot;
> +}

Jason Gunthorpe March 23, 2022, 7:34 p.m. UTC | #2

On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> On Fri, 18 Mar 2022 14:27:33 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > +static int conv_iommu_prot(u32 map_flags)
> > +{
> > +	int iommu_prot;
> > +
> > +	/*
> > +	 * We provide no manual cache coherency ioctls to userspace and most
> > +	 * architectures make the CPU ops for cache flushing privileged.
> > +	 * Therefore we require the underlying IOMMU to support CPU coherent
> > +	 * operation.
> > +	 */
> > +	iommu_prot = IOMMU_CACHE;
> 
> Where is this requirement enforced?  AIUI we'd need to test
> IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> intel_iommu_map() simply drop the flag when not supported by HW.

You are right, the correct thing to do is to fail device
binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
however we can't do that because Intel abuses the meaning of
IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
supported.

I want Intel to split out their special no-snoop from IOMMU_CACHE and
IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
all iommu drivers. Once this is done vfio and iommufd should both
always set IOMMU_CACHE and refuse to work without
IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
arch that does in fact work today with vfio, somehow, but I don't..)

I added a fixme about this.

> This also seems like an issue relative to vfio compatibility that I
> don't see mentioned in that patch.  Thanks,

Yes, it was missed in the notes for vfio compat that Intel no-snoop is
not working currently, I fixed it.

Thanks,
Jason

Alex Williamson March 23, 2022, 8:04 p.m. UTC | #3

On Wed, 23 Mar 2022 16:34:39 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > On Fri, 18 Mar 2022 14:27:33 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > +static int conv_iommu_prot(u32 map_flags)
> > > +{
> > > +	int iommu_prot;
> > > +
> > > +	/*
> > > +	 * We provide no manual cache coherency ioctls to userspace and most
> > > +	 * architectures make the CPU ops for cache flushing privileged.
> > > +	 * Therefore we require the underlying IOMMU to support CPU coherent
> > > +	 * operation.
> > > +	 */
> > > +	iommu_prot = IOMMU_CACHE;  
> > 
> > Where is this requirement enforced?  AIUI we'd need to test
> > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > intel_iommu_map() simply drop the flag when not supported by HW.  
> 
> You are right, the correct thing to do is to fail device
> binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> however we can't do that because Intel abuses the meaning of
> IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
> supported.
> 
> I want Intel to split out their special no-snoop from IOMMU_CACHE and
> IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
> all iommu drivers. Once this is done vfio and iommufd should both
> always set IOMMU_CACHE and refuse to work without
> IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
> arch that does in fact work today with vfio, somehow, but I don't..)

IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
lack of snoop-control support, causing us to have mixed coherent and
non-coherent domains.  I don't recall if you go back far enough in VT-d
history if the primary IOMMU might have lacked this support.  So I
think there are systems we care about with IOMMUs that can't enforce
DMA coherency.

As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and all
mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all mappings
are coherent regardless of mapping protection flags?  What's the point
of IOMMU_CACHE at that point?

> I added a fixme about this.
> 
> > This also seems like an issue relative to vfio compatibility that I
> > don't see mentioned in that patch.  Thanks,  
> 
> Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> not working currently, I fixed it.

Right, I see it in the comments relative to extensions, but missed in
the commit log.  Thanks,

Alex

Jason Gunthorpe March 23, 2022, 8:34 p.m. UTC | #4

On Wed, Mar 23, 2022 at 02:04:46PM -0600, Alex Williamson wrote:
> On Wed, 23 Mar 2022 16:34:39 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > > On Fri, 18 Mar 2022 14:27:33 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > +static int conv_iommu_prot(u32 map_flags)
> > > > +{
> > > > +	int iommu_prot;
> > > > +
> > > > +	/*
> > > > +	 * We provide no manual cache coherency ioctls to userspace and most
> > > > +	 * architectures make the CPU ops for cache flushing privileged.
> > > > +	 * Therefore we require the underlying IOMMU to support CPU coherent
> > > > +	 * operation.
> > > > +	 */
> > > > +	iommu_prot = IOMMU_CACHE;  
> > > 
> > > Where is this requirement enforced?  AIUI we'd need to test
> > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > > intel_iommu_map() simply drop the flag when not supported by HW.  
> > 
> > You are right, the correct thing to do is to fail device
> > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> > however we can't do that because Intel abuses the meaning of
> > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
> > supported.
> > 
> > I want Intel to split out their special no-snoop from IOMMU_CACHE and
> > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
> > all iommu drivers. Once this is done vfio and iommufd should both
> > always set IOMMU_CACHE and refuse to work without
> > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
> > arch that does in fact work today with vfio, somehow, but I don't..)
> 
> IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
> lack of snoop-control support, causing us to have mixed coherent and
> non-coherent domains.  I don't recall if you go back far enough in VT-d
> history if the primary IOMMU might have lacked this support.  So I
> think there are systems we care about with IOMMUs that can't enforce
> DMA coherency.
> 
> As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and all
> mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
> suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all mappings
> are coherent regardless of mapping protection flags?  What's the point
> of IOMMU_CACHE at that point?

IOMMU_CAP_CACHE_COHERENCY should return to what it was before Intel's
change.

It only means normal DMAs issued in a normal way are coherent with the
CPU and do not require special cache flushing instructions. ie DMA
issued by a kernel driver using the DMA API.

It does not mean that non-coherence DMA does not exist, or that
platform or device specific ways to trigger non-coherence are blocked.

Stated another way, any platform that wires dev_is_dma_coherent() to
true, like all x86 does, must support IOMMU_CACHE and report
IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
supports. The platform obviously declares it support this in order to
support the in-kernel DMA API.

Thus, a new cap indicating that 'all dma is coherent' or 'no-snoop
blocking' should be created to cover Intel's special need. From what I
know it is only implemented in the Intel driver and apparently only
for some IOMMUs connected to IGD.

> > Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> > not working currently, I fixed it.
> 
> Right, I see it in the comments relative to extensions, but missed in
> the commit log.  Thanks,

Oh good, I remembered it was someplace..

Jason

Jason Gunthorpe March 23, 2022, 10:54 p.m. UTC | #5

On Wed, Mar 23, 2022 at 05:34:18PM -0300, Jason Gunthorpe wrote:

> Stated another way, any platform that wires dev_is_dma_coherent() to
> true, like all x86 does, must support IOMMU_CACHE and report
> IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> supports. The platform obviously declares it support this in order to
> support the in-kernel DMA API.

That gives me a nice simple idea:

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 3c6b95ad026829..8366884df4a030 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -8,6 +8,7 @@
 #include <linux/pci.h>
 #include <linux/irqdomain.h>
 #include <linux/dma-iommu.h>
+#include <linux/dma-map-ops.h>
 
 #include "iommufd_private.h"
 
@@ -61,6 +62,10 @@ struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
 	struct iommu_group *group;
 	int rc;
 
+	/* iommufd always uses IOMMU_CACHE */
+	if (!dev_is_dma_coherent(&pdev->dev))
+		return ERR_PTR(-EINVAL);
+
 	ictx = iommufd_fget(fd);
 	if (!ictx)
 		return ERR_PTR(-EINVAL);
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 48149988c84bbc..3d6df1ffbf93e6 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -129,7 +129,8 @@ static int conv_iommu_prot(u32 map_flags)
 	 * We provide no manual cache coherency ioctls to userspace and most
 	 * architectures make the CPU ops for cache flushing privileged.
 	 * Therefore we require the underlying IOMMU to support CPU coherent
-	 * operation.
+	 * operation. Support for IOMMU_CACHE is enforced by the
+	 * dev_is_dma_coherent() test during bind.
 	 */
 	iommu_prot = IOMMU_CACHE;
 	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)

Looking at it I would say all the places that test
IOMMU_CAP_CACHE_COHERENCY can be replaced with dev_is_dma_coherent()
except for the one call in VFIO that is supporting the Intel no-snoop
behavior.

Then we can rename IOMMU_CAP_CACHE_COHERENCY to something like
IOMMU_CAP_ENFORCE_CACHE_COHERENCY and just create a
IOMMU_ENFORCE_CACHE prot flag for Intel IOMMU to use instead of
abusing IOMMU_CACHE.

Jason

Tian, Kevin March 24, 2022, 6:46 a.m. UTC | #6

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 24, 2022 4:34 AM
> 
> On Wed, Mar 23, 2022 at 02:04:46PM -0600, Alex Williamson wrote:
> > On Wed, 23 Mar 2022 16:34:39 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > > > On Fri, 18 Mar 2022 14:27:33 -0300
> > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > > +static int conv_iommu_prot(u32 map_flags)
> > > > > +{
> > > > > +	int iommu_prot;
> > > > > +
> > > > > +	/*
> > > > > +	 * We provide no manual cache coherency ioctls to userspace
> and most
> > > > > +	 * architectures make the CPU ops for cache flushing
> privileged.
> > > > > +	 * Therefore we require the underlying IOMMU to support
> CPU coherent
> > > > > +	 * operation.
> > > > > +	 */
> > > > > +	iommu_prot = IOMMU_CACHE;
> > > >
> > > > Where is this requirement enforced?  AIUI we'd need to test
> > > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > > > intel_iommu_map() simply drop the flag when not supported by HW.
> > >
> > > You are right, the correct thing to do is to fail device
> > > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> > > however we can't do that because Intel abuses the meaning of
> > > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop
> behavior is
> > > supported.
> > >
> > > I want Intel to split out their special no-snoop from IOMMU_CACHE and
> > > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent
> meaning in
> > > all iommu drivers. Once this is done vfio and iommufd should both
> > > always set IOMMU_CACHE and refuse to work without
> > > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of
> an !IOMMU_CACHE
> > > arch that does in fact work today with vfio, somehow, but I don't..)
> >
> > IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
> > lack of snoop-control support, causing us to have mixed coherent and
> > non-coherent domains.  I don't recall if you go back far enough in VT-d
> > history if the primary IOMMU might have lacked this support.  So I
> > think there are systems we care about with IOMMUs that can't enforce
> > DMA coherency.
> >
> > As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and
> all
> > mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
> > suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all
> mappings
> > are coherent regardless of mapping protection flags?  What's the point
> > of IOMMU_CACHE at that point?
> 
> IOMMU_CAP_CACHE_COHERENCY should return to what it was before Intel's
> change.

One nit (as I explained in previous v1 discussion). It is not that Intel
abuses this capability as it was firstly introduced for Intel's force 
snoop requirement. It is just that when later its meaning was changed
to match what you described below the original use of Intel was not
caught and fixed properly.

Tian, Kevin March 24, 2022, 7:25 a.m. UTC | #7

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 24, 2022 6:55 AM
> 
> On Wed, Mar 23, 2022 at 05:34:18PM -0300, Jason Gunthorpe wrote:
> 
> > Stated another way, any platform that wires dev_is_dma_coherent() to
> > true, like all x86 does, must support IOMMU_CACHE and report
> > IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> > supports. The platform obviously declares it support this in order to
> > support the in-kernel DMA API.
> 
> That gives me a nice simple idea:
> 
> diff --git a/drivers/iommu/iommufd/device.c
> b/drivers/iommu/iommufd/device.c
> index 3c6b95ad026829..8366884df4a030 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -8,6 +8,7 @@
>  #include <linux/pci.h>
>  #include <linux/irqdomain.h>
>  #include <linux/dma-iommu.h>
> +#include <linux/dma-map-ops.h>
> 
>  #include "iommufd_private.h"
> 
> @@ -61,6 +62,10 @@ struct iommufd_device
> *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
>  	struct iommu_group *group;
>  	int rc;
> 
> +	/* iommufd always uses IOMMU_CACHE */
> +	if (!dev_is_dma_coherent(&pdev->dev))
> +		return ERR_PTR(-EINVAL);
> +
>  	ictx = iommufd_fget(fd);
>  	if (!ictx)
>  		return ERR_PTR(-EINVAL);
> diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
> index 48149988c84bbc..3d6df1ffbf93e6 100644
> --- a/drivers/iommu/iommufd/ioas.c
> +++ b/drivers/iommu/iommufd/ioas.c
> @@ -129,7 +129,8 @@ static int conv_iommu_prot(u32 map_flags)
>  	 * We provide no manual cache coherency ioctls to userspace and
> most
>  	 * architectures make the CPU ops for cache flushing privileged.
>  	 * Therefore we require the underlying IOMMU to support CPU
> coherent
> -	 * operation.
> +	 * operation. Support for IOMMU_CACHE is enforced by the
> +	 * dev_is_dma_coherent() test during bind.
>  	 */
>  	iommu_prot = IOMMU_CACHE;
>  	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
> 
> Looking at it I would say all the places that test
> IOMMU_CAP_CACHE_COHERENCY can be replaced with
> dev_is_dma_coherent()
> except for the one call in VFIO that is supporting the Intel no-snoop
> behavior.
> 
> Then we can rename IOMMU_CAP_CACHE_COHERENCY to something like
> IOMMU_CAP_ENFORCE_CACHE_COHERENCY and just create a
> IOMMU_ENFORCE_CACHE prot flag for Intel IOMMU to use instead of
> abusing IOMMU_CACHE.
> 

Based on that here is a quick tweak of the force-snoop part (not compiled).

Several notes:

- IOMMU_CAP_CACHE_COHERENCY is kept as it's checked in vfio's
  group attach interface. Removing it may require a group_is_dma_coherent();

- vdpa is not changed as force-snoop is only for integrated GPU today which
  is not managed by vdpa. But adding the snoop support is easy if necessary;

- vfio type1 reports force-snoop fact to KVM via VFIO_DMA_CC_IOMMU. For
  iommufd the compat layer may leverage that interface but more thoughts
  are required for non-compat usage how that can be reused or whether a
  new one is required between iommufd and kvm. Per earlier  discussions
  Paolo prefers to reuse.

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 5b196cf..06cca04 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5110,7 +5110,8 @@ static int intel_iommu_map(struct iommu_domain *domain,
 		prot |= DMA_PTE_READ;
 	if (iommu_prot & IOMMU_WRITE)
 		prot |= DMA_PTE_WRITE;
-	if ((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping)
+	/* nothing to do for IOMMU_CACHE */
+	if ((iommu_prot & IOMMU_SNOOP) && dmar_domain->iommu_snooping)
 		prot |= DMA_PTE_SNP;
 
 	max_addr = iova + size;
@@ -5236,6 +5237,8 @@ static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
 static bool intel_iommu_capable(enum iommu_cap cap)
 {
 	if (cap == IOMMU_CAP_CACHE_COHERENCY)
+		return true;
+	if (cap == IOMMU_CAP_FORCE_SNOOP)
 		return domain_update_iommu_snooping(NULL);
 	if (cap == IOMMU_CAP_INTR_REMAP)
 		return irq_remapping_enabled == 1;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 9394aa9..abc4cfe 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2270,6 +2270,9 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
 		domain->prot |= IOMMU_CACHE;
 
+	if (iommu_capable(bus, IOMMU_CAP_FORCE_SNOOP)
+		domain->prot |= IOMMU_SNOOP;
+
 	/*
 	 * Try to match an existing compatible domain.  We don't want to
 	 * preclude an IOMMU driver supporting multiple bus_types and being
@@ -2611,14 +2614,14 @@ static void vfio_iommu_type1_release(void *iommu_data)
 	kfree(iommu);
 }
 
-static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
+static int vfio_domains_have_iommu_snoop(struct vfio_iommu *iommu)
 {
 	struct vfio_domain *domain;
 	int ret = 1;
 
 	mutex_lock(&iommu->lock);
 	list_for_each_entry(domain, &iommu->domain_list, next) {
-		if (!(domain->prot & IOMMU_CACHE)) {
+		if (!(domain->prot & IOMMU_SNOOP)) {
 			ret = 0;
 			break;
 		}
@@ -2641,7 +2644,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 	case VFIO_DMA_CC_IOMMU:
 		if (!iommu)
 			return 0;
-		return vfio_domains_have_iommu_cache(iommu);
+		return vfio_domains_have_iommu_snoop(iommu);
 	default:
 		return 0;
 	}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index de0c57a..45184d7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -21,6 +21,8 @@
 #define IOMMU_CACHE	(1 << 2) /* DMA cache coherency */
 #define IOMMU_NOEXEC	(1 << 3)
 #define IOMMU_MMIO	(1 << 4) /* e.g. things like MSI doorbells */
+#define IOMMU_SNOOP	(1 << 5) /* force DMA to snoop */
+
 /*
  * Where the bus hardware includes a privilege level as part of its access type
  * markings, and certain devices are capable of issuing transactions marked as
@@ -106,6 +108,8 @@ enum iommu_cap {
 					   transactions */
 	IOMMU_CAP_INTR_REMAP,		/* IOMMU supports interrupt isolation */
 	IOMMU_CAP_NOEXEC,		/* IOMMU_NOEXEC flag */
+	IOMMU_CAP_FORCE_SNOOP,		/* IOMMU forces all transactions to
+					   snoop cache */
 };
 
 /* These are the possible reserved region types */

Thanks
Kevin

Jason Gunthorpe March 24, 2022, 1:46 p.m. UTC | #8

On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:

> Based on that here is a quick tweak of the force-snoop part (not compiled).

I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
started out OK but got weird. So lets fix it back to the way it was.

How about this:

https://github.com/jgunthorpe/linux/commits/intel_no_snoop

b11c19a4b34c2a iommu: Move the Intel no-snoop control off of IOMMU_CACHE
5263947f9d5f36 vfio: Require that device support DMA cache coherence
eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its original meaning
2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()

If you like it could you take it from here?

Thanks,
Jason

Tian, Kevin March 25, 2022, 2:15 a.m. UTC | #9

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 24, 2022 9:46 PM
> 
> On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:
> 
> > Based on that here is a quick tweak of the force-snoop part (not compiled).
> 
> I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> started out OK but got weird. So lets fix it back to the way it was.
> 
> How about this:
> 
> https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> 
> b11c19a4b34c2a iommu: Move the Intel no-snoop control off of
> IOMMU_CACHE
> 5263947f9d5f36 vfio: Require that device support DMA cache coherence
> eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its
> original meaning
> 2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY
> with dev_is_dma_coherent()
> 
> If you like it could you take it from here?
> 

this looks good to me except that the 2nd patch (eab4b381) should be
the last one otherwise it affects bisect. and in that case the subject 
would be simply about removing the capability instead of restoring...

let me find a box to verify it.

Thanks
Kevin

Tian, Kevin March 27, 2022, 2:32 a.m. UTC | #10

> From: Tian, Kevin
> Sent: Friday, March 25, 2022 10:16 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, March 24, 2022 9:46 PM
> >
> > On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:
> >
> > > Based on that here is a quick tweak of the force-snoop part (not
> compiled).
> >
> > I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> > started out OK but got weird. So lets fix it back to the way it was.
> >
> > How about this:
> >
> > https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> >
> > b11c19a4b34c2a iommu: Move the Intel no-snoop control off of
> > IOMMU_CACHE
> > 5263947f9d5f36 vfio: Require that device support DMA cache coherence
> > eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its
> > original meaning
> > 2752e12bed48f6 iommu: Replace uses of
> IOMMU_CAP_CACHE_COHERENCY
> > with dev_is_dma_coherent()
> >
> > If you like it could you take it from here?
> >
> 
> this looks good to me except that the 2nd patch (eab4b381) should be
> the last one otherwise it affects bisect. and in that case the subject
> would be simply about removing the capability instead of restoring...
> 
> let me find a box to verify it.
> 

My colleague (Terrence) has the environment and helped verify it.

He will give his tested-by after you send out the formal series.

Thanks
Kevin

Jason Gunthorpe March 27, 2022, 2:28 p.m. UTC | #11

On Sun, Mar 27, 2022 at 02:32:23AM +0000, Tian, Kevin wrote:

> > this looks good to me except that the 2nd patch (eab4b381) should be
> > the last one otherwise it affects bisect. and in that case the subject
> > would be simply about removing the capability instead of
> > restoring...

Oh because VFIO won't sent IOMMU_CACHE in this case? Hmm. OK

> > let me find a box to verify it.
> 
> My colleague (Terrence) has the environment and helped verify it.
> 
> He will give his tested-by after you send out the formal series.

Okay, I can send it after the merge window

Jason

Alex Williamson March 28, 2022, 5:17 p.m. UTC | #12

On Thu, 24 Mar 2022 10:46:22 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:
> 
> > Based on that here is a quick tweak of the force-snoop part (not compiled).  
> 
> I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> started out OK but got weird. So lets fix it back to the way it was.
> 
> How about this:
> 
> https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> 
> b11c19a4b34c2a iommu: Move the Intel no-snoop control off of IOMMU_CACHE
> 5263947f9d5f36 vfio: Require that device support DMA cache coherence

I have some issues with the argument here:

  This will block device/platform/iommu combinations that do not
  support cache coherent DMA - but these never worked anyhow as VFIO
  did not expose any interface to perform the required cache
  maintenance operations.

VFIO never intended to provide such operations, it only tried to make
the coherence of the device visible to userspace such that it can
perform operations via other means, for example via KVM.  The "never
worked" statement here seems false.

Commit b11c19a4b34c2a also appears to be a behavioral change.  AIUI
vfio_domain.enforce_cache_coherency would only be set on Intel VT-d
where snoop-control is supported, this translates to KVM emulating
coherency instructions everywhere except VT-d w/ snoop-control.

My understanding of AMD-Vi is that no-snoop TLPs are always coherent, so
this would trigger unnecessary wbinvd emulation on those platforms.  I
don't know if other archs need similar, but it seems we're changing
polarity wrt no-snoop TLPs from "everyone is coherent except this case
on Intel" to "everyone is non-coherent except this opposite case on
Intel".  Thanks,

Alex

> eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its original meaning
> 2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()
> 
> If you like it could you take it from here?
> 
> Thanks,
> Jason
>

Jason Gunthorpe March 28, 2022, 6:57 p.m. UTC | #13

On Mon, Mar 28, 2022 at 11:17:23AM -0600, Alex Williamson wrote:
> On Thu, 24 Mar 2022 10:46:22 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:
> > 
> > > Based on that here is a quick tweak of the force-snoop part (not compiled).  
> > 
> > I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> > started out OK but got weird. So lets fix it back to the way it was.
> > 
> > How about this:
> > 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjgunthorpe%2Flinux%2Fcommits%2Fintel_no_snoop&amp;data=04%7C01%7Cjgg%40nvidia.com%7C9d34426f1c1646af43a608da10ded6b5%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637840846514240225%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2ByHWyE8Yxcwxe8r8LoMQD9tPh5%2FZPaGfNsUkMlpRfWM%3D&amp;reserved=0
> > 
> > b11c19a4b34c2a iommu: Move the Intel no-snoop control off of IOMMU_CACHE
> > 5263947f9d5f36 vfio: Require that device support DMA cache coherence
> 
> I have some issues with the argument here:
> 
>   This will block device/platform/iommu combinations that do not
>   support cache coherent DMA - but these never worked anyhow as VFIO
>   did not expose any interface to perform the required cache
>   maintenance operations.
> 
> VFIO never intended to provide such operations, it only tried to make
> the coherence of the device visible to userspace such that it can
> perform operations via other means, for example via KVM.  The "never
> worked" statement here seems false.

VFIO is generic. I expect if DPDK connects to VFIO then it will work
properly. That is definitely not the case today when
dev_is_dma_coherent() is false. This is what the paragraph is talking
about.

Remember, x86 wires dev_is_dma_coherent() to true, so this above
remark is not related to anything about x86.

We don't have a way in VFIO to negotiate that 'vfio can only be used
with kvm' so I hope no cases like that really do exist :( Do you know
of any?

> Commit b11c19a4b34c2a also appears to be a behavioral change.  AIUI
> vfio_domain.enforce_cache_coherency would only be set on Intel VT-d
> where snoop-control is supported, this translates to KVM emulating
> coherency instructions everywhere except VT-d w/ snoop-control.

It seems so.

> My understanding of AMD-Vi is that no-snoop TLPs are always coherent, so
> this would trigger unnecessary wbinvd emulation on those platforms.  

I look in the AMD manual and it looks like it works the same as intel
with a dedicated IOPTE bit:

  #define IOMMU_PTE_FC (1ULL << 60)

https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf Pg 79:

 FC: Force Coherent. Software uses the FC bit in the PTE to indicate
 the source of the upstream coherent attribute state for an
 untranslated DMA transaction.1 = the IOMMU sets the coherent attribute
 state in the upstream request. 0 = the IOMMU passes on the coherent
 attribute state from the originating request. Device internal
 address/page table translations are considered "untranslated accesses"
 by IOMMU.The FC state is returned in the ATS response to the device
 endpoint via the state of the (N)oSnoop bit.

So, currently AMD and Intel have exactly the same HW feature with a
different kAPI..

I would say it is wrong that AMD creates kernel owned domains for the
DMA-API to use that do not support snoop.

> don't know if other archs need similar, but it seems we're changing
> polarity wrt no-snoop TLPs from "everyone is coherent except this case
> on Intel" to "everyone is non-coherent except this opposite case on
> Intel".

Yes. We should not assume no-snoop blocking is a HW feature without
explicit knowledge that it is.

From a kAPI compat perspective IOMMU_CAP_CACHE_COHERENCY
only has two impacts:
 - Only on x86 arch it controls kvm_arch_register_noncoherent_dma()
 - It triggers IOMMU_CACHE

If we look at the list of places where IOMMU_CAP_CACHE_COHERENCY is set:

 drivers/iommu/intel/iommu.c
   Must have IOMMU_CACHE set/clear to control no-snoop blocking

 drivers/iommu/amd/iommu.c
   Always sets its no-snoop block, inconsistent with Intel

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
 drivers/iommu/arm/arm-smmu/arm-smmu.c
 drivers/iommu/arm/arm-smmu/qcom_iommu.c
   Must have IOMMU_CACHE set, ARM arch has no
   kvm_arch_register_noncoherent_dma()

   From what I could tell in the manuals and the prior discussion
   SMMU doesn't block no-snoop.

   ie ARM lies about IOMMU_CAP_CACHE_COHERENCY because it needs
   IOMM_CACHE set to work.

 drivers/iommu/fsl_pamu_domain.c
 drivers/iommu/s390-iommu.c
   Ignore IOMM_CACHE, arch has no kvm_arch_register_noncoherent_dma()

   No idea if the HW blocks no-snoop or not, but it doesn't matter.

So other than AMD, it is OK to change the sense and makes it clearer
for future driver authors what they are expected to do with this.

Thanks,
Jason

Jason Gunthorpe March 28, 2022, 7:47 p.m. UTC | #14

On Mon, Mar 28, 2022 at 03:57:53PM -0300, Jason Gunthorpe wrote:

> So, currently AMD and Intel have exactly the same HW feature with a
> different kAPI..

I fixed it like below and made the ordering changes Kevin pointed
to. Will send next week after the merge window:

527e438a974a06 iommu: Delete IOMMU_CAP_CACHE_COHERENCY
5cbc8603ffdf20 vfio: Move the Intel no-snoop control off of IOMMU_CACHE
ebc961f93d1af3 iommu: Introduce the domain op enforce_cache_coherency()
79c52a2bb1e60b vfio: Require that devices support DMA cache coherence
02168f961b6a75 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()

'79c can be avoided, we'd just drive IOMMU_CACHE off of
dev_is_dma_coherent() - but if we do that I'd like to properly
document the arch/iommu/platform/kvm combination that is using this..

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3c0ac3c34a7f9a..f144eb9fea8e31 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2269,6 +2269,12 @@ static int amd_iommu_def_domain_type(struct device *dev)
 	return 0;
 }
 
+static bool amd_iommu_enforce_cache_coherency(struct iommu_domain *domain)
+{
+	/* IOMMU_PTE_FC is always set */
+	return true;
+}
+
 const struct iommu_ops amd_iommu_ops = {
 	.capable = amd_iommu_capable,
 	.domain_alloc = amd_iommu_domain_alloc,
@@ -2291,6 +2297,7 @@ const struct iommu_ops amd_iommu_ops = {
 		.flush_iotlb_all = amd_iommu_flush_iotlb_all,
 		.iotlb_sync	= amd_iommu_iotlb_sync,
 		.free		= amd_iommu_domain_free,
+		.enforce_cache_coherency = amd_iommu_enforce_cache_coherency,
 	}
 };

Thanks,
Jason

Alex Williamson March 28, 2022, 9:26 p.m. UTC | #15

On Mon, 28 Mar 2022 16:47:49 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Mar 28, 2022 at 03:57:53PM -0300, Jason Gunthorpe wrote:
> 
> > So, currently AMD and Intel have exactly the same HW feature with a
> > different kAPI..  
> 
> I fixed it like below and made the ordering changes Kevin pointed
> to. Will send next week after the merge window:
> 
> 527e438a974a06 iommu: Delete IOMMU_CAP_CACHE_COHERENCY
> 5cbc8603ffdf20 vfio: Move the Intel no-snoop control off of IOMMU_CACHE
> ebc961f93d1af3 iommu: Introduce the domain op enforce_cache_coherency()
> 79c52a2bb1e60b vfio: Require that devices support DMA cache coherence
> 02168f961b6a75 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()
> 
> '79c can be avoided, we'd just drive IOMMU_CACHE off of
> dev_is_dma_coherent() - but if we do that I'd like to properly
> document the arch/iommu/platform/kvm combination that is using this..

We can try to enforce dev_is_dma_coherent(), as you note it's not going
to affect any x86 users.  arm64 is the only obviously relevant arch that
defines ARCH_HAS_SYNC_DMA_FOR_{DEVICE,CPU} but the device.dma_coherent
setting comes from ACPI/OF firmware, so someone from ARM land will need
to shout if this is an issue.  I think we'd need to back off and go
with documentation if a broken use case shows up.  Thanks,

Alex

 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 3c0ac3c34a7f9a..f144eb9fea8e31 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -2269,6 +2269,12 @@ static int amd_iommu_def_domain_type(struct device *dev)
>  	return 0;
>  }
>  
> +static bool amd_iommu_enforce_cache_coherency(struct iommu_domain *domain)
> +{
> +	/* IOMMU_PTE_FC is always set */
> +	return true;
> +}
> +
>  const struct iommu_ops amd_iommu_ops = {
>  	.capable = amd_iommu_capable,
>  	.domain_alloc = amd_iommu_domain_alloc,
> @@ -2291,6 +2297,7 @@ const struct iommu_ops amd_iommu_ops = {
>  		.flush_iotlb_all = amd_iommu_flush_iotlb_all,
>  		.iotlb_sync	= amd_iommu_iotlb_sync,
>  		.free		= amd_iommu_domain_free,
> +		.enforce_cache_coherency = amd_iommu_enforce_cache_coherency,
>  	}
>  };
> 
> Thanks,
> Jason
>

Liu, Yi L March 30, 2022, 1:35 p.m. UTC | #16

Hi Jason,

On 2022/3/19 01:27, Jason Gunthorpe wrote:
> Connect the IOAS to its IOCTL interface. This exposes most of the
> functionality in the io_pagetable to userspace.
> 
> This is intended to be the core of the generic interface that IOMMUFD will
> provide. Every IOMMU driver should be able to implement an iommu_domain
> that is compatible with this generic mechanism.
> 
> It is also designed to be easy to use for simple non virtual machine
> monitor users, like DPDK:
>   - Universal simple support for all IOMMUs (no PPC special path)
>   - An IOVA allocator that considerds the aperture and the reserved ranges
>   - io_pagetable allows any number of iommu_domains to be connected to the
>     IOAS
> 
> Along with room in the design to add non-generic features to cater to
> specific HW functionality.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommufd/Makefile          |   1 +
>   drivers/iommu/iommufd/ioas.c            | 248 ++++++++++++++++++++++++
>   drivers/iommu/iommufd/iommufd_private.h |  27 +++
>   drivers/iommu/iommufd/main.c            |  17 ++
>   include/uapi/linux/iommufd.h            | 132 +++++++++++++
>   5 files changed, 425 insertions(+)
>   create mode 100644 drivers/iommu/iommufd/ioas.c
> 
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> index b66a8c47ff55ec..2b4f36f1b72f9d 100644
> --- a/drivers/iommu/iommufd/Makefile
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -1,6 +1,7 @@
>   # SPDX-License-Identifier: GPL-2.0-only
>   iommufd-y := \
>   	io_pagetable.o \
> +	ioas.o \
>   	main.o \
>   	pages.o
>   
> diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
> new file mode 100644
> index 00000000000000..c530b2ba74b06b
> --- /dev/null
> +++ b/drivers/iommu/iommufd/ioas.c
> @@ -0,0 +1,248 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> + */
> +#include <linux/interval_tree.h>
> +#include <linux/iommufd.h>
> +#include <linux/iommu.h>
> +#include <uapi/linux/iommufd.h>
> +
> +#include "io_pagetable.h"
> +
> +void iommufd_ioas_destroy(struct iommufd_object *obj)
> +{
> +	struct iommufd_ioas *ioas = container_of(obj, struct iommufd_ioas, obj);
> +	int rc;
> +
> +	rc = iopt_unmap_all(&ioas->iopt);
> +	WARN_ON(rc);
> +	iopt_destroy_table(&ioas->iopt);
> +}
> +
> +struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
> +{
> +	struct iommufd_ioas *ioas;
> +	int rc;
> +
> +	ioas = iommufd_object_alloc(ictx, ioas, IOMMUFD_OBJ_IOAS);
> +	if (IS_ERR(ioas))
> +		return ioas;
> +
> +	rc = iopt_init_table(&ioas->iopt);
> +	if (rc)
> +		goto out_abort;
> +	return ioas;
> +
> +out_abort:
> +	iommufd_object_abort(ictx, &ioas->obj);
> +	return ERR_PTR(rc);
> +}
> +
> +int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_alloc *cmd = ucmd->cmd;
> +	struct iommufd_ioas *ioas;
> +	int rc;
> +
> +	if (cmd->flags)
> +		return -EOPNOTSUPP;
> +
> +	ioas = iommufd_ioas_alloc(ucmd->ictx);
> +	if (IS_ERR(ioas))
> +		return PTR_ERR(ioas);
> +
> +	cmd->out_ioas_id = ioas->obj.id;
> +	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> +	if (rc)
> +		goto out_table;
> +	iommufd_object_finalize(ucmd->ictx, &ioas->obj);
> +	return 0;
> +
> +out_table:
> +	iommufd_ioas_destroy(&ioas->obj);
> +	return rc;
> +}
> +
> +int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_iova_ranges __user *uptr = ucmd->ubuffer;
> +	struct iommu_ioas_iova_ranges *cmd = ucmd->cmd;
> +	struct iommufd_ioas *ioas;
> +	struct interval_tree_span_iter span;
> +	u32 max_iovas;
> +	int rc;
> +
> +	if (cmd->__reserved)
> +		return -EOPNOTSUPP;
> +
> +	max_iovas = cmd->size - sizeof(*cmd);
> +	if (max_iovas % sizeof(cmd->out_valid_iovas[0]))
> +		return -EINVAL;
> +	max_iovas /= sizeof(cmd->out_valid_iovas[0]);
> +
> +	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
> +	if (IS_ERR(ioas))
> +		return PTR_ERR(ioas);
> +
> +	down_read(&ioas->iopt.iova_rwsem);
> +	cmd->out_num_iovas = 0;
> +	for (interval_tree_span_iter_first(
> +		     &span, &ioas->iopt.reserved_iova_itree, 0, ULONG_MAX);
> +	     !interval_tree_span_iter_done(&span);
> +	     interval_tree_span_iter_next(&span)) {
> +		if (!span.is_hole)
> +			continue;
> +		if (cmd->out_num_iovas < max_iovas) {
> +			rc = put_user((u64)span.start_hole,
> +				      &uptr->out_valid_iovas[cmd->out_num_iovas]
> +					       .start);
> +			if (rc)
> +				goto out_put;
> +			rc = put_user(
> +				(u64)span.last_hole,
> +				&uptr->out_valid_iovas[cmd->out_num_iovas].last);
> +			if (rc)
> +				goto out_put;
> +		}
> +		cmd->out_num_iovas++;
> +	}
> +	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> +	if (rc)
> +		goto out_put;
> +	if (cmd->out_num_iovas > max_iovas)
> +		rc = -EMSGSIZE;
> +out_put:
> +	up_read(&ioas->iopt.iova_rwsem);
> +	iommufd_put_object(&ioas->obj);
> +	return rc;
> +}
> +
> +static int conv_iommu_prot(u32 map_flags)
> +{
> +	int iommu_prot;
> +
> +	/*
> +	 * We provide no manual cache coherency ioctls to userspace and most
> +	 * architectures make the CPU ops for cache flushing privileged.
> +	 * Therefore we require the underlying IOMMU to support CPU coherent
> +	 * operation.
> +	 */
> +	iommu_prot = IOMMU_CACHE;
> +	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
> +		iommu_prot |= IOMMU_WRITE;
> +	if (map_flags & IOMMU_IOAS_MAP_READABLE)
> +		iommu_prot |= IOMMU_READ;
> +	return iommu_prot;
> +}
> +
> +int iommufd_ioas_map(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_map *cmd = ucmd->cmd;
> +	struct iommufd_ioas *ioas;
> +	unsigned int flags = 0;
> +	unsigned long iova;
> +	int rc;
> +
> +	if ((cmd->flags &
> +	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
> +	       IOMMU_IOAS_MAP_READABLE)) ||
> +	    cmd->__reserved)
> +		return -EOPNOTSUPP;
> +	if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX)
> +		return -EOVERFLOW;
> +
> +	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
> +	if (IS_ERR(ioas))
> +		return PTR_ERR(ioas);
> +
> +	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
> +		flags = IOPT_ALLOC_IOVA;
> +	iova = cmd->iova;
> +	rc = iopt_map_user_pages(&ioas->iopt, &iova,
> +				 u64_to_user_ptr(cmd->user_va), cmd->length,
> +				 conv_iommu_prot(cmd->flags), flags);
> +	if (rc)
> +		goto out_put;
> +
> +	cmd->iova = iova;
> +	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> +out_put:
> +	iommufd_put_object(&ioas->obj);
> +	return rc;
> +}
> +
> +int iommufd_ioas_copy(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_copy *cmd = ucmd->cmd;
> +	struct iommufd_ioas *src_ioas;
> +	struct iommufd_ioas *dst_ioas;
> +	struct iopt_pages *pages;
> +	unsigned int flags = 0;
> +	unsigned long iova;
> +	unsigned long start_byte;
> +	int rc;
> +
> +	if ((cmd->flags &
> +	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
> +	       IOMMU_IOAS_MAP_READABLE)))
> +		return -EOPNOTSUPP;
> +	if (cmd->length >= ULONG_MAX)
> +		return -EOVERFLOW;
> +
> +	src_ioas = iommufd_get_ioas(ucmd, cmd->src_ioas_id);
> +	if (IS_ERR(src_ioas))
> +		return PTR_ERR(src_ioas);
> +	/* FIXME: copy is not limited to an exact match anymore */
> +	pages = iopt_get_pages(&src_ioas->iopt, cmd->src_iova, &start_byte,
> +			       cmd->length);
> +	iommufd_put_object(&src_ioas->obj);
> +	if (IS_ERR(pages))
> +		return PTR_ERR(pages);
> +
> +	dst_ioas = iommufd_get_ioas(ucmd, cmd->dst_ioas_id);
> +	if (IS_ERR(dst_ioas)) {
> +		iopt_put_pages(pages);
> +		return PTR_ERR(dst_ioas);
> +	}
> +
> +	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
> +		flags = IOPT_ALLOC_IOVA;
> +	iova = cmd->dst_iova;
> +	rc = iopt_map_pages(&dst_ioas->iopt, pages, &iova, start_byte,
> +			    cmd->length, conv_iommu_prot(cmd->flags), flags);
> +	if (rc) {
> +		iopt_put_pages(pages);
> +		goto out_put_dst;
> +	}
> +
> +	cmd->dst_iova = iova;
> +	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> +out_put_dst:
> +	iommufd_put_object(&dst_ioas->obj);
> +	return rc;
> +}
> +
> +int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_unmap *cmd = ucmd->cmd;
> +	struct iommufd_ioas *ioas;
> +	int rc;
> +
> +	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
> +	if (IS_ERR(ioas))
> +		return PTR_ERR(ioas);
> +
> +	if (cmd->iova == 0 && cmd->length == U64_MAX) {
> +		rc = iopt_unmap_all(&ioas->iopt);
> +	} else {
> +		if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX) {
> +			rc = -EOVERFLOW;
> +			goto out_put;
> +		}
> +		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length);
> +	}
> +
> +out_put:
> +	iommufd_put_object(&ioas->obj);
> +	return rc;
> +}
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index bcf08e61bc87e9..d24c9dac5a82a9 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -96,6 +96,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
>   enum iommufd_object_type {
>   	IOMMUFD_OBJ_NONE,
>   	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
> +	IOMMUFD_OBJ_IOAS,
>   	IOMMUFD_OBJ_MAX,
>   };
>   
> @@ -147,4 +148,30 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
>   			     type),                                            \
>   		     typeof(*(ptr)), obj)
>   
> +/*
> + * The IO Address Space (IOAS) pagetable is a virtual page table backed by the
> + * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The
> + * mapping is copied into all of the associated domains and made available to
> + * in-kernel users.
> + */
> +struct iommufd_ioas {
> +	struct iommufd_object obj;
> +	struct io_pagetable iopt;
> +};
> +
> +static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
> +						    u32 id)
> +{
> +	return container_of(iommufd_get_object(ucmd->ictx, id,
> +					       IOMMUFD_OBJ_IOAS),
> +			    struct iommufd_ioas, obj);
> +}
> +
> +struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx);
> +int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd);
> +void iommufd_ioas_destroy(struct iommufd_object *obj);
> +int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
> +int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
> +int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
> +int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
>   #endif
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index ae8db2f663004f..e506f493b54cfe 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -184,6 +184,10 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp)
>   }
>   
>   union ucmd_buffer {
> +	struct iommu_ioas_alloc alloc;
> +	struct iommu_ioas_iova_ranges iova_ranges;
> +	struct iommu_ioas_map map;
> +	struct iommu_ioas_unmap unmap;
>   	struct iommu_destroy destroy;
>   };
>   
> @@ -205,6 +209,16 @@ struct iommufd_ioctl_op {
>   	}
>   static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>   	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
> +	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
> +		 struct iommu_ioas_alloc, out_ioas_id),
> +	IOCTL_OP(IOMMU_IOAS_COPY, iommufd_ioas_copy, struct iommu_ioas_copy,
> +		 src_iova),
> +	IOCTL_OP(IOMMU_IOAS_IOVA_RANGES, iommufd_ioas_iova_ranges,
> +		 struct iommu_ioas_iova_ranges, __reserved),
> +	IOCTL_OP(IOMMU_IOAS_MAP, iommufd_ioas_map, struct iommu_ioas_map,
> +		 __reserved),
> +	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
> +		 length),
>   };
>   
>   static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
> @@ -270,6 +284,9 @@ struct iommufd_ctx *iommufd_fget(int fd)
>   }
>   
>   static struct iommufd_object_ops iommufd_object_ops[] = {
> +	[IOMMUFD_OBJ_IOAS] = {
> +		.destroy = iommufd_ioas_destroy,
> +	},
>   };
>   
>   static struct miscdevice iommu_misc_dev = {
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 2f7f76ec6db4cb..ba7b17ec3002e3 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -37,6 +37,11 @@
>   enum {
>   	IOMMUFD_CMD_BASE = 0x80,
>   	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
> +	IOMMUFD_CMD_IOAS_ALLOC,
> +	IOMMUFD_CMD_IOAS_IOVA_RANGES,
> +	IOMMUFD_CMD_IOAS_MAP,
> +	IOMMUFD_CMD_IOAS_COPY,
> +	IOMMUFD_CMD_IOAS_UNMAP,
>   };
>   
>   /**
> @@ -52,4 +57,131 @@ struct iommu_destroy {
>   };
>   #define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
>   
> +/**
> + * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
> + * @size: sizeof(struct iommu_ioas_alloc)
> + * @flags: Must be 0
> + * @out_ioas_id: Output IOAS ID for the allocated object
> + *
> + * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
> + * to memory mapping.
> + */
> +struct iommu_ioas_alloc {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 out_ioas_id;
> +};
> +#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
> +
> +/**
> + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> + * @size: sizeof(struct iommu_ioas_iova_ranges)
> + * @ioas_id: IOAS ID to read ranges from
> + * @out_num_iovas: Output total number of ranges in the IOAS
> + * @__reserved: Must be 0
> + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> + *                   of out_num_iovas or the length implied by size.
> + * @out_valid_iovas.start: First IOVA in the allowed range
> + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> + *
> + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> + * not allowed. out_num_iovas will be set to the total number of iovas
> + * and the out_valid_iovas[] will be filled in as space permits.
> + * size should include the allocated flex array.
> + */
> +struct iommu_ioas_iova_ranges {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__u32 out_num_iovas;
> +	__u32 __reserved;
> +	struct iommu_valid_iovas {
> +		__aligned_u64 start;
> +		__aligned_u64 last;
> +	} out_valid_iovas[];
> +};
> +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> +
> +/**
> + * enum iommufd_ioas_map_flags - Flags for map and copy
> + * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
> + *                             IOVA to place the mapping at
> + * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
> + * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
> + */
> +enum iommufd_ioas_map_flags {
> +	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
> +	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
> +	IOMMU_IOAS_MAP_READABLE = 1 << 2,
> +};
> +
> +/**
> + * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
> + * @size: sizeof(struct iommu_ioas_map)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @ioas_id: IOAS ID to change the mapping of
> + * @__reserved: Must be 0
> + * @user_va: Userspace pointer to start mapping from
> + * @length: Number of bytes to map
> + * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
> + *        then this must be provided as input.
> + *
> + * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
> + * mapping will be established at iova, otherwise a suitable location will be
> + * automatically selected and returned in iova.
> + */
> +struct iommu_ioas_map {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 ioas_id;
> +	__u32 __reserved;
> +	__aligned_u64 user_va;
> +	__aligned_u64 length;
> +	__aligned_u64 iova;
> +};
> +#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
> +
> +/**
> + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @dst_ioas_id: IOAS ID to change the mapping of
> + * @src_ioas_id: IOAS ID to copy from

so the dst and src ioas_id are allocated via the same iommufd.
right? just out of curious, do you think it is possible that
the srs/dst ioas_ids are from different iommufds? In that case
may need to add src/dst iommufd. It's not needed today, just to
see if any blocker in kernel to support such copy. :-)

> + * @length: Number of bytes to copy and map
> + * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
> + *            set then this must be provided as input.
> + * @src_iova: IOVA to start the copy
> + *
> + * Copy an already existing mapping from src_ioas_id and establish it in
> + * dst_ioas_id. The src iova/length must exactly match a range used with
> + * IOMMU_IOAS_MAP.
> + */
> +struct iommu_ioas_copy {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 dst_ioas_id;
> +	__u32 src_ioas_id;
> +	__aligned_u64 length;
> +	__aligned_u64 dst_iova;
> +	__aligned_u64 src_iova;
> +};
> +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> +
> +/**
> + * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @ioas_id: IOAS ID to change the mapping of
> + * @iova: IOVA to start the unmapping at
> + * @length: Number of bytes to unmap
> + *
> + * Unmap an IOVA range. The iova/length must exactly match a range
> + * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
> + * In the latter case all IOVAs will be unmaped.
> + */
> +struct iommu_ioas_unmap {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +};
> +#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
>   #endif

David Gibson March 31, 2022, 4:36 a.m. UTC | #17

On Fri, Mar 18, 2022 at 02:27:33PM -0300, Jason Gunthorpe wrote:
> Connect the IOAS to its IOCTL interface. This exposes most of the
> functionality in the io_pagetable to userspace.
> 
> This is intended to be the core of the generic interface that IOMMUFD will
> provide. Every IOMMU driver should be able to implement an iommu_domain
> that is compatible with this generic mechanism.
> 
> It is also designed to be easy to use for simple non virtual machine
> monitor users, like DPDK:
>  - Universal simple support for all IOMMUs (no PPC special path)
>  - An IOVA allocator that considerds the aperture and the reserved ranges
>  - io_pagetable allows any number of iommu_domains to be connected to the
>    IOAS
> 
> Along with room in the design to add non-generic features to cater to
> specific HW functionality.


[snip]
> +/**
> + * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
> + * @size: sizeof(struct iommu_ioas_alloc)
> + * @flags: Must be 0
> + * @out_ioas_id: Output IOAS ID for the allocated object
> + *
> + * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
> + * to memory mapping.
> + */
> +struct iommu_ioas_alloc {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 out_ioas_id;
> +};
> +#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
> +
> +/**
> + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> + * @size: sizeof(struct iommu_ioas_iova_ranges)
> + * @ioas_id: IOAS ID to read ranges from
> + * @out_num_iovas: Output total number of ranges in the IOAS
> + * @__reserved: Must be 0
> + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> + *                   of out_num_iovas or the length implied by size.
> + * @out_valid_iovas.start: First IOVA in the allowed range
> + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> + *
> + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> + * not allowed. out_num_iovas will be set to the total number of iovas
> + * and the out_valid_iovas[] will be filled in as space permits.
> + * size should include the allocated flex array.
> + */
> +struct iommu_ioas_iova_ranges {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__u32 out_num_iovas;
> +	__u32 __reserved;
> +	struct iommu_valid_iovas {
> +		__aligned_u64 start;
> +		__aligned_u64 last;
> +	} out_valid_iovas[];
> +};
> +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)

Is the information returned by this valid for the lifeime of the IOAS,
or can it change?  If it can change, what events can change it?

If it *can't* change, then how do we have enough information to
determine this at ALLOC time, since we don't necessarily know which
(if any) hardware IOMMU will be attached to it.

> +/**
> + * enum iommufd_ioas_map_flags - Flags for map and copy
> + * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
> + *                             IOVA to place the mapping at
> + * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
> + * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
> + */
> +enum iommufd_ioas_map_flags {
> +	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
> +	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
> +	IOMMU_IOAS_MAP_READABLE = 1 << 2,
> +};
> +
> +/**
> + * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
> + * @size: sizeof(struct iommu_ioas_map)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @ioas_id: IOAS ID to change the mapping of
> + * @__reserved: Must be 0
> + * @user_va: Userspace pointer to start mapping from
> + * @length: Number of bytes to map
> + * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
> + *        then this must be provided as input.
> + *
> + * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
> + * mapping will be established at iova, otherwise a suitable location will be
> + * automatically selected and returned in iova.
> + */
> +struct iommu_ioas_map {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 ioas_id;
> +	__u32 __reserved;
> +	__aligned_u64 user_va;
> +	__aligned_u64 length;
> +	__aligned_u64 iova;
> +};
> +#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
> +
> +/**
> + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @dst_ioas_id: IOAS ID to change the mapping of
> + * @src_ioas_id: IOAS ID to copy from
> + * @length: Number of bytes to copy and map
> + * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
> + *            set then this must be provided as input.
> + * @src_iova: IOVA to start the copy
> + *
> + * Copy an already existing mapping from src_ioas_id and establish it in
> + * dst_ioas_id. The src iova/length must exactly match a range used with
> + * IOMMU_IOAS_MAP.
> + */
> +struct iommu_ioas_copy {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 dst_ioas_id;
> +	__u32 src_ioas_id;
> +	__aligned_u64 length;
> +	__aligned_u64 dst_iova;
> +	__aligned_u64 src_iova;
> +};
> +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)

Since it can only copy a single mapping, what's the benefit of this
over just repeating an IOAS_MAP in the new IOAS?

> +/**
> + * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @ioas_id: IOAS ID to change the mapping of
> + * @iova: IOVA to start the unmapping at
> + * @length: Number of bytes to unmap
> + *
> + * Unmap an IOVA range. The iova/length must exactly match a range
> + * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
> + * In the latter case all IOVAs will be unmaped.
> + */
> +struct iommu_ioas_unmap {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +};
> +#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
>  #endif

Tian, Kevin March 31, 2022, 5:41 a.m. UTC | #18

> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Thursday, March 31, 2022 12:36 PM
> > +
> > +/**
> > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > + * @ioas_id: IOAS ID to read ranges from
> > + * @out_num_iovas: Output total number of ranges in the IOAS
> > + * @__reserved: Must be 0
> > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the
> smaller
> > + *                   of out_num_iovas or the length implied by size.
> > + * @out_valid_iovas.start: First IOVA in the allowed range
> > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > + *
> > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these
> ranges is
> > + * not allowed. out_num_iovas will be set to the total number of iovas
> > + * and the out_valid_iovas[] will be filled in as space permits.
> > + * size should include the allocated flex array.
> > + */
> > +struct iommu_ioas_iova_ranges {
> > +	__u32 size;
> > +	__u32 ioas_id;
> > +	__u32 out_num_iovas;
> > +	__u32 __reserved;
> > +	struct iommu_valid_iovas {
> > +		__aligned_u64 start;
> > +		__aligned_u64 last;
> > +	} out_valid_iovas[];
> > +};
> > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE,
> IOMMUFD_CMD_IOAS_IOVA_RANGES)
> 
> Is the information returned by this valid for the lifeime of the IOAS,
> or can it change?  If it can change, what events can change it?
> 

It can change when a new device is attached to an ioas.

You can look at iopt_table_enforce_group_resv_regions() in patch7
which is called by iommufd_device_attach() in patch10. That function
will first check whether new reserved ranges from the attached device
have been used and if no conflict then add them to the list of reserved
ranges of this ioas.

Userspace can call this ioctl to retrieve updated IOVA range info after
attaching a device.

Thanks
Kevin

Jason Gunthorpe March 31, 2022, 12:58 p.m. UTC | #19

On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:

> > +/**
> > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > + * @ioas_id: IOAS ID to read ranges from
> > + * @out_num_iovas: Output total number of ranges in the IOAS
> > + * @__reserved: Must be 0
> > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> > + *                   of out_num_iovas or the length implied by size.
> > + * @out_valid_iovas.start: First IOVA in the allowed range
> > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > + *
> > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> > + * not allowed. out_num_iovas will be set to the total number of iovas
> > + * and the out_valid_iovas[] will be filled in as space permits.
> > + * size should include the allocated flex array.
> > + */
> > +struct iommu_ioas_iova_ranges {
> > +	__u32 size;
> > +	__u32 ioas_id;
> > +	__u32 out_num_iovas;
> > +	__u32 __reserved;
> > +	struct iommu_valid_iovas {
> > +		__aligned_u64 start;
> > +		__aligned_u64 last;
> > +	} out_valid_iovas[];
> > +};
> > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> 
> Is the information returned by this valid for the lifeime of the IOAS,
> or can it change?  If it can change, what events can change it?
>
> If it *can't* change, then how do we have enough information to
> determine this at ALLOC time, since we don't necessarily know which
> (if any) hardware IOMMU will be attached to it.

It is a good point worth documenting. It can change. Particularly
after any device attachment.

I added this:

 * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
 * is not allowed. out_num_iovas will be set to the total number of iovas and
 * the out_valid_iovas[] will be filled in as space permits. size should include
 * the allocated flex array.
 *
 * The allowed ranges are dependent on the HW path the DMA operation takes, and
 * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
 * full range, and each attached device will narrow the ranges based on that
 * devices HW restrictions.


> > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> 
> Since it can only copy a single mapping, what's the benefit of this
> over just repeating an IOAS_MAP in the new IOAS?

It causes the underlying pin accounting to be shared and can avoid
calling GUP entirely.

Thanks,
Jason

Jason Gunthorpe March 31, 2022, 12:59 p.m. UTC | #20

On Wed, Mar 30, 2022 at 09:35:52PM +0800, Yi Liu wrote:

> > +/**
> > + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
> > + * @size: sizeof(struct iommu_ioas_copy)
> > + * @flags: Combination of enum iommufd_ioas_map_flags
> > + * @dst_ioas_id: IOAS ID to change the mapping of
> > + * @src_ioas_id: IOAS ID to copy from
> 
> so the dst and src ioas_id are allocated via the same iommufd.
> right? just out of curious, do you think it is possible that
> the srs/dst ioas_ids are from different iommufds? In that case
> may need to add src/dst iommufd. It's not needed today, just to
> see if any blocker in kernel to support such copy. :-)

Yes, all IDs in all ioctls are within the scope of a single iommufd.

There should be no reason for a single application to open multiple
iommufds.

Jason

Liu, Yi L April 1, 2022, 1:30 p.m. UTC | #21

On 2022/3/31 20:59, Jason Gunthorpe wrote:
> On Wed, Mar 30, 2022 at 09:35:52PM +0800, Yi Liu wrote:
> 
>>> +/**
>>> + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
>>> + * @size: sizeof(struct iommu_ioas_copy)
>>> + * @flags: Combination of enum iommufd_ioas_map_flags
>>> + * @dst_ioas_id: IOAS ID to change the mapping of
>>> + * @src_ioas_id: IOAS ID to copy from
>>
>> so the dst and src ioas_id are allocated via the same iommufd.
>> right? just out of curious, do you think it is possible that
>> the srs/dst ioas_ids are from different iommufds? In that case
>> may need to add src/dst iommufd. It's not needed today, just to
>> see if any blocker in kernel to support such copy. :-)
> 
> Yes, all IDs in all ioctls are within the scope of a single iommufd.
> 
> There should be no reason for a single application to open multiple
> iommufds.

then should this be documented?

David Gibson April 28, 2022, 5:58 a.m. UTC | #22

On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> 
> > > +/**
> > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > + * @ioas_id: IOAS ID to read ranges from
> > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > + * @__reserved: Must be 0
> > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> > > + *                   of out_num_iovas or the length implied by size.
> > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > + *
> > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > + * size should include the allocated flex array.
> > > + */
> > > +struct iommu_ioas_iova_ranges {
> > > +	__u32 size;
> > > +	__u32 ioas_id;
> > > +	__u32 out_num_iovas;
> > > +	__u32 __reserved;
> > > +	struct iommu_valid_iovas {
> > > +		__aligned_u64 start;
> > > +		__aligned_u64 last;
> > > +	} out_valid_iovas[];
> > > +};
> > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > 
> > Is the information returned by this valid for the lifeime of the IOAS,
> > or can it change?  If it can change, what events can change it?
> >
> > If it *can't* change, then how do we have enough information to
> > determine this at ALLOC time, since we don't necessarily know which
> > (if any) hardware IOMMU will be attached to it.
> 
> It is a good point worth documenting. It can change. Particularly
> after any device attachment.

Right.. this is vital and needs to be front and centre in the
comments/docs here.  Really, I think an interface that *doesn't* have
magically changing status would be better (which is why I was
advocating that the user set the constraints, and the kernel supplied
or failed outright).  Still I recognize that has its own problems.

> I added this:
> 
>  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
>  * is not allowed. out_num_iovas will be set to the total number of iovas and
>  * the out_valid_iovas[] will be filled in as space permits. size should include
>  * the allocated flex array.
>  *
>  * The allowed ranges are dependent on the HW path the DMA operation takes, and
>  * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
>  * full range, and each attached device will narrow the ranges based on that
>  * devices HW restrictions.

I think you need to be even more explicit about this: which exact
operations on the fd can invalidate exactly which items in the
information from this call?  Can it only ever be narrowed, or can it
be broadened with any operations?

> > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > 
> > Since it can only copy a single mapping, what's the benefit of this
> > over just repeating an IOAS_MAP in the new IOAS?
> 
> It causes the underlying pin accounting to be shared and can avoid
> calling GUP entirely.

If that's the only purpose, then that needs to be right here in the
comments too.  So is expected best practice to IOAS_MAP everything you
might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
mappings you actually end up wanting into the "real" IOASes for use?

Seems like it would be nicer for the interface to just figure it out
for you: I can see there being sufficient complications with that to
have this slightly awkward interface, but I think it needs a rationale
to accompany it.

Jason Gunthorpe April 28, 2022, 2:22 p.m. UTC | #23

On Thu, Apr 28, 2022 at 03:58:30PM +1000, David Gibson wrote:
> On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> > On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> > 
> > > > +/**
> > > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > > + * @ioas_id: IOAS ID to read ranges from
> > > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > > + * @__reserved: Must be 0
> > > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> > > > + *                   of out_num_iovas or the length implied by size.
> > > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > > + *
> > > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> > > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > > + * size should include the allocated flex array.
> > > > + */
> > > > +struct iommu_ioas_iova_ranges {
> > > > +	__u32 size;
> > > > +	__u32 ioas_id;
> > > > +	__u32 out_num_iovas;
> > > > +	__u32 __reserved;
> > > > +	struct iommu_valid_iovas {
> > > > +		__aligned_u64 start;
> > > > +		__aligned_u64 last;
> > > > +	} out_valid_iovas[];
> > > > +};
> > > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > > 
> > > Is the information returned by this valid for the lifeime of the IOAS,
> > > or can it change?  If it can change, what events can change it?
> > >
> > > If it *can't* change, then how do we have enough information to
> > > determine this at ALLOC time, since we don't necessarily know which
> > > (if any) hardware IOMMU will be attached to it.
> > 
> > It is a good point worth documenting. It can change. Particularly
> > after any device attachment.
> 
> Right.. this is vital and needs to be front and centre in the
> comments/docs here.  Really, I think an interface that *doesn't* have
> magically changing status would be better (which is why I was
> advocating that the user set the constraints, and the kernel supplied
> or failed outright).  Still I recognize that has its own problems.

That is a neat idea, it could be a nice option, it lets userspace
further customize the kernel allocator.

But I don't have a use case in mind? The simplified things I know
about want to attach their devices then allocate valid IOVA, they
don't really have a notion about what IOVA regions they are willing to
accept, or necessarily do hotplug.

What might be interesting is to have some option to load in a machine
specific default ranges - ie the union of every group and and every
iommu_domain. The idea being that after such a call hotplug of a
device should be very likely to succeed.

Though I don't have a user in mind..

> > I added this:
> > 
> >  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
> >  * is not allowed. out_num_iovas will be set to the total number of iovas and
> >  * the out_valid_iovas[] will be filled in as space permits. size should include
> >  * the allocated flex array.
> >  *
> >  * The allowed ranges are dependent on the HW path the DMA operation takes, and
> >  * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
> >  * full range, and each attached device will narrow the ranges based on that
> >  * devices HW restrictions.
> 
> I think you need to be even more explicit about this: which exact
> operations on the fd can invalidate exactly which items in the
> information from this call?  Can it only ever be narrowed, or can it
> be broadened with any operations?

I think "attach" is the phrase we are using for that operation - it is
not a specific IOCTL here because it happens on, say, the VFIO device FD.

Let's add "detatching a device can widen the ranges. Userspace should
query ranges after every attach/detatch to know what IOVAs are valid
for mapping."

> > > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > > 
> > > Since it can only copy a single mapping, what's the benefit of this
> > > over just repeating an IOAS_MAP in the new IOAS?
> > 
> > It causes the underlying pin accounting to be shared and can avoid
> > calling GUP entirely.
> 
> If that's the only purpose, then that needs to be right here in the
> comments too.  So is expected best practice to IOAS_MAP everything you
> might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
> mappings you actually end up wanting into the "real" IOASes for use?

That is one possibility, yes. qemu seems to be using this to establish
a clone ioas of an existing operational one which is another usage
model.

I added this additionally:

 * This may be used to efficiently clone a subset of an IOAS to another, or as a
 * kind of 'cache' to speed up mapping. Copy has an effciency advantage over
 * establishing equivilant new mappings, as internal resources are shared, and
 * the kernel will pin the user memory only once.

> Seems like it would be nicer for the interface to just figure it out
> for you: I can see there being sufficient complications with that to
> have this slightly awkward interface, but I think it needs a rationale
> to accompany it.

It is more than complicates, the kernel has no way to accurately know
when a user pointer is an alias of an existing user pointer or is
something new because the mm has become incoherent.

It is possible that uncoordinated modules in userspace could
experience data corruption if the wrong decision is made - mm
coherence with pinning is pretty weak in Linux.. Since I dislike that
kind of unreliable magic I made it explicit.

Thanks,
Jason

David Gibson April 29, 2022, 6 a.m. UTC | #24

On Thu, Apr 28, 2022 at 11:22:58AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 03:58:30PM +1000, David Gibson wrote:
> > On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> > > 
> > > > > +/**
> > > > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > > > + * @ioas_id: IOAS ID to read ranges from
> > > > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > > > + * @__reserved: Must be 0
> > > > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> > > > > + *                   of out_num_iovas or the length implied by size.
> > > > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > > > + *
> > > > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> > > > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > > > + * size should include the allocated flex array.
> > > > > + */
> > > > > +struct iommu_ioas_iova_ranges {
> > > > > +	__u32 size;
> > > > > +	__u32 ioas_id;
> > > > > +	__u32 out_num_iovas;
> > > > > +	__u32 __reserved;
> > > > > +	struct iommu_valid_iovas {
> > > > > +		__aligned_u64 start;
> > > > > +		__aligned_u64 last;
> > > > > +	} out_valid_iovas[];
> > > > > +};
> > > > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > > > 
> > > > Is the information returned by this valid for the lifeime of the IOAS,
> > > > or can it change?  If it can change, what events can change it?
> > > >
> > > > If it *can't* change, then how do we have enough information to
> > > > determine this at ALLOC time, since we don't necessarily know which
> > > > (if any) hardware IOMMU will be attached to it.
> > > 
> > > It is a good point worth documenting. It can change. Particularly
> > > after any device attachment.
> > 
> > Right.. this is vital and needs to be front and centre in the
> > comments/docs here.  Really, I think an interface that *doesn't* have
> > magically changing status would be better (which is why I was
> > advocating that the user set the constraints, and the kernel supplied
> > or failed outright).  Still I recognize that has its own problems.
> 
> That is a neat idea, it could be a nice option, it lets userspace
> further customize the kernel allocator.
> 
> But I don't have a use case in mind? The simplified things I know
> about want to attach their devices then allocate valid IOVA, they
> don't really have a notion about what IOVA regions they are willing to
> accept, or necessarily do hotplug.

The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
emulation code knows the IOVA windows that are expected of the vIOMMU
(because that's a property of the emulated platform), and requests
them of the host IOMMU.  If the host can supply that, you're good
(this doesn't necessarily mean the host windows match exactly, just
that the requested windows fit within the host windows).  If not,
you report an error.  This can be done at any point when the host
windows might change - so try to attach a device that can't support
the requested windows, and it will fail.  Attaching a device which
shrinks the windows, but still fits the requested windows within, and
you're still good to go.

For a typical direct userspace case you don't want that.  However, it
probably *does* make sense for userspace to specify how large a window
it wants.  So some form that allows you to specify size without base
address also makes sense.  In that case the kernel would set a base
address according to the host IOMMU's capabilities, or fail if it
can't supply any window of the requested size.  When to allocate that
base address is a bit unclear though.  If you do it at window request
time, then you might pick something that a later device can't work
with.  If you do it later, it's less clear how to sensibly report it
to userspace.

One option might be to only allow IOAS_MAP (or COPY) operations after
windows are requested, but otherwise you can choose the order.  So,
things that have strict requirements for the windows (vIOMMU
emulation) would request the windows then add devices: they know the
windows they need, if the devices can't work with that, that's what
needs to fail.  A userspace driver, however, would attach the devices
it wants to use, then request a window (without specifying base
address).

A query ioctl to give the largest possible windows in the current
state could still be useful for debugging here, of course, but
wouldn't need to be used in the normal course of operation.

> What might be interesting is to have some option to load in a machine
> specific default ranges - ie the union of every group and and every
> iommu_domain. The idea being that after such a call hotplug of a
> device should be very likely to succeed.
> 
> Though I don't have a user in mind..
> 
> > > I added this:
> > > 
> > >  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
> > >  * is not allowed. out_num_iovas will be set to the total number of iovas and
> > >  * the out_valid_iovas[] will be filled in as space permits. size should include
> > >  * the allocated flex array.
> > >  *
> > >  * The allowed ranges are dependent on the HW path the DMA operation takes, and
> > >  * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
> > >  * full range, and each attached device will narrow the ranges based on that
> > >  * devices HW restrictions.
> > 
> > I think you need to be even more explicit about this: which exact
> > operations on the fd can invalidate exactly which items in the
> > information from this call?  Can it only ever be narrowed, or can it
> > be broadened with any operations?
> 
> I think "attach" is the phrase we are using for that operation - it is
> not a specific IOCTL here because it happens on, say, the VFIO device FD.
> 
> Let's add "detatching a device can widen the ranges. Userspace should
> query ranges after every attach/detatch to know what IOVAs are valid
> for mapping."
> 
> > > > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > > > 
> > > > Since it can only copy a single mapping, what's the benefit of this
> > > > over just repeating an IOAS_MAP in the new IOAS?
> > > 
> > > It causes the underlying pin accounting to be shared and can avoid
> > > calling GUP entirely.
> > 
> > If that's the only purpose, then that needs to be right here in the
> > comments too.  So is expected best practice to IOAS_MAP everything you
> > might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
> > mappings you actually end up wanting into the "real" IOASes for use?
> 
> That is one possibility, yes. qemu seems to be using this to establish
> a clone ioas of an existing operational one which is another usage
> model.

Right, for qemu (or other hypervisors) the obvious choice would be to
create a "staging" IOAS where IOVA == GPA, then COPY that into the various
emulated bus IOASes.  For a userspace driver situation, I'm guessing
you'd map your relevant memory pool into an IOAS, then COPY to the
IOAS you need for whatever specific devices you're using.

> I added this additionally:
> 
>  * This may be used to efficiently clone a subset of an IOAS to another, or as a
>  * kind of 'cache' to speed up mapping. Copy has an effciency advantage over
>  * establishing equivilant new mappings, as internal resources are shared, and
>  * the kernel will pin the user memory only once.

I think adding that helps substantially.

> > Seems like it would be nicer for the interface to just figure it out
> > for you: I can see there being sufficient complications with that to
> > have this slightly awkward interface, but I think it needs a rationale
> > to accompany it.
> 
> It is more than complicates, the kernel has no way to accurately know
> when a user pointer is an alias of an existing user pointer or is
> something new because the mm has become incoherent.
> 
> It is possible that uncoordinated modules in userspace could
> experience data corruption if the wrong decision is made - mm
> coherence with pinning is pretty weak in Linux.. Since I dislike that
> kind of unreliable magic I made it explicit.

Fair enough.

Jason Gunthorpe April 29, 2022, 12:54 p.m. UTC | #25

On Fri, Apr 29, 2022 at 04:00:14PM +1000, David Gibson wrote:
> > But I don't have a use case in mind? The simplified things I know
> > about want to attach their devices then allocate valid IOVA, they
> > don't really have a notion about what IOVA regions they are willing to
> > accept, or necessarily do hotplug.
> 
> The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
> emulation code knows the IOVA windows that are expected of the vIOMMU
> (because that's a property of the emulated platform), and requests
> them of the host IOMMU.  If the host can supply that, you're good
> (this doesn't necessarily mean the host windows match exactly, just
> that the requested windows fit within the host windows).  If not,
> you report an error.  This can be done at any point when the host
> windows might change - so try to attach a device that can't support
> the requested windows, and it will fail.  Attaching a device which
> shrinks the windows, but still fits the requested windows within, and
> you're still good to go.

We were just talking about this in another area - Alex said that qemu
doesn't know the IOVA ranges? Is there some vIOMMU cases where it does?

Even if yes, qemu is able to manage this on its own - it doesn't use
the kernel IOVA allocator, so there is not a strong reason to tell the
kernel what the narrowed ranges are.

> > That is one possibility, yes. qemu seems to be using this to establish
> > a clone ioas of an existing operational one which is another usage
> > model.
> 
> Right, for qemu (or other hypervisors) the obvious choice would be to
> create a "staging" IOAS where IOVA == GPA, then COPY that into the various
> emulated bus IOASes.  For a userspace driver situation, I'm guessing
> you'd map your relevant memory pool into an IOAS, then COPY to the
> IOAS you need for whatever specific devices you're using.

qemu seems simpler, it juggled multiple containers so it literally
just copies when it instantiates a new container and does a map in
multi-container.

Jason

David Gibson April 30, 2022, 2:44 p.m. UTC | #26

On Fri, Apr 29, 2022 at 09:54:42AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 04:00:14PM +1000, David Gibson wrote:
> > > But I don't have a use case in mind? The simplified things I know
> > > about want to attach their devices then allocate valid IOVA, they
> > > don't really have a notion about what IOVA regions they are willing to
> > > accept, or necessarily do hotplug.
> > 
> > The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
> > emulation code knows the IOVA windows that are expected of the vIOMMU
> > (because that's a property of the emulated platform), and requests
> > them of the host IOMMU.  If the host can supply that, you're good
> > (this doesn't necessarily mean the host windows match exactly, just
> > that the requested windows fit within the host windows).  If not,
> > you report an error.  This can be done at any point when the host
> > windows might change - so try to attach a device that can't support
> > the requested windows, and it will fail.  Attaching a device which
> > shrinks the windows, but still fits the requested windows within, and
> > you're still good to go.
> 
> We were just talking about this in another area - Alex said that qemu
> doesn't know the IOVA ranges? Is there some vIOMMU cases where it does?

Uh.. what?  We certainly know (or, rather, choose) the IOVA ranges for
ppc.  That is to say we set up the default IOVA ranges at machine
construction (those defaults have changed with machine version a
couple of times).  If the guest uses dynamic DMA windows we then
update those ranges based on the hypercalls, but at any point we know
what the IOVA windows are supposed to be.  I don't really see how x86
or anything else could not know the IOVA ranges.  Who else *could* set
the ranges when implementing a vIOMMU in TCG mode?

For the non-vIOMMU case then IOVA==GPA, so everything qemu knows about
the GPA space it also knows about the IOVA space.  Which, come to
think of it, means memory hotplug also complicates things.

> Even if yes, qemu is able to manage this on its own - it doesn't use
> the kernel IOVA allocator, so there is not a strong reason to tell the
> kernel what the narrowed ranges are.

I don't follow.  The problem for the qemu case here is if you hotplug
a device which narrows down the range to something smaller than the
guest expects.  If qemu has told the kernel the ranges it needs, that
can just fail (which is the best you can do).  If the kernel adds the
device but narrows the ranges, then you may have just put the guest
into a situation where the vIOMMU cannot do what the guest expects it
to.  If qemu can only query the windows, not specify them then it
won't know that adding a particular device will conflict with its
guest side requirements until after it's already added.  That could
mess up concurrent guest initiated map operations for existing devices
in the same guest side domain, so I don't think reversing the hotplug
after the problem is detected is enough.

> > > That is one possibility, yes. qemu seems to be using this to establish
> > > a clone ioas of an existing operational one which is another usage
> > > model.
> > 
> > Right, for qemu (or other hypervisors) the obvious choice would be to
> > create a "staging" IOAS where IOVA == GPA, then COPY that into the various
> > emulated bus IOASes.  For a userspace driver situation, I'm guessing
> > you'd map your relevant memory pool into an IOAS, then COPY to the
> > IOAS you need for whatever specific devices you're using.
> 
> qemu seems simpler, it juggled multiple containers so it literally
> just copies when it instantiates a new container and does a map in
> multi-container.

I don't follow you.  Are you talking about the vIOMMU or non vIOMMU
case?  In the vIOMMU case the different containers can be for
different guest side iommu domains with different guest-IOVA spaces,
so you can't just copy from one to another.

[RFC,08/12] iommufd: IOCTLs for the io_pagetable

Commit Message

Comments

Patch