diff mbox series

[RFC,03/10] iommu/vt-d: Allocate groups for mediated devices

Message ID 1532239773-15325-4-git-send-email-baolu.lu@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series vfio/mdev: IOMMU aware mediated device | expand

Commit Message

Baolu Lu July 22, 2018, 6:09 a.m. UTC
With the Intel IOMMU supporting PASID granularity isolation and
protection, a mediated device could be isolated and protected by
an IOMMU unit. We need to allocate a new group instead of a PCI
group.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Comments

Yi Liu July 23, 2018, 4:44 a.m. UTC | #1
> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> Sent: Sunday, July 22, 2018 2:09 PM
> 
> With the Intel IOMMU supporting PASID granularity isolation and protection, a
> mediated device could be isolated and protected by an IOMMU unit. We need to
> allocate a new group instead of a PCI group.

Existing vfio mdev framework also allocates an iommu group for mediate device.

mdev_probe()
  |_ mdev_attach_iommu()
       |_ iommu_group_alloc()

IMHO, this patch actually do a wrapper to the group allocation API. The reason is: it
is not necessary to apply the group allocation rules when the allocation is for mediated
device. Instead, just allocate a new one for it. :)

Thanks,
Yi Liu
> 
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index
> 3ede34a..57ccfc4 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5276,6 +5276,17 @@ static void intel_iommu_put_resv_regions(struct device
> *dev,
>  	}
>  }
> 
> +static struct iommu_group *intel_iommu_device_group(struct device *dev)
> +{
> +	if (dev_is_pci(dev))
> +		return pci_device_group(dev);
> +
> +	if (dev_is_mdev(dev))
> +		return iommu_group_alloc();
> +
> +	return ERR_PTR(-EINVAL);
> +}
> +
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct intel_svm_dev
> *sdev)  { @@ -5370,7 +5381,7 @@ const struct iommu_ops intel_iommu_ops = {
>  	.remove_device		= intel_iommu_remove_device,
>  	.get_resv_regions	= intel_iommu_get_resv_regions,
>  	.put_resv_regions	= intel_iommu_put_resv_regions,
> -	.device_group		= pci_device_group,
> +	.device_group		= intel_iommu_device_group,
>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
>  };
> 
> --
> 2.7.4
Baolu Lu July 24, 2018, 2:22 a.m. UTC | #2
Hi,

On 07/23/2018 12:44 PM, Liu, Yi L wrote:
>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
>> Sent: Sunday, July 22, 2018 2:09 PM
>>
>> With the Intel IOMMU supporting PASID granularity isolation and protection, a
>> mediated device could be isolated and protected by an IOMMU unit. We need to
>> allocate a new group instead of a PCI group.
> Existing vfio mdev framework also allocates an iommu group for mediate device.
>
> mdev_probe()
>   |_ mdev_attach_iommu()
>        |_ iommu_group_alloc()

When external components ask iommu to allocate a group for a device,
it will call pci_device_group in Intel IOMMU driver's @device_group
callback. In another word, current Intel IOMMU driver doesn't aware
the mediated device and treat all devices as PCI ones. This patch
extends the @device_group call back to make it aware of a mediated
device.

Best regards,
Lu Baolu

>
> IMHO, this patch actually do a wrapper to the group allocation API. The reason is: it
> is not necessary to apply the group allocation rules when the allocation is for mediated
> device. Instead, just allocate a new one for it. :)
>
> Thanks,
> Yi Liu
>> Cc: Ashok Raj <ashok.raj@intel.com>
>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Cc: Kevin Tian <kevin.tian@intel.com>
>> Cc: Liu Yi L <yi.l.liu@intel.com>
>> Signed-off-by: Sanjay Kumar <sanjay.k.kumar@intel.com>
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>> ---
>>  drivers/iommu/intel-iommu.c | 13 ++++++++++++-
>>  1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index
>> 3ede34a..57ccfc4 100644
>> --- a/drivers/iommu/intel-iommu.c
>> +++ b/drivers/iommu/intel-iommu.c
>> @@ -5276,6 +5276,17 @@ static void intel_iommu_put_resv_regions(struct device
>> *dev,
>>  	}
>>  }
>>
>> +static struct iommu_group *intel_iommu_device_group(struct device *dev)
>> +{
>> +	if (dev_is_pci(dev))
>> +		return pci_device_group(dev);
>> +
>> +	if (dev_is_mdev(dev))
>> +		return iommu_group_alloc();
>> +
>> +	return ERR_PTR(-EINVAL);
>> +}
>> +
>>  #ifdef CONFIG_INTEL_IOMMU_SVM
>>  int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct intel_svm_dev
>> *sdev)  { @@ -5370,7 +5381,7 @@ const struct iommu_ops intel_iommu_ops = {
>>  	.remove_device		= intel_iommu_remove_device,
>>  	.get_resv_regions	= intel_iommu_get_resv_regions,
>>  	.put_resv_regions	= intel_iommu_put_resv_regions,
>> -	.device_group		= pci_device_group,
>> +	.device_group		= intel_iommu_device_group,
>>  	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
>>  };
>>
>> --
>> 2.7.4
>
Jean-Philippe Brucker July 24, 2018, 11:30 a.m. UTC | #3
Hi Baolu,

On 24/07/18 03:22, Lu Baolu wrote:
> Hi,
> 
> On 07/23/2018 12:44 PM, Liu, Yi L wrote:
>>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
>>> Sent: Sunday, July 22, 2018 2:09 PM
>>>
>>> With the Intel IOMMU supporting PASID granularity isolation and protection, a
>>> mediated device could be isolated and protected by an IOMMU unit. We need to
>>> allocate a new group instead of a PCI group.
>> Existing vfio mdev framework also allocates an iommu group for mediate device.
>>
>> mdev_probe()
>>   |_ mdev_attach_iommu()
>>        |_ iommu_group_alloc()
> 
> When external components ask iommu to allocate a group for a device,
> it will call pci_device_group in Intel IOMMU driver's @device_group
> callback. In another word, current Intel IOMMU driver doesn't aware
> the mediated device and treat all devices as PCI ones. This patch
> extends the @device_group call back to make it aware of a mediated
> device.

I agree that allocating two groups for an mdev seems strange, and in my
opinion we shouldn't export the notion of mdev to IOMMU drivers.
Otherwise each driver will have to add its own "dev_is_mdev()" special
cases, which will get messy in the long run. Besides, the macro is
currently private, and to be exported it should be wrapped in
symbol_get/put(mdev_bus_type).

There is another way: as we're planning to add a generic pasid_alloc()
function to the IOMMU API, the mdev module itself could allocate a
default PASID for each mdev by calling pasid_alloc() on the mdev's
parent, and then do map()/unmap() with that PASID. This way we don't
have to add IOMMU ops to the mdev bus, everything can still be done
using the ops of the parent. And IOMMU drivers "only" have to implement
PASID ops, which will be reused by drivers other than mdev.

The allocated PASID also needs to be installed into the parent device.
If the mdev module knows the PASID, we can do that by adding
set_pasid(mdev, pasid) and clear_pasid(mdev, pasid) operations to
mdev_parent_ops.

Thanks,
Jean
Alex Williamson July 24, 2018, 7:51 p.m. UTC | #4
On Tue, 24 Jul 2018 12:30:35 +0100
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> Hi Baolu,
> 
> On 24/07/18 03:22, Lu Baolu wrote:
> > Hi,
> > 
> > On 07/23/2018 12:44 PM, Liu, Yi L wrote:  
> >>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> >>> Sent: Sunday, July 22, 2018 2:09 PM
> >>>
> >>> With the Intel IOMMU supporting PASID granularity isolation and protection, a
> >>> mediated device could be isolated and protected by an IOMMU unit. We need to
> >>> allocate a new group instead of a PCI group.  
> >> Existing vfio mdev framework also allocates an iommu group for mediate device.
> >>
> >> mdev_probe()
> >>   |_ mdev_attach_iommu()
> >>        |_ iommu_group_alloc()  
> > 
> > When external components ask iommu to allocate a group for a device,
> > it will call pci_device_group in Intel IOMMU driver's @device_group
> > callback. In another word, current Intel IOMMU driver doesn't aware
> > the mediated device and treat all devices as PCI ones. This patch
> > extends the @device_group call back to make it aware of a mediated
> > device.  
> 
> I agree that allocating two groups for an mdev seems strange, and in my
> opinion we shouldn't export the notion of mdev to IOMMU drivers.

Yep, I was just thinking the same thing.  This is too highly integrated
into VT-d and too narrowly focused on PASID being the only way that an
mdev could make use of the IOMMU.  Thanks,

Alex
Baolu Lu July 25, 2018, 2:06 a.m. UTC | #5
Hi Jean,

Pull Kevin in. Not sure why he was removed from cc list.

On 07/24/2018 07:30 PM, Jean-Philippe Brucker wrote:
> Hi Baolu,
>
> On 24/07/18 03:22, Lu Baolu wrote:
>> Hi,
>>
>> On 07/23/2018 12:44 PM, Liu, Yi L wrote:
>>>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
>>>> Sent: Sunday, July 22, 2018 2:09 PM
>>>>
>>>> With the Intel IOMMU supporting PASID granularity isolation and protection, a
>>>> mediated device could be isolated and protected by an IOMMU unit. We need to
>>>> allocate a new group instead of a PCI group.
>>> Existing vfio mdev framework also allocates an iommu group for mediate device.
>>>
>>> mdev_probe()
>>>   |_ mdev_attach_iommu()
>>>        |_ iommu_group_alloc()
>> When external components ask iommu to allocate a group for a device,
>> it will call pci_device_group in Intel IOMMU driver's @device_group
>> callback. In another word, current Intel IOMMU driver doesn't aware
>> the mediated device and treat all devices as PCI ones. This patch
>> extends the @device_group call back to make it aware of a mediated
>> device.
> I agree that allocating two groups for an mdev seems strange, and in my
> opinion we shouldn't export the notion of mdev to IOMMU drivers.
> Otherwise each driver will have to add its own "dev_is_mdev()" special
> cases, which will get messy in the long run. Besides, the macro is
> currently private, and to be exported it should be wrapped in
> symbol_get/put(mdev_bus_type).

I agree with you.

It should be better if we can make it in a driver agnostic way.

>
> There is another way: as we're planning to add a generic pasid_alloc()
> function to the IOMMU API, the mdev module itself could allocate a
> default PASID for each mdev by calling pasid_alloc() on the mdev's
> parent, and then do map()/unmap() with that PASID. This way we don't
> have to add IOMMU ops to the mdev bus, everything can still be done
> using the ops of the parent. And IOMMU drivers "only" have to implement
> PASID ops, which will be reused by drivers other than mdev.

A quick question when I thinking about this is how to allocate a domain
for the mediated device who uses the default pasid for DMA transactions?

Best regards,
Lu Baolu

>
> The allocated PASID also needs to be installed into the parent device.
> If the mdev module knows the PASID, we can do that by adding
> set_pasid(mdev, pasid) and clear_pasid(mdev, pasid) operations to
> mdev_parent_ops.
>
> Thanks,
> Jean
>
>
Tian, Kevin July 25, 2018, 2:16 a.m. UTC | #6
> From: Jean-Philippe Brucker
> Sent: Tuesday, July 24, 2018 7:31 PM
> 
> Hi Baolu,
> 
> On 24/07/18 03:22, Lu Baolu wrote:
> > Hi,
> >
> > On 07/23/2018 12:44 PM, Liu, Yi L wrote:
> >>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> >>> Sent: Sunday, July 22, 2018 2:09 PM
> >>>
> >>> With the Intel IOMMU supporting PASID granularity isolation and
> protection, a
> >>> mediated device could be isolated and protected by an IOMMU unit.
> We need to
> >>> allocate a new group instead of a PCI group.
> >> Existing vfio mdev framework also allocates an iommu group for
> mediate device.
> >>
> >> mdev_probe()
> >>   |_ mdev_attach_iommu()
> >>        |_ iommu_group_alloc()
> >
> > When external components ask iommu to allocate a group for a device,
> > it will call pci_device_group in Intel IOMMU driver's @device_group
> > callback. In another word, current Intel IOMMU driver doesn't aware
> > the mediated device and treat all devices as PCI ones. This patch
> > extends the @device_group call back to make it aware of a mediated
> > device.
> 
> I agree that allocating two groups for an mdev seems strange, and in my
> opinion we shouldn't export the notion of mdev to IOMMU drivers.
> Otherwise each driver will have to add its own "dev_is_mdev()" special
> cases, which will get messy in the long run. Besides, the macro is
> currently private, and to be exported it should be wrapped in
> symbol_get/put(mdev_bus_type).

There is only one group per mdev, but I agree that avoiding mdev
awareness in IOMMU driver is definitely good... We didn't find a
good way before, which is why current RFC was implemented that 
way.

> 
> There is another way: as we're planning to add a generic pasid_alloc()
> function to the IOMMU API, the mdev module itself could allocate a
> default PASID for each mdev by calling pasid_alloc() on the mdev's
> parent, and then do map()/unmap() with that PASID. This way we don't
> have to add IOMMU ops to the mdev bus, everything can still be done
> using the ops of the parent. And IOMMU drivers "only" have to implement
> PASID ops, which will be reused by drivers other than mdev.

yes, PASID is more general abstraction than mdev. However there is
one problem. Please note with new scalable mode now PASID is not
used just for SVA. You can do 1st-level, 2nd-level and nested in PASID
granularity, meaning each PASID can be bound to an unique iommu
domain. However today IOMMU driver allows only one iommu_domain 
per device. If we simply install allocated PASID to parent device, we 
should also remove that iommu_domain limitation. Is it the way that
we want to pursue?

mdev avoids such problem as it introduces a separate device...

> 
> The allocated PASID also needs to be installed into the parent device.
> If the mdev module knows the PASID, we can do that by adding
> set_pasid(mdev, pasid) and clear_pasid(mdev, pasid) operations to
> mdev_parent_ops.
> 

Thanks
Kevin
Yi Liu July 25, 2018, 2:35 a.m. UTC | #7
Hi Jean,

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Tuesday, July 24, 2018 7:31 PM
> 
> Hi Baolu,
> 
> On 24/07/18 03:22, Lu Baolu wrote:
> > Hi,
> >
> > On 07/23/2018 12:44 PM, Liu, Yi L wrote:
> >>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> >>> Sent: Sunday, July 22, 2018 2:09 PM
> >>>
> >>> With the Intel IOMMU supporting PASID granularity isolation and protection, a
> >>> mediated device could be isolated and protected by an IOMMU unit. We need to
> >>> allocate a new group instead of a PCI group.
> >> Existing vfio mdev framework also allocates an iommu group for mediate device.
> >>
> >> mdev_probe()
> >>   |_ mdev_attach_iommu()
> >>        |_ iommu_group_alloc()
> >
> > When external components ask iommu to allocate a group for a device,
> > it will call pci_device_group in Intel IOMMU driver's @device_group
> > callback. In another word, current Intel IOMMU driver doesn't aware
> > the mediated device and treat all devices as PCI ones. This patch
> > extends the @device_group call back to make it aware of a mediated
> > device.
> 
> I agree that allocating two groups for an mdev seems strange, and in my

There will not be two groups for a mdev. Pls refer to Patch 08/10 of this
series. Baolu added iommu_ops check when doing group allocation in
mdev_attach_iommu().

[RFC PATCH 08/10] vfio/mdev: Set iommu ops for mdev bus

@@ -21,6 +21,13 @@ static int mdev_attach_iommu(struct mdev_device *mdev)
 	int ret;
 	struct iommu_group *group;
 
+	/*
+	 * If iommu_ops is set for bus, add_device() will allocate
+	 * a group and add the device in the group.
+	 */
+	if (iommu_present(mdev->dev.bus))
+		return 0;
+

> opinion we shouldn't export the notion of mdev to IOMMU drivers.

The key idea of this RFC is to tag iommu domain with PASID, if any mdev
is added to such a domain, it would get the PASID and config in its parent.
Thus the transactions from mdev can be isolated in iommu hardware.

Based on this idea, mdevs can be managed in a flexible manner. e.g.
if two mdevs are assigned to same VM, they can share PASID. This share
can be easily achieve by adding them to the same domain. If we default
allocate a PASID for each mdev, it may be a waste.

With vendor-specific iommu driver handle the mdev difference, it can
largely keep the fundamental iommu concepts in current software
implementation.

> Otherwise each driver will have to add its own "dev_is_mdev()" special
> cases, which will get messy in the long run. Besides, the macro is
> currently private, and to be exported it should be wrapped in
> symbol_get/put(mdev_bus_type).

Agreed. Should figure out a better manner.

> There is another way: as we're planning to add a generic pasid_alloc()
> function to the IOMMU API, the mdev module itself could allocate a
> default PASID for each mdev by calling pasid_alloc() on the mdev's
> parent, and then do map()/unmap() with that PASID. This way we don't

so far, map/unmap is per-domain operation. In this way, passing PASID makes
it be kind of per-device operation. This may affect too much of existing software
implementation.

> have to add IOMMU ops to the mdev bus, everything can still be done
> using the ops of the parent. And IOMMU drivers "only" have to implement
> PASID ops, which will be reused by drivers other than mdev.
> 
> The allocated PASID also needs to be installed into the parent device.
> If the mdev module knows the PASID, we can do that by adding
> set_pasid(mdev, pasid) and clear_pasid(mdev, pasid) operations to
> mdev_parent_ops.

Your idea is fascinating. Pls feel free let us know if we missed any from you. :)

> Thanks,
> Jean

Thanks,
Yi Liu
Tian, Kevin July 25, 2018, 6:19 a.m. UTC | #8
> From: Tian, Kevin
> Sent: Wednesday, July 25, 2018 10:16 AM
> 
[...]
> >
> > There is another way: as we're planning to add a generic pasid_alloc()
> > function to the IOMMU API, the mdev module itself could allocate a
> > default PASID for each mdev by calling pasid_alloc() on the mdev's
> > parent, and then do map()/unmap() with that PASID. This way we don't
> > have to add IOMMU ops to the mdev bus, everything can still be done
> > using the ops of the parent. And IOMMU drivers "only" have to
> implement
> > PASID ops, which will be reused by drivers other than mdev.
> 
> yes, PASID is more general abstraction than mdev. However there is
> one problem. Please note with new scalable mode now PASID is not
> used just for SVA. You can do 1st-level, 2nd-level and nested in PASID
> granularity, meaning each PASID can be bound to an unique iommu
> domain. However today IOMMU driver allows only one iommu_domain
> per device. If we simply install allocated PASID to parent device, we
> should also remove that iommu_domain limitation. Is it the way that
> we want to pursue?
> 
> mdev avoids such problem as it introduces a separate device...

another thing to think about is to simplify the caller logic. It would
be good to have consistent IOMMU APIs cross PCI device and 
mdev, for common VFIO operations e.g. map/unmap, iommu_
attach_group, etc. If we need special version ops with PASID, it 
brings burden to VFIO and other IOMMU clients...

For IOMMU-capable mdev, there are two use cases which have 
been talked so far:

1) mdev with PASID-granular DMA isolation
2) mdev inheriting RID from parent device (no sharing)

1) is what this RFC is currently for. 2) is what Alex concerned 
which is not covered in RFC yet.

In my mind, an ideal design is like below:

a) VFIO provides an interface for parent driver to specify 
whether a mdev is IOMMU capable, with flag to indicate 
whether it's PASID-granular or RID-inherit;

b) Once an IOMMU-capable mdev is created, VFIO treats it no
different from normal physical devices, using exactly same 
IOMMU APIs to operate (including future SVA). VFIO doesn't 
care whether PASID or RID will be used in hardware;

c) IOMMU driver then handle RID/PASID difference internally, 
based on its own configuration and mdev type.

To support above goal, I feel a new device handle is a nice fit 
to represent mdev within IOMMU driver, with same set of 
capabilities as observed on its parent device.

Jean, maybe I didn't fully understand your proposal. Can you
elaborate whether above goal can be achieved with it?

Thanks
Kevin
Jean-Philippe Brucker July 25, 2018, 7:19 p.m. UTC | #9
On 25/07/18 07:19, Tian, Kevin wrote:
>> From: Tian, Kevin
>> Sent: Wednesday, July 25, 2018 10:16 AM
>>
> [...]
>>>
>>> There is another way: as we're planning to add a generic pasid_alloc()
>>> function to the IOMMU API, the mdev module itself could allocate a
>>> default PASID for each mdev by calling pasid_alloc() on the mdev's
>>> parent, and then do map()/unmap() with that PASID. This way we don't
>>> have to add IOMMU ops to the mdev bus, everything can still be done
>>> using the ops of the parent. And IOMMU drivers "only" have to
>> implement
>>> PASID ops, which will be reused by drivers other than mdev.
>>
>> yes, PASID is more general abstraction than mdev. However there is
>> one problem. Please note with new scalable mode now PASID is not
>> used just for SVA. You can do 1st-level, 2nd-level and nested in PASID
>> granularity, meaning each PASID can be bound to an unique iommu
>> domain. However today IOMMU driver allows only one iommu_domain
>> per device. If we simply install allocated PASID to parent device, we
>> should also remove that iommu_domain limitation. Is it the way that
>> we want to pursue?
>>
>> mdev avoids such problem as it introduces a separate device...
> 
> another thing to think about is to simplify the caller logic. It would
> be good to have consistent IOMMU APIs cross PCI device and 
> mdev, for common VFIO operations e.g. map/unmap, iommu_
> attach_group, etc. If we need special version ops with PASID, it 
> brings burden to VFIO and other IOMMU clients...

Yes, providing special ops is the simplest to implement (or least
invasive, I guess) but isn't particularly elegant. See the patch from
Jordan below, that I was planning to add to the SVA series (I'm
laboriously working towards next version) and my email from last week:
https://patchwork.kernel.org/patch/10412251/

Whenever I come back to hierarchical IOMMU domains I reject it as too
complicated, but maybe that is what we need. I find it difficult to
reason about because domains currently represent both a collection of
devices and a one or more address spaces. I proposed the io_mm thing to
represent a single address space, and to avoid adding special cases to
every function in the IOMMU subsystem that manipulates domains.

> For IOMMU-capable mdev, there are two use cases which have 
> been talked so far:
> 
> 1) mdev with PASID-granular DMA isolation
> 2) mdev inheriting RID from parent device (no sharing)

Would that be a single mdev per parent device? Otherwise I don't really
understand how it would work without all map/unmap operations going to
the parent device's driver.

> 1) is what this RFC is currently for. 2) is what Alex concerned 
> which is not covered in RFC yet.
> 
> In my mind, an ideal design is like below:
> 
> a) VFIO provides an interface for parent driver to specify 
> whether a mdev is IOMMU capable, with flag to indicate 
> whether it's PASID-granular or RID-inherit;
> 
> b) Once an IOMMU-capable mdev is created, VFIO treats it no
> different from normal physical devices, using exactly same 
> IOMMU APIs to operate (including future SVA). VFIO doesn't 
> care whether PASID or RID will be used in hardware;
> 
> c) IOMMU driver then handle RID/PASID difference internally, 
> based on its own configuration and mdev type.
> 
> To support above goal, I feel a new device handle is a nice fit 
> to represent mdev within IOMMU driver, with same set of 
> capabilities as observed on its parent device.

It's a good abstraction, but I'm still concerned about other users of
PASID-granular DMA isolation, for example GPU drivers wanting to improve
isolation of DMA bufs, will want the same functionality without going
through the vfio-mdev module.

The IOMMU operations we care about don't take a device handle, I think,
just a domain. And VFIO itself only deals with domains when doing
map/unmap. Maybe we could add this operation to the IOMMU subsystem:

child_domain = domain_create_child(parent_dev, parent_domain)

A child domain would be a smaller isolation granule, getting a PASID if
that's what the IOMMU implements or another mechanism for 2). It is
automatically attached to its parent's devices, so attach/detach
operations wouldn't be necessary on the child.

Depending on the underlying architecture the child domain can support
map/unmap on a single stage, or map/unmap for 2nd level and bind_pgtable
for 1st level.

I'm not sure how this works for host SVA though. I think the
sva_bind_dev() API could stay the same, but the internals will need
to change.

Thanks,
Jean

-----
I was going to send the following patch for dealing with private PASIDs.
I used it for a vfio-mdev prototype internally: VFIO would call
iommu_sva_alloc_pasid on the parent device and then sva_map/sva_unmap on
the parent domain + allocated io_mm (which redirects to
iommu_map/iommu_unmap when there is no io_mm). Since it relies on io_mm
instead of child domains, it duplicates the iommu ops, and as discussed
might not be adapted to Vt-d scalable mode.

Subject: [PATCH] iommu: sva: Add support for private PASIDs

Provide an API for allocating PASIDs and populating them manually. To ease
cleanup and factor allocation code, reuse the io_mm structure for private
PASID. Private io_mm has a NULL mm_struct pointer, and cannot be bound to
multiple devices. The mm_alloc() IOMMU op must now check if the mm
argument is NULL, in which case it should allocate io_pgtables instead of
binding to an mm.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 drivers/iommu/iommu-sva.c | 211 +++++++++++++++++++++++++++++++++-----
 drivers/iommu/iommu.c     |  49 ++++++---
 include/linux/iommu.h     | 118 +++++++++++++++++++--
 3 files changed, 331 insertions(+), 47 deletions(-)

diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 35d0f387fbef..12c9b861014d 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -15,11 +15,11 @@
 /**
  * DOC: io_mm model
  *
- * The io_mm keeps track of process address spaces shared between CPU and
IOMMU.
- * The following example illustrates the relation between structures
- * iommu_domain, io_mm and iommu_bond. An iommu_bond is a link between
io_mm and
- * device. A device can have multiple io_mm and an io_mm may be bound to
- * multiple devices.
+ * When used with the bind()/unbind() functions, the io_mm keeps track of
+ * process address spaces shared between CPU and IOMMU. The following example
+ * illustrates the relation between structures iommu_domain, io_mm and
+ * iommu_bond. An iommu_bond is a link between io_mm and device. A device can
+ * have multiple io_mm and an io_mm may be bound to multiple devices.
  *              ___________________________
  *             |  IOMMU domain A           |
  *             |  ________________         |
@@ -97,6 +97,12 @@
  * non-PASID translations. In this case PASID 0 is reserved and entry 0
points
  * to the io_pgtable base. On Intel IOMMU, the io_pgtable base would be
held in
  * the device table and PASID 0 would be available to the allocator.
+ *
+ * The io_mm can also represent a private IOMMU address space, which isn't
+ * shared with a process. The device driver calls iommu_sva_alloc_pasid which
+ * returns an io_mm that can be populated with the iommu_sva_map/unmap
+ * functions. The principle is the same as shared io_mm, except that a
private
+ * io_mm cannot be bound to multiple devices.
  */

 struct iommu_bond {
@@ -130,6 +136,9 @@ static DEFINE_SPINLOCK(iommu_sva_lock);

 static struct mmu_notifier_ops iommu_mmu_notifier;

+#define io_mm_is_private(io_mm) ((io_mm) != NULL && (io_mm)->mm == NULL)
+#define io_mm_is_shared(io_mm) ((io_mm) != NULL && (io_mm)->mm != NULL)
+
 static struct io_mm *
 io_mm_alloc(struct iommu_domain *domain, struct device *dev,
 	    struct mm_struct *mm, unsigned long flags)
@@ -148,19 +157,10 @@ io_mm_alloc(struct iommu_domain *domain, struct
device *dev,
 	if (!io_mm)
 		return ERR_PTR(-ENOMEM);

-	/*
-	 * The mm must not be freed until after the driver frees the io_mm
-	 * (which may involve unpinning the CPU ASID for instance, requiring a
-	 * valid mm struct.)
-	 */
-	mmgrab(mm);
-
 	io_mm->flags		= flags;
 	io_mm->mm		= mm;
-	io_mm->notifier.ops	= &iommu_mmu_notifier;
 	io_mm->release		= domain->ops->mm_free;
 	INIT_LIST_HEAD(&io_mm->devices);
-	/* Leave kref to zero until the io_mm is fully initialized */

 	idr_preload(GFP_KERNEL);
 	spin_lock(&iommu_sva_lock);
@@ -175,6 +175,32 @@ io_mm_alloc(struct iommu_domain *domain, struct
device *dev,
 		goto err_free_mm;
 	}

+	return io_mm;
+
+err_free_mm:
+	io_mm->release(io_mm);
+	return ERR_PTR(ret);
+}
+
+static struct io_mm *
+io_mm_alloc_shared(struct iommu_domain *domain, struct device *dev,
+		   struct mm_struct *mm, unsigned long flags)
+{
+	int ret;
+	struct io_mm *io_mm;
+
+	io_mm = io_mm_alloc(domain, dev, mm, flags);
+	if (IS_ERR(io_mm))
+		return io_mm;
+
+	/*
+	 * The mm must not be freed until after the driver frees the io_mm
+	 * (which may involve unpinning the CPU ASID for instance, requiring a
+	 * valid mm struct.)
+	 */
+	mmgrab(mm);
+
+	io_mm->notifier.ops = &iommu_mmu_notifier;
 	ret = mmu_notifier_register(&io_mm->notifier, mm);
 	if (ret)
 		goto err_free_pasid;
@@ -202,7 +228,6 @@ io_mm_alloc(struct iommu_domain *domain, struct device
*dev,
 	idr_remove(&iommu_pasid_idr, io_mm->pasid);
 	spin_unlock(&iommu_sva_lock);

-err_free_mm:
 	io_mm->release(io_mm);
 	mmdrop(mm);

@@ -231,6 +256,11 @@ static void io_mm_release(struct kref *kref)
 	/* The PASID can now be reallocated for another mm. */
 	idr_remove(&iommu_pasid_idr, io_mm->pasid);

+	if (io_mm_is_private(io_mm)) {
+		io_mm->release(io_mm);
+		return;
+	}
+
 	/*
 	 * If we're being released from mm exit, the notifier callback ->release
 	 * has already been called. Otherwise we don't need ->release, the io_mm
@@ -258,7 +288,7 @@ static int io_mm_get_locked(struct io_mm *io_mm)
 	if (io_mm && kref_get_unless_zero(&io_mm->kref)) {
 		/*
 		 * kref_get_unless_zero doesn't provide ordering for reads. This
-		 * barrier pairs with the one in io_mm_alloc.
+		 * barrier pairs with the one in io_mm_alloc_shared.
 		 */
 		smp_rmb();
 		return 1;
@@ -289,7 +319,7 @@ static int io_mm_attach(struct iommu_domain *domain,
struct device *dev,
 	struct iommu_sva_param *param = dev->iommu_param->sva_param;

 	if (!domain->ops->mm_attach || !domain->ops->mm_detach ||
-	    !domain->ops->mm_invalidate)
+	    (io_mm_is_shared(io_mm) && !domain->ops->mm_invalidate))
 		return -ENODEV;

 	if (pasid > param->max_pasid || pasid < param->min_pasid)
@@ -550,7 +580,7 @@ int iommu_sva_device_init(struct device *dev, unsigned
long features,
 	struct iommu_sva_param *param;
 	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);

-	if (!domain || !domain->ops->sva_device_init)
+	if (!domain)
 		return -ENODEV;

 	if (features & ~IOMMU_SVA_FEAT_IOPF)
@@ -584,9 +614,11 @@ int iommu_sva_device_init(struct device *dev,
unsigned long features,
 	 * IOMMU driver updates the limits depending on the IOMMU and device
 	 * capabilities.
 	 */
-	ret = domain->ops->sva_device_init(dev, param);
-	if (ret)
-		goto err_free_param;
+	if (domain->ops->sva_device_init) {
+		ret = domain->ops->sva_device_init(dev, param);
+		if (ret)
+			goto err_free_param;
+	}

 	dev->iommu_param->sva_param = param;
 	mutex_unlock(&dev->iommu_param->lock);
@@ -688,7 +720,7 @@ int __iommu_sva_bind_device(struct device *dev, struct
mm_struct *mm,
 	}

 	if (!io_mm) {
-		io_mm = io_mm_alloc(domain, dev, mm, flags);
+		io_mm = io_mm_alloc_shared(domain, dev, mm, flags);
 		if (IS_ERR(io_mm))
 			return PTR_ERR(io_mm);
 	}
@@ -723,6 +755,9 @@ int __iommu_sva_unbind_device(struct device *dev, int
pasid)
 	/* spin_lock_irq matches the one in wait_event_lock_irq */
 	spin_lock_irq(&iommu_sva_lock);
 	list_for_each_entry(bond, &param->mm_list, dev_head) {
+		if (io_mm_is_private(bond->io_mm))
+			continue;
+
 		if (bond->io_mm->pasid == pasid) {
 			io_mm_detach_locked(bond, true);
 			ret = 0;
@@ -765,6 +800,136 @@ void __iommu_sva_unbind_dev_all(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(__iommu_sva_unbind_dev_all);

+/*
+ * iommu_sva_alloc_pasid - Allocate a private PASID
+ *
+ * Allocate a PASID for private map/unmap operations. Create a new I/O
address
+ * space for this device, that isn't bound to any process.
+ *
+ * iommu_sva_device_init must have been called first.
+ */
+int iommu_sva_alloc_pasid(struct device *dev, struct io_mm **out)
+{
+	int ret;
+	struct io_mm *io_mm;
+	struct iommu_domain *domain;
+	struct iommu_sva_param *param = dev->iommu_param->sva_param;
+
+	if (!out || !param)
+		return -EINVAL;
+
+	domain = iommu_get_domain_for_dev(dev);
+	if (!domain)
+		return -EINVAL;
+
+	io_mm = io_mm_alloc(domain, dev, NULL, 0);
+	if (IS_ERR(io_mm))
+		return PTR_ERR(io_mm);
+
+	kref_init(&io_mm->kref);
+
+	ret = io_mm_attach(domain, dev, io_mm, NULL);
+	if (ret) {
+		io_mm_put(io_mm);
+		return ret;
+	}
+
+	*out = io_mm;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_sva_alloc_pasid);
+
+void iommu_sva_free_pasid(struct device *dev, struct io_mm *io_mm)
+{
+	struct iommu_bond *bond;
+
+	if (WARN_ON(io_mm_is_shared(io_mm)))
+		return;
+
+	spin_lock(&iommu_sva_lock);
+	list_for_each_entry(bond, &io_mm->devices, mm_head) {
+		if (bond->dev == dev) {
+			io_mm_detach_locked(bond, false);
+			break;
+		}
+	}
+	spin_unlock(&iommu_sva_lock);
+}
+EXPORT_SYMBOL_GPL(iommu_sva_free_pasid);
+
+int iommu_sva_map(struct iommu_domain *domain, struct io_mm *io_mm,
+		  unsigned long iova, phys_addr_t paddr, size_t size, int prot)
+{
+	if (WARN_ON(io_mm_is_shared(io_mm)))
+		return -ENODEV;
+
+	return __iommu_map(domain, io_mm, iova, paddr, size, prot);
+}
+EXPORT_SYMBOL_GPL(iommu_sva_map);
+
+size_t iommu_sva_map_sg(struct iommu_domain *domain, struct io_mm *io_mm,
+			unsigned long iova, struct scatterlist *sg,
+			unsigned int nents, int prot)
+{
+	if (WARN_ON(io_mm_is_shared(io_mm)))
+		return -ENODEV;
+
+	return domain->ops->map_sg(domain, io_mm, iova, sg, nents, prot);
+}
+EXPORT_SYMBOL_GPL(iommu_sva_map_sg);
+
+size_t iommu_sva_unmap(struct iommu_domain *domain, struct io_mm *io_mm,
+		       unsigned long iova, size_t size)
+{
+	if (WARN_ON(io_mm_is_shared(io_mm)))
+		return 0;
+
+	return __iommu_unmap(domain, io_mm, iova, size, true);
+}
+EXPORT_SYMBOL_GPL(iommu_sva_unmap);
+
+size_t iommu_sva_unmap_fast(struct iommu_domain *domain, struct io_mm *io_mm,
+			    unsigned long iova, size_t size)
+{
+	if (WARN_ON(io_mm_is_shared(io_mm)))
+		return 0;
+
+	return __iommu_unmap(domain, io_mm, iova, size, false);
+}
+EXPORT_SYMBOL_GPL(iommu_sva_unmap_fast);
+
+phys_addr_t iommu_sva_iova_to_phys(struct iommu_domain *domain,
+				   struct io_mm *io_mm, dma_addr_t iova)
+{
+	if (!io_mm)
+		return iommu_iova_to_phys(domain, iova);
+
+	if (WARN_ON(io_mm_is_shared(io_mm)))
+		return 0;
+
+	if (unlikely(domain->ops->sva_iova_to_phys == NULL))
+		return 0;
+
+	return domain->ops->sva_iova_to_phys(domain, io_mm, iova);
+}
+EXPORT_SYMBOL_GPL(iommu_sva_iova_to_phys);
+
+void iommu_sva_tlb_range_add(struct iommu_domain *domain, struct io_mm
*io_mm,
+			     unsigned long iova, size_t size)
+{
+	if (!io_mm) {
+		iommu_tlb_range_add(domain, iova, size);
+		return;
+	}
+
+	if (WARN_ON(io_mm_is_shared(io_mm)))
+		return;
+
+	if (domain->ops->sva_iotlb_range_add != NULL)
+		domain->ops->sva_iotlb_range_add(domain, io_mm, iova, size);
+}
+EXPORT_SYMBOL_GPL(iommu_sva_tlb_range_add);
+
 /**
  * iommu_sva_find() - Find mm associated to the given PASID
  * @pasid: Process Address Space ID assigned to the mm
@@ -779,7 +944,7 @@ struct mm_struct *iommu_sva_find(int pasid)

 	spin_lock(&iommu_sva_lock);
 	io_mm = idr_find(&iommu_pasid_idr, pasid);
-	if (io_mm && io_mm_get_locked(io_mm)) {
+	if (io_mm_is_shared(io_mm) && io_mm_get_locked(io_mm)) {
 		if (mmget_not_zero(io_mm->mm))
 			mm = io_mm->mm;

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index b02e1d32c733..26ac9b2a813e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1816,8 +1816,8 @@ static size_t iommu_pgsize(struct iommu_domain *domain,
 	return pgsize;
 }

-int iommu_map(struct iommu_domain *domain, unsigned long iova,
-	      phys_addr_t paddr, size_t size, int prot)
+int __iommu_map(struct iommu_domain *domain, struct io_mm *io_mm,
+		unsigned long iova, phys_addr_t paddr, size_t size, int prot)
 {
 	unsigned long orig_iova = iova;
 	unsigned int min_pagesz;
@@ -1825,7 +1825,8 @@ int iommu_map(struct iommu_domain *domain, unsigned
long iova,
 	phys_addr_t orig_paddr = paddr;
 	int ret = 0;

-	if (unlikely(domain->ops->map == NULL ||
+	if (unlikely((!io_mm && domain->ops->map == NULL) ||
+		     (io_mm && domain->ops->sva_map == NULL) ||
 		     domain->pgsize_bitmap == 0UL))
 		return -ENODEV;

@@ -1854,7 +1855,12 @@ int iommu_map(struct iommu_domain *domain, unsigned
long iova,
 		pr_debug("mapping: iova 0x%lx pa %pa pgsize 0x%zx\n",
 			 iova, &paddr, pgsize);

-		ret = domain->ops->map(domain, iova, paddr, pgsize, prot);
+		if (io_mm)
+			ret = domain->ops->sva_map(domain, io_mm, iova, paddr,
+						   pgsize, prot);
+		else
+			ret = domain->ops->map(domain, iova, paddr, pgsize,
+					       prot);
 		if (ret)
 			break;

@@ -1865,24 +1871,30 @@ int iommu_map(struct iommu_domain *domain,
unsigned long iova,

 	/* unroll mapping in case something went wrong */
 	if (ret)
-		iommu_unmap(domain, orig_iova, orig_size - size);
+		__iommu_unmap(domain, io_mm, orig_iova, orig_size - size, true);
 	else
 		trace_map(orig_iova, orig_paddr, orig_size);

 	return ret;
 }
+
+int iommu_map(struct iommu_domain *domain, unsigned long iov,
+	      phys_addr_t paddr, size_t size, int prot)
+{
+	return __iommu_map(domain, NULL, iov, paddr, size, prot);
+}
 EXPORT_SYMBOL_GPL(iommu_map);

-static size_t __iommu_unmap(struct iommu_domain *domain,
-			    unsigned long iova, size_t size,
-			    bool sync)
+size_t __iommu_unmap(struct iommu_domain *domain, struct io_mm *io_mm,
+		     unsigned long iova, size_t size, bool sync)
 {
 	const struct iommu_ops *ops = domain->ops;
 	size_t unmapped_page, unmapped = 0;
 	unsigned long orig_iova = iova;
 	unsigned int min_pagesz;

-	if (unlikely(ops->unmap == NULL ||
+	if (unlikely((!io_mm && ops->unmap == NULL) ||
+		     (io_mm && ops->sva_unmap == NULL) ||
 		     domain->pgsize_bitmap == 0UL))
 		return 0;

@@ -1912,7 +1924,11 @@ static size_t __iommu_unmap(struct iommu_domain
*domain,
 	while (unmapped < size) {
 		size_t pgsize = iommu_pgsize(domain, iova, size - unmapped);

-		unmapped_page = ops->unmap(domain, iova, pgsize);
+		if (io_mm)
+			unmapped_page = ops->sva_unmap(domain, io_mm, iova,
+						       pgsize);
+		else
+			unmapped_page = ops->unmap(domain, iova, pgsize);
 		if (!unmapped_page)
 			break;

@@ -1936,19 +1952,20 @@ static size_t __iommu_unmap(struct iommu_domain
*domain,
 size_t iommu_unmap(struct iommu_domain *domain,
 		   unsigned long iova, size_t size)
 {
-	return __iommu_unmap(domain, iova, size, true);
+	return __iommu_unmap(domain, NULL, iova, size, true);
 }
 EXPORT_SYMBOL_GPL(iommu_unmap);

 size_t iommu_unmap_fast(struct iommu_domain *domain,
 			unsigned long iova, size_t size)
 {
-	return __iommu_unmap(domain, iova, size, false);
+	return __iommu_unmap(domain, NULL, iova, size, false);
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);

-size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-			 struct scatterlist *sg, unsigned int nents, int prot)
+size_t default_iommu_map_sg(struct iommu_domain *domain, struct io_mm *io_mm,
+			    unsigned long iova, struct scatterlist *sg, unsigned
+			    int nents, int prot)
 {
 	struct scatterlist *s;
 	size_t mapped = 0;
@@ -1972,7 +1989,7 @@ size_t default_iommu_map_sg(struct iommu_domain
*domain, unsigned long iova,
 		if (!IS_ALIGNED(s->offset, min_pagesz))
 			goto out_err;

-		ret = iommu_map(domain, iova + mapped, phys, s->length, prot);
+		ret = __iommu_map(domain, io_mm, iova + mapped, phys, s->length, prot);
 		if (ret)
 			goto out_err;

@@ -1983,7 +2000,7 @@ size_t default_iommu_map_sg(struct iommu_domain
*domain, unsigned long iova,

 out_err:
 	/* undo mappings already done */
-	iommu_unmap(domain, iova, mapped);
+	__iommu_unmap(domain, io_mm, iova, mapped, true);

 	return 0;

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b525f5636df8..1f63a8691173 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -248,13 +248,17 @@ struct iommu_sva_param {
  * @mm_invalidate: Invalidate a range of mappings for an mm
  * @map: map a physically contiguous memory region to an iommu domain
  * @unmap: unmap a physically contiguous memory region from an iommu domain
+ * @sva_map: map a physically contiguous memory region to an address space
+ * @sva_unmap: unmap a physically contiguous memory region from an
address space
  * @map_sg: map a scatter-gather list of physically contiguous memory chunks
  *          to an iommu domain
  * @flush_tlb_all: Synchronously flush all hardware TLBs for this domain
  * @tlb_range_add: Add a given iova range to the flush queue for this domain
+ * @sva_iotlb_range_add: Add a given iova range to the flush queue for
this mm
  * @tlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
  *            queue
  * @iova_to_phys: translate iova to physical address
+ * @sva_iova_to_phys: translate iova to physical address
  * @add_device: add device to iommu grouping
  * @remove_device: remove device from iommu grouping
  * @device_group: find iommu group for a particular device
@@ -302,13 +306,24 @@ struct iommu_ops {
 		   phys_addr_t paddr, size_t size, int prot);
 	size_t (*unmap)(struct iommu_domain *domain, unsigned long iova,
 		     size_t size);
-	size_t (*map_sg)(struct iommu_domain *domain, unsigned long iova,
-			 struct scatterlist *sg, unsigned int nents, int prot);
+	int (*sva_map)(struct iommu_domain *domain, struct io_mm *io_mm,
+		       unsigned long iova, phys_addr_t paddr, size_t size,
+		       int prot);
+	size_t (*sva_unmap)(struct iommu_domain *domain, struct io_mm *io_mm,
+			    unsigned long iova, size_t size);
+	size_t (*map_sg)(struct iommu_domain *domain, struct io_mm *io_mm,
+			 unsigned long iova, struct scatterlist *sg,
+			 unsigned int nents, int prot);
 	void (*flush_iotlb_all)(struct iommu_domain *domain);
 	void (*iotlb_range_add)(struct iommu_domain *domain,
 				unsigned long iova, size_t size);
+	void (*sva_iotlb_range_add)(struct iommu_domain *domain,
+				    struct io_mm *io_mm, unsigned long iova,
+				    size_t size);
 	void (*iotlb_sync)(struct iommu_domain *domain);
 	phys_addr_t (*iova_to_phys)(struct iommu_domain *domain, dma_addr_t iova);
+	phys_addr_t (*sva_iova_to_phys)(struct iommu_domain *domain,
+					struct io_mm *io_mm, dma_addr_t iova);
 	int (*add_device)(struct device *dev);
 	void (*remove_device)(struct device *dev);
 	struct iommu_group *(*device_group)(struct device *dev);
@@ -528,15 +543,20 @@ extern int iommu_sva_invalidate(struct iommu_domain
*domain,
 		struct device *dev, struct tlb_invalidate_info *inv_info);

 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
+extern int __iommu_map(struct iommu_domain *domain, struct io_mm *io_mm,
+		       unsigned long iova, phys_addr_t paddr, size_t size,
+		       int prot);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot);
+extern size_t __iommu_unmap(struct iommu_domain *domain, struct io_mm *io_mm,
+			    unsigned long iova, size_t size, bool sync);
 extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
 			  size_t size);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
 			       unsigned long iova, size_t size);
-extern size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned
long iova,
-				struct scatterlist *sg,unsigned int nents,
-				int prot);
+extern size_t default_iommu_map_sg(struct iommu_domain *domain, struct
io_mm *io_mm,
+				   unsigned long iova, struct scatterlist *sg,
+				   unsigned int nents, int prot);
 extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
dma_addr_t iova);
 extern void iommu_set_fault_handler(struct iommu_domain *domain,
 			iommu_fault_handler_t handler, void *token);
@@ -622,7 +642,7 @@ static inline size_t iommu_map_sg(struct iommu_domain
*domain,
 				  unsigned long iova, struct scatterlist *sg,
 				  unsigned int nents, int prot)
 {
-	return domain->ops->map_sg(domain, iova, sg, nents, prot);
+	return domain->ops->map_sg(domain, NULL, iova, sg, nents, prot);
 }

 /* PCI device grouping function */
@@ -710,12 +730,25 @@ static inline struct iommu_domain
*iommu_get_domain_for_dev(struct device *dev)
 	return NULL;
 }

+static inline int __iommu_map(struct iommu_domain *domain, struct io_mm
*io_mm,
+			      unsigned long iova, phys_addr_t paddr,
+			      size_t size, int prot)
+{
+	return -ENODEV;
+}
+
 static inline int iommu_map(struct iommu_domain *domain, unsigned long iova,
 			    phys_addr_t paddr, size_t size, int prot)
 {
 	return -ENODEV;
 }

+static inline size_t __iommu_unmap(struct iommu_domain *domain, struct
io_mm *io_mm,
+		     unsigned long iova, size_t size, bool sync)
+{
+	return 0;
+}
+
 static inline size_t iommu_unmap(struct iommu_domain *domain,
 				 unsigned long iova, size_t size)
 {
@@ -729,8 +762,9 @@ static inline size_t iommu_unmap_fast(struct
iommu_domain *domain,
 }

 static inline size_t iommu_map_sg(struct iommu_domain *domain,
-				  unsigned long iova, struct scatterlist *sg,
-				  unsigned int nents, int prot)
+				  struct io_mm *io_mm, unsigned long iova,
+				  struct scatterlist *sg, unsigned int nents,
+				  int prot)
 {
 	return 0;
 }
@@ -1017,6 +1051,22 @@ extern int __iommu_sva_bind_device(struct device
*dev, struct mm_struct *mm,
 extern int __iommu_sva_unbind_device(struct device *dev, int pasid);
 extern void __iommu_sva_unbind_dev_all(struct device *dev);

+int iommu_sva_alloc_pasid(struct device *dev, struct io_mm **io_mm);
+void iommu_sva_free_pasid(struct device *dev, struct io_mm *io_mm);
+
+int iommu_sva_map(struct iommu_domain *domain, struct io_mm *io_mm,
+		  unsigned long iova, phys_addr_t paddr, size_t size, int prot);
+size_t iommu_sva_map_sg(struct iommu_domain *domain, struct io_mm *io_mm,
+			unsigned long iova, struct scatterlist *sg,
+			unsigned int nents, int prot);
+size_t iommu_sva_unmap(struct iommu_domain *domain,
+		       struct io_mm *io_mm, unsigned long iova, size_t size);
+size_t iommu_sva_unmap_fast(struct iommu_domain *domain, struct io_mm *io_mm,
+			    unsigned long iova, size_t size);
+phys_addr_t iommu_sva_iova_to_phys(struct iommu_domain *domain,
+				   struct io_mm *io_mm, dma_addr_t iova);
+void iommu_sva_tlb_range_add(struct iommu_domain *domain, struct io_mm
*io_mm,
+			     unsigned long iova, size_t size);
 extern struct mm_struct *iommu_sva_find(int pasid);
 #else /* CONFIG_IOMMU_SVA */
 static inline int iommu_sva_device_init(struct device *dev,
@@ -1048,6 +1098,58 @@ static inline void
__iommu_sva_unbind_dev_all(struct device *dev)
 {
 }

+static inline struct io_mm *
+iommu_sva_alloc_pasid(struct iommu_domain *domain, struct device *dev)
+{
+	return ERR_PTR(-ENODEV);
+}
+
+static inline void iommu_sva_free_pasid(struct io_mm *io_mm, struct
device *dev)
+{
+}
+
+static inline int iommu_sva_map(struct iommu_domain *domain,
+				struct io_mm *io_mm, unsigned long iova,
+				phys_addr_t paddr, size_t size, int prot)
+{
+	return -EINVAL;
+}
+
+static inline size_t iommu_sva_map_sg(struct iommu_domain *domain,
+				      struct io_mm *io_mm, unsigned long iova,
+				      struct scatterlist *sg,
+				      unsigned int nents, int prot)
+{
+	return 0;
+}
+
+static inline size_t iommu_sva_unmap(struct iommu_domain *domain,
+				     struct io_mm *io_mm, unsigned long iova,
+				     size_t size)
+{
+	return 0;
+}
+
+static inline size_t iommu_sva_unmap_fast(struct iommu_domain *domain,
+					  struct io_mm *io_mm,
+					  unsigned long iova, size_t size)
+{
+	return 0;
+}
+
+static inline phys_addr_t iommu_sva_iova_to_phys(struct iommu_domain *domain,
+						 struct io_mm *io_mm,
+						 dma_addr_t iova)
+{
+	return 0;
+}
+
+static inline void iommu_sva_tlb_range_add(struct iommu_domain *domain,
+					   struct io_mm *io_mm,
+					   unsigned long iova, size_t size)
+{
+}
+
 static inline struct mm_struct *iommu_sva_find(int pasid)
 {
 	return NULL;
Tian, Kevin July 26, 2018, 3:03 a.m. UTC | #10
> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Thursday, July 26, 2018 3:20 AM
> 
> On 25/07/18 07:19, Tian, Kevin wrote:
> >> From: Tian, Kevin
> >> Sent: Wednesday, July 25, 2018 10:16 AM
> >>
> > [...]
> >>>
> >>> There is another way: as we're planning to add a generic pasid_alloc()
> >>> function to the IOMMU API, the mdev module itself could allocate a
> >>> default PASID for each mdev by calling pasid_alloc() on the mdev's
> >>> parent, and then do map()/unmap() with that PASID. This way we don't
> >>> have to add IOMMU ops to the mdev bus, everything can still be done
> >>> using the ops of the parent. And IOMMU drivers "only" have to
> >> implement
> >>> PASID ops, which will be reused by drivers other than mdev.
> >>
> >> yes, PASID is more general abstraction than mdev. However there is
> >> one problem. Please note with new scalable mode now PASID is not
> >> used just for SVA. You can do 1st-level, 2nd-level and nested in PASID
> >> granularity, meaning each PASID can be bound to an unique iommu
> >> domain. However today IOMMU driver allows only one iommu_domain
> >> per device. If we simply install allocated PASID to parent device, we
> >> should also remove that iommu_domain limitation. Is it the way that
> >> we want to pursue?
> >>
> >> mdev avoids such problem as it introduces a separate device...
> >
> > another thing to think about is to simplify the caller logic. It would
> > be good to have consistent IOMMU APIs cross PCI device and
> > mdev, for common VFIO operations e.g. map/unmap, iommu_
> > attach_group, etc. If we need special version ops with PASID, it
> > brings burden to VFIO and other IOMMU clients...
> 
> Yes, providing special ops is the simplest to implement (or least
> invasive, I guess) but isn't particularly elegant. See the patch from
> Jordan below, that I was planning to add to the SVA series (I'm
> laboriously working towards next version) and my email from last week:
> https://patchwork.kernel.org/patch/10412251/
> 
> Whenever I come back to hierarchical IOMMU domains I reject it as too
> complicated, but maybe that is what we need. I find it difficult to
> reason about because domains currently represent both a collection of
> devices and a one or more address spaces. I proposed the io_mm thing to
> represent a single address space, and to avoid adding special cases to
> every function in the IOMMU subsystem that manipulates domains.

I suppose io_mm is still under iommu_domain, right? this is different
from hierarchical iommu domain concept...

> 
> > For IOMMU-capable mdev, there are two use cases which have
> > been talked so far:
> >
> > 1) mdev with PASID-granular DMA isolation
> > 2) mdev inheriting RID from parent device (no sharing)
> 
> Would that be a single mdev per parent device? Otherwise I don't really
> understand how it would work without all map/unmap operations going to
> the parent device's driver.

yes, that is why I said "no sharing". In that special case, host driver 
itself doesn't do DMA. Only the mdev does. 

> 
> > 1) is what this RFC is currently for. 2) is what Alex concerned
> > which is not covered in RFC yet.
> >
> > In my mind, an ideal design is like below:
> >
> > a) VFIO provides an interface for parent driver to specify
> > whether a mdev is IOMMU capable, with flag to indicate
> > whether it's PASID-granular or RID-inherit;
> >
> > b) Once an IOMMU-capable mdev is created, VFIO treats it no
> > different from normal physical devices, using exactly same
> > IOMMU APIs to operate (including future SVA). VFIO doesn't
> > care whether PASID or RID will be used in hardware;
> >
> > c) IOMMU driver then handle RID/PASID difference internally,
> > based on its own configuration and mdev type.
> >
> > To support above goal, I feel a new device handle is a nice fit
> > to represent mdev within IOMMU driver, with same set of
> > capabilities as observed on its parent device.
> 
> It's a good abstraction, but I'm still concerned about other users of
> PASID-granular DMA isolation, for example GPU drivers wanting to improve
> isolation of DMA bufs, will want the same functionality without going
> through the vfio-mdev module.

for GPU I think you meant SVA. that one anyway needs its own 
interface as what we have been discussing in yours and Jacob's
series.

Here mdev is orthogonal to a specific capability like SVA. It is 
sort of logical representation of subset resource of parent device,
on top of which we can enable IOMMU capabilities including SVA.

I'm not sure whether it is good to combine two requirements closely...

> 
> The IOMMU operations we care about don't take a device handle, I think,
> just a domain. And VFIO itself only deals with domains when doing
> map/unmap. Maybe we could add this operation to the IOMMU subsystem:
> 
> child_domain = domain_create_child(parent_dev, parent_domain)
> 
> A child domain would be a smaller isolation granule, getting a PASID if
> that's what the IOMMU implements or another mechanism for 2). It is
> automatically attached to its parent's devices, so attach/detach
> operations wouldn't be necessary on the child.
> 
> Depending on the underlying architecture the child domain can support
> map/unmap on a single stage, or map/unmap for 2nd level and
> bind_pgtable
> for 1st level.
> 
> I'm not sure how this works for host SVA though. I think the
> sva_bind_dev() API could stay the same, but the internals will need
> to change.
> 

hierarchical domain might be the right way to go, but let's do more
thinking on any corner cases. 

One open is whether iommu domain can fully carry all required 
attributes on mdev. Note today for physical device each vendor 
driver maintains a device structure for device specific info which 
may impact IOMMU setting (e.g. struct device_domain_info in 
intel-iommu, and struct arm_smmu_device in arm-smmu). If we 
want mdev to have a different attribute as its parent device, then 
new representation might be required. But honestly speaking I 
don't think it is a valid requirement now, since physically finally
it is still the IOMMU structure of parent device being configured,
so mdev should just inherit same attributes as parent. If the role 
of mdev representation in current RFC is just to connect iommu_
domain when hierarchical domain is missing, then we might 
instead just make the latter happen...

Let's think more on this direction. btw can you elaborate any 
other complexities when you evaluated this option earlier?

Thanks
Kevin
Tian, Kevin July 26, 2018, 3:28 a.m. UTC | #11
> From: Tian, Kevin
> Sent: Thursday, July 26, 2018 11:04 AM
[...]
> >
> > The IOMMU operations we care about don't take a device handle, I think,
> > just a domain. And VFIO itself only deals with domains when doing
> > map/unmap. Maybe we could add this operation to the IOMMU
> subsystem:
> >
> > child_domain = domain_create_child(parent_dev, parent_domain)
> >
> > A child domain would be a smaller isolation granule, getting a PASID if
> > that's what the IOMMU implements or another mechanism for 2). It is
> > automatically attached to its parent's devices, so attach/detach
> > operations wouldn't be necessary on the child.
> >
> > Depending on the underlying architecture the child domain can support
> > map/unmap on a single stage, or map/unmap for 2nd level and
> > bind_pgtable
> > for 1st level.
> >
> > I'm not sure how this works for host SVA though. I think the
> > sva_bind_dev() API could stay the same, but the internals will need
> > to change.
> >
> 
> hierarchical domain might be the right way to go, but let's do more
> thinking on any corner cases.
> 

btw maybe we don't need make it 'hierarchical', as maintaining
hierarchy usually brings more work. What we require is possibly
just the capability of having one device bind to multiple 
iommu_domains. One domain is reserved for parent driver's
own DMA activities (e.g. serving DMA APIs), while other domains
are auxiliary and can be tagged with a PASID (or any other identifier
which IOMMU can use to support multiple domains). Such identifiers 
may be automatically provisioned when auxiliary domain is attached, 
i.e. not requiring an explicit request from caller. IMO it's safe to 
assume that supporting multiple iommu domains anyway implies 
some finer-grained capability than RID-based in underlying IOMMU.
Then there is no need of parent/child concept.

thoughts?

Thanks
Kevin
Jean-Philippe Brucker July 26, 2018, 3:09 p.m. UTC | #12
On 26/07/18 04:03, Tian, Kevin wrote:
>> Whenever I come back to hierarchical IOMMU domains I reject it as too
>> complicated, but maybe that is what we need. I find it difficult to
>> reason about because domains currently represent both a collection of
>> devices and a one or more address spaces. I proposed the io_mm thing to
>> represent a single address space, and to avoid adding special cases to
>> every function in the IOMMU subsystem that manipulates domains.
> 
> I suppose io_mm is still under iommu_domain, right? this is different
> from hierarchical iommu domain concept...

Yes, my initial solution is io_mm attached to domains, but in the rest
of the mail yesterday I tried to explore a way to replace io_mm with
child domains instead. In my current patches, io_mm when used for
private PASIDs is basically a lightweight child domain. I don't know
which solution is best or if mdev-aware IOMMU is still preferable. For
the moment I'll keep the private PASID proposal that uses io_mm, until
we've thought a bit more about this.

[...]
>> It's a good abstraction, but I'm still concerned about other users of
>> PASID-granular DMA isolation, for example GPU drivers wanting to improve
>> isolation of DMA bufs, will want the same functionality without going
>> through the vfio-mdev module.
> 
> for GPU I think you meant SVA. that one anyway needs its own 
> interface as what we have been discussing in yours and Jacob's
> series.

Actually I didn't mean SVA. People would like to allocate PASIDs in
kernel drivers and do iommu_map/unmap on them, without binding process
address spaces. Not counting hardware validation, I've actually seen
more demand for private PASID management than for SVA.

See for example the discussion on RFCv2 of SVA, or Jordan's series:
https://www.spinics.net/lists/arm-kernel/msg611038.html
https://lwn.net/Articles/747889/

It would be good if mdev and non-mdev could reuse most of the same code
for private PASID, whatever it turns out to be

> Here mdev is orthogonal to a specific capability like SVA. It is 
> sort of logical representation of subset resource of parent device,
> on top of which we can enable IOMMU capabilities including SVA.
> 
> I'm not sure whether it is good to combine two requirements closely...
> 
>>
>> The IOMMU operations we care about don't take a device handle, I think,
>> just a domain. And VFIO itself only deals with domains when doing
>> map/unmap. Maybe we could add this operation to the IOMMU subsystem:
>>
>> child_domain = domain_create_child(parent_dev, parent_domain)
>>
>> A child domain would be a smaller isolation granule, getting a PASID if
>> that's what the IOMMU implements or another mechanism for 2). It is
>> automatically attached to its parent's devices, so attach/detach
>> operations wouldn't be necessary on the child.
>>
>> Depending on the underlying architecture the child domain can support
>> map/unmap on a single stage, or map/unmap for 2nd level and
>> bind_pgtable
>> for 1st level.
>>
>> I'm not sure how this works for host SVA though. I think the
>> sva_bind_dev() API could stay the same, but the internals will need
>> to change.

Thinking more about this, the SVA case seems a bit nasty. If we replaced
io_mm with iommu_domain for SVA, a child domain would represent a single
process address space, and therefore have multiple parent domains... Not
sure we should go down that road. Maybe we should keep SVA separate, and
keep io_mm to be a wrapper for mm_struct


> hierarchical domain might be the right way to go, but let's do more
> thinking on any corner cases. 
> 
> One open is whether iommu domain can fully carry all required 
> attributes on mdev. Note today for physical device each vendor 
> driver maintains a device structure for device specific info which 
> may impact IOMMU setting (e.g. struct device_domain_info in 
> intel-iommu, and struct arm_smmu_device in arm-smmu). If we 
> want mdev to have a different attribute as its parent device, then 
> new representation might be required. But honestly speaking I 
> don't think it is a valid requirement now, since physically finally
> it is still the IOMMU structure of parent device being configured,
> so mdev should just inherit same attributes as parent. If the role 
> of mdev representation in current RFC is just to connect iommu_
> domain when hierarchical domain is missing, then we might 
> instead just make the latter happen...

Agreed, the mdev would inherit properties of its parent since physically
the IOMMU sees DMA transactions coming from the parent

> Let's think more on this direction. btw can you elaborate any 
> other complexities when you evaluated this option earlier?

I can't think of anything right now - my prototype worked well with
iommu_sva_alloc_pasid, but the goal was mainly for me to understand mdev
using a toy DMA engine, I didn't spend time thinking about corner cases.

Thanks,
Jean
Jean-Philippe Brucker July 26, 2018, 3:09 p.m. UTC | #13
On 26/07/18 04:28, Tian, Kevin wrote:
>> hierarchical domain might be the right way to go, but let's do more
>> thinking on any corner cases.
>>
> 
> btw maybe we don't need make it 'hierarchical', as maintaining
> hierarchy usually brings more work. What we require is possibly
> just the capability of having one device bind to multiple 
> iommu_domains. One domain is reserved for parent driver's
> own DMA activities (e.g. serving DMA APIs), while other domains
> are auxiliary and can be tagged with a PASID (or any other identifier
> which IOMMU can use to support multiple domains). Such identifiers 
> may be automatically provisioned when auxiliary domain is attached, 
> i.e. not requiring an explicit request from caller. IMO it's safe to 
> assume that supporting multiple iommu domains anyway implies 
> some finer-grained capability than RID-based in underlying IOMMU.
> Then there is no need of parent/child concept.

Right, we probably don't need a hierarchy. I like this model (it's
actually the one I favor for supporting PASID in virtio-iommu), though
I'm hoping we can avoid changing the logic of iommu_attach/detach, and
instead introduce a new "get_child_domain", "attach_aux_domain" or
simply "clone" op.

Currently the attach_dev() op toggles the domain of a device. For
example a device automatically gets a default domain for kernel DMA,
then VFIO attaches a new domain for assigning to a VM. attach_dev()
disables the default domain and installs fresh page tables. detach_dev()
isn't called, and the SMMU drivers don't even implement it. If we wanted
to reuse attach_dev() for auxiliary domains, we'd need some flag to
differentiate the "add auxiliary domain" operation from the "detach
everything and attach one new domain" one

Thanks,
Jean
diff mbox series

Patch

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3ede34a..57ccfc4 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5276,6 +5276,17 @@  static void intel_iommu_put_resv_regions(struct device *dev,
 	}
 }
 
+static struct iommu_group *intel_iommu_device_group(struct device *dev)
+{
+	if (dev_is_pci(dev))
+		return pci_device_group(dev);
+
+	if (dev_is_mdev(dev))
+		return iommu_group_alloc();
+
+	return ERR_PTR(-EINVAL);
+}
+
 #ifdef CONFIG_INTEL_IOMMU_SVM
 int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct intel_svm_dev *sdev)
 {
@@ -5370,7 +5381,7 @@  const struct iommu_ops intel_iommu_ops = {
 	.remove_device		= intel_iommu_remove_device,
 	.get_resv_regions	= intel_iommu_get_resv_regions,
 	.put_resv_regions	= intel_iommu_put_resv_regions,
-	.device_group		= pci_device_group,
+	.device_group		= intel_iommu_device_group,
 	.pgsize_bitmap		= INTEL_IOMMU_PGSIZES,
 };