diff mbox

[RFC,1/8] iommu: Introduce bind_pasid_table API function

Message ID 1493201525-14418-2-git-send-email-yi.l.liu@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yi Liu April 26, 2017, 10:11 a.m. UTC
From: Jacob Pan <jacob.jun.pan@linux.intel.com>

Virtual IOMMU was proposed to support Shared Virtual Memory (SVM) use
case in the guest:
https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html

As part of the proposed architecture, when a SVM capable PCI
device is assigned to a guest, nested mode is turned on. Guest owns the
first level page tables (request with PASID) and performs GVA->GPA
translation. Second level page tables are owned by the host for GPA->HPA
translation for both request with and without PASID.

A new IOMMU driver interface is therefore needed to perform tasks as
follows:
* Enable nested translation and appropriate translation type
* Assign guest PASID table pointer (in GPA) and size to host IOMMU

This patch introduces new functions called iommu_(un)bind_pasid_table()
to IOMMU APIs. Architecture specific IOMMU function can be added later
to perform the specific steps for binding pasid table of assigned devices.

This patch also adds model definition in iommu.h. It would be used to
check if the bind request is from a compatible entity. e.g. a bind
request from an intel_iommu emulator may not be supported by an ARM SMMU
driver.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
---
 drivers/iommu/iommu.c | 19 +++++++++++++++++++
 include/linux/iommu.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

Comments

Jean-Philippe Brucker April 26, 2017, 4:56 p.m. UTC | #1
Hi Yi, Jacob,

On 26/04/17 11:11, Liu, Yi L wrote:
> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> 
> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM) use
> case in the guest:
> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> 
> As part of the proposed architecture, when a SVM capable PCI
> device is assigned to a guest, nested mode is turned on. Guest owns the
> first level page tables (request with PASID) and performs GVA->GPA
> translation. Second level page tables are owned by the host for GPA->HPA
> translation for both request with and without PASID.
> 
> A new IOMMU driver interface is therefore needed to perform tasks as
> follows:
> * Enable nested translation and appropriate translation type
> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> 
> This patch introduces new functions called iommu_(un)bind_pasid_table()
> to IOMMU APIs. Architecture specific IOMMU function can be added later
> to perform the specific steps for binding pasid table of assigned devices.
> 
> This patch also adds model definition in iommu.h. It would be used to
> check if the bind request is from a compatible entity. e.g. a bind
> request from an intel_iommu emulator may not be supported by an ARM SMMU
> driver.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> ---
>  drivers/iommu/iommu.c | 19 +++++++++++++++++++
>  include/linux/iommu.h | 31 +++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index dbe7f65..f2da636 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1134,6 +1134,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>  
> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> +			struct pasid_table_info *pasidt_binfo)

I guess that domain can always be deduced from dev using
iommu_get_domain_for_dev, and doesn't need to be passed as argument?

For the next version of my SVM series, I was thinking of passing group
instead of device to iommu_bind. Since all devices in a group are expected
to share the same mappings (whether they want it or not), users will have
to do iommu_group_for_each_dev anyway (as you do in patch 6/8). So it
might be simpler to let the IOMMU core take the group lock and do
group->domain->ops->bind_task(dev...) for each device. The question also
holds for iommu_do_invalidate in patch 3/8.

This way the prototypes would be:
int iommu_bind...(struct iommu_group *group, struct ... *info)
int iommu_unbind...(struct iommu_group *group, struct ...*info)
int iommu_invalidate...(struct iommu_group *group, struct ...*info)

For PASID table binding it might not matter much, as VFIO will most likely
be the only user. But task binding will be called by device drivers, which
by now should be encouraged to do things at iommu_group granularity.
Alternatively it could be done implicitly like in iommu_attach_device,
with "iommu_bind_device_x" calling "iommu_bind_group_x".


Extending this reasoning, since groups in a domain are also supposed to
have the same mappings, then similarly to map/unmap,
bind/unbind/invalidate should really be done with an iommu_domain (and
nothing else) as target argument. However this requires the IOMMU core to
keep a group list in each domain, which might complicate things a little
too much.

But "all devices in a domain share the same PASID table" is the paradigm
I'm currently using in the guts of arm-smmu-v3. And I wonder if, as with
iommu_group, it should be made more explicit to users, so they don't
assume that devices within a domain are isolated from each others with
regard to PASID DMA.

> +{
> +	if (unlikely(!domain->ops->bind_pasid_table))
> +		return -EINVAL;
> +
> +	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
> +}
> +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
> +
> +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> +{
> +	if (unlikely(!domain->ops->unbind_pasid_table))
> +		return -EINVAL;
> +
> +	return domain->ops->unbind_pasid_table(domain, dev);
> +}
> +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
>  				  struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 0ff5111..491a011 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -131,6 +131,15 @@ struct iommu_dm_region {
>  	int			prot;
>  };
>  
> +struct pasid_table_info {
> +	__u64	ptr;	/* PASID table ptr */
> +	__u64	size;	/* PASID table size*/
> +	__u32	model;	/* magic number */
> +#define INTEL_IOMMU	(1 << 0)
> +#define ARM_SMMU	(1 << 1)

Not sure if there is any advantage in this being a bitfield rather than
simple values (1, 2, 3, etc).
The names should also have a prefix, such as "PASID_TABLE_MODEL_"

Thanks a lot for doing this,
Jean
Jacob Pan April 26, 2017, 6:29 p.m. UTC | #2
On Wed, 26 Apr 2017 17:56:45 +0100
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> Hi Yi, Jacob,
> 
> On 26/04/17 11:11, Liu, Yi L wrote:
> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > 
> > Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
> > use case in the guest:
> > https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> > 
> > As part of the proposed architecture, when a SVM capable PCI
> > device is assigned to a guest, nested mode is turned on. Guest owns
> > the first level page tables (request with PASID) and performs
> > GVA->GPA translation. Second level page tables are owned by the
> > host for GPA->HPA translation for both request with and without
> > PASID.
> > 
> > A new IOMMU driver interface is therefore needed to perform tasks as
> > follows:
> > * Enable nested translation and appropriate translation type
> > * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> > 
> > This patch introduces new functions called
> > iommu_(un)bind_pasid_table() to IOMMU APIs. Architecture specific
> > IOMMU function can be added later to perform the specific steps for
> > binding pasid table of assigned devices.
> > 
> > This patch also adds model definition in iommu.h. It would be used
> > to check if the bind request is from a compatible entity. e.g. a
> > bind request from an intel_iommu emulator may not be supported by
> > an ARM SMMU driver.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > ---
> >  drivers/iommu/iommu.c | 19 +++++++++++++++++++
> >  include/linux/iommu.h | 31 +++++++++++++++++++++++++++++++
> >  2 files changed, 50 insertions(+)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index dbe7f65..f2da636 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1134,6 +1134,25 @@ int iommu_attach_device(struct iommu_domain
> > *domain, struct device *dev) }
> >  EXPORT_SYMBOL_GPL(iommu_attach_device);
> >  
> > +int iommu_bind_pasid_table(struct iommu_domain *domain, struct
> > device *dev,
> > +			struct pasid_table_info *pasidt_binfo)
> 
> I guess that domain can always be deduced from dev using
> iommu_get_domain_for_dev, and doesn't need to be passed as argument?
> 
true. device should have attached a domain before calling this API.
> For the next version of my SVM series, I was thinking of passing group
> instead of device to iommu_bind. Since all devices in a group are
> expected to share the same mappings (whether they want it or not),
> users will have to do iommu_group_for_each_dev anyway (as you do in
> patch 6/8). So it might be simpler to let the IOMMU core take the
> group lock and do group->domain->ops->bind_task(dev...) for each
> device. The question also holds for iommu_do_invalidate in patch 3/8.
> 
> This way the prototypes would be:
> int iommu_bind...(struct iommu_group *group, struct ... *info)
> int iommu_unbind...(struct iommu_group *group, struct ...*info)
> int iommu_invalidate...(struct iommu_group *group, struct ...*info)
> 
Just to understand this granularity implication of fault notification
(e.g. page request) of this change. PRI for all devices in the group
will be enabled. IOMMU driver receives page request per device with the
same PASID bond to the group. There can be two scenarios:
1. If iommu_bind() to a task, IOMMU driver handles page fault
internally per device, there is no need to do group level, true?
2. If the device iommu_bind_pasid_table() is called, then we propagate
PRQ to VFIO per device.


> For PASID table binding it might not matter much, as VFIO will most
> likely be the only user. But task binding will be called by device
> drivers, which by now should be encouraged to do things at
> iommu_group granularity. Alternatively it could be done implicitly
> like in iommu_attach_device, with "iommu_bind_device_x" calling
> "iommu_bind_group_x".
> 
> 
> Extending this reasoning, since groups in a domain are also supposed
> to have the same mappings, then similarly to map/unmap,
> bind/unbind/invalidate should really be done with an iommu_domain (and
> nothing else) as target argument. However this requires the IOMMU
> core to keep a group list in each domain, which might complicate
> things a little too much.
> 
> But "all devices in a domain share the same PASID table" is the
> paradigm I'm currently using in the guts of arm-smmu-v3. And I wonder
> if, as with iommu_group, it should be made more explicit to users, so
> they don't assume that devices within a domain are isolated from each
> others with regard to PASID DMA.
> 
> > +{
> > +	if (unlikely(!domain->ops->bind_pasid_table))
> > +		return -EINVAL;
> > +
> > +	return domain->ops->bind_pasid_table(domain, dev,
> > pasidt_binfo); +}
> > +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
> > +
> > +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct
> > device *dev) +{
> > +	if (unlikely(!domain->ops->unbind_pasid_table))
> > +		return -EINVAL;
> > +
> > +	return domain->ops->unbind_pasid_table(domain, dev);
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
> > +
> >  static void __iommu_detach_device(struct iommu_domain *domain,
> >  				  struct device *dev)
> >  {
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 0ff5111..491a011 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -131,6 +131,15 @@ struct iommu_dm_region {
> >  	int			prot;
> >  };
> >  
> > +struct pasid_table_info {
> > +	__u64	ptr;	/* PASID table ptr */
> > +	__u64	size;	/* PASID table size*/
> > +	__u32	model;	/* magic number */
> > +#define INTEL_IOMMU	(1 << 0)
> > +#define ARM_SMMU	(1 << 1)
> 
> Not sure if there is any advantage in this being a bitfield rather
> than simple values (1, 2, 3, etc).
> The names should also have a prefix, such as "PASID_TABLE_MODEL_"
> 
> Thanks a lot for doing this,
> Jean
Jean-Philippe Brucker April 26, 2017, 6:59 p.m. UTC | #3
On 26/04/17 19:29, jacob pan wrote:
> On Wed, 26 Apr 2017 17:56:45 +0100
> Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:
> 
>> Hi Yi, Jacob,
>>
>> On 26/04/17 11:11, Liu, Yi L wrote:
>>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>
>>> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
>>> use case in the guest:
>>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
>>>
>>> As part of the proposed architecture, when a SVM capable PCI
>>> device is assigned to a guest, nested mode is turned on. Guest owns
>>> the first level page tables (request with PASID) and performs
>>> GVA->GPA translation. Second level page tables are owned by the
>>> host for GPA->HPA translation for both request with and without
>>> PASID.
>>>
>>> A new IOMMU driver interface is therefore needed to perform tasks as
>>> follows:
>>> * Enable nested translation and appropriate translation type
>>> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
>>>
>>> This patch introduces new functions called
>>> iommu_(un)bind_pasid_table() to IOMMU APIs. Architecture specific
>>> IOMMU function can be added later to perform the specific steps for
>>> binding pasid table of assigned devices.
>>>
>>> This patch also adds model definition in iommu.h. It would be used
>>> to check if the bind request is from a compatible entity. e.g. a
>>> bind request from an intel_iommu emulator may not be supported by
>>> an ARM SMMU driver.
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>>> ---
>>>  drivers/iommu/iommu.c | 19 +++++++++++++++++++
>>>  include/linux/iommu.h | 31 +++++++++++++++++++++++++++++++
>>>  2 files changed, 50 insertions(+)
>>>
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index dbe7f65..f2da636 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -1134,6 +1134,25 @@ int iommu_attach_device(struct iommu_domain
>>> *domain, struct device *dev) }
>>>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>>>  
>>> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct
>>> device *dev,
>>> +			struct pasid_table_info *pasidt_binfo)
>>
>> I guess that domain can always be deduced from dev using
>> iommu_get_domain_for_dev, and doesn't need to be passed as argument?
>>
> true. device should have attached a domain before calling this API.
>> For the next version of my SVM series, I was thinking of passing group
>> instead of device to iommu_bind. Since all devices in a group are
>> expected to share the same mappings (whether they want it or not),
>> users will have to do iommu_group_for_each_dev anyway (as you do in
>> patch 6/8). So it might be simpler to let the IOMMU core take the
>> group lock and do group->domain->ops->bind_task(dev...) for each
>> device. The question also holds for iommu_do_invalidate in patch 3/8.
>>
>> This way the prototypes would be:
>> int iommu_bind...(struct iommu_group *group, struct ... *info)
>> int iommu_unbind...(struct iommu_group *group, struct ...*info)
>> int iommu_invalidate...(struct iommu_group *group, struct ...*info)
>>
> Just to understand this granularity implication of fault notification
> (e.g. page request) of this change. PRI for all devices in the group
> will be enabled. IOMMU driver receives page request per device with the
> same PASID bond to the group. There can be two scenarios:
> 1. If iommu_bind() to a task, IOMMU driver handles page fault
> internally per device, there is no need to do group level, true?

Yes, we find the task corresponding to the PASID, and call handle_mm_fault
on it.

> 2. If the device iommu_bind_pasid_table() is called, then we propagate
> PRQ to VFIO per device.

I guess yes. Although it could be reported on the container, but the guest
IOMMU driver probably wants to know which device triggered the fault anyway.

The implication of having a group granularity instead of device is that
after the fault is handled, all other devices in the group are also able
to access the region that was just mapped.

If I understand correctly, unlike putting multiple groups in a domain,
putting multiple devices in a group is generally not a software choice.
Usually with PCIe there is a single device per group, but in some cases
(lack of ACS isolation, legacy PCI, bugs), functions cannot be
distinguished by the IOMMU, or can snoop each other's DMA. If this is the
case they need to be put in the same group by the bus driver.

Thanks,
Jean

>> For PASID table binding it might not matter much, as VFIO will most
>> likely be the only user. But task binding will be called by device
>> drivers, which by now should be encouraged to do things at
>> iommu_group granularity. Alternatively it could be done implicitly
>> like in iommu_attach_device, with "iommu_bind_device_x" calling
>> "iommu_bind_group_x".
>>
>>
>> Extending this reasoning, since groups in a domain are also supposed
>> to have the same mappings, then similarly to map/unmap,
>> bind/unbind/invalidate should really be done with an iommu_domain (and
>> nothing else) as target argument. However this requires the IOMMU
>> core to keep a group list in each domain, which might complicate
>> things a little too much.
>>
>> But "all devices in a domain share the same PASID table" is the
>> paradigm I'm currently using in the guts of arm-smmu-v3. And I wonder
>> if, as with iommu_group, it should be made more explicit to users, so
>> they don't assume that devices within a domain are isolated from each
>> others with regard to PASID DMA.
>>
>>> +{
>>> +	if (unlikely(!domain->ops->bind_pasid_table))
>>> +		return -EINVAL;
>>> +
>>> +	return domain->ops->bind_pasid_table(domain, dev,
>>> pasidt_binfo); +}
>>> +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
>>> +
>>> +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct
>>> device *dev) +{
>>> +	if (unlikely(!domain->ops->unbind_pasid_table))
>>> +		return -EINVAL;
>>> +
>>> +	return domain->ops->unbind_pasid_table(domain, dev);
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
>>> +
>>>  static void __iommu_detach_device(struct iommu_domain *domain,
>>>  				  struct device *dev)
>>>  {
>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>> index 0ff5111..491a011 100644
>>> --- a/include/linux/iommu.h
>>> +++ b/include/linux/iommu.h
>>> @@ -131,6 +131,15 @@ struct iommu_dm_region {
>>>  	int			prot;
>>>  };
>>>  
>>> +struct pasid_table_info {
>>> +	__u64	ptr;	/* PASID table ptr */
>>> +	__u64	size;	/* PASID table size*/
>>> +	__u32	model;	/* magic number */
>>> +#define INTEL_IOMMU	(1 << 0)
>>> +#define ARM_SMMU	(1 << 1)
>>
>> Not sure if there is any advantage in this being a bitfield rather
>> than simple values (1, 2, 3, etc).
>> The names should also have a prefix, such as "PASID_TABLE_MODEL_"
>>
>> Thanks a lot for doing this,
>> Jean
>
Liu, Yi L April 27, 2017, 6:36 a.m. UTC | #4
On Wed, Apr 26, 2017 at 05:56:45PM +0100, Jean-Philippe Brucker wrote:
> Hi Yi, Jacob,
> 
> On 26/04/17 11:11, Liu, Yi L wrote:
> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > 
> > Virtual IOMMU was proposed to support Shared Virtual Memory (SVM) use
> > case in the guest:
> > https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> > 
> > As part of the proposed architecture, when a SVM capable PCI
> > device is assigned to a guest, nested mode is turned on. Guest owns the
> > first level page tables (request with PASID) and performs GVA->GPA
> > translation. Second level page tables are owned by the host for GPA->HPA
> > translation for both request with and without PASID.
> > 
> > A new IOMMU driver interface is therefore needed to perform tasks as
> > follows:
> > * Enable nested translation and appropriate translation type
> > * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> > 
> > This patch introduces new functions called iommu_(un)bind_pasid_table()
> > to IOMMU APIs. Architecture specific IOMMU function can be added later
> > to perform the specific steps for binding pasid table of assigned devices.
> > 
> > This patch also adds model definition in iommu.h. It would be used to
> > check if the bind request is from a compatible entity. e.g. a bind
> > request from an intel_iommu emulator may not be supported by an ARM SMMU
> > driver.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > ---
> >  drivers/iommu/iommu.c | 19 +++++++++++++++++++
> >  include/linux/iommu.h | 31 +++++++++++++++++++++++++++++++
> >  2 files changed, 50 insertions(+)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index dbe7f65..f2da636 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1134,6 +1134,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
> >  }
> >  EXPORT_SYMBOL_GPL(iommu_attach_device);
> >  
> > +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> > +			struct pasid_table_info *pasidt_binfo)
> 
> I guess that domain can always be deduced from dev using
> iommu_get_domain_for_dev, and doesn't need to be passed as argument?
> 
> For the next version of my SVM series, I was thinking of passing group
> instead of device to iommu_bind. Since all devices in a group are expected
> to share the same mappings (whether they want it or not), users will have
> to do iommu_group_for_each_dev anyway (as you do in patch 6/8). So it
> might be simpler to let the IOMMU core take the group lock and do
> group->domain->ops->bind_task(dev...) for each device. The question also
> holds for iommu_do_invalidate in patch 3/8.
> 
> This way the prototypes would be:
> int iommu_bind...(struct iommu_group *group, struct ... *info)
> int iommu_unbind...(struct iommu_group *group, struct ...*info)
> int iommu_invalidate...(struct iommu_group *group, struct ...*info)
> 
> For PASID table binding it might not matter much, as VFIO will most likely
> be the only user. But task binding will be called by device drivers, which
> by now should be encouraged to do things at iommu_group granularity.
> Alternatively it could be done implicitly like in iommu_attach_device,
> with "iommu_bind_device_x" calling "iommu_bind_group_x".
> 
> 
> Extending this reasoning, since groups in a domain are also supposed to
> have the same mappings, then similarly to map/unmap,
> bind/unbind/invalidate should really be done with an iommu_domain (and
> nothing else) as target argument. However this requires the IOMMU core to
> keep a group list in each domain, which might complicate things a little
> too much.
> 
> But "all devices in a domain share the same PASID table" is the paradigm
> I'm currently using in the guts of arm-smmu-v3. And I wonder if, as with
> iommu_group, it should be made more explicit to users, so they don't
> assume that devices within a domain are isolated from each others with
> regard to PASID DMA.
> 
> > +{
> > +	if (unlikely(!domain->ops->bind_pasid_table))
> > +		return -EINVAL;
> > +
> > +	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
> > +
> > +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> > +{
> > +	if (unlikely(!domain->ops->unbind_pasid_table))
> > +		return -EINVAL;
> > +
> > +	return domain->ops->unbind_pasid_table(domain, dev);
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
> > +
> >  static void __iommu_detach_device(struct iommu_domain *domain,
> >  				  struct device *dev)
> >  {
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 0ff5111..491a011 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -131,6 +131,15 @@ struct iommu_dm_region {
> >  	int			prot;
> >  };
> >  
> > +struct pasid_table_info {
> > +	__u64	ptr;	/* PASID table ptr */
> > +	__u64	size;	/* PASID table size*/
> > +	__u32	model;	/* magic number */
> > +#define INTEL_IOMMU	(1 << 0)
> > +#define ARM_SMMU	(1 << 1)
> 
> Not sure if there is any advantage in this being a bitfield rather than
> simple values (1, 2, 3, etc).
> The names should also have a prefix, such as "PASID_TABLE_MODEL_"

Hi Jean,

For the value, no special reason from my side. so it is fine to just
use (1,2..).
For the prefix, model definition may also needed for iommu tlb
invalidate propagation. So it may be suitable to have a prefix like
"SVM_MODEL_" or so? Does it make sense?

Thanks,
Yi L

> Thanks a lot for doing this,
> Jean
Jean-Philippe Brucker April 27, 2017, 10:12 a.m. UTC | #5
On 27/04/17 07:36, Liu, Yi L wrote:
> On Wed, Apr 26, 2017 at 05:56:45PM +0100, Jean-Philippe Brucker wrote:
>> Hi Yi, Jacob,
>>
>> On 26/04/17 11:11, Liu, Yi L wrote:
>>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>
>>> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM) use
>>> case in the guest:
>>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
>>>
>>> As part of the proposed architecture, when a SVM capable PCI
>>> device is assigned to a guest, nested mode is turned on. Guest owns the
>>> first level page tables (request with PASID) and performs GVA->GPA
>>> translation. Second level page tables are owned by the host for GPA->HPA
>>> translation for both request with and without PASID.
>>>
>>> A new IOMMU driver interface is therefore needed to perform tasks as
>>> follows:
>>> * Enable nested translation and appropriate translation type
>>> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
>>>
>>> This patch introduces new functions called iommu_(un)bind_pasid_table()
>>> to IOMMU APIs. Architecture specific IOMMU function can be added later
>>> to perform the specific steps for binding pasid table of assigned devices.
>>>
>>> This patch also adds model definition in iommu.h. It would be used to
>>> check if the bind request is from a compatible entity. e.g. a bind
>>> request from an intel_iommu emulator may not be supported by an ARM SMMU
>>> driver.
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>>> ---
>>>  drivers/iommu/iommu.c | 19 +++++++++++++++++++
>>>  include/linux/iommu.h | 31 +++++++++++++++++++++++++++++++
>>>  2 files changed, 50 insertions(+)
>>>
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index dbe7f65..f2da636 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -1134,6 +1134,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>>>  }
>>>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>>>  
>>> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
>>> +			struct pasid_table_info *pasidt_binfo)
>>
>> I guess that domain can always be deduced from dev using
>> iommu_get_domain_for_dev, and doesn't need to be passed as argument?
>>
>> For the next version of my SVM series, I was thinking of passing group
>> instead of device to iommu_bind. Since all devices in a group are expected
>> to share the same mappings (whether they want it or not), users will have
>> to do iommu_group_for_each_dev anyway (as you do in patch 6/8). So it
>> might be simpler to let the IOMMU core take the group lock and do
>> group->domain->ops->bind_task(dev...) for each device. The question also
>> holds for iommu_do_invalidate in patch 3/8.
>>
>> This way the prototypes would be:
>> int iommu_bind...(struct iommu_group *group, struct ... *info)
>> int iommu_unbind...(struct iommu_group *group, struct ...*info)
>> int iommu_invalidate...(struct iommu_group *group, struct ...*info)
>>
>> For PASID table binding it might not matter much, as VFIO will most likely
>> be the only user. But task binding will be called by device drivers, which
>> by now should be encouraged to do things at iommu_group granularity.
>> Alternatively it could be done implicitly like in iommu_attach_device,
>> with "iommu_bind_device_x" calling "iommu_bind_group_x".
>>
>>
>> Extending this reasoning, since groups in a domain are also supposed to
>> have the same mappings, then similarly to map/unmap,
>> bind/unbind/invalidate should really be done with an iommu_domain (and
>> nothing else) as target argument. However this requires the IOMMU core to
>> keep a group list in each domain, which might complicate things a little
>> too much.
>>
>> But "all devices in a domain share the same PASID table" is the paradigm
>> I'm currently using in the guts of arm-smmu-v3. And I wonder if, as with
>> iommu_group, it should be made more explicit to users, so they don't
>> assume that devices within a domain are isolated from each others with
>> regard to PASID DMA.
>>
>>> +{
>>> +	if (unlikely(!domain->ops->bind_pasid_table))
>>> +		return -EINVAL;
>>> +
>>> +	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
>>> +
>>> +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
>>> +{
>>> +	if (unlikely(!domain->ops->unbind_pasid_table))
>>> +		return -EINVAL;
>>> +
>>> +	return domain->ops->unbind_pasid_table(domain, dev);
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
>>> +
>>>  static void __iommu_detach_device(struct iommu_domain *domain,
>>>  				  struct device *dev)
>>>  {
>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>> index 0ff5111..491a011 100644
>>> --- a/include/linux/iommu.h
>>> +++ b/include/linux/iommu.h
>>> @@ -131,6 +131,15 @@ struct iommu_dm_region {
>>>  	int			prot;
>>>  };
>>>  
>>> +struct pasid_table_info {
>>> +	__u64	ptr;	/* PASID table ptr */
>>> +	__u64	size;	/* PASID table size*/
>>> +	__u32	model;	/* magic number */
>>> +#define INTEL_IOMMU	(1 << 0)
>>> +#define ARM_SMMU	(1 << 1)
>>
>> Not sure if there is any advantage in this being a bitfield rather than
>> simple values (1, 2, 3, etc).
>> The names should also have a prefix, such as "PASID_TABLE_MODEL_"
> 
> Hi Jean,
> 
> For the value, no special reason from my side. so it is fine to just
> use (1,2..).
> For the prefix, model definition may also needed for iommu tlb
> invalidate propagation. So it may be suitable to have a prefix like
> "SVM_MODEL_" or so? Does it make sense?

Sure, SVM_MODEL would do. However, once the PASID table is bound and the
model has been negotiated, do we need to repeat the model in each
invalidation? At that point user and driver already agreed on the model to
use (and the various structure formats), so they could just transmit
opaque structures and each end of the pipe would know how to interpret it.

Thanks,
Jean
Liu, Yi L April 28, 2017, 7:59 a.m. UTC | #6
On Thu, Apr 27, 2017 at 11:12:45AM +0100, Jean-Philippe Brucker wrote:
> On 27/04/17 07:36, Liu, Yi L wrote:
> > On Wed, Apr 26, 2017 at 05:56:45PM +0100, Jean-Philippe Brucker wrote:
> >> Hi Yi, Jacob,
> >>
> >> On 26/04/17 11:11, Liu, Yi L wrote:
> >>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>>
> >>> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM) use
> >>> case in the guest:
> >>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> >>>
> >>> As part of the proposed architecture, when a SVM capable PCI
> >>> device is assigned to a guest, nested mode is turned on. Guest owns the
> >>> first level page tables (request with PASID) and performs GVA->GPA
> >>> translation. Second level page tables are owned by the host for GPA->HPA
> >>> translation for both request with and without PASID.
> >>>
> >>> A new IOMMU driver interface is therefore needed to perform tasks as
> >>> follows:
> >>> * Enable nested translation and appropriate translation type
> >>> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> >>>
> >>> This patch introduces new functions called iommu_(un)bind_pasid_table()
> >>> to IOMMU APIs. Architecture specific IOMMU function can be added later
> >>> to perform the specific steps for binding pasid table of assigned devices.
> >>>
> >>> This patch also adds model definition in iommu.h. It would be used to
> >>> check if the bind request is from a compatible entity. e.g. a bind
> >>> request from an intel_iommu emulator may not be supported by an ARM SMMU
> >>> driver.
> >>>
> >>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> >>> ---
> >>>  drivers/iommu/iommu.c | 19 +++++++++++++++++++
> >>>  include/linux/iommu.h | 31 +++++++++++++++++++++++++++++++
> >>>  2 files changed, 50 insertions(+)
> >>>
> >>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >>> index dbe7f65..f2da636 100644
> >>> --- a/drivers/iommu/iommu.c
> >>> +++ b/drivers/iommu/iommu.c
> >>> @@ -1134,6 +1134,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(iommu_attach_device);
> >>>  
> >>> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> >>> +			struct pasid_table_info *pasidt_binfo)
> >>
> >> I guess that domain can always be deduced from dev using
> >> iommu_get_domain_for_dev, and doesn't need to be passed as argument?
> >>
> >> For the next version of my SVM series, I was thinking of passing group
> >> instead of device to iommu_bind. Since all devices in a group are expected
> >> to share the same mappings (whether they want it or not), users will have
> >> to do iommu_group_for_each_dev anyway (as you do in patch 6/8). So it
> >> might be simpler to let the IOMMU core take the group lock and do
> >> group->domain->ops->bind_task(dev...) for each device. The question also
> >> holds for iommu_do_invalidate in patch 3/8.
> >>
> >> This way the prototypes would be:
> >> int iommu_bind...(struct iommu_group *group, struct ... *info)
> >> int iommu_unbind...(struct iommu_group *group, struct ...*info)
> >> int iommu_invalidate...(struct iommu_group *group, struct ...*info)
> >>
> >> For PASID table binding it might not matter much, as VFIO will most likely
> >> be the only user. But task binding will be called by device drivers, which
> >> by now should be encouraged to do things at iommu_group granularity.
> >> Alternatively it could be done implicitly like in iommu_attach_device,
> >> with "iommu_bind_device_x" calling "iommu_bind_group_x".
> >>
> >>
> >> Extending this reasoning, since groups in a domain are also supposed to
> >> have the same mappings, then similarly to map/unmap,
> >> bind/unbind/invalidate should really be done with an iommu_domain (and
> >> nothing else) as target argument. However this requires the IOMMU core to
> >> keep a group list in each domain, which might complicate things a little
> >> too much.
> >>
> >> But "all devices in a domain share the same PASID table" is the paradigm
> >> I'm currently using in the guts of arm-smmu-v3. And I wonder if, as with
> >> iommu_group, it should be made more explicit to users, so they don't
> >> assume that devices within a domain are isolated from each others with
> >> regard to PASID DMA.
> >>
> >>> +{
> >>> +	if (unlikely(!domain->ops->bind_pasid_table))
> >>> +		return -EINVAL;
> >>> +
> >>> +	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
> >>> +
> >>> +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> >>> +{
> >>> +	if (unlikely(!domain->ops->unbind_pasid_table))
> >>> +		return -EINVAL;
> >>> +
> >>> +	return domain->ops->unbind_pasid_table(domain, dev);
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
> >>> +
> >>>  static void __iommu_detach_device(struct iommu_domain *domain,
> >>>  				  struct device *dev)
> >>>  {
> >>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> >>> index 0ff5111..491a011 100644
> >>> --- a/include/linux/iommu.h
> >>> +++ b/include/linux/iommu.h
> >>> @@ -131,6 +131,15 @@ struct iommu_dm_region {
> >>>  	int			prot;
> >>>  };
> >>>  
> >>> +struct pasid_table_info {
> >>> +	__u64	ptr;	/* PASID table ptr */
> >>> +	__u64	size;	/* PASID table size*/
> >>> +	__u32	model;	/* magic number */
> >>> +#define INTEL_IOMMU	(1 << 0)
> >>> +#define ARM_SMMU	(1 << 1)
> >>
> >> Not sure if there is any advantage in this being a bitfield rather than
> >> simple values (1, 2, 3, etc).
> >> The names should also have a prefix, such as "PASID_TABLE_MODEL_"
> > 
> > Hi Jean,
> > 
> > For the value, no special reason from my side. so it is fine to just
> > use (1,2..).
> > For the prefix, model definition may also needed for iommu tlb
> > invalidate propagation. So it may be suitable to have a prefix like
> > "SVM_MODEL_" or so? Does it make sense?
> 
> Sure, SVM_MODEL would do. However, once the PASID table is bound and the
> model has been negotiated, do we need to repeat the model in each
> invalidation? At that point user and driver already agreed on the model to
> use (and the various structure formats), so they could just transmit
> opaque structures and each end of the pipe would know how to interpret it.

Jean,

yes, it works if this invalidate API is only for SVM. But I remember there may
be requirement from gIOVA. If guest tries to invalidate large range of gIOVA
mapping, using unmap is low efficiency. Tianyu Lan may provide detail about
this case. That's why I added the model check all the same in invalidate path.
Let's double check if the requirement is really needed from gIOVA. If no, I
agree with you on the suggested method.

Thanks,
Yi L
Liu, Yi L April 28, 2017, 9:04 a.m. UTC | #7
On Wed, Apr 26, 2017 at 05:56:45PM +0100, Jean-Philippe Brucker wrote:
> Hi Yi, Jacob,
> 
> On 26/04/17 11:11, Liu, Yi L wrote:
> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > 
> > Virtual IOMMU was proposed to support Shared Virtual Memory (SVM) use
> > case in the guest:
> > https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> > 
> > As part of the proposed architecture, when a SVM capable PCI
> > device is assigned to a guest, nested mode is turned on. Guest owns the
> > first level page tables (request with PASID) and performs GVA->GPA
> > translation. Second level page tables are owned by the host for GPA->HPA
> > translation for both request with and without PASID.
> > 
> > A new IOMMU driver interface is therefore needed to perform tasks as
> > follows:
> > * Enable nested translation and appropriate translation type
> > * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> > 
> > This patch introduces new functions called iommu_(un)bind_pasid_table()
> > to IOMMU APIs. Architecture specific IOMMU function can be added later
> > to perform the specific steps for binding pasid table of assigned devices.
> > 
> > This patch also adds model definition in iommu.h. It would be used to
> > check if the bind request is from a compatible entity. e.g. a bind
> > request from an intel_iommu emulator may not be supported by an ARM SMMU
> > driver.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > ---
> >  drivers/iommu/iommu.c | 19 +++++++++++++++++++
> >  include/linux/iommu.h | 31 +++++++++++++++++++++++++++++++
> >  2 files changed, 50 insertions(+)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index dbe7f65..f2da636 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1134,6 +1134,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
> >  }
> >  EXPORT_SYMBOL_GPL(iommu_attach_device);
> >  
> > +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> > +			struct pasid_table_info *pasidt_binfo)
> 
> I guess that domain can always be deduced from dev using
> iommu_get_domain_for_dev, and doesn't need to be passed as argument?
> 
> For the next version of my SVM series, I was thinking of passing group
> instead of device to iommu_bind. Since all devices in a group are expected
> to share the same mappings (whether they want it or not), users will have

Virtual address space is not tied to protection domain as I/O virtual address
space does. Is it really necessary to affect all the devices in this group.
Or it is just for consistence?

> to do iommu_group_for_each_dev anyway (as you do in patch 6/8). So it
> might be simpler to let the IOMMU core take the group lock and do
> group->domain->ops->bind_task(dev...) for each device. The question also
> holds for iommu_do_invalidate in patch 3/8.

In my understanding, it is moving the for_each_dev loop into iommu driver?
Is it?

> This way the prototypes would be:
> int iommu_bind...(struct iommu_group *group, struct ... *info)
> int iommu_unbind...(struct iommu_group *group, struct ...*info)
> int iommu_invalidate...(struct iommu_group *group, struct ...*info)

For PASID table binding from guest, I think it'd better to be per-device op
since the bind operation wants to modify the host context entry. But we may
still share the API and do things differently in iommu driver.

For invalidation, I think it'd better to be per-group. Actually, with guest
IOMMU exists, there is only one group in a domain on Intel platform. Do it for
each device is not expected. How about it on ARM?

> For PASID table binding it might not matter much, as VFIO will most likely
> be the only user. But task binding will be called by device drivers, which
> by now should be encouraged to do things at iommu_group granularity.
> Alternatively it could be done implicitly like in iommu_attach_device,
> with "iommu_bind_device_x" calling "iommu_bind_group_x".

Do you mean the bind task from userspace driver? I guess you're trying to do
different types of binding request in a single svm_bind API?

> 
> Extending this reasoning, since groups in a domain are also supposed to
> have the same mappings, then similarly to map/unmap,
> bind/unbind/invalidate should really be done with an iommu_domain (and
> nothing else) as target argument. However this requires the IOMMU core to
> keep a group list in each domain, which might complicate things a little
> too much.
> 
> But "all devices in a domain share the same PASID table" is the paradigm
> I'm currently using in the guts of arm-smmu-v3. And I wonder if, as with
> iommu_group, it should be made more explicit to users, so they don't
> assume that devices within a domain are isolated from each others with
> regard to PASID DMA.

Is the isolation you mentioned means forbidding to do PASID DMA to the same
virtual address space when the device comes from different domain?

Thanks,
Yi L
Jean-Philippe Brucker April 28, 2017, 12:51 p.m. UTC | #8
On 28/04/17 10:04, Liu, Yi L wrote:
> On Wed, Apr 26, 2017 at 05:56:45PM +0100, Jean-Philippe Brucker wrote:
>> Hi Yi, Jacob,
>>
>> On 26/04/17 11:11, Liu, Yi L wrote:
>>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>
>>> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM) use
>>> case in the guest:
>>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
>>>
>>> As part of the proposed architecture, when a SVM capable PCI
>>> device is assigned to a guest, nested mode is turned on. Guest owns the
>>> first level page tables (request with PASID) and performs GVA->GPA
>>> translation. Second level page tables are owned by the host for GPA->HPA
>>> translation for both request with and without PASID.
>>>
>>> A new IOMMU driver interface is therefore needed to perform tasks as
>>> follows:
>>> * Enable nested translation and appropriate translation type
>>> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
>>>
>>> This patch introduces new functions called iommu_(un)bind_pasid_table()
>>> to IOMMU APIs. Architecture specific IOMMU function can be added later
>>> to perform the specific steps for binding pasid table of assigned devices.
>>>
>>> This patch also adds model definition in iommu.h. It would be used to
>>> check if the bind request is from a compatible entity. e.g. a bind
>>> request from an intel_iommu emulator may not be supported by an ARM SMMU
>>> driver.
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>>> ---
>>>  drivers/iommu/iommu.c | 19 +++++++++++++++++++
>>>  include/linux/iommu.h | 31 +++++++++++++++++++++++++++++++
>>>  2 files changed, 50 insertions(+)
>>>
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index dbe7f65..f2da636 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -1134,6 +1134,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>>>  }
>>>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>>>  
>>> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
>>> +			struct pasid_table_info *pasidt_binfo)
>>
>> I guess that domain can always be deduced from dev using
>> iommu_get_domain_for_dev, and doesn't need to be passed as argument?
>>
>> For the next version of my SVM series, I was thinking of passing group
>> instead of device to iommu_bind. Since all devices in a group are expected
>> to share the same mappings (whether they want it or not), users will have
> 
> Virtual address space is not tied to protection domain as I/O virtual address
> space does. Is it really necessary to affect all the devices in this group.
> Or it is just for consistence?

It's mostly about consistency, and also avoid hiding implicit behavior in
the IOMMU driver. I have the following example, described using group and
domain structures from the IOMMU API:
                 ____________________
                |IOMMU  ____________ |
                |      |DOM  ______ ||
                |      |    |GRP   |||     bind
                |      |    |    A<-----------------Task 1
                |      |    |    B |||
                |      |    |______|||
                |      |     ______ ||
                |      |    |GRP   |||
                |      |    |    C |||
                |      |    |______|||
                |      |____________||
                |       ____________ |
                |      |DOM  ______ ||
                |      |    |GRP   |||
                |      |    |    D |||
                |      |    |______|||
                |      |____________||
                |____________________|

Let's take PCI functions A, B, C, and D, all with PASID capabilities. Due
to some hardware limitation (in the bus, the device or the IOMMU), B can
see all DMA transactions issued by A. A and B are therefore in the same
IOMMU group. C and D can be isolated by the IOMMU, so they each have their
own group.

(As far as I know, in the SVM world at the moment, devices are neatly
integrated and there is no need for putting multiple devices in the same
IOMMU group, but I don't think we should expect all future SVM systems to
be well-behaved.)

So when a user binds Task 1 to device A, it is *implicitly* giving device
B access to Task 1 as well. Simply because the IOMMU is unable to isolate
A from B, PASID or not. B could access the same address space as A, even
if you don't call bind again to explicitly attach the PASID table to B.

If the bind is done with device as argument, maybe users will believe that
using PASIDs provides an additional level of isolation within a group,
when it really doesn't. That's why I'm inclined to have the whole bind API
be on groups rather than devices, if only for clarity.

But I don't know, maybe a comment explaining the above would be sufficient.

To be frank my comment about group versus device is partly to make sure
that I grasp the various concepts correctly and that we're on the same
page. Doing the bind on groups is less significant in your case, for PASID
table binding, because VFIO already takes care of IOMMU group properly. In
my case I expect DRM, network, DMA drivers to use the API as well for
binding tasks, and I don't want to introduce ambiguity in the API that
would lead to security holes later.

>> to do iommu_group_for_each_dev anyway (as you do in patch 6/8). So it
>> might be simpler to let the IOMMU core take the group lock and do
>> group->domain->ops->bind_task(dev...) for each device. The question also
>> holds for iommu_do_invalidate in patch 3/8.
> 
> In my understanding, it is moving the for_each_dev loop into iommu driver?
> Is it?

Yes, that's what I meant

>> This way the prototypes would be:
>> int iommu_bind...(struct iommu_group *group, struct ... *info)
>> int iommu_unbind...(struct iommu_group *group, struct ...*info)
>> int iommu_invalidate...(struct iommu_group *group, struct ...*info)
> 
> For PASID table binding from guest, I think it'd better to be per-device op
> since the bind operation wants to modify the host context entry. But we may
> still share the API and do things differently in iommu driver.

Sure, as said above the use cases for PASID table and single PASID binding
are different, sharing the API is not strictly necessary.

> For invalidation, I think it'd better to be per-group. Actually, with guest
> IOMMU exists, there is only one group in a domain on Intel platform. Do it for
> each device is not expected. How about it on ARM?

In ARM systems with the DMA API (IOMMU_DOMAIN_DMA), there is one group per
domain. But with VFIO (IOMMU_DOMAIN_UNMANAGED), VFIO will try to attach
multiple groups in the same container to the same domain when possible.

>> For PASID table binding it might not matter much, as VFIO will most likely
>> be the only user. But task binding will be called by device drivers, which
>> by now should be encouraged to do things at iommu_group granularity.
>> Alternatively it could be done implicitly like in iommu_attach_device,
>> with "iommu_bind_device_x" calling "iommu_bind_group_x".
> 
> Do you mean the bind task from userspace driver? I guess you're trying to do
> different types of binding request in a single svm_bind API?
> 
>>
>> Extending this reasoning, since groups in a domain are also supposed to
>> have the same mappings, then similarly to map/unmap,
>> bind/unbind/invalidate should really be done with an iommu_domain (and
>> nothing else) as target argument. However this requires the IOMMU core to
>> keep a group list in each domain, which might complicate things a little
>> too much.
>>
>> But "all devices in a domain share the same PASID table" is the paradigm
>> I'm currently using in the guts of arm-smmu-v3. And I wonder if, as with
>> iommu_group, it should be made more explicit to users, so they don't
>> assume that devices within a domain are isolated from each others with
>> regard to PASID DMA.
> 
> Is the isolation you mentioned means forbidding to do PASID DMA to the same
> virtual address space when the device comes from different domain?

In the above example, devices A, B and C are in the same IOMMU domain
(because, for instance, user put the two groups in the same VFIO
container.) Then in the SMMUv3 driver they would all share the same PASID
table. A, B and C can access Task 1 with the PASID obtained during the
depicted bind. They don't need to call bind again for device C, though it
would be good practice.

But D is in a different domain, so unless you also call bind on Task 1 for
device D, there is no way that D can access Task 1.

Thanks,
Jean
Alex Williamson May 12, 2017, 9:59 p.m. UTC | #9
On Wed, 26 Apr 2017 18:11:58 +0800
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> 
> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM) use
> case in the guest:
> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> 
> As part of the proposed architecture, when a SVM capable PCI
> device is assigned to a guest, nested mode is turned on. Guest owns the
> first level page tables (request with PASID) and performs GVA->GPA
> translation. Second level page tables are owned by the host for GPA->HPA
> translation for both request with and without PASID.
> 
> A new IOMMU driver interface is therefore needed to perform tasks as
> follows:
> * Enable nested translation and appropriate translation type
> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> 
> This patch introduces new functions called iommu_(un)bind_pasid_table()
> to IOMMU APIs. Architecture specific IOMMU function can be added later
> to perform the specific steps for binding pasid table of assigned devices.
> 
> This patch also adds model definition in iommu.h. It would be used to
> check if the bind request is from a compatible entity. e.g. a bind
> request from an intel_iommu emulator may not be supported by an ARM SMMU
> driver.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> ---
>  drivers/iommu/iommu.c | 19 +++++++++++++++++++
>  include/linux/iommu.h | 31 +++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index dbe7f65..f2da636 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1134,6 +1134,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>  
> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> +			struct pasid_table_info *pasidt_binfo)
> +{
> +	if (unlikely(!domain->ops->bind_pasid_table))
> +		return -EINVAL;
> +
> +	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
> +}
> +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
> +
> +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> +{
> +	if (unlikely(!domain->ops->unbind_pasid_table))
> +		return -EINVAL;
> +
> +	return domain->ops->unbind_pasid_table(domain, dev);
> +}
> +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
>  				  struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 0ff5111..491a011 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -131,6 +131,15 @@ struct iommu_dm_region {
>  	int			prot;
>  };
>  
> +struct pasid_table_info {
> +	__u64	ptr;	/* PASID table ptr */
> +	__u64	size;	/* PASID table size*/
> +	__u32	model;	/* magic number */
> +#define INTEL_IOMMU	(1 << 0)
> +#define ARM_SMMU	(1 << 1)
> +	__u8	opaque[];/* IOMMU-specific details */
> +};

This needs to be in uapi since you're expecting a user to pass it 

> +
>  #ifdef CONFIG_IOMMU_API
>  
>  /**
> @@ -159,6 +168,8 @@ struct iommu_dm_region {
>   * @domain_get_windows: Return the number of windows for a domain
>   * @of_xlate: add OF master IDs to iommu grouping
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
> + * @bind_pasid_table: bind pasid table pointer for guest SVM
> + * @unbind_pasid_table: unbind pasid table pointer and restore defaults
>   */
>  struct iommu_ops {
>  	bool (*capable)(enum iommu_cap);
> @@ -200,6 +211,10 @@ struct iommu_ops {
>  	u32 (*domain_get_windows)(struct iommu_domain *domain);
>  
>  	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
> +	int (*bind_pasid_table)(struct iommu_domain *domain, struct device *dev,
> +				struct pasid_table_info *pasidt_binfo);
> +	int (*unbind_pasid_table)(struct iommu_domain *domain,
> +				struct device *dev);
>  
>  	unsigned long pgsize_bitmap;
>  };
> @@ -221,6 +236,10 @@ extern int iommu_attach_device(struct iommu_domain *domain,
>  			       struct device *dev);
>  extern void iommu_detach_device(struct iommu_domain *domain,
>  				struct device *dev);
> +extern int iommu_bind_pasid_table(struct iommu_domain *domain,
> +		struct device *dev, struct pasid_table_info *pasidt_binfo);
> +extern int iommu_unbind_pasid_table(struct iommu_domain *domain,
> +				struct device *dev);
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
>  		     phys_addr_t paddr, size_t size, int prot);
> @@ -595,6 +614,18 @@ const struct iommu_ops *iommu_get_instance(struct fwnode_handle *fwnode)
>  	return NULL;
>  }
>  
> +static inline
> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> +			struct pasid_table_info *pasidt_binfo)
> +{
> +	return -EINVAL;
> +}
> +static inline
> +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> +{
> +	return -EINVAL;
> +}
> +
>  #endif /* CONFIG_IOMMU_API */
>  
>  #endif /* __LINUX_IOMMU_H */
Liu, Yi L May 14, 2017, 10:56 a.m. UTC | #10
On Fri, May 12, 2017 at 03:59:14PM -0600, Alex Williamson wrote:
> On Wed, 26 Apr 2017 18:11:58 +0800
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > 
> > Virtual IOMMU was proposed to support Shared Virtual Memory (SVM) use
> > case in the guest:
> > https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> > 
> > As part of the proposed architecture, when a SVM capable PCI
> > device is assigned to a guest, nested mode is turned on. Guest owns the
> > first level page tables (request with PASID) and performs GVA->GPA
> > translation. Second level page tables are owned by the host for GPA->HPA
> > translation for both request with and without PASID.
> > 
> > A new IOMMU driver interface is therefore needed to perform tasks as
> > follows:
> > * Enable nested translation and appropriate translation type
> > * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> > 
> > This patch introduces new functions called iommu_(un)bind_pasid_table()
> > to IOMMU APIs. Architecture specific IOMMU function can be added later
> > to perform the specific steps for binding pasid table of assigned devices.
> > 
> > This patch also adds model definition in iommu.h. It would be used to
> > check if the bind request is from a compatible entity. e.g. a bind
> > request from an intel_iommu emulator may not be supported by an ARM SMMU
> > driver.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > ---
> >  drivers/iommu/iommu.c | 19 +++++++++++++++++++
> >  include/linux/iommu.h | 31 +++++++++++++++++++++++++++++++
> >  2 files changed, 50 insertions(+)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index dbe7f65..f2da636 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1134,6 +1134,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
> >  }
> >  EXPORT_SYMBOL_GPL(iommu_attach_device);
> >  
> > +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> > +			struct pasid_table_info *pasidt_binfo)
> > +{
> > +	if (unlikely(!domain->ops->bind_pasid_table))
> > +		return -EINVAL;
> > +
> > +	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
> > +
> > +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> > +{
> > +	if (unlikely(!domain->ops->unbind_pasid_table))
> > +		return -EINVAL;
> > +
> > +	return domain->ops->unbind_pasid_table(domain, dev);
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
> > +
> >  static void __iommu_detach_device(struct iommu_domain *domain,
> >  				  struct device *dev)
> >  {
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 0ff5111..491a011 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -131,6 +131,15 @@ struct iommu_dm_region {
> >  	int			prot;
> >  };
> >  
> > +struct pasid_table_info {
> > +	__u64	ptr;	/* PASID table ptr */
> > +	__u64	size;	/* PASID table size*/
> > +	__u32	model;	/* magic number */
> > +#define INTEL_IOMMU	(1 << 0)
> > +#define ARM_SMMU	(1 << 1)
> > +	__u8	opaque[];/* IOMMU-specific details */
> > +};
> 
> This needs to be in uapi since you're expecting a user to pass it 

yes, it is. Thx for the correction.

Thanks,
Yi L
> > +
> >  #ifdef CONFIG_IOMMU_API
> >  
> >  /**
> > @@ -159,6 +168,8 @@ struct iommu_dm_region {
> >   * @domain_get_windows: Return the number of windows for a domain
> >   * @of_xlate: add OF master IDs to iommu grouping
> >   * @pgsize_bitmap: bitmap of all possible supported page sizes
> > + * @bind_pasid_table: bind pasid table pointer for guest SVM
> > + * @unbind_pasid_table: unbind pasid table pointer and restore defaults
> >   */
> >  struct iommu_ops {
> >  	bool (*capable)(enum iommu_cap);
> > @@ -200,6 +211,10 @@ struct iommu_ops {
> >  	u32 (*domain_get_windows)(struct iommu_domain *domain);
> >  
> >  	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
> > +	int (*bind_pasid_table)(struct iommu_domain *domain, struct device *dev,
> > +				struct pasid_table_info *pasidt_binfo);
> > +	int (*unbind_pasid_table)(struct iommu_domain *domain,
> > +				struct device *dev);
> >  
> >  	unsigned long pgsize_bitmap;
> >  };
> > @@ -221,6 +236,10 @@ extern int iommu_attach_device(struct iommu_domain *domain,
> >  			       struct device *dev);
> >  extern void iommu_detach_device(struct iommu_domain *domain,
> >  				struct device *dev);
> > +extern int iommu_bind_pasid_table(struct iommu_domain *domain,
> > +		struct device *dev, struct pasid_table_info *pasidt_binfo);
> > +extern int iommu_unbind_pasid_table(struct iommu_domain *domain,
> > +				struct device *dev);
> >  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
> >  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> >  		     phys_addr_t paddr, size_t size, int prot);
> > @@ -595,6 +614,18 @@ const struct iommu_ops *iommu_get_instance(struct fwnode_handle *fwnode)
> >  	return NULL;
> >  }
> >  
> > +static inline
> > +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> > +			struct pasid_table_info *pasidt_binfo)
> > +{
> > +	return -EINVAL;
> > +}
> > +static inline
> > +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> > +{
> > +	return -EINVAL;
> > +}
> > +
> >  #endif /* CONFIG_IOMMU_API */
> >  
> >  #endif /* __LINUX_IOMMU_H */
>
Liu, Yi L May 23, 2017, 7:50 a.m. UTC | #11
On Fri, Apr 28, 2017 at 01:51:42PM +0100, Jean-Philippe Brucker wrote:
> On 28/04/17 10:04, Liu, Yi L wrote:
Hi Jean,

Sorry for the delay response. Still have some follow-up comments on
per-device or per-group. Pls refer to comments inline.

> > On Wed, Apr 26, 2017 at 05:56:45PM +0100, Jean-Philippe Brucker wrote:
> >> Hi Yi, Jacob,
> >>
> >> On 26/04/17 11:11, Liu, Yi L wrote:
> >>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>>
> >>> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM) use
> >>> case in the guest:
> >>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> >>>
> >>> As part of the proposed architecture, when a SVM capable PCI
> >>> device is assigned to a guest, nested mode is turned on. Guest owns the
> >>> first level page tables (request with PASID) and performs GVA->GPA
> >>> translation. Second level page tables are owned by the host for GPA->HPA
> >>> translation for both request with and without PASID.
> >>>
> >>> A new IOMMU driver interface is therefore needed to perform tasks as
> >>> follows:
> >>> * Enable nested translation and appropriate translation type
> >>> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> >>>
> >>> This patch introduces new functions called iommu_(un)bind_pasid_table()
> >>> to IOMMU APIs. Architecture specific IOMMU function can be added later
> >>> to perform the specific steps for binding pasid table of assigned devices.
> >>>
> >>> This patch also adds model definition in iommu.h. It would be used to
> >>> check if the bind request is from a compatible entity. e.g. a bind
> >>> request from an intel_iommu emulator may not be supported by an ARM SMMU
> >>> driver.
> >>>
> >>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> >>> ---
> >>>  drivers/iommu/iommu.c | 19 +++++++++++++++++++
> >>>  include/linux/iommu.h | 31 +++++++++++++++++++++++++++++++
> >>>  2 files changed, 50 insertions(+)
> >>>
> >>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >>> index dbe7f65..f2da636 100644
> >>> --- a/drivers/iommu/iommu.c
> >>> +++ b/drivers/iommu/iommu.c
> >>> @@ -1134,6 +1134,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(iommu_attach_device);
> >>>  
> >>> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> >>> +			struct pasid_table_info *pasidt_binfo)
> >>
> >> I guess that domain can always be deduced from dev using
> >> iommu_get_domain_for_dev, and doesn't need to be passed as argument?
> >>
> >> For the next version of my SVM series, I was thinking of passing group
> >> instead of device to iommu_bind. Since all devices in a group are expected
> >> to share the same mappings (whether they want it or not), users will have
> > 
> > Virtual address space is not tied to protection domain as I/O virtual address
> > space does. Is it really necessary to affect all the devices in this group.
> > Or it is just for consistence?
> 
> It's mostly about consistency, and also avoid hiding implicit behavior in
> the IOMMU driver. I have the following example, described using group and
> domain structures from the IOMMU API:
>                  ____________________
>                 |IOMMU  ____________ |
>                 |      |DOM  ______ ||
>                 |      |    |GRP   |||     bind
>                 |      |    |    A<-----------------Task 1
>                 |      |    |    B |||
>                 |      |    |______|||
>                 |      |     ______ ||
>                 |      |    |GRP   |||
>                 |      |    |    C |||
>                 |      |    |______|||
>                 |      |____________||
>                 |       ____________ |
>                 |      |DOM  ______ ||
>                 |      |    |GRP   |||
>                 |      |    |    D |||
>                 |      |    |______|||
>                 |      |____________||
>                 |____________________|
> 
> Let's take PCI functions A, B, C, and D, all with PASID capabilities. Due
> to some hardware limitation (in the bus, the device or the IOMMU), B can
> see all DMA transactions issued by A. A and B are therefore in the same
> IOMMU group. C and D can be isolated by the IOMMU, so they each have their
> own group.
> 
> (As far as I know, in the SVM world at the moment, devices are neatly
> integrated and there is no need for putting multiple devices in the same
> IOMMU group, but I don't think we should expect all future SVM systems to
> be well-behaved.)
>
> So when a user binds Task 1 to device A, it is *implicitly* giving device
> B access to Task 1 as well. Simply because the IOMMU is unable to isolate
> A from B, PASID or not. B could access the same address space as A, even
> if you don't call bind again to explicitly attach the PASID table to B.
> 
> If the bind is done with device as argument, maybe users will believe that
> using PASIDs provides an additional level of isolation within a group,
> when it really doesn't. That's why I'm inclined to have the whole bind API
> be on groups rather than devices, if only for clarity.

This may depend on how the user understand the isolation. I think different
PASID does mean different address space. From this perspective, it does look
like isolation.

> But I don't know, maybe a comment explaining the above would be sufficient.
> 
> To be frank my comment about group versus device is partly to make sure
> that I grasp the various concepts correctly and that we're on the same
> page. Doing the bind on groups is less significant in your case, for PASID
> table binding, because VFIO already takes care of IOMMU group properly. In
> my case I expect DRM, network, DMA drivers to use the API as well for
> binding tasks, and I don't want to introduce ambiguity in the API that
> would lead to security holes later.

For this part, would you provide more detail about why it would be more
significant to bind on group level in your case? I think we need strong
reason to support it. Currently, the other map_page APIs are passing
device as argument. Would it also be recommended to use group as argument?

Thanks,
Yi L

> >> to do iommu_group_for_each_dev anyway (as you do in patch 6/8). So it
> >> might be simpler to let the IOMMU core take the group lock and do
> >> group->domain->ops->bind_task(dev...) for each device. The question also
> >> holds for iommu_do_invalidate in patch 3/8.
> > 
> > In my understanding, it is moving the for_each_dev loop into iommu driver?
> > Is it?
> 
> Yes, that's what I meant
> 
> >> This way the prototypes would be:
> >> int iommu_bind...(struct iommu_group *group, struct ... *info)
> >> int iommu_unbind...(struct iommu_group *group, struct ...*info)
> >> int iommu_invalidate...(struct iommu_group *group, struct ...*info)
> > 
> > For PASID table binding from guest, I think it'd better to be per-device op
> > since the bind operation wants to modify the host context entry. But we may
> > still share the API and do things differently in iommu driver.
> 
> Sure, as said above the use cases for PASID table and single PASID binding
> are different, sharing the API is not strictly necessary.
> 
> > For invalidation, I think it'd better to be per-group. Actually, with guest
> > IOMMU exists, there is only one group in a domain on Intel platform. Do it for
> > each device is not expected. How about it on ARM?
> 
> In ARM systems with the DMA API (IOMMU_DOMAIN_DMA), there is one group per
> domain. But with VFIO (IOMMU_DOMAIN_UNMANAGED), VFIO will try to attach
> multiple groups in the same container to the same domain when possible.
> 
> >> For PASID table binding it might not matter much, as VFIO will most likely
> >> be the only user. But task binding will be called by device drivers, which
> >> by now should be encouraged to do things at iommu_group granularity.
> >> Alternatively it could be done implicitly like in iommu_attach_device,
> >> with "iommu_bind_device_x" calling "iommu_bind_group_x".
> > 
> > Do you mean the bind task from userspace driver? I guess you're trying to do
> > different types of binding request in a single svm_bind API?
> > 
> >>
> >> Extending this reasoning, since groups in a domain are also supposed to
> >> have the same mappings, then similarly to map/unmap,
> >> bind/unbind/invalidate should really be done with an iommu_domain (and
> >> nothing else) as target argument. However this requires the IOMMU core to
> >> keep a group list in each domain, which might complicate things a little
> >> too much.
> >>
> >> But "all devices in a domain share the same PASID table" is the paradigm
> >> I'm currently using in the guts of arm-smmu-v3. And I wonder if, as with
> >> iommu_group, it should be made more explicit to users, so they don't
> >> assume that devices within a domain are isolated from each others with
> >> regard to PASID DMA.
> > 
> > Is the isolation you mentioned means forbidding to do PASID DMA to the same
> > virtual address space when the device comes from different domain?
> 
> In the above example, devices A, B and C are in the same IOMMU domain
> (because, for instance, user put the two groups in the same VFIO
> container.) Then in the SMMUv3 driver they would all share the same PASID
> table. A, B and C can access Task 1 with the PASID obtained during the
> depicted bind. They don't need to call bind again for device C, though it
> would be good practice.
> 
> But D is in a different domain, so unless you also call bind on Task 1 for
> device D, there is no way that D can access Task 1.
> 
> Thanks,
> Jean
>
Jean-Philippe Brucker May 25, 2017, 12:33 p.m. UTC | #12
On 23/05/17 08:50, Liu, Yi L wrote:
> On Fri, Apr 28, 2017 at 01:51:42PM +0100, Jean-Philippe Brucker wrote:
[...]
>>>>
>>>> For the next version of my SVM series, I was thinking of passing group
>>>> instead of device to iommu_bind. Since all devices in a group are expected
>>>> to share the same mappings (whether they want it or not), users will have
>>>
>>> Virtual address space is not tied to protection domain as I/O virtual address
>>> space does. Is it really necessary to affect all the devices in this group.
>>> Or it is just for consistence?
>>
>> It's mostly about consistency, and also avoid hiding implicit behavior in
>> the IOMMU driver. I have the following example, described using group and
>> domain structures from the IOMMU API:
>>                  ____________________
>>                 |IOMMU  ____________ |
>>                 |      |DOM  ______ ||
>>                 |      |    |GRP   |||     bind
>>                 |      |    |    A<-----------------Task 1
>>                 |      |    |    B |||
>>                 |      |    |______|||
>>                 |      |     ______ ||
>>                 |      |    |GRP   |||
>>                 |      |    |    C |||
>>                 |      |    |______|||
>>                 |      |____________||
>>                 |       ____________ |
>>                 |      |DOM  ______ ||
>>                 |      |    |GRP   |||
>>                 |      |    |    D |||
>>                 |      |    |______|||
>>                 |      |____________||
>>                 |____________________|
>>
>> Let's take PCI functions A, B, C, and D, all with PASID capabilities. Due
>> to some hardware limitation (in the bus, the device or the IOMMU), B can
>> see all DMA transactions issued by A. A and B are therefore in the same
>> IOMMU group. C and D can be isolated by the IOMMU, so they each have their
>> own group.
>>
>> (As far as I know, in the SVM world at the moment, devices are neatly
>> integrated and there is no need for putting multiple devices in the same
>> IOMMU group, but I don't think we should expect all future SVM systems to
>> be well-behaved.)
>>
>> So when a user binds Task 1 to device A, it is *implicitly* giving device
>> B access to Task 1 as well. Simply because the IOMMU is unable to isolate
>> A from B, PASID or not. B could access the same address space as A, even
>> if you don't call bind again to explicitly attach the PASID table to B.
>>
>> If the bind is done with device as argument, maybe users will believe that
>> using PASIDs provides an additional level of isolation within a group,
>> when it really doesn't. That's why I'm inclined to have the whole bind API
>> be on groups rather than devices, if only for clarity.
> 
> This may depend on how the user understand the isolation. I think different
> PASID does mean different address space. From this perspective, it does look
> like isolation.

Yes, and it isn't isolation. Not at device granularity, that is. IOMMU has
the concept of group because sometimes the hardware simply cannot isolate
devices. Different PASIDs does mean different address spaces, but two
devices in the same group may be able to access each other's address
spaces, regardless of the presence of a PASID.

To illustrate the problem with PASIDs, let's say that for whatever reason
(e.g. lack of ACS Source Validation in a PCI switch), device B (0x0100)
can spoof device A's RID (0x0200). Therefore we put A and B in the same
IOMMU group.

User binds Task 1 to device A and Task 2 to device B. They use PASIDs X
and Y, so user thinks that they are isolated. But given the physical
properties of the system, device B can pretend it is device A, and access
the whole address space of Task 1 by sending transactions with RID 0x0200
and PASID X. So user effectively created a backdoor between tasks 1 and 2
without knowing it, and using PASIDs didn't add any protection.

>> But I don't know, maybe a comment explaining the above would be sufficient.
>>
>> To be frank my comment about group versus device is partly to make sure
>> that I grasp the various concepts correctly and that we're on the same
>> page. Doing the bind on groups is less significant in your case, for PASID
>> table binding, because VFIO already takes care of IOMMU group properly. In
>> my case I expect DRM, network, DMA drivers to use the API as well for
>> binding tasks, and I don't want to introduce ambiguity in the API that
>> would lead to security holes later.
> 
> For this part, would you provide more detail about why it would be more
> significant to bind on group level in your case? I think we need strong
> reason to support it. Currently, the other map_page APIs are passing
> device as argument. Would it also be recommended to use group as argument?

Well I'm only concerned about the API we're introducing at the moment, I'm
not suggesting we change existing ones. Because PASID is a new concept and
is currently unregulated, it would be good for new users to understand
what kind of isolation they are getting from it. And it is more important
than previous APIs because SVM's main objective is to simplify userspace
programming model, and therefore bring e.g. GPU programming to users that
will be more naive with regard to hardware properties and limitations.

I'm thinking for instance about GPU drivers using the bind API to provide
OpenCL SVM to userspace. If the person writing the driver has to pass
IOMMU groups instead of devices, they will have less chance to fall into
the trap described above. They would have to follow the VFIO model, and
propagate the concept of IOMMU groups all the way to userspace.

As I said, maybe we can just add a comment warning future users about the
limitations of PASID isolation and that will be enough, I really don't
know what's best. Since VFIO will likely stay the only user of PASID
tables binding and handles IOMMU groups well, I think it boils down to
stylistic decision in your case.

Thanks,
Jean

>>>> to do iommu_group_for_each_dev anyway (as you do in patch 6/8). So it
>>>> might be simpler to let the IOMMU core take the group lock and do
>>>> group->domain->ops->bind_task(dev...) for each device. The question also
>>>> holds for iommu_do_invalidate in patch 3/8.
>>>
>>> In my understanding, it is moving the for_each_dev loop into iommu driver?
>>> Is it?
>>
>> Yes, that's what I meant
>>
>>>> This way the prototypes would be:
>>>> int iommu_bind...(struct iommu_group *group, struct ... *info)
>>>> int iommu_unbind...(struct iommu_group *group, struct ...*info)
>>>> int iommu_invalidate...(struct iommu_group *group, struct ...*info)
>>>
>>> For PASID table binding from guest, I think it'd better to be per-device op
>>> since the bind operation wants to modify the host context entry. But we may
>>> still share the API and do things differently in iommu driver.
>>
>> Sure, as said above the use cases for PASID table and single PASID binding
>> are different, sharing the API is not strictly necessary.
>>
>>> For invalidation, I think it'd better to be per-group. Actually, with guest
>>> IOMMU exists, there is only one group in a domain on Intel platform. Do it for
>>> each device is not expected. How about it on ARM?
>>
>> In ARM systems with the DMA API (IOMMU_DOMAIN_DMA), there is one group per
>> domain. But with VFIO (IOMMU_DOMAIN_UNMANAGED), VFIO will try to attach
>> multiple groups in the same container to the same domain when possible.
>>
>>>> For PASID table binding it might not matter much, as VFIO will most likely
>>>> be the only user. But task binding will be called by device drivers, which
>>>> by now should be encouraged to do things at iommu_group granularity.
>>>> Alternatively it could be done implicitly like in iommu_attach_device,
>>>> with "iommu_bind_device_x" calling "iommu_bind_group_x".
>>>
>>> Do you mean the bind task from userspace driver? I guess you're trying to do
>>> different types of binding request in a single svm_bind API?
>>>
>>>>
>>>> Extending this reasoning, since groups in a domain are also supposed to
>>>> have the same mappings, then similarly to map/unmap,
>>>> bind/unbind/invalidate should really be done with an iommu_domain (and
>>>> nothing else) as target argument. However this requires the IOMMU core to
>>>> keep a group list in each domain, which might complicate things a little
>>>> too much.
>>>>
>>>> But "all devices in a domain share the same PASID table" is the paradigm
>>>> I'm currently using in the guts of arm-smmu-v3. And I wonder if, as with
>>>> iommu_group, it should be made more explicit to users, so they don't
>>>> assume that devices within a domain are isolated from each others with
>>>> regard to PASID DMA.
>>>
>>> Is the isolation you mentioned means forbidding to do PASID DMA to the same
>>> virtual address space when the device comes from different domain?
>>
>> In the above example, devices A, B and C are in the same IOMMU domain
>> (because, for instance, user put the two groups in the same VFIO
>> container.) Then in the SMMUv3 driver they would all share the same PASID
>> table. A, B and C can access Task 1 with the PASID obtained during the
>> depicted bind. They don't need to call bind again for device C, though it
>> would be good practice.
>>
>> But D is in a different domain, so unless you also call bind on Task 1 for
>> device D, there is no way that D can access Task 1.
>>
>> Thanks,
>> Jean
>>
diff mbox

Patch

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index dbe7f65..f2da636 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1134,6 +1134,25 @@  int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_attach_device);
 
+int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
+			struct pasid_table_info *pasidt_binfo)
+{
+	if (unlikely(!domain->ops->bind_pasid_table))
+		return -EINVAL;
+
+	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
+}
+EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
+
+int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
+{
+	if (unlikely(!domain->ops->unbind_pasid_table))
+		return -EINVAL;
+
+	return domain->ops->unbind_pasid_table(domain, dev);
+}
+EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 0ff5111..491a011 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -131,6 +131,15 @@  struct iommu_dm_region {
 	int			prot;
 };
 
+struct pasid_table_info {
+	__u64	ptr;	/* PASID table ptr */
+	__u64	size;	/* PASID table size*/
+	__u32	model;	/* magic number */
+#define INTEL_IOMMU	(1 << 0)
+#define ARM_SMMU	(1 << 1)
+	__u8	opaque[];/* IOMMU-specific details */
+};
+
 #ifdef CONFIG_IOMMU_API
 
 /**
@@ -159,6 +168,8 @@  struct iommu_dm_region {
  * @domain_get_windows: Return the number of windows for a domain
  * @of_xlate: add OF master IDs to iommu grouping
  * @pgsize_bitmap: bitmap of all possible supported page sizes
+ * @bind_pasid_table: bind pasid table pointer for guest SVM
+ * @unbind_pasid_table: unbind pasid table pointer and restore defaults
  */
 struct iommu_ops {
 	bool (*capable)(enum iommu_cap);
@@ -200,6 +211,10 @@  struct iommu_ops {
 	u32 (*domain_get_windows)(struct iommu_domain *domain);
 
 	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
+	int (*bind_pasid_table)(struct iommu_domain *domain, struct device *dev,
+				struct pasid_table_info *pasidt_binfo);
+	int (*unbind_pasid_table)(struct iommu_domain *domain,
+				struct device *dev);
 
 	unsigned long pgsize_bitmap;
 };
@@ -221,6 +236,10 @@  extern int iommu_attach_device(struct iommu_domain *domain,
 			       struct device *dev);
 extern void iommu_detach_device(struct iommu_domain *domain,
 				struct device *dev);
+extern int iommu_bind_pasid_table(struct iommu_domain *domain,
+		struct device *dev, struct pasid_table_info *pasidt_binfo);
+extern int iommu_unbind_pasid_table(struct iommu_domain *domain,
+				struct device *dev);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot);
@@ -595,6 +614,18 @@  const struct iommu_ops *iommu_get_instance(struct fwnode_handle *fwnode)
 	return NULL;
 }
 
+static inline
+int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
+			struct pasid_table_info *pasidt_binfo)
+{
+	return -EINVAL;
+}
+static inline
+int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
+{
+	return -EINVAL;
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */