[02/37] iommu/sva: Bind process address spaces to devices

Message ID	20180212183352.22730-3-jean-philippe.brucker@arm.com (mailing list archive)
State	New, archived
Delegated to:	Bjorn Helgaas
Headers	show Return-Path: <linux-pci-owner@kernel.org> From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> To: linux-arm-kernel@lists.infradead.org, linux-pci@vger.kernel.org, linux-acpi@vger.kernel.org, devicetree@vger.kernel.org, iommu@lists.linux-foundation.org, kvm@vger.kernel.org Cc: joro@8bytes.org, robh+dt@kernel.org, mark.rutland@arm.com, catalin.marinas@arm.com, will.deacon@arm.com, lorenzo.pieralisi@arm.com, hanjun.guo@linaro.org, sudeep.holla@arm.com, rjw@rjwysocki.net, lenb@kernel.org, robin.murphy@arm.com, bhelgaas@google.com, alex.williamson@redhat.com, tn@semihalf.com, liubo95@huawei.com, thunder.leizhen@huawei.com, xieyisheng1@huawei.com, xuzaibo@huawei.com, ilias.apalodimas@linaro.org, jonathan.cameron@huawei.com, shunyong.yang@hxt-semitech.com, nwatters@codeaurora.org, okaya@codeaurora.org, jcrouse@codeaurora.org, rfranz@cavium.com, dwmw2@infradead.org, jacob.jun.pan@linux.intel.com, yi.l.liu@intel.com, ashok.raj@intel.com, robdclark@gmail.com, christian.koenig@amd.com, bharatku@xilinx.com Subject: [PATCH 02/37] iommu/sva: Bind process address spaces to devices Date: Mon, 12 Feb 2018 18:33:17 +0000 Message-Id: <20180212183352.22730-3-jean-philippe.brucker@arm.com> In-Reply-To: <20180212183352.22730-1-jean-philippe.brucker@arm.com> References: <20180212183352.22730-1-jean-philippe.brucker@arm.com> Sender: linux-pci-owner@vger.kernel.org Precedence: bulk

Jean-Philippe Brucker Feb. 12, 2018, 6:33 p.m. UTC

Add bind() and unbind() operations to the IOMMU API. Device drivers can
use them to share process page tables with their devices. bind_group()
is provided for VFIO's convenience, as it needs to provide a coherent
interface on containers. Other device drivers will most likely want to
use bind_device(), which binds a single device in the group.

Regardless of the IOMMU group or domain a device is in, device drivers
should call bind() for each device that will use the PASID.

This patch only adds skeletons for the device driver API, most of the
implementation is still missing.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 drivers/iommu/iommu-sva.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/iommu.c     |  63 ++++++++++++++++++++++++++++
 include/linux/iommu.h     |  36 ++++++++++++++++
 3 files changed, 204 insertions(+)

Tian, Kevin Feb. 13, 2018, 7:54 a.m. UTC | #1

> From: Jean-Philippe Brucker
> Sent: Tuesday, February 13, 2018 2:33 AM
> 
> Add bind() and unbind() operations to the IOMMU API. Device drivers can
> use them to share process page tables with their devices. bind_group()
> is provided for VFIO's convenience, as it needs to provide a coherent
> interface on containers. Other device drivers will most likely want to
> use bind_device(), which binds a single device in the group.

I saw your bind_group implementation tries to bind the address space
for all devices within a group, which IMO has some problem. Based on PCIe
spec, packet routing on the bus doesn't take PASID into consideration. 
since devices within same group cannot be isolated based on requestor-ID
i.e. traffic not guaranteed going to IOMMU, enabling SVA on multiple devices
could cause undesired p2p.
 
If my understanding of PCIe spec is correct, probably we should fail 
calling bind_group()/bind_device() when there are multiple devices within 
the given group. If only one device then bind_group is essentially a wrapper
to bind_device.

> 
> Regardless of the IOMMU group or domain a device is in, device drivers
> should call bind() for each device that will use the PASID.
> 
> This patch only adds skeletons for the device driver API, most of the
> implementation is still missing.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  drivers/iommu/iommu-sva.c | 105
> ++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/iommu.c     |  63 ++++++++++++++++++++++++++++
>  include/linux/iommu.h     |  36 ++++++++++++++++
>  3 files changed, 204 insertions(+)
> 
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> index cab5d723520f..593685d891bf 100644
> --- a/drivers/iommu/iommu-sva.c
> +++ b/drivers/iommu/iommu-sva.c
> @@ -9,6 +9,9 @@
> 
>  #include <linux/iommu.h>
> 
> +/* TODO: stub for the fault queue. Remove later. */
> +#define iommu_fault_queue_flush(...)
> +
>  /**
>   * iommu_sva_device_init() - Initialize Shared Virtual Addressing for a
> device
>   * @dev: the device
> @@ -78,6 +81,8 @@ int iommu_sva_device_shutdown(struct device *dev)
>  	if (!domain)
>  		return -ENODEV;
> 
> +	__iommu_sva_unbind_dev_all(dev);
> +
>  	if (domain->ops->sva_device_shutdown)
>  		domain->ops->sva_device_shutdown(dev);
> 
> @@ -88,3 +93,103 @@ int iommu_sva_device_shutdown(struct device
> *dev)
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(iommu_sva_device_shutdown);
> +
> +/**
> + * iommu_sva_bind_device() - Bind a process address space to a device
> + * @dev: the device
> + * @mm: the mm to bind, caller must hold a reference to it
> + * @pasid: valid address where the PASID will be stored
> + * @flags: bond properties (IOMMU_SVA_FEAT_*)
> + * @drvdata: private data passed to the mm exit handler
> + *
> + * Create a bond between device and task, allowing the device to access
> the mm
> + * using the returned PASID. A subsequent bind() for the same device and
> mm will
> + * reuse the bond (and return the same PASID), but users will have to call
> + * unbind() twice.

what's the point of requiring unbind twice?

> + *
> + * Callers should have taken care of setting up SVA for this device with
> + * iommu_sva_device_init() beforehand. They may also be notified of the
> bond
> + * disappearing, for example when the last task that uses the mm dies, by
> + * registering a notifier with iommu_register_mm_exit_handler().
> + *
> + * If IOMMU_SVA_FEAT_PASID is requested, a PASID is allocated and
> returned.
> + * TODO: The alternative, binding the non-PASID context to an mm, isn't
> + * supported at the moment because existing IOMMU domain types
> initialize the
> + * non-PASID context for iommu_map()/unmap() or bypass. This requires
> a new
> + * domain type.
> + *
> + * If IOMMU_SVA_FEAT_IOPF is not requested, the caller must pin down
> all
> + * mappings shared with the device. mlock() isn't sufficient, as it doesn't
> + * prevent minor page faults (e.g. copy-on-write). TODO: !IOPF isn't
> allowed at
> + * the moment.
> + *
> + * On success, 0 is returned and @pasid contains a valid ID. Otherwise, an
> error
> + * is returned.
> + */
> +int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm,
> int *pasid,
> +			  unsigned long flags, void *drvdata)
> +{
> +	struct iommu_domain *domain;
> +	struct iommu_param *dev_param = dev->iommu_param;
> +
> +	domain = iommu_get_domain_for_dev(dev);
> +	if (!domain)
> +		return -EINVAL;
> +
> +	if (!pasid)
> +		return -EINVAL;
> +
> +	if (!dev_param || (flags & ~dev_param->sva_features))
> +		return -EINVAL;
> +
> +	if (flags != (IOMMU_SVA_FEAT_PASID | IOMMU_SVA_FEAT_IOPF))
> +		return -EINVAL;
> +
> +	return -ENOSYS; /* TODO */
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_bind_device);
> +
> +/**
> + * iommu_sva_unbind_device() - Remove a bond created with
> iommu_sva_bind_device
> + * @dev: the device
> + * @pasid: the pasid returned by bind()
> + *
> + * Remove bond between device and address space identified by @pasid.
> Users
> + * should not call unbind() if the corresponding mm exited (as the PASID
> might
> + * have been reallocated to another process.)
> + *
> + * The device must not be issuing any more transaction for this PASID. All
> + * outstanding page requests for this PASID must have been flushed to the
> IOMMU.
> + *
> + * Returns 0 on success, or an error value
> + */
> +int iommu_sva_unbind_device(struct device *dev, int pasid)
> +{
> +	struct iommu_domain *domain;
> +
> +	domain = iommu_get_domain_for_dev(dev);
> +	if (WARN_ON(!domain))
> +		return -EINVAL;
> +
> +	/*
> +	 * Caller stopped the device from issuing PASIDs, now make sure
> they are
> +	 * out of the fault queue.
> +	 */
> +	iommu_fault_queue_flush(dev);
> +
> +	return -ENOSYS; /* TODO */
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_unbind_device);
> +
> +/**
> + * __iommu_sva_unbind_dev_all() - Detach all address spaces from this
> device
> + *
> + * When detaching @device from a domain, IOMMU drivers should use
> this helper.
> + */
> +void __iommu_sva_unbind_dev_all(struct device *dev)
> +{
> +	iommu_fault_queue_flush(dev);
> +
> +	/* TODO */
> +}
> +EXPORT_SYMBOL_GPL(__iommu_sva_unbind_dev_all);
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index d4a4edaf2d8c..f977851c522b 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1535,6 +1535,69 @@ void iommu_detach_group(struct
> iommu_domain *domain, struct iommu_group *group)
>  }
>  EXPORT_SYMBOL_GPL(iommu_detach_group);
> 
> +/*
> + * iommu_sva_bind_group() - Share address space with all devices in the
> group.
> + * @group: the iommu group
> + * @mm: the mm to bind
> + * @pasid: valid address where the PASID will be stored
> + * @flags: bond properties (IOMMU_PROCESS_BIND_*)
> + * @drvdata: private data passed to the mm exit handler
> + *
> + * Create a bond between group and process, allowing devices in the
> group to
> + * access the process address space using @pasid.
> + *
> + * Refer to iommu_sva_bind_device() for more details.
> + *
> + * On success, 0 is returned and @pasid contains a valid ID. Otherwise, an
> error
> + * is returned.
> + */
> +int iommu_sva_bind_group(struct iommu_group *group, struct
> mm_struct *mm,
> +			 int *pasid, unsigned long flags, void *drvdata)
> +{
> +	struct group_device *device;
> +	int ret = -ENODEV;
> +
> +	if (!group->domain)
> +		return -EINVAL;
> +
> +	mutex_lock(&group->mutex);
> +	list_for_each_entry(device, &group->devices, list) {
> +		ret = iommu_sva_bind_device(device->dev, mm, pasid,
> flags,
> +					    drvdata);
> +		if (ret)
> +			break;
> +	}
> +
> +	if (ret) {
> +		list_for_each_entry_continue_reverse(device, &group-
> >devices, list)
> +			iommu_sva_unbind_device(device->dev, *pasid);
> +	}
> +	mutex_unlock(&group->mutex);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_bind_group);
> +
> +/**
> + * iommu_sva_unbind_group() - Remove a bond created with
> iommu_sva_bind_group()
> + * @group: the group
> + * @pasid: the pasid returned by bind
> + *
> + * Refer to iommu_sva_unbind_device() for more details.
> + */
> +int iommu_sva_unbind_group(struct iommu_group *group, int pasid)
> +{
> +	struct group_device *device;
> +
> +	mutex_lock(&group->mutex);
> +	list_for_each_entry(device, &group->devices, list)
> +		iommu_sva_unbind_device(device->dev, pasid);
> +	mutex_unlock(&group->mutex);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_unbind_group);
> +
>  phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
> dma_addr_t iova)
>  {
>  	if (unlikely(domain->ops->iova_to_phys == NULL))
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index e9e09eecdece..1fb10d64b9e5 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -576,6 +576,10 @@ int iommu_fwspec_init(struct device *dev, struct
> fwnode_handle *iommu_fwnode,
>  void iommu_fwspec_free(struct device *dev);
>  int iommu_fwspec_add_ids(struct device *dev, u32 *ids, int num_ids);
>  const struct iommu_ops *iommu_ops_from_fwnode(struct
> fwnode_handle *fwnode);
> +extern int iommu_sva_bind_group(struct iommu_group *group,
> +				struct mm_struct *mm, int *pasid,
> +				unsigned long flags, void *drvdata);
> +extern int iommu_sva_unbind_group(struct iommu_group *group, int
> pasid);
> 
>  #else /* CONFIG_IOMMU_API */
> 
> @@ -890,12 +894,28 @@ const struct iommu_ops
> *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
>  	return NULL;
>  }
> 
> +static inline int iommu_sva_bind_group(struct iommu_group *group,
> +				       struct mm_struct *mm, int *pasid,
> +				       unsigned long flags, void *drvdata)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline int iommu_sva_unbind_group(struct iommu_group *group,
> int pasid)
> +{
> +	return -ENODEV;
> +}
> +
>  #endif /* CONFIG_IOMMU_API */
> 
>  #ifdef CONFIG_IOMMU_SVA
>  extern int iommu_sva_device_init(struct device *dev, unsigned long
> features,
>  				 unsigned int max_pasid);
>  extern int iommu_sva_device_shutdown(struct device *dev);
> +extern int iommu_sva_bind_device(struct device *dev, struct mm_struct
> *mm,
> +				int *pasid, unsigned long flags, void
> *drvdata);
> +extern int iommu_sva_unbind_device(struct device *dev, int pasid);
> +extern void __iommu_sva_unbind_dev_all(struct device *dev);
>  #else /* CONFIG_IOMMU_SVA */
>  static inline int iommu_sva_device_init(struct device *dev,
>  					unsigned long features,
> @@ -908,6 +928,22 @@ static inline int
> iommu_sva_device_shutdown(struct device *dev)
>  {
>  	return -ENODEV;
>  }
> +
> +static inline int iommu_sva_bind_device(struct device *dev,
> +					struct mm_struct *mm, int *pasid,
> +					unsigned long flags, void *drvdata)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline int iommu_sva_unbind_device(struct device *dev, int pasid)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline void __iommu_sva_unbind_dev_all(struct device *dev)
> +{
> +}
>  #endif /* CONFIG_IOMMU_SVA */
> 
>  #endif /* __LINUX_IOMMU_H */
> --
> 2.15.1

Jean-Philippe Brucker Feb. 13, 2018, 12:57 p.m. UTC | #2

On 13/02/18 07:54, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Tuesday, February 13, 2018 2:33 AM
>>
>> Add bind() and unbind() operations to the IOMMU API. Device drivers can
>> use them to share process page tables with their devices. bind_group()
>> is provided for VFIO's convenience, as it needs to provide a coherent
>> interface on containers. Other device drivers will most likely want to
>> use bind_device(), which binds a single device in the group.
> 
> I saw your bind_group implementation tries to bind the address space
> for all devices within a group, which IMO has some problem. Based on PCIe
> spec, packet routing on the bus doesn't take PASID into consideration. 
> since devices within same group cannot be isolated based on requestor-ID
> i.e. traffic not guaranteed going to IOMMU, enabling SVA on multiple devices
> could cause undesired p2p.
But so does enabling "classic" DMA... If two devices are not protected by
ACS for example, they are put in the same IOMMU group, and one device
might be able to snoop the other's DMA. VFIO allows userspace to create a
container for them and use MAP/UNMAP, but makes it explicit to the user
that for DMA, these devices are not isolated and must be considered as a
single device (you can't pass them to different VMs or put them in
different containers). So I tried to keep the same idea as MAP/UNMAP for
SVA, performing BIND/UNBIND operations on the VFIO container instead of
the device.

I kept the analogy simple though, because I don't think there will be many
SVA-capable systems that require IOMMU groups. They will likely implement
proper device isolation. Unlike iommu_attach_device(), bind_device()
doesn't call bind_group(), because keeping bonds consistent in groups is
complicated, not worth implementing (drivers can explicitly bind() all
devices that need it) and probably wouldn't ever be used. I also can't
test it. But maybe we could implement the following for now:

* bind_device() fails if the device's group has more than one device,
otherwise calls __bind_device(). This prevents device drivers that are
oblivious to IOMMU groups from opening a backdoor.

* bind_group() calls __bind_device() for all devices in group. This way
users that are aware of IOMMU groups can still use them safely. Note that
at the moment bind_group() fails as soon as it finds a device that doesn't
support SVA. Having all devices support SVA in a given group is
unrealistic and this behavior ought to be improved.

* hotplugging a device into a group still succeeds even if the group
already has mm bonds. Same happens for classic DMA, a hotplugged device
will have access to all mappings already present in the domain.

> If my understanding of PCIe spec is correct, probably we should fail 
> calling bind_group()/bind_device() when there are multiple devices within 
> the given group. If only one device then bind_group is essentially a wrapper
> to bind_device.>>
>> Regardless of the IOMMU group or domain a device is in, device drivers
>> should call bind() for each device that will use the PASID.
>>
[...]
>> +/**
>> + * iommu_sva_bind_device() - Bind a process address space to a device
>> + * @dev: the device
>> + * @mm: the mm to bind, caller must hold a reference to it
>> + * @pasid: valid address where the PASID will be stored
>> + * @flags: bond properties (IOMMU_SVA_FEAT_*)
>> + * @drvdata: private data passed to the mm exit handler
>> + *
>> + * Create a bond between device and task, allowing the device to access
>> the mm
>> + * using the returned PASID. A subsequent bind() for the same device and
>> mm will
>> + * reuse the bond (and return the same PASID), but users will have to call
>> + * unbind() twice.
> 
> what's the point of requiring unbind twice?

Mmh, that was necessary when we kept bond information as domain<->mm, but
since it's now device<->mm, we can probably remove the bond refcount. I
consider that a bind() between a given device and mm will always be issued
by the same driver.

Thanks,
Jean

Tian, Kevin Feb. 13, 2018, 11:34 p.m. UTC | #3

> From: Jean-Philippe Brucker
> Sent: Tuesday, February 13, 2018 8:57 PM
> 
> On 13/02/18 07:54, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker
> >> Sent: Tuesday, February 13, 2018 2:33 AM
> >>
> >> Add bind() and unbind() operations to the IOMMU API. Device drivers
> can
> >> use them to share process page tables with their devices. bind_group()
> >> is provided for VFIO's convenience, as it needs to provide a coherent
> >> interface on containers. Other device drivers will most likely want to
> >> use bind_device(), which binds a single device in the group.
> >
> > I saw your bind_group implementation tries to bind the address space
> > for all devices within a group, which IMO has some problem. Based on
> PCIe
> > spec, packet routing on the bus doesn't take PASID into consideration.
> > since devices within same group cannot be isolated based on requestor-
> ID
> > i.e. traffic not guaranteed going to IOMMU, enabling SVA on multiple
> devices
> > could cause undesired p2p.
> But so does enabling "classic" DMA... If two devices are not protected by
> ACS for example, they are put in the same IOMMU group, and one device
> might be able to snoop the other's DMA. VFIO allows userspace to create a
> container for them and use MAP/UNMAP, but makes it explicit to the user
> that for DMA, these devices are not isolated and must be considered as a
> single device (you can't pass them to different VMs or put them in
> different containers). So I tried to keep the same idea as MAP/UNMAP for
> SVA, performing BIND/UNBIND operations on the VFIO container instead of
> the device.

there is a small difference. for classic DMA we can reserve PCI BARs 
when allocating IOVA, thus multiple devices in the same group can 
still work correctly applied with same translation, if isolation is not
cared in between. However for SVA it's CPU virtual addresses 
managed by kernel mm thus difficult to introduce similar address 
reservation. Then it's possible for a VA falling into other device's 
BAR in the same group and cause undesired p2p traffic. In such 
regard, SVA is actually functionally-broken.

> 
> I kept the analogy simple though, because I don't think there will be many
> SVA-capable systems that require IOMMU groups. They will likely

I agree that multiple SVA-capable devices in same IOMMU group is not
a typical configuration, especially it's usually observed on new devices.
Then based on above limitation, I think we could just explicitly avoid
enabling SVA in such case. :-)

> implement
> proper device isolation. Unlike iommu_attach_device(), bind_device()
> doesn't call bind_group(), because keeping bonds consistent in groups is
> complicated, not worth implementing (drivers can explicitly bind() all
> devices that need it) and probably wouldn't ever be used. I also can't
> test it. But maybe we could implement the following for now:
> 
> * bind_device() fails if the device's group has more than one device,
> otherwise calls __bind_device(). This prevents device drivers that are
> oblivious to IOMMU groups from opening a backdoor.
> 
> * bind_group() calls __bind_device() for all devices in group. This way
> users that are aware of IOMMU groups can still use them safely. Note that
> at the moment bind_group() fails as soon as it finds a device that doesn't
> support SVA. Having all devices support SVA in a given group is
> unrealistic and this behavior ought to be improved.
> 
> * hotplugging a device into a group still succeeds even if the group
> already has mm bonds. Same happens for classic DMA, a hotplugged
> device
> will have access to all mappings already present in the domain.
> 
> > If my understanding of PCIe spec is correct, probably we should fail
> > calling bind_group()/bind_device() when there are multiple devices within
> > the given group. If only one device then bind_group is essentially a
> wrapper
> > to bind_device.>>
> >> Regardless of the IOMMU group or domain a device is in, device drivers
> >> should call bind() for each device that will use the PASID.
> >>
> [...]
> >> +/**
> >> + * iommu_sva_bind_device() - Bind a process address space to a device
> >> + * @dev: the device
> >> + * @mm: the mm to bind, caller must hold a reference to it
> >> + * @pasid: valid address where the PASID will be stored
> >> + * @flags: bond properties (IOMMU_SVA_FEAT_*)
> >> + * @drvdata: private data passed to the mm exit handler
> >> + *
> >> + * Create a bond between device and task, allowing the device to access
> >> the mm
> >> + * using the returned PASID. A subsequent bind() for the same device
> and
> >> mm will
> >> + * reuse the bond (and return the same PASID), but users will have to
> call
> >> + * unbind() twice.
> >
> > what's the point of requiring unbind twice?
> 
> Mmh, that was necessary when we kept bond information as domain<-
> >mm, but
> since it's now device<->mm, we can probably remove the bond refcount. I
> consider that a bind() between a given device and mm will always be issued
> by the same driver.
> 
> Thanks,
> Jean

Jean-Philippe Brucker Feb. 15, 2018, 12:40 p.m. UTC | #4

On 13/02/18 23:34, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Tuesday, February 13, 2018 8:57 PM
>>
>> On 13/02/18 07:54, Tian, Kevin wrote:
>>>> From: Jean-Philippe Brucker
>>>> Sent: Tuesday, February 13, 2018 2:33 AM
>>>>
>>>> Add bind() and unbind() operations to the IOMMU API. Device drivers
>> can
>>>> use them to share process page tables with their devices. bind_group()
>>>> is provided for VFIO's convenience, as it needs to provide a coherent
>>>> interface on containers. Other device drivers will most likely want to
>>>> use bind_device(), which binds a single device in the group.
>>>
>>> I saw your bind_group implementation tries to bind the address space
>>> for all devices within a group, which IMO has some problem. Based on
>> PCIe
>>> spec, packet routing on the bus doesn't take PASID into consideration.
>>> since devices within same group cannot be isolated based on requestor-
>> ID
>>> i.e. traffic not guaranteed going to IOMMU, enabling SVA on multiple
>> devices
>>> could cause undesired p2p.
>> But so does enabling "classic" DMA... If two devices are not protected by
>> ACS for example, they are put in the same IOMMU group, and one device
>> might be able to snoop the other's DMA. VFIO allows userspace to create a
>> container for them and use MAP/UNMAP, but makes it explicit to the user
>> that for DMA, these devices are not isolated and must be considered as a
>> single device (you can't pass them to different VMs or put them in
>> different containers). So I tried to keep the same idea as MAP/UNMAP for
>> SVA, performing BIND/UNBIND operations on the VFIO container instead of
>> the device.
> 
> there is a small difference. for classic DMA we can reserve PCI BARs 
> when allocating IOVA, thus multiple devices in the same group can 
> still work correctly applied with same translation, if isolation is not
> cared in between. However for SVA it's CPU virtual addresses 
> managed by kernel mm thus difficult to introduce similar address 
> reservation. Then it's possible for a VA falling into other device's 
> BAR in the same group and cause undesired p2p traffic. In such 
> regard, SVA is actually functionally-broken.

I think the problem exists even if there is a single device in the group.
If for example, malloc() returns a VA that corresponds to a PCI host
bridge in IOVA space, performing DMA on that buffer won't reach the IOMMU
and will cause undesirable side-effects.

My series doesn't address the problem, but I believe we should carve
reserved regions out of the process address space during bind(), for
example by creating a PROT_NONE vma preventing userspace from obtaining
that VA.

If you solve this problem, you also solve it for multiple devices in a
group, because the IOMMU core provides the resv API on groups... That's
until you hotplug a device into a live group (currently WARN in VFIO),
with different resv regions.

>> I kept the analogy simple though, because I don't think there will be many
>> SVA-capable systems that require IOMMU groups. They will likely
> 
> I agree that multiple SVA-capable devices in same IOMMU group is not
> a typical configuration, especially it's usually observed on new devices.
> Then based on above limitation, I think we could just explicitly avoid
> enabling SVA in such case. :-)

I'd certainly like that :)

Thanks,
Jean

Sinan Kaya Feb. 28, 2018, 8:34 p.m. UTC | #5

On 2/12/2018 1:33 PM, Jean-Philippe Brucker wrote:
> +int iommu_sva_unbind_group(struct iommu_group *group, int pasid)
> +{
> +	struct group_device *device;
> +
> +	mutex_lock(&group->mutex);
> +	list_for_each_entry(device, &group->devices, list)
> +		iommu_sva_unbind_device(device->dev, pasid);
> +	mutex_unlock(&group->mutex);
> +
> +	return 0;
> +}

I think we should handle the errors returned by iommu_sva_unbind_device() here
or at least print a warning if we want to still continue unbinding.

Yi Liu March 1, 2018, 3:03 a.m. UTC | #6

Hi Jean,

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Thursday, February 15, 2018 8:41 PM
> Subject: Re: [PATCH 02/37] iommu/sva: Bind process address spaces to devices
> 
> On 13/02/18 23:34, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker
> >> Sent: Tuesday, February 13, 2018 8:57 PM
> >>
> >> On 13/02/18 07:54, Tian, Kevin wrote:
> >>>> From: Jean-Philippe Brucker
> >>>> Sent: Tuesday, February 13, 2018 2:33 AM
> >>>>
> >>>> Add bind() and unbind() operations to the IOMMU API. Device drivers
> >> can
> >>>> use them to share process page tables with their devices.
> >>>> bind_group() is provided for VFIO's convenience, as it needs to
> >>>> provide a coherent interface on containers. Other device drivers
> >>>> will most likely want to use bind_device(), which binds a single device in the
> group.
> >>>
> >>> I saw your bind_group implementation tries to bind the address space
> >>> for all devices within a group, which IMO has some problem. Based on
> >> PCIe
> >>> spec, packet routing on the bus doesn't take PASID into consideration.
> >>> since devices within same group cannot be isolated based on
> >>> requestor-
> >> ID
> >>> i.e. traffic not guaranteed going to IOMMU, enabling SVA on multiple
> >> devices
> >>> could cause undesired p2p.
> >> But so does enabling "classic" DMA... If two devices are not
> >> protected by ACS for example, they are put in the same IOMMU group,
> >> and one device might be able to snoop the other's DMA. VFIO allows
> >> userspace to create a container for them and use MAP/UNMAP, but makes
> >> it explicit to the user that for DMA, these devices are not isolated
> >> and must be considered as a single device (you can't pass them to
> >> different VMs or put them in different containers). So I tried to
> >> keep the same idea as MAP/UNMAP for SVA, performing BIND/UNBIND
> >> operations on the VFIO container instead of the device.
> >
> > there is a small difference. for classic DMA we can reserve PCI BARs
> > when allocating IOVA, thus multiple devices in the same group can
> > still work correctly applied with same translation, if isolation is
> > not cared in between. However for SVA it's CPU virtual addresses
> > managed by kernel mm thus difficult to introduce similar address
> > reservation. Then it's possible for a VA falling into other device's
> > BAR in the same group and cause undesired p2p traffic. In such regard,
> > SVA is actually functionally-broken.
> 
> I think the problem exists even if there is a single device in the group.
> If for example, malloc() returns a VA that corresponds to a PCI host bridge in IOVA
> space, performing DMA on that buffer won't reach the IOMMU and will cause
> undesirable side-effects.

If only a single device in a group, should it mean there is ACS support in
the path from this device to root complex? It means any memory request
from this device would be upstreamed to root complex, thus it should be
able to avoid undesired p2p traffics. So I intend to believe, even we do
bind in group level, we actually expect to make it work only for the case
where a single device within a group.

Thanks,
Yi Liu

Jean-Philippe Brucker March 2, 2018, 12:32 p.m. UTC | #7

On 28/02/18 20:34, Sinan Kaya wrote:
> On 2/12/2018 1:33 PM, Jean-Philippe Brucker wrote:
>> +int iommu_sva_unbind_group(struct iommu_group *group, int pasid)
>> +{
>> +	struct group_device *device;
>> +
>> +	mutex_lock(&group->mutex);
>> +	list_for_each_entry(device, &group->devices, list)
>> +		iommu_sva_unbind_device(device->dev, pasid);
>> +	mutex_unlock(&group->mutex);
>> +
>> +	return 0;
>> +}
> 
> I think we should handle the errors returned by iommu_sva_unbind_device() here
> or at least print a warning if we want to still continue unbinding. 

Agreed, though bind_group/unbind_group are probably going away in next
series

Thanks,
Jean

Jean-Philippe Brucker March 2, 2018, 4:03 p.m. UTC | #8

On 01/03/18 03:03, Liu, Yi L wrote:
> Hi Jean,
> 
>> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
>> Sent: Thursday, February 15, 2018 8:41 PM
>> Subject: Re: [PATCH 02/37] iommu/sva: Bind process address spaces to devices
>>
>> On 13/02/18 23:34, Tian, Kevin wrote:
>>>> From: Jean-Philippe Brucker
>>>> Sent: Tuesday, February 13, 2018 8:57 PM
>>>>
>>>> On 13/02/18 07:54, Tian, Kevin wrote:
>>>>>> From: Jean-Philippe Brucker
>>>>>> Sent: Tuesday, February 13, 2018 2:33 AM
>>>>>>
>>>>>> Add bind() and unbind() operations to the IOMMU API. Device drivers
>>>> can
>>>>>> use them to share process page tables with their devices.
>>>>>> bind_group() is provided for VFIO's convenience, as it needs to
>>>>>> provide a coherent interface on containers. Other device drivers
>>>>>> will most likely want to use bind_device(), which binds a single device in the
>> group.
>>>>>
>>>>> I saw your bind_group implementation tries to bind the address space
>>>>> for all devices within a group, which IMO has some problem. Based on
>>>> PCIe
>>>>> spec, packet routing on the bus doesn't take PASID into consideration.
>>>>> since devices within same group cannot be isolated based on
>>>>> requestor-
>>>> ID
>>>>> i.e. traffic not guaranteed going to IOMMU, enabling SVA on multiple
>>>> devices
>>>>> could cause undesired p2p.
>>>> But so does enabling "classic" DMA... If two devices are not
>>>> protected by ACS for example, they are put in the same IOMMU group,
>>>> and one device might be able to snoop the other's DMA. VFIO allows
>>>> userspace to create a container for them and use MAP/UNMAP, but makes
>>>> it explicit to the user that for DMA, these devices are not isolated
>>>> and must be considered as a single device (you can't pass them to
>>>> different VMs or put them in different containers). So I tried to
>>>> keep the same idea as MAP/UNMAP for SVA, performing BIND/UNBIND
>>>> operations on the VFIO container instead of the device.
>>>
>>> there is a small difference. for classic DMA we can reserve PCI BARs
>>> when allocating IOVA, thus multiple devices in the same group can
>>> still work correctly applied with same translation, if isolation is
>>> not cared in between. However for SVA it's CPU virtual addresses
>>> managed by kernel mm thus difficult to introduce similar address
>>> reservation. Then it's possible for a VA falling into other device's
>>> BAR in the same group and cause undesired p2p traffic. In such regard,
>>> SVA is actually functionally-broken.
>>
>> I think the problem exists even if there is a single device in the group.
>> If for example, malloc() returns a VA that corresponds to a PCI host bridge in IOVA
>> space, performing DMA on that buffer won't reach the IOMMU and will cause
>> undesirable side-effects.
> 
> If only a single device in a group, should it mean there is ACS support in
> the path from this device to root complex? It means any memory request
> from this device would be upstreamed to root complex, thus it should be
> able to avoid undesired p2p traffics. So I intend to believe, even we do
> bind in group level, we actually expect to make it work only for the case
> where a single device within a group.

Yes if each device has its own group then ACS is properly enabled.

Even without thinking about ACS or p2p, all memory requests don't
necessarily make it to the IOMMU. For example transactions targeting the
PCI host bridge MMIO window (marked as RESV_RESERVED by dma-iommu.c),
may get eaten by the RC and not reach the IOMMU (I'm blindly following
the code here, don't have anything in the spec to back me up). Commit
fade1ec055dc also refers to "faults, corruption and other badness"
though I don't know if that's only for PCI or could also affect future
systems.

And I don't think prefixing transactions with a PASID changes the
situation. I couldn't find anything in the PCIe spec contradicting it
and I guess it's up to the root complex implementation. So I tend to
take a conservative approach and assume that RESV_RESERVED regions will
also apply to PASID-prefixed traffic.

Thanks,
Jean

[02/37] iommu/sva: Bind process address spaces to devices

Commit Message

Comments

Patch