diff mbox series

[v3,03/10] iommu/sva: Manage process address spaces

Message ID 20180920170046.20154-4-jean-philippe.brucker@arm.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show
Series Shared Virtual Addressing for the IOMMU | expand

Commit Message

Jean-Philippe Brucker Sept. 20, 2018, 5 p.m. UTC
Allocate IOMMU mm structures and bind them to devices. Four operations are
added to IOMMU drivers:

* mm_alloc(): to create an io_mm structure and perform architecture-
  specific operations required to grab the process (for instance on ARM,
  pin down the CPU ASID so that the process doesn't get assigned a new
  ASID on rollover).

  There is a single valid io_mm structure per Linux mm. Future extensions
  may also use io_mm for kernel-managed address spaces, populated with
  map()/unmap() calls instead of bound to process address spaces. This
  patch focuses on "shared" io_mm.

* mm_attach(): attach an mm to a device. The IOMMU driver checks that the
  device is capable of sharing an address space, and writes the PASID
  table entry to install the pgd.

  Some IOMMU drivers will have a single PASID table per domain, for
  convenience. Other can implement it differently but to help these
  drivers, mm_attach and mm_detach take 'attach_domain' and
  'detach_domain' parameters, that tell whether they need to set and clear
  the PASID entry or only send the required TLB invalidations.

* mm_detach(): detach an mm from a device. The IOMMU driver removes the
  PASID table entry and invalidates the IOTLBs.

* mm_free(): free a structure allocated by mm_alloc(), and let arch
  release the process.

mm_attach and mm_detach operations are serialized with a spinlock. When
trying to optimize this code, we should at least prevent concurrent
attach()/detach() on the same domain (so multi-level PASID table code can
allocate tables lazily). mm_alloc() can sleep, but mm_free must not
(because we'll have to call it from call_srcu later on).

At the moment we use an IDR for allocating PASIDs and retrieving contexts.
We also use a single spinlock. These can be refined and optimized later (a
custom allocator will be needed for top-down PASID allocation).

Keeping track of address spaces requires the use of MMU notifiers.
Handling process exit with regard to unbind() is tricky, so it is left for
another patch and we explicitly fail mm_alloc() for the moment.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
v2->v3: use sva_lock, comment updates
---
 drivers/iommu/iommu-sva.c | 397 +++++++++++++++++++++++++++++++++++++-
 drivers/iommu/iommu.c     |   1 +
 include/linux/iommu.h     |  29 +++
 3 files changed, 424 insertions(+), 3 deletions(-)

Comments

Baolu Lu Sept. 25, 2018, 3:15 a.m. UTC | #1
Hi,

On 09/21/2018 01:00 AM, Jean-Philippe Brucker wrote:
> Allocate IOMMU mm structures and bind them to devices. Four operations are
> added to IOMMU drivers:
> 
> * mm_alloc(): to create an io_mm structure and perform architecture-
>    specific operations required to grab the process (for instance on ARM,
>    pin down the CPU ASID so that the process doesn't get assigned a new
>    ASID on rollover).
> 
>    There is a single valid io_mm structure per Linux mm. Future extensions
>    may also use io_mm for kernel-managed address spaces, populated with
>    map()/unmap() calls instead of bound to process address spaces. This
>    patch focuses on "shared" io_mm.
> 
> * mm_attach(): attach an mm to a device. The IOMMU driver checks that the
>    device is capable of sharing an address space, and writes the PASID
>    table entry to install the pgd.
> 
>    Some IOMMU drivers will have a single PASID table per domain, for
>    convenience. Other can implement it differently but to help these
>    drivers, mm_attach and mm_detach take 'attach_domain' and
>    'detach_domain' parameters, that tell whether they need to set and clear
>    the PASID entry or only send the required TLB invalidations.
> 
> * mm_detach(): detach an mm from a device. The IOMMU driver removes the
>    PASID table entry and invalidates the IOTLBs.
> 
> * mm_free(): free a structure allocated by mm_alloc(), and let arch
>    release the process.
> 
> mm_attach and mm_detach operations are serialized with a spinlock. When
> trying to optimize this code, we should at least prevent concurrent
> attach()/detach() on the same domain (so multi-level PASID table code can
> allocate tables lazily). mm_alloc() can sleep, but mm_free must not
> (because we'll have to call it from call_srcu later on).
> 
> At the moment we use an IDR for allocating PASIDs and retrieving contexts.
> We also use a single spinlock. These can be refined and optimized later (a
> custom allocator will be needed for top-down PASID allocation).
> 
> Keeping track of address spaces requires the use of MMU notifiers.
> Handling process exit with regard to unbind() is tricky, so it is left for
> another patch and we explicitly fail mm_alloc() for the moment.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
> v2->v3: use sva_lock, comment updates
> ---
>   drivers/iommu/iommu-sva.c | 397 +++++++++++++++++++++++++++++++++++++-
>   drivers/iommu/iommu.c     |   1 +
>   include/linux/iommu.h     |  29 +++
>   3 files changed, 424 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> index d60d4f0bb89e..a486bc947335 100644
> --- a/drivers/iommu/iommu-sva.c
> +++ b/drivers/iommu/iommu-sva.c
> @@ -5,25 +5,415 @@
>    * Copyright (C) 2018 ARM Ltd.
>    */
>   
> +#include <linux/idr.h>
>   #include <linux/iommu.h>
> +#include <linux/sched/mm.h>
>   #include <linux/slab.h>
> +#include <linux/spinlock.h>
> +
> +/**
> + * DOC: io_mm model
> + *
> + * The io_mm keeps track of process address spaces shared between CPU and IOMMU.
> + * The following example illustrates the relation between structures
> + * iommu_domain, io_mm and iommu_bond. An iommu_bond is a link between io_mm and
> + * device. A device can have multiple io_mm and an io_mm may be bound to
> + * multiple devices.
> + *              ___________________________
> + *             |  IOMMU domain A           |
> + *             |  ________________         |
> + *             | |  IOMMU group   |        +------- io_pgtables
> + *             | |                |        |
> + *             | |   dev 00:00.0 ----+------- bond --- io_mm X
> + *             | |________________|   \    |
> + *             |                       '----- bond ---.
> + *             |___________________________|           \
> + *              ___________________________             \
> + *             |  IOMMU domain B           |           io_mm Y
> + *             |  ________________         |           / /
> + *             | |  IOMMU group   |        |          / /
> + *             | |                |        |         / /
> + *             | |   dev 00:01.0 ------------ bond -' /
> + *             | |   dev 00:01.1 ------------ bond --'
> + *             | |________________|        |
> + *             |                           +------- io_pgtables
> + *             |___________________________|
> + *
> + * In this example, device 00:00.0 is in domain A, devices 00:01.* are in domain
> + * B. All devices within the same domain access the same address spaces. Device
> + * 00:00.0 accesses address spaces X and Y, each corresponding to an mm_struct.
> + * Devices 00:01.* only access address space Y. In addition each
> + * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable, that is
> + * managed with iommu_map()/iommu_unmap(), and isn't shared with the CPU MMU.
> + *
> + * To obtain the above configuration, users would for instance issue the
> + * following calls:
> + *
> + *     iommu_sva_bind_device(dev 00:00.0, mm X, ...) -> PASID 1
> + *     iommu_sva_bind_device(dev 00:00.0, mm Y, ...) -> PASID 2
> + *     iommu_sva_bind_device(dev 00:01.0, mm Y, ...) -> PASID 2
> + *     iommu_sva_bind_device(dev 00:01.1, mm Y, ...) -> PASID 2
> + *
> + * A single Process Address Space ID (PASID) is allocated for each mm. In the
> + * example, devices use PASID 1 to read/write into address space X and PASID 2
> + * to read/write into address space Y.
> + *
> + * Hardware tables describing this configuration in the IOMMU would typically
> + * look like this:
> + *
> + *                                PASID tables
> + *                                 of domain A
> + *                              .->+--------+
> + *                             / 0 |        |-------> io_pgtable
> + *                            /    +--------+
> + *            Device tables  /   1 |        |-------> pgd X
> + *              +--------+  /      +--------+
> + *      00:00.0 |      A |-'     2 |        |--.
> + *              +--------+         +--------+   \
> + *              :        :       3 |        |    \
> + *              +--------+         +--------+     --> pgd Y
> + *      00:01.0 |      B |--.                    /
> + *              +--------+   \                  |
> + *      00:01.1 |      B |----+   PASID tables  |
> + *              +--------+     \   of domain B  |
> + *                              '->+--------+   |
> + *                               0 |        |-- | --> io_pgtable
> + *                                 +--------+   |
> + *                               1 |        |   |
> + *                                 +--------+   |
> + *                               2 |        |---'
> + *                                 +--------+
> + *                               3 |        |
> + *                                 +--------+
> + *
> + * With this model, a single call binds all devices in a given domain to an
> + * address space. Other devices in the domain will get the same bond implicitly.
> + * However, users must issue one bind() for each device, because IOMMUs may
> + * implement SVA differently. Furthermore, mandating one bind() per device
> + * allows the driver to perform sanity-checks on device capabilities.
> + *
> + * In some IOMMUs, one entry (typically the first one) of the PASID table can be
> + * used to hold non-PASID translations. In this case PASID #0 is reserved and
> + * the first entry points to the io_pgtable pointer. In other IOMMUs the
> + * io_pgtable pointer is held in the device table and PASID #0 is available to
> + * the allocator.
> + */
> +
> +struct iommu_bond {
> +	struct io_mm		*io_mm;
> +	struct device		*dev;
> +	struct iommu_domain	*domain;
> +
> +	struct list_head	mm_head;
> +	struct list_head	dev_head;
> +	struct list_head	domain_head;
> +
> +	void			*drvdata;
> +};
> +
> +/*
> + * Because we're using an IDR, PASIDs are limited to 31 bits (the sign bit is
> + * used for returning errors). In practice implementations will use at most 20
> + * bits, which is the PCI limit.
> + */
> +static DEFINE_IDR(iommu_pasid_idr);
> +
> +/*
> + * For the moment this is an all-purpose lock. It serializes
> + * access/modifications to bonds, access/modifications to the PASID IDR, and
> + * changes to io_mm refcount as well.
> + */
> +static DEFINE_SPINLOCK(iommu_sva_lock);
> +
> +static struct io_mm *
> +io_mm_alloc(struct iommu_domain *domain, struct device *dev,
> +	    struct mm_struct *mm, unsigned long flags)
> +{
> +	int ret;
> +	int pasid;
> +	struct io_mm *io_mm;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	if (!domain->ops->mm_alloc || !domain->ops->mm_free)
> +		return ERR_PTR(-ENODEV);
> +
> +	io_mm = domain->ops->mm_alloc(domain, mm, flags);
> +	if (IS_ERR(io_mm))
> +		return io_mm;
> +	if (!io_mm)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/*
> +	 * The mm must not be freed until after the driver frees the io_mm
> +	 * (which may involve unpinning the CPU ASID for instance, requiring a
> +	 * valid mm struct.)
> +	 */
> +	mmgrab(mm);
> +
> +	io_mm->flags		= flags;
> +	io_mm->mm		= mm;
> +	io_mm->release		= domain->ops->mm_free;
> +	INIT_LIST_HEAD(&io_mm->devices);
> +	/* Leave kref as zero until the io_mm is fully initialized */
> +
> +	idr_preload(GFP_KERNEL);
> +	spin_lock(&iommu_sva_lock);
> +	pasid = idr_alloc(&iommu_pasid_idr, io_mm, param->min_pasid,
> +			  param->max_pasid + 1, GFP_ATOMIC);
> +	io_mm->pasid = pasid;
> +	spin_unlock(&iommu_sva_lock);
> +	idr_preload_end();
> +
> +	if (pasid < 0) {
> +		ret = pasid;
> +		goto err_free_mm;
> +	}
> +
> +	/* TODO: keep track of mm. For the moment, abort. */
> +	ret = -ENOSYS;
> +	spin_lock(&iommu_sva_lock);
> +	idr_remove(&iommu_pasid_idr, io_mm->pasid);
> +	spin_unlock(&iommu_sva_lock);
> +
> +err_free_mm:
> +	io_mm->release(io_mm);
> +	mmdrop(mm);
> +
> +	return ERR_PTR(ret);
> +}
> +
> +static void io_mm_free(struct io_mm *io_mm)
> +{
> +	struct mm_struct *mm = io_mm->mm;
> +
> +	io_mm->release(io_mm);
> +	mmdrop(mm);
> +}
> +
> +static void io_mm_release(struct kref *kref)
> +{
> +	struct io_mm *io_mm;
> +
> +	io_mm = container_of(kref, struct io_mm, kref);
> +	WARN_ON(!list_empty(&io_mm->devices));
> +
> +	/* The PASID can now be reallocated for another mm... */
> +	idr_remove(&iommu_pasid_idr, io_mm->pasid);
> +	/* ... but this mm is freed after a grace period (TODO) */
> +	io_mm_free(io_mm);
> +}
> +
> +/*
> + * Returns non-zero if a reference to the io_mm was successfully taken.
> + * Returns zero if the io_mm is being freed and should not be used.
> + */
> +static int io_mm_get_locked(struct io_mm *io_mm)
> +{
> +	if (io_mm)
> +		return kref_get_unless_zero(&io_mm->kref);
> +
> +	return 0;
> +}
> +
> +static void io_mm_put_locked(struct io_mm *io_mm)
> +{
> +	kref_put(&io_mm->kref, io_mm_release);
> +}
> +
> +static void io_mm_put(struct io_mm *io_mm)
> +{
> +	spin_lock(&iommu_sva_lock);
> +	io_mm_put_locked(io_mm);
> +	spin_unlock(&iommu_sva_lock);
> +}
> +
> +static int io_mm_attach(struct iommu_domain *domain, struct device *dev,
> +			struct io_mm *io_mm, void *drvdata)
> +{
> +	int ret;
> +	bool attach_domain = true;
> +	int pasid = io_mm->pasid;
> +	struct iommu_bond *bond, *tmp;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	if (!domain->ops->mm_attach || !domain->ops->mm_detach)
> +		return -ENODEV;
> +
> +	if (pasid > param->max_pasid || pasid < param->min_pasid)
> +		return -ERANGE;
> +
> +	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> +	if (!bond)
> +		return -ENOMEM;
> +
> +	bond->domain		= domain;
> +	bond->io_mm		= io_mm;
> +	bond->dev		= dev;
> +	bond->drvdata		= drvdata;
> +
> +	spin_lock(&iommu_sva_lock);
> +	/*
> +	 * Check if this io_mm is already bound to the domain. In which case the
> +	 * IOMMU driver doesn't have to install the PASID table entry.
> +	 */
> +	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
> +		if (tmp->io_mm == io_mm) {
> +			attach_domain = false;
> +			break;
> +		}
> +	}
> +
> +	ret = domain->ops->mm_attach(domain, dev, io_mm, attach_domain);
> +	if (ret) {
> +		kfree(bond);
> +		goto out_unlock;
> +	}
> +
> +	list_add(&bond->mm_head, &io_mm->devices);
> +	list_add(&bond->domain_head, &domain->mm_list);
> +	list_add(&bond->dev_head, &param->mm_list);
> +
> +out_unlock:
> +	spin_unlock(&iommu_sva_lock);
> +	return ret;
> +}
> +
> +static void io_mm_detach_locked(struct iommu_bond *bond)
> +{
> +	struct iommu_bond *tmp;
> +	bool detach_domain = true;
> +	struct iommu_domain *domain = bond->domain;
> +
> +	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
> +		if (tmp->io_mm == bond->io_mm && tmp->dev != bond->dev) {
> +			detach_domain = false;
> +			break;
> +		}
> +	}
> +
> +	list_del(&bond->mm_head);
> +	list_del(&bond->domain_head);
> +	list_del(&bond->dev_head);
> +
> +	domain->ops->mm_detach(domain, bond->dev, bond->io_mm, detach_domain);
> +
> +	io_mm_put_locked(bond->io_mm);
> +	kfree(bond);
> +}
>   
>   int __iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, int *pasid,
>   			    unsigned long flags, void *drvdata)
>   {
> -	return -ENOSYS; /* TODO */
> +	int i;
> +	int ret = 0;
> +	struct iommu_bond *bond;
> +	struct io_mm *io_mm = NULL;
> +	struct iommu_domain *domain;
> +	struct iommu_sva_param *param;
> +
> +	domain = iommu_get_domain_for_dev(dev);
> +	if (!domain)
> +		return -EINVAL;
> +
> +	mutex_lock(&dev->iommu_param->sva_lock);
> +	param = dev->iommu_param->sva_param;
> +	if (!param || (flags & ~param->features)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	/* If an io_mm already exists, use it */
> +	spin_lock(&iommu_sva_lock);
> +	idr_for_each_entry(&iommu_pasid_idr, io_mm, i) {

This might be problematic for vt-d (and other possible arch's which use
PASID other than SVA). When vt-d iommu works in scalable mode, a PASID
might be allocated for:

(1) SVA
(2) Device Assignable Interface (might be a mdev or directly managed
     within a device driver).
(3) SVA in VM guest
(4) Device Assignable Interface in VM guest

So we can't expect that an io_mm pointer was associated with each PASID.
And this code might run into problem if the pasid is allocated for
usages other than SVA.

Best regards,
Lu Baolu

> +		if (io_mm->mm == mm && io_mm_get_locked(io_mm)) {
> +			/* ... Unless it's already bound to this device */
> +			list_for_each_entry(bond, &io_mm->devices, mm_head) {
> +				if (bond->dev == dev) {
> +					ret = -EEXIST;
> +					io_mm_put_locked(io_mm);
> +					break;
> +				}
> +			}
> +			break;
> +		}
> +	}
> +	spin_unlock(&iommu_sva_lock);
> +	if (ret)
> +		goto out_unlock;
> +
> +	/* Require identical features within an io_mm for now */
> +	if (io_mm && (flags != io_mm->flags)) {
> +		io_mm_put(io_mm);
> +		ret = -EDOM;
> +		goto out_unlock;
> +	}
> +
> +	if (!io_mm) {
> +		io_mm = io_mm_alloc(domain, dev, mm, flags);
> +		if (IS_ERR(io_mm)) {
> +			ret = PTR_ERR(io_mm);
> +			goto out_unlock;
> +		}
> +	}
> +
> +	ret = io_mm_attach(domain, dev, io_mm, drvdata);
> +	if (ret)
> +		io_mm_put(io_mm);
> +	else
> +		*pasid = io_mm->pasid;
> +
> +out_unlock:
> +	mutex_unlock(&dev->iommu_param->sva_lock);
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(__iommu_sva_bind_device);
>   
>   int __iommu_sva_unbind_device(struct device *dev, int pasid)
>   {
> -	return -ENOSYS; /* TODO */
> +	int ret = -ESRCH;
> +	struct iommu_domain *domain;
> +	struct iommu_bond *bond = NULL;
> +	struct iommu_sva_param *param;
> +
> +	domain = iommu_get_domain_for_dev(dev);
> +	if (!domain)
> +		return -EINVAL;
> +
> +	mutex_lock(&dev->iommu_param->sva_lock);
> +	param = dev->iommu_param->sva_param;
> +	if (!param) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	spin_lock(&iommu_sva_lock);
> +	list_for_each_entry(bond, &param->mm_list, dev_head) {
> +		if (bond->io_mm->pasid == pasid) {
> +			io_mm_detach_locked(bond);
> +			ret = 0;
> +			break;
> +		}
> +	}
> +	spin_unlock(&iommu_sva_lock);
> +
> +out_unlock:
> +	mutex_unlock(&dev->iommu_param->sva_lock);
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(__iommu_sva_unbind_device);
>   
>   static void __iommu_sva_unbind_device_all(struct device *dev)
>   {
> -	/* TODO */
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +	struct iommu_bond *bond, *next;
> +
> +	if (!param)
> +		return;
> +
> +	spin_lock(&iommu_sva_lock);
> +	list_for_each_entry_safe(bond, next, &param->mm_list, dev_head)
> +		io_mm_detach_locked(bond);
> +	spin_unlock(&iommu_sva_lock);
>   }
>   
>   /**
> @@ -82,6 +472,7 @@ int iommu_sva_init_device(struct device *dev, unsigned long features,
>   	param->features		= features;
>   	param->min_pasid	= min_pasid;
>   	param->max_pasid	= max_pasid;
> +	INIT_LIST_HEAD(&param->mm_list);
>   
>   	mutex_lock(&dev->iommu_param->sva_lock);
>   	if (dev->iommu_param->sva_param) {
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index aba3bf15d46c..7113fe398b70 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1525,6 +1525,7 @@ static struct iommu_domain *__iommu_domain_alloc(struct bus_type *bus,
>   	domain->type = type;
>   	/* Assume all sizes by default; the driver may override this later */
>   	domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;
> +	INIT_LIST_HEAD(&domain->mm_list);
>   
>   	return domain;
>   }
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 9c49877e37a5..6a3ced6a5aa1 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -99,6 +99,20 @@ struct iommu_domain {
>   	void *handler_token;
>   	struct iommu_domain_geometry geometry;
>   	void *iova_cookie;
> +
> +	struct list_head mm_list;
> +};
> +
> +struct io_mm {
> +	int			pasid;
> +	/* IOMMU_SVA_FEAT_* */
> +	unsigned long		flags;
> +	struct list_head	devices;
> +	struct kref		kref;
> +	struct mm_struct	*mm;
> +
> +	/* Release callback for this mm */
> +	void (*release)(struct io_mm *io_mm);
>   };
>   
>   enum iommu_cap {
> @@ -201,6 +215,7 @@ struct iommu_sva_param {
>   	unsigned long features;
>   	unsigned int min_pasid;
>   	unsigned int max_pasid;
> +	struct list_head mm_list;
>   };
>   
>   /**
> @@ -212,6 +227,12 @@ struct iommu_sva_param {
>    * @detach_dev: detach device from an iommu domain
>    * @sva_init_device: initialize Shared Virtual Addressing for a device
>    * @sva_shutdown_device: shutdown Shared Virtual Addressing for a device
> + * @mm_alloc: allocate io_mm
> + * @mm_free: free io_mm
> + * @mm_attach: attach io_mm to a device. Install PASID entry if necessary. Must
> + *             not sleep.
> + * @mm_detach: detach io_mm from a device. Remove PASID entry and
> + *             flush associated TLB entries if necessary. Must not sleep.
>    * @map: map a physically contiguous memory region to an iommu domain
>    * @unmap: unmap a physically contiguous memory region from an iommu domain
>    * @flush_tlb_all: Synchronously flush all hardware TLBs for this domain
> @@ -249,6 +270,14 @@ struct iommu_ops {
>   	void (*detach_dev)(struct iommu_domain *domain, struct device *dev);
>   	int (*sva_init_device)(struct device *dev, struct iommu_sva_param *param);
>   	void (*sva_shutdown_device)(struct device *dev);
> +	struct io_mm *(*mm_alloc)(struct iommu_domain *domain,
> +				  struct mm_struct *mm,
> +				  unsigned long flags);
> +	void (*mm_free)(struct io_mm *io_mm);
> +	int (*mm_attach)(struct iommu_domain *domain, struct device *dev,
> +			 struct io_mm *io_mm, bool attach_domain);
> +	void (*mm_detach)(struct iommu_domain *domain, struct device *dev,
> +			  struct io_mm *io_mm, bool detach_domain);
>   	int (*map)(struct iommu_domain *domain, unsigned long iova,
>   		   phys_addr_t paddr, size_t size, int prot);
>   	size_t (*unmap)(struct iommu_domain *domain, unsigned long iova,
>
Jean-Philippe Brucker Sept. 25, 2018, 10:32 a.m. UTC | #2
On 25/09/2018 04:15, Lu Baolu wrote:
>> +     /* If an io_mm already exists, use it */
>> +     spin_lock(&iommu_sva_lock);
>> +     idr_for_each_entry(&iommu_pasid_idr, io_mm, i) {
> 
> This might be problematic for vt-d (and other possible arch's which use
> PASID other than SVA). When vt-d iommu works in scalable mode, a PASID
> might be allocated for:
> 
> (1) SVA
> (2) Device Assignable Interface (might be a mdev or directly managed
>      within a device driver).
> (3) SVA in VM guest
> (4) Device Assignable Interface in VM guest
> 
> So we can't expect that an io_mm pointer was associated with each PASID.

Yes as discussed on the previous series, we'll need to move the PASID
allocator outside of iommu-sva at some point, and even outside of
drivers/iommu since device driver might want to use the PASID allocator
without an IOMMU.

To be usable by all consumers of PASIDs, that allocator will need the
same interface as IDR, so I don't think we have a problem here. I
haven't had time or need to write such allocator yet (and don't plan to
do it as part of this series), but I drafted an interface, that at least
fulfills the needs of SVA.

* Single system-wide PASID space
* Multiple consumers, each associating their own structure to PASIDs.
  Each consumer gets a token.
* Device drivers might want to use both SVA and private PASIDs for a
  device at the same time.
* In my opinion "pasid" isn't the right name, "ioasid" would be better
  but that's not important.

typedef unsigned int pasid_t;

/* Returns consumer token */
void *pasid_get_consumer();
void pasid_put_consumer(void *consumer);

/* Returns pasid or invalid (pasid_t)(-1) */
pasid_t pasid_alloc(void *consumer, pasid_t min, pasid_t max,
                    void *private);
void pasid_remove(pasid_t pasid);

/* Iterate over PASIDs for this consumer. Func returns non-zero to stop
iterating */
int pasid_for_each(void *consumer, void *iter_data,
		   int (*func)(void *iter_data, pasid_t pasid,
			       void *private));
/* Returns priv data or NULL */
void *pasid_find(void *consumer, pasid_t pasid);

Thanks,
Jean

> And this code might run into problem if the pasid is allocated for
> usages other than SVA.
> 
> Best regards,
> Lu Baolu
Joerg Roedel Sept. 25, 2018, 1:26 p.m. UTC | #3
On Tue, Sep 25, 2018 at 11:15:40AM +0800, Lu Baolu wrote:
> This might be problematic for vt-d (and other possible arch's which use
> PASID other than SVA). When vt-d iommu works in scalable mode, a PASID
> might be allocated for:
> 
> (1) SVA
> (2) Device Assignable Interface (might be a mdev or directly managed
>     within a device driver).
> (3) SVA in VM guest
> (4) Device Assignable Interface in VM guest
> 
> So we can't expect that an io_mm pointer was associated with each PASID.
> And this code might run into problem if the pasid is allocated for
> usages other than SVA.

So all of these use-cases above should work in parallel on the same
device, just with different PASIDs? Or is a device always using only one
of the above modes at the same time?

Regards,

	Joerg
Baolu Lu Sept. 25, 2018, 11:33 p.m. UTC | #4
Hi Joerg,

On 09/25/2018 09:26 PM, Joerg Roedel wrote:
> On Tue, Sep 25, 2018 at 11:15:40AM +0800, Lu Baolu wrote:
>> This might be problematic for vt-d (and other possible arch's which use
>> PASID other than SVA). When vt-d iommu works in scalable mode, a PASID
>> might be allocated for:
>>
>> (1) SVA
>> (2) Device Assignable Interface (might be a mdev or directly managed
>>      within a device driver).
>> (3) SVA in VM guest
>> (4) Device Assignable Interface in VM guest
>>
>> So we can't expect that an io_mm pointer was associated with each PASID.
>> And this code might run into problem if the pasid is allocated for
>> usages other than SVA.
> 
> So all of these use-cases above should work in parallel on the same
> device, just with different PASIDs?

No. It's not required.

> Or is a device always using only one
> of the above modes at the same time?

A device might use one or multiple modes described above at the same
time.

Best regards,
Lu Baolu
Baolu Lu Sept. 26, 2018, 3:12 a.m. UTC | #5
Hi,

On 09/25/2018 06:32 PM, Jean-Philippe Brucker wrote:
> On 25/09/2018 04:15, Lu Baolu wrote:
>>> +     /* If an io_mm already exists, use it */
>>> +     spin_lock(&iommu_sva_lock);
>>> +     idr_for_each_entry(&iommu_pasid_idr, io_mm, i) {
>>
>> This might be problematic for vt-d (and other possible arch's which use
>> PASID other than SVA). When vt-d iommu works in scalable mode, a PASID
>> might be allocated for:
>>
>> (1) SVA
>> (2) Device Assignable Interface (might be a mdev or directly managed
>>       within a device driver).
>> (3) SVA in VM guest
>> (4) Device Assignable Interface in VM guest
>>
>> So we can't expect that an io_mm pointer was associated with each PASID.
> 
> Yes as discussed on the previous series, we'll need to move the PASID
> allocator outside of iommu-sva at some point, and even outside of
> drivers/iommu since device driver might want to use the PASID allocator
> without an IOMMU.
> 
> To be usable by all consumers of PASIDs, that allocator will need the
> same interface as IDR, so I don't think we have a problem here. I
> haven't had time or need to write such allocator yet (and don't plan to
> do it as part of this series), but I drafted an interface, that at least
> fulfills the needs of SVA.

I have a patch set for the global pasid allocator. It mostly matches
your idea. I can send it out for comments later.

Best regards,
Lu Baolu

> 
> * Single system-wide PASID space
> * Multiple consumers, each associating their own structure to PASIDs.
>    Each consumer gets a token.
> * Device drivers might want to use both SVA and private PASIDs for a
>    device at the same time.
> * In my opinion "pasid" isn't the right name, "ioasid" would be better
>    but that's not important.
> 
> typedef unsigned int pasid_t;
> 
> /* Returns consumer token */
> void *pasid_get_consumer();
> void pasid_put_consumer(void *consumer);
> 
> /* Returns pasid or invalid (pasid_t)(-1) */
> pasid_t pasid_alloc(void *consumer, pasid_t min, pasid_t max,
>                      void *private);
> void pasid_remove(pasid_t pasid);
> 
> /* Iterate over PASIDs for this consumer. Func returns non-zero to stop
> iterating */
> int pasid_for_each(void *consumer, void *iter_data,
> 		   int (*func)(void *iter_data, pasid_t pasid,
> 			       void *private));
> /* Returns priv data or NULL */
> void *pasid_find(void *consumer, pasid_t pasid);
> 
> Thanks,
> Jean
> 
>> And this code might run into problem if the pasid is allocated for
>> usages other than SVA.
>>
>> Best regards,
>> Lu Baolu
> 
>
Jean-Philippe Brucker Sept. 26, 2018, 10:20 a.m. UTC | #6
On 26/09/2018 00:33, Lu Baolu wrote:
> Hi Joerg,
> 
> On 09/25/2018 09:26 PM, Joerg Roedel wrote:
>> On Tue, Sep 25, 2018 at 11:15:40AM +0800, Lu Baolu wrote:
>>> This might be problematic for vt-d (and other possible arch's which use
>>> PASID other than SVA). When vt-d iommu works in scalable mode, a PASID
>>> might be allocated for:
>>>
>>> (1) SVA
>>> (2) Device Assignable Interface (might be a mdev or directly managed
>>>      within a device driver).
>>> (3) SVA in VM guest
>>> (4) Device Assignable Interface in VM guest
>>>
>>> So we can't expect that an io_mm pointer was associated with each PASID.
>>> And this code might run into problem if the pasid is allocated for
>>> usages other than SVA.
>> 
>> So all of these use-cases above should work in parallel on the same
>> device, just with different PASIDs?
> 
> No. It's not required.
> 
>> Or is a device always using only one
>> of the above modes at the same time?
> 
> A device might use one or multiple modes described above at the same
> time.

Yes, at the moment it's difficult to guess what device drivers will
want, but I can imagine some driver offering SVA to userspace, while
keeping a few PASIDs for themselves to map kernel memory. Or create mdev
devices for virtualization while also allowing bare-metal SVA. So I
think we should aim at enabling these use-cases in parallel, even if it
doesn't necessarily need to be possible right now.

Thanks,
Jean
Joerg Roedel Sept. 26, 2018, 12:45 p.m. UTC | #7
On Wed, Sep 26, 2018 at 11:20:34AM +0100, Jean-Philippe Brucker wrote:
> Yes, at the moment it's difficult to guess what device drivers will
> want, but I can imagine some driver offering SVA to userspace, while
> keeping a few PASIDs for themselves to map kernel memory. Or create mdev
> devices for virtualization while also allowing bare-metal SVA. So I
> think we should aim at enabling these use-cases in parallel, even if it
> doesn't necessarily need to be possible right now.

Yeah okay, but allowing these use-cases in parallel basically disallows
giving any guest control over a device's pasid-table, no?

I am just asking because I want to make up my mind about the necessary
extensions to the IOMMU-API.


Regards,

	Joerg
Jean-Philippe Brucker Sept. 26, 2018, 1:50 p.m. UTC | #8
On 26/09/2018 13:45, Joerg Roedel wrote:
> On Wed, Sep 26, 2018 at 11:20:34AM +0100, Jean-Philippe Brucker wrote:
>> Yes, at the moment it's difficult to guess what device drivers will
>> want, but I can imagine some driver offering SVA to userspace, while
>> keeping a few PASIDs for themselves to map kernel memory. Or create mdev
>> devices for virtualization while also allowing bare-metal SVA. So I
>> think we should aim at enabling these use-cases in parallel, even if it
>> doesn't necessarily need to be possible right now.
> 
> Yeah okay, but allowing these use-cases in parallel basically disallows
> giving any guest control over a device's pasid-table, no?
All of these use-cases require the host to manage the PASID tables, so
while any one of them is enabled, we can't give a guest control over the
PASID tables. But allowing these use-cases in parallel doesn't change that.

There is an ambiguity: I understand "(3) SVA in VM guest" as SVA for a
device-assignable interface assigned to a guest, using vfio-mdev and the
new Intel vt-d architecture (right?). That case does require the host to
allocate and manage PASIDs (because the PCI device is shared between
multiple VMs).

For the "classic" vfio-pci case, "SVA in guest" still means giving the
guest control over the whole PASID table.

Thanks,
Jean
Jacob Pan Sept. 26, 2018, 10:35 p.m. UTC | #9
On Thu, 20 Sep 2018 18:00:39 +0100
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> +
> +static int io_mm_attach(struct iommu_domain *domain, struct device
> *dev,
> +			struct io_mm *io_mm, void *drvdata)
> +{
> +	int ret;
> +	bool attach_domain = true;
> +	int pasid = io_mm->pasid;
> +	struct iommu_bond *bond, *tmp;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	if (!domain->ops->mm_attach || !domain->ops->mm_detach)
> +		return -ENODEV;
> +
> +	if (pasid > param->max_pasid || pasid < param->min_pasid)
> +		return -ERANGE;
> +
> +	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> +	if (!bond)
> +		return -ENOMEM;
> +
> +	bond->domain		= domain;
> +	bond->io_mm		= io_mm;
> +	bond->dev		= dev;
> +	bond->drvdata		= drvdata;
> +
> +	spin_lock(&iommu_sva_lock);
> +	/*
> +	 * Check if this io_mm is already bound to the domain. In
> which case the
> +	 * IOMMU driver doesn't have to install the PASID table
> entry.
> +	 */
> +	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
> +		if (tmp->io_mm == io_mm) {
> +			attach_domain = false;
> +			break;
> +		}
> +	}
> +
> +	ret = domain->ops->mm_attach(domain, dev, io_mm,
> attach_domain);
> +	if (ret) {
> +		kfree(bond);
> +		goto out_unlock;
> +	}
> +
> +	list_add(&bond->mm_head, &io_mm->devices);
> +	list_add(&bond->domain_head, &domain->mm_list);
> +	list_add(&bond->dev_head, &param->mm_list);
> +

I am trying to understand if mm_list is needed for both per device and
per domain. Do you always unbind and detach domain? Seems device could
use the domain->mm_list to track all mm's, true?
Jacob Pan Sept. 26, 2018, 10:58 p.m. UTC | #10
On Wed, 26 Sep 2018 14:45:27 +0200
Joerg Roedel <joro@8bytes.org> wrote:

> On Wed, Sep 26, 2018 at 11:20:34AM +0100, Jean-Philippe Brucker wrote:
> > Yes, at the moment it's difficult to guess what device drivers will
> > want, but I can imagine some driver offering SVA to userspace, while
> > keeping a few PASIDs for themselves to map kernel memory. Or create
> > mdev devices for virtualization while also allowing bare-metal SVA.
> > So I think we should aim at enabling these use-cases in parallel,
> > even if it doesn't necessarily need to be possible right now.  
> 
> Yeah okay, but allowing these use-cases in parallel basically
> disallows giving any guest control over a device's pasid-table, no?
> 
For VT-d 3 (which is the only revision to support PASID), PASID table
is always controlled by the host driver. Guest SVA usage would bind
PASID with gCR3.
But I thought ARM (https://lkml.org/lkml/2018/9/18/1082) is using bind
PASID table approach which gives guest control of the device PASID
table. I don't know if that is intended for any parallel use of PASID
on the same device.
> I am just asking because I want to make up my mind about the necessary
> extensions to the IOMMU-API.
> 
One extension, we will need and being developed is bind_guest_pasid()
for guest SVA usage.
Usage:
1. guest allocate a system wide PASID for SVA
2. guest write PASID to its PASID table
3. PASID cache flush results in bind PASID (from guest) to device
4. Host IOMMU driver install gCR3s of the PASID to device PASID table
(ops.bind_guest_pasid)

Thanks,

Jacob
> 
> Regards,
> 
> 	Joerg
> 

[Jacob Pan]
Yi Liu Sept. 27, 2018, 3:22 a.m. UTC | #11
> From: iommu-bounces@lists.linux-foundation.org [mailto:iommu-
> bounces@lists.linux-foundation.org] On Behalf Of Jean-Philippe Brucker
> Sent: Wednesday, September 26, 2018 9:50 PM
> Subject: Re: [PATCH v3 03/10] iommu/sva: Manage process address spaces
> 
> On 26/09/2018 13:45, Joerg Roedel wrote:
> > On Wed, Sep 26, 2018 at 11:20:34AM +0100, Jean-Philippe Brucker wrote:
> >> Yes, at the moment it's difficult to guess what device drivers will
> >> want, but I can imagine some driver offering SVA to userspace, while
> >> keeping a few PASIDs for themselves to map kernel memory. Or create mdev
> >> devices for virtualization while also allowing bare-metal SVA. So I
> >> think we should aim at enabling these use-cases in parallel, even if it
> >> doesn't necessarily need to be possible right now.
> >
> > Yeah okay, but allowing these use-cases in parallel basically disallows
> > giving any guest control over a device's pasid-table, no?
> All of these use-cases require the host to manage the PASID tables, so
> while any one of them is enabled, we can't give a guest control over the
> PASID tables. But allowing these use-cases in parallel doesn't change that.
> 
> There is an ambiguity: I understand "(3) SVA in VM guest" as SVA for a
> device-assignable interface assigned to a guest, using vfio-mdev and the
> new Intel vt-d architecture (right?). That case does require the host to
> allocate and manage PASIDs (because the PCI device is shared between
> multiple VMs).

Correct. For such case, we give host the charge to allocate and manage PASIDs.
And the reason is correct.

> 
> For the "classic" vfio-pci case, "SVA in guest" still means giving the
> guest control over the whole PASID table.

No, if giving guest control over the whole PASID table, it means guest may have
its own PASID namespace. right? And for vfio-mdev case, it gets PASID from host.
So there would be multiple PASID namespaces. Thinking about the following scenario:

A PF/VF assigned to a VM via "classic" vfio-pci. And an assignable-device-interface
assigned to this VM via vfio-mdev. If an application in this VM tries to manipulate
these two "devices", it should have the same PASID programmed to them. right?
But as the above comments mentioned, for vfio-pci case, it would get a PASID from
its own PASID namespace. While the vfio-mdev case would get a PASID from host.
This would result in conflict.

So I would like we get host to allocate and manage the whole PASID table, so that
to cover possible combinations of vfio-pci passthru and vfio-mdev passthru.

> 
> Thanks,
> Jean

Regards,
Yi Liu
Jean-Philippe Brucker Sept. 27, 2018, 1:37 p.m. UTC | #12
On 27/09/2018 04:22, Liu, Yi L wrote:
>> For the "classic" vfio-pci case, "SVA in guest" still means giving the
>> guest control over the whole PASID table.
> 
> No, if giving guest control over the whole PASID table, it means guest may have
> its own PASID namespace. right? And for vfio-mdev case, it gets PASID from host.
> So there would be multiple PASID namespaces. Thinking about the following scenario:
> 
> A PF/VF assigned to a VM via "classic" vfio-pci. And an assignable-device-interface
> assigned to this VM via vfio-mdev. If an application in this VM tries to manipulate
> these two "devices", it should have the same PASID programmed to them. right?
> But as the above comments mentioned, for vfio-pci case, it would get a PASID from
> its own PASID namespace. While the vfio-mdev case would get a PASID from host.
> This would result in conflict.

Ah I see, if the host assigns an ADI via vfio-mdev and a PCI function
via vfio-pci to the same VM, the guest needs to use the paravirtualized
PASID allocator for the PCI device as well, not just the ADI. In fact
all guest PASIDs need to be allocated through one PV channel, even if
the VM has other vIOMMUs that don't support PV. But I suppose that kind
of VM is unrealistic. However for SMMUv3 we'll still use the
bind_pasid_table for vfio-pci and let the guest allocate PASIDs, since
the PASID table lives in guest-physical space.

In any case, for the patch series at hand, it means that iommu-sva will
need assistance from the vt-d driver to allocate PASIDs: host uses the
generic allocator, guest uses the PV one. I guess the mm_alloc() op
could do that?

Thanks,
Jean
Jean-Philippe Brucker Oct. 3, 2018, 5:52 p.m. UTC | #13
On 26/09/2018 23:35, Jacob Pan wrote:
> On Thu, 20 Sep 2018 18:00:39 +0100
> Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:
> 
>> +
>> +static int io_mm_attach(struct iommu_domain *domain, struct device
>> *dev,
>> +			struct io_mm *io_mm, void *drvdata)
>> +{
>> +	int ret;
>> +	bool attach_domain = true;
>> +	int pasid = io_mm->pasid;
>> +	struct iommu_bond *bond, *tmp;
>> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
>> +
>> +	if (!domain->ops->mm_attach || !domain->ops->mm_detach)
>> +		return -ENODEV;
>> +
>> +	if (pasid > param->max_pasid || pasid < param->min_pasid)
>> +		return -ERANGE;
>> +
>> +	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
>> +	if (!bond)
>> +		return -ENOMEM;
>> +
>> +	bond->domain		= domain;
>> +	bond->io_mm		= io_mm;
>> +	bond->dev		= dev;
>> +	bond->drvdata		= drvdata;
>> +
>> +	spin_lock(&iommu_sva_lock);
>> +	/*
>> +	 * Check if this io_mm is already bound to the domain. In
>> which case the
>> +	 * IOMMU driver doesn't have to install the PASID table
>> entry.
>> +	 */
>> +	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
>> +		if (tmp->io_mm == io_mm) {
>> +			attach_domain = false;
>> +			break;
>> +		}
>> +	}
>> +
>> +	ret = domain->ops->mm_attach(domain, dev, io_mm,
>> attach_domain);
>> +	if (ret) {
>> +		kfree(bond);
>> +		goto out_unlock;
>> +	}
>> +
>> +	list_add(&bond->mm_head, &io_mm->devices);
>> +	list_add(&bond->domain_head, &domain->mm_list);
>> +	list_add(&bond->dev_head, &param->mm_list);
>> +
> 
> I am trying to understand if mm_list is needed for both per device and
> per domain. Do you always unbind and detach domain? Seems device could
> use the domain->mm_list to track all mm's, true?

We need to track bonds per devices, since the bind/unbind() user interface
in on devices. Tracking per domain is just a helper, so IOMMU drivers that
have a single PASID table per domain know when they need to install a new
entry (the attach_domain parameter above) and remove it. I think my code
is wrong here: if binding two devices that are in the same domain to the
same process we shouldn't add the io_mm to domain->mm_list twice.

I'm still not sure if I should remove domains handling here though, could
you confirm if you're planning to support iommu_get_domain_for_dev for vt-d?

Thanks,
Jean
Yi Liu Oct. 8, 2018, 8:29 a.m. UTC | #14
Hi Jean,

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Thursday, September 27, 2018 9:38 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; Joerg Roedel <joro@8bytes.org>
> Subject: Re: [PATCH v3 03/10] iommu/sva: Manage process address spaces
> 
> On 27/09/2018 04:22, Liu, Yi L wrote:
> >> For the "classic" vfio-pci case, "SVA in guest" still means giving the
> >> guest control over the whole PASID table.
> >
> > No, if giving guest control over the whole PASID table, it means guest may have
> > its own PASID namespace. right? And for vfio-mdev case, it gets PASID from host.
> > So there would be multiple PASID namespaces. Thinking about the following
> scenario:
> >
> > A PF/VF assigned to a VM via "classic" vfio-pci. And an assignable-device-interface
> > assigned to this VM via vfio-mdev. If an application in this VM tries to manipulate
> > these two "devices", it should have the same PASID programmed to them. right?
> > But as the above comments mentioned, for vfio-pci case, it would get a PASID
> from
> > its own PASID namespace. While the vfio-mdev case would get a PASID from host.
> > This would result in conflict.
> 
> Ah I see, if the host assigns an ADI via vfio-mdev and a PCI function
> via vfio-pci to the same VM, the guest needs to use the paravirtualized
> PASID allocator for the PCI device as well, not just the ADI. In fact
> all guest PASIDs need to be allocated through one PV channel, even if
> the VM has other vIOMMUs that don't support PV. But I suppose that kind
> of VM is unrealistic.

yes, such kind of VM is unrealistic. :)

> However for SMMUv3 we'll still use the
> bind_pasid_table for vfio-pci and let the guest allocate PASIDs, since
> the PASID table lives in guest-physical space.

I think it's ok. This doesn’t result in any conflict.

> 
> In any case, for the patch series at hand, it means that iommu-sva will
> need assistance from the vt-d driver to allocate PASIDs: host uses the
> generic allocator, guest uses the PV one.

Exactly.

> I guess the mm_alloc() op could do that?

Do you mean the io_mm_alloc in your SVA patch series? We've got
some patch for the PV one. Allen (Baolu Lu) is preparing to send it out
for review. I guess we can have more alignment during that patch
reviewing.

Thanks,
Yi Liu
Jacob Pan Oct. 15, 2018, 8:53 p.m. UTC | #15
On Wed, 3 Oct 2018 18:52:16 +0100
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> On 26/09/2018 23:35, Jacob Pan wrote:
> > On Thu, 20 Sep 2018 18:00:39 +0100
> > Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:
> >   
> >> +
> >> +static int io_mm_attach(struct iommu_domain *domain, struct device
> >> *dev,
> >> +			struct io_mm *io_mm, void *drvdata)
> >> +{
> >> +	int ret;
> >> +	bool attach_domain = true;
> >> +	int pasid = io_mm->pasid;
> >> +	struct iommu_bond *bond, *tmp;
> >> +	struct iommu_sva_param *param =
> >> dev->iommu_param->sva_param; +
> >> +	if (!domain->ops->mm_attach || !domain->ops->mm_detach)
> >> +		return -ENODEV;
> >> +
> >> +	if (pasid > param->max_pasid || pasid < param->min_pasid)
> >> +		return -ERANGE;
> >> +
> >> +	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> >> +	if (!bond)
> >> +		return -ENOMEM;
> >> +
> >> +	bond->domain		= domain;
> >> +	bond->io_mm		= io_mm;
> >> +	bond->dev		= dev;
> >> +	bond->drvdata		= drvdata;
> >> +
> >> +	spin_lock(&iommu_sva_lock);
> >> +	/*
> >> +	 * Check if this io_mm is already bound to the domain. In
> >> which case the
> >> +	 * IOMMU driver doesn't have to install the PASID table
> >> entry.
> >> +	 */
> >> +	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
> >> +		if (tmp->io_mm == io_mm) {
> >> +			attach_domain = false;
> >> +			break;
> >> +		}
> >> +	}
> >> +
> >> +	ret = domain->ops->mm_attach(domain, dev, io_mm,
> >> attach_domain);
> >> +	if (ret) {
> >> +		kfree(bond);
> >> +		goto out_unlock;
> >> +	}
> >> +
> >> +	list_add(&bond->mm_head, &io_mm->devices);
> >> +	list_add(&bond->domain_head, &domain->mm_list);
> >> +	list_add(&bond->dev_head, &param->mm_list);
> >> +  
> > 
> > I am trying to understand if mm_list is needed for both per device
> > and per domain. Do you always unbind and detach domain? Seems
> > device could use the domain->mm_list to track all mm's, true?  
> 
> We need to track bonds per devices, since the bind/unbind() user
> interface in on devices. Tracking per domain is just a helper, so
> IOMMU drivers that have a single PASID table per domain know when
> they need to install a new entry (the attach_domain parameter above)
> and remove it. I think my code is wrong here: if binding two devices
> that are in the same domain to the same process we shouldn't add the
> io_mm to domain->mm_list twice.
> 
> I'm still not sure if I should remove domains handling here though,
> could you confirm if you're planning to support
> iommu_get_domain_for_dev for vt-d?
> 
yes. i am working on getting vt-d onto the same behavior in terms of
default domain. I have a patch being tested, we need to respect RMRR (
reserved region) that is setup before iommu_get_domain_for_dev().
> Thanks,
> Jean

[Jacob Pan]
diff mbox series

Patch

diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index d60d4f0bb89e..a486bc947335 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -5,25 +5,415 @@ 
  * Copyright (C) 2018 ARM Ltd.
  */
 
+#include <linux/idr.h>
 #include <linux/iommu.h>
+#include <linux/sched/mm.h>
 #include <linux/slab.h>
+#include <linux/spinlock.h>
+
+/**
+ * DOC: io_mm model
+ *
+ * The io_mm keeps track of process address spaces shared between CPU and IOMMU.
+ * The following example illustrates the relation between structures
+ * iommu_domain, io_mm and iommu_bond. An iommu_bond is a link between io_mm and
+ * device. A device can have multiple io_mm and an io_mm may be bound to
+ * multiple devices.
+ *              ___________________________
+ *             |  IOMMU domain A           |
+ *             |  ________________         |
+ *             | |  IOMMU group   |        +------- io_pgtables
+ *             | |                |        |
+ *             | |   dev 00:00.0 ----+------- bond --- io_mm X
+ *             | |________________|   \    |
+ *             |                       '----- bond ---.
+ *             |___________________________|           \
+ *              ___________________________             \
+ *             |  IOMMU domain B           |           io_mm Y
+ *             |  ________________         |           / /
+ *             | |  IOMMU group   |        |          / /
+ *             | |                |        |         / /
+ *             | |   dev 00:01.0 ------------ bond -' /
+ *             | |   dev 00:01.1 ------------ bond --'
+ *             | |________________|        |
+ *             |                           +------- io_pgtables
+ *             |___________________________|
+ *
+ * In this example, device 00:00.0 is in domain A, devices 00:01.* are in domain
+ * B. All devices within the same domain access the same address spaces. Device
+ * 00:00.0 accesses address spaces X and Y, each corresponding to an mm_struct.
+ * Devices 00:01.* only access address space Y. In addition each
+ * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable, that is
+ * managed with iommu_map()/iommu_unmap(), and isn't shared with the CPU MMU.
+ *
+ * To obtain the above configuration, users would for instance issue the
+ * following calls:
+ *
+ *     iommu_sva_bind_device(dev 00:00.0, mm X, ...) -> PASID 1
+ *     iommu_sva_bind_device(dev 00:00.0, mm Y, ...) -> PASID 2
+ *     iommu_sva_bind_device(dev 00:01.0, mm Y, ...) -> PASID 2
+ *     iommu_sva_bind_device(dev 00:01.1, mm Y, ...) -> PASID 2
+ *
+ * A single Process Address Space ID (PASID) is allocated for each mm. In the
+ * example, devices use PASID 1 to read/write into address space X and PASID 2
+ * to read/write into address space Y.
+ *
+ * Hardware tables describing this configuration in the IOMMU would typically
+ * look like this:
+ *
+ *                                PASID tables
+ *                                 of domain A
+ *                              .->+--------+
+ *                             / 0 |        |-------> io_pgtable
+ *                            /    +--------+
+ *            Device tables  /   1 |        |-------> pgd X
+ *              +--------+  /      +--------+
+ *      00:00.0 |      A |-'     2 |        |--.
+ *              +--------+         +--------+   \
+ *              :        :       3 |        |    \
+ *              +--------+         +--------+     --> pgd Y
+ *      00:01.0 |      B |--.                    /
+ *              +--------+   \                  |
+ *      00:01.1 |      B |----+   PASID tables  |
+ *              +--------+     \   of domain B  |
+ *                              '->+--------+   |
+ *                               0 |        |-- | --> io_pgtable
+ *                                 +--------+   |
+ *                               1 |        |   |
+ *                                 +--------+   |
+ *                               2 |        |---'
+ *                                 +--------+
+ *                               3 |        |
+ *                                 +--------+
+ *
+ * With this model, a single call binds all devices in a given domain to an
+ * address space. Other devices in the domain will get the same bond implicitly.
+ * However, users must issue one bind() for each device, because IOMMUs may
+ * implement SVA differently. Furthermore, mandating one bind() per device
+ * allows the driver to perform sanity-checks on device capabilities.
+ *
+ * In some IOMMUs, one entry (typically the first one) of the PASID table can be
+ * used to hold non-PASID translations. In this case PASID #0 is reserved and
+ * the first entry points to the io_pgtable pointer. In other IOMMUs the
+ * io_pgtable pointer is held in the device table and PASID #0 is available to
+ * the allocator.
+ */
+
+struct iommu_bond {
+	struct io_mm		*io_mm;
+	struct device		*dev;
+	struct iommu_domain	*domain;
+
+	struct list_head	mm_head;
+	struct list_head	dev_head;
+	struct list_head	domain_head;
+
+	void			*drvdata;
+};
+
+/*
+ * Because we're using an IDR, PASIDs are limited to 31 bits (the sign bit is
+ * used for returning errors). In practice implementations will use at most 20
+ * bits, which is the PCI limit.
+ */
+static DEFINE_IDR(iommu_pasid_idr);
+
+/*
+ * For the moment this is an all-purpose lock. It serializes
+ * access/modifications to bonds, access/modifications to the PASID IDR, and
+ * changes to io_mm refcount as well.
+ */
+static DEFINE_SPINLOCK(iommu_sva_lock);
+
+static struct io_mm *
+io_mm_alloc(struct iommu_domain *domain, struct device *dev,
+	    struct mm_struct *mm, unsigned long flags)
+{
+	int ret;
+	int pasid;
+	struct io_mm *io_mm;
+	struct iommu_sva_param *param = dev->iommu_param->sva_param;
+
+	if (!domain->ops->mm_alloc || !domain->ops->mm_free)
+		return ERR_PTR(-ENODEV);
+
+	io_mm = domain->ops->mm_alloc(domain, mm, flags);
+	if (IS_ERR(io_mm))
+		return io_mm;
+	if (!io_mm)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * The mm must not be freed until after the driver frees the io_mm
+	 * (which may involve unpinning the CPU ASID for instance, requiring a
+	 * valid mm struct.)
+	 */
+	mmgrab(mm);
+
+	io_mm->flags		= flags;
+	io_mm->mm		= mm;
+	io_mm->release		= domain->ops->mm_free;
+	INIT_LIST_HEAD(&io_mm->devices);
+	/* Leave kref as zero until the io_mm is fully initialized */
+
+	idr_preload(GFP_KERNEL);
+	spin_lock(&iommu_sva_lock);
+	pasid = idr_alloc(&iommu_pasid_idr, io_mm, param->min_pasid,
+			  param->max_pasid + 1, GFP_ATOMIC);
+	io_mm->pasid = pasid;
+	spin_unlock(&iommu_sva_lock);
+	idr_preload_end();
+
+	if (pasid < 0) {
+		ret = pasid;
+		goto err_free_mm;
+	}
+
+	/* TODO: keep track of mm. For the moment, abort. */
+	ret = -ENOSYS;
+	spin_lock(&iommu_sva_lock);
+	idr_remove(&iommu_pasid_idr, io_mm->pasid);
+	spin_unlock(&iommu_sva_lock);
+
+err_free_mm:
+	io_mm->release(io_mm);
+	mmdrop(mm);
+
+	return ERR_PTR(ret);
+}
+
+static void io_mm_free(struct io_mm *io_mm)
+{
+	struct mm_struct *mm = io_mm->mm;
+
+	io_mm->release(io_mm);
+	mmdrop(mm);
+}
+
+static void io_mm_release(struct kref *kref)
+{
+	struct io_mm *io_mm;
+
+	io_mm = container_of(kref, struct io_mm, kref);
+	WARN_ON(!list_empty(&io_mm->devices));
+
+	/* The PASID can now be reallocated for another mm... */
+	idr_remove(&iommu_pasid_idr, io_mm->pasid);
+	/* ... but this mm is freed after a grace period (TODO) */
+	io_mm_free(io_mm);
+}
+
+/*
+ * Returns non-zero if a reference to the io_mm was successfully taken.
+ * Returns zero if the io_mm is being freed and should not be used.
+ */
+static int io_mm_get_locked(struct io_mm *io_mm)
+{
+	if (io_mm)
+		return kref_get_unless_zero(&io_mm->kref);
+
+	return 0;
+}
+
+static void io_mm_put_locked(struct io_mm *io_mm)
+{
+	kref_put(&io_mm->kref, io_mm_release);
+}
+
+static void io_mm_put(struct io_mm *io_mm)
+{
+	spin_lock(&iommu_sva_lock);
+	io_mm_put_locked(io_mm);
+	spin_unlock(&iommu_sva_lock);
+}
+
+static int io_mm_attach(struct iommu_domain *domain, struct device *dev,
+			struct io_mm *io_mm, void *drvdata)
+{
+	int ret;
+	bool attach_domain = true;
+	int pasid = io_mm->pasid;
+	struct iommu_bond *bond, *tmp;
+	struct iommu_sva_param *param = dev->iommu_param->sva_param;
+
+	if (!domain->ops->mm_attach || !domain->ops->mm_detach)
+		return -ENODEV;
+
+	if (pasid > param->max_pasid || pasid < param->min_pasid)
+		return -ERANGE;
+
+	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
+	if (!bond)
+		return -ENOMEM;
+
+	bond->domain		= domain;
+	bond->io_mm		= io_mm;
+	bond->dev		= dev;
+	bond->drvdata		= drvdata;
+
+	spin_lock(&iommu_sva_lock);
+	/*
+	 * Check if this io_mm is already bound to the domain. In which case the
+	 * IOMMU driver doesn't have to install the PASID table entry.
+	 */
+	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
+		if (tmp->io_mm == io_mm) {
+			attach_domain = false;
+			break;
+		}
+	}
+
+	ret = domain->ops->mm_attach(domain, dev, io_mm, attach_domain);
+	if (ret) {
+		kfree(bond);
+		goto out_unlock;
+	}
+
+	list_add(&bond->mm_head, &io_mm->devices);
+	list_add(&bond->domain_head, &domain->mm_list);
+	list_add(&bond->dev_head, &param->mm_list);
+
+out_unlock:
+	spin_unlock(&iommu_sva_lock);
+	return ret;
+}
+
+static void io_mm_detach_locked(struct iommu_bond *bond)
+{
+	struct iommu_bond *tmp;
+	bool detach_domain = true;
+	struct iommu_domain *domain = bond->domain;
+
+	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
+		if (tmp->io_mm == bond->io_mm && tmp->dev != bond->dev) {
+			detach_domain = false;
+			break;
+		}
+	}
+
+	list_del(&bond->mm_head);
+	list_del(&bond->domain_head);
+	list_del(&bond->dev_head);
+
+	domain->ops->mm_detach(domain, bond->dev, bond->io_mm, detach_domain);
+
+	io_mm_put_locked(bond->io_mm);
+	kfree(bond);
+}
 
 int __iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, int *pasid,
 			    unsigned long flags, void *drvdata)
 {
-	return -ENOSYS; /* TODO */
+	int i;
+	int ret = 0;
+	struct iommu_bond *bond;
+	struct io_mm *io_mm = NULL;
+	struct iommu_domain *domain;
+	struct iommu_sva_param *param;
+
+	domain = iommu_get_domain_for_dev(dev);
+	if (!domain)
+		return -EINVAL;
+
+	mutex_lock(&dev->iommu_param->sva_lock);
+	param = dev->iommu_param->sva_param;
+	if (!param || (flags & ~param->features)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* If an io_mm already exists, use it */
+	spin_lock(&iommu_sva_lock);
+	idr_for_each_entry(&iommu_pasid_idr, io_mm, i) {
+		if (io_mm->mm == mm && io_mm_get_locked(io_mm)) {
+			/* ... Unless it's already bound to this device */
+			list_for_each_entry(bond, &io_mm->devices, mm_head) {
+				if (bond->dev == dev) {
+					ret = -EEXIST;
+					io_mm_put_locked(io_mm);
+					break;
+				}
+			}
+			break;
+		}
+	}
+	spin_unlock(&iommu_sva_lock);
+	if (ret)
+		goto out_unlock;
+
+	/* Require identical features within an io_mm for now */
+	if (io_mm && (flags != io_mm->flags)) {
+		io_mm_put(io_mm);
+		ret = -EDOM;
+		goto out_unlock;
+	}
+
+	if (!io_mm) {
+		io_mm = io_mm_alloc(domain, dev, mm, flags);
+		if (IS_ERR(io_mm)) {
+			ret = PTR_ERR(io_mm);
+			goto out_unlock;
+		}
+	}
+
+	ret = io_mm_attach(domain, dev, io_mm, drvdata);
+	if (ret)
+		io_mm_put(io_mm);
+	else
+		*pasid = io_mm->pasid;
+
+out_unlock:
+	mutex_unlock(&dev->iommu_param->sva_lock);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__iommu_sva_bind_device);
 
 int __iommu_sva_unbind_device(struct device *dev, int pasid)
 {
-	return -ENOSYS; /* TODO */
+	int ret = -ESRCH;
+	struct iommu_domain *domain;
+	struct iommu_bond *bond = NULL;
+	struct iommu_sva_param *param;
+
+	domain = iommu_get_domain_for_dev(dev);
+	if (!domain)
+		return -EINVAL;
+
+	mutex_lock(&dev->iommu_param->sva_lock);
+	param = dev->iommu_param->sva_param;
+	if (!param) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	spin_lock(&iommu_sva_lock);
+	list_for_each_entry(bond, &param->mm_list, dev_head) {
+		if (bond->io_mm->pasid == pasid) {
+			io_mm_detach_locked(bond);
+			ret = 0;
+			break;
+		}
+	}
+	spin_unlock(&iommu_sva_lock);
+
+out_unlock:
+	mutex_unlock(&dev->iommu_param->sva_lock);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__iommu_sva_unbind_device);
 
 static void __iommu_sva_unbind_device_all(struct device *dev)
 {
-	/* TODO */
+	struct iommu_sva_param *param = dev->iommu_param->sva_param;
+	struct iommu_bond *bond, *next;
+
+	if (!param)
+		return;
+
+	spin_lock(&iommu_sva_lock);
+	list_for_each_entry_safe(bond, next, &param->mm_list, dev_head)
+		io_mm_detach_locked(bond);
+	spin_unlock(&iommu_sva_lock);
 }
 
 /**
@@ -82,6 +472,7 @@  int iommu_sva_init_device(struct device *dev, unsigned long features,
 	param->features		= features;
 	param->min_pasid	= min_pasid;
 	param->max_pasid	= max_pasid;
+	INIT_LIST_HEAD(&param->mm_list);
 
 	mutex_lock(&dev->iommu_param->sva_lock);
 	if (dev->iommu_param->sva_param) {
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index aba3bf15d46c..7113fe398b70 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1525,6 +1525,7 @@  static struct iommu_domain *__iommu_domain_alloc(struct bus_type *bus,
 	domain->type = type;
 	/* Assume all sizes by default; the driver may override this later */
 	domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;
+	INIT_LIST_HEAD(&domain->mm_list);
 
 	return domain;
 }
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 9c49877e37a5..6a3ced6a5aa1 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -99,6 +99,20 @@  struct iommu_domain {
 	void *handler_token;
 	struct iommu_domain_geometry geometry;
 	void *iova_cookie;
+
+	struct list_head mm_list;
+};
+
+struct io_mm {
+	int			pasid;
+	/* IOMMU_SVA_FEAT_* */
+	unsigned long		flags;
+	struct list_head	devices;
+	struct kref		kref;
+	struct mm_struct	*mm;
+
+	/* Release callback for this mm */
+	void (*release)(struct io_mm *io_mm);
 };
 
 enum iommu_cap {
@@ -201,6 +215,7 @@  struct iommu_sva_param {
 	unsigned long features;
 	unsigned int min_pasid;
 	unsigned int max_pasid;
+	struct list_head mm_list;
 };
 
 /**
@@ -212,6 +227,12 @@  struct iommu_sva_param {
  * @detach_dev: detach device from an iommu domain
  * @sva_init_device: initialize Shared Virtual Addressing for a device
  * @sva_shutdown_device: shutdown Shared Virtual Addressing for a device
+ * @mm_alloc: allocate io_mm
+ * @mm_free: free io_mm
+ * @mm_attach: attach io_mm to a device. Install PASID entry if necessary. Must
+ *             not sleep.
+ * @mm_detach: detach io_mm from a device. Remove PASID entry and
+ *             flush associated TLB entries if necessary. Must not sleep.
  * @map: map a physically contiguous memory region to an iommu domain
  * @unmap: unmap a physically contiguous memory region from an iommu domain
  * @flush_tlb_all: Synchronously flush all hardware TLBs for this domain
@@ -249,6 +270,14 @@  struct iommu_ops {
 	void (*detach_dev)(struct iommu_domain *domain, struct device *dev);
 	int (*sva_init_device)(struct device *dev, struct iommu_sva_param *param);
 	void (*sva_shutdown_device)(struct device *dev);
+	struct io_mm *(*mm_alloc)(struct iommu_domain *domain,
+				  struct mm_struct *mm,
+				  unsigned long flags);
+	void (*mm_free)(struct io_mm *io_mm);
+	int (*mm_attach)(struct iommu_domain *domain, struct device *dev,
+			 struct io_mm *io_mm, bool attach_domain);
+	void (*mm_detach)(struct iommu_domain *domain, struct device *dev,
+			  struct io_mm *io_mm, bool detach_domain);
 	int (*map)(struct iommu_domain *domain, unsigned long iova,
 		   phys_addr_t paddr, size_t size, int prot);
 	size_t (*unmap)(struct iommu_domain *domain, unsigned long iova,