[v2,03/40] iommu/sva: Manage process address spaces

Message ID	20180511190641.23008-4-jean-philippe.brucker@arm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> To: linux-arm-kernel@lists.infradead.org, linux-pci@vger.kernel.org, linux-acpi@vger.kernel.org, devicetree@vger.kernel.org, iommu@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org Cc: joro@8bytes.org, will.deacon@arm.com, robin.murphy@arm.com, alex.williamson@redhat.com, tn@semihalf.com, liubo95@huawei.com, thunder.leizhen@huawei.com, xieyisheng1@huawei.com, xuzaibo@huawei.com, ilias.apalodimas@linaro.org, jonathan.cameron@huawei.com, liudongdong3@huawei.com, shunyong.yang@hxt-semitech.com, nwatters@codeaurora.org, okaya@codeaurora.org, jcrouse@codeaurora.org, rfranz@cavium.com, dwmw2@infradead.org, jacob.jun.pan@linux.intel.com, yi.l.liu@intel.com, ashok.raj@intel.com, kevin.tian@intel.com, baolu.lu@linux.intel.com, robdclark@gmail.com, christian.koenig@amd.com, bharatku@xilinx.com, rgummal@xilinx.com Subject: [PATCH v2 03/40] iommu/sva: Manage process address spaces Date: Fri, 11 May 2018 20:06:04 +0100 Message-Id: <20180511190641.23008-4-jean-philippe.brucker@arm.com> In-Reply-To: <20180511190641.23008-1-jean-philippe.brucker@arm.com> References: <20180511190641.23008-1-jean-philippe.brucker@arm.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Jean-Philippe Brucker May 11, 2018, 7:06 p.m. UTC

Allocate IOMMU mm structures and binding them to devices. Four operations
are added to IOMMU drivers:

* mm_alloc(): to create an io_mm structure and perform architecture-
  specific operations required to grab the process (for instance on ARM,
  pin down the CPU ASID so that the process doesn't get assigned a new
  ASID on rollover).

  There is a single valid io_mm structure per Linux mm. Future extensions
  may also use io_mm for kernel-managed address spaces, populated with
  map()/unmap() calls instead of bound to process address spaces. This
  patch focuses on "shared" io_mm.

* mm_attach(): attach an mm to a device. The IOMMU driver checks that the
  device is capable of sharing an address space, and writes the PASID
  table entry to install the pgd.

  Some IOMMU drivers will have a single PASID table per domain, for
  convenience. Other can implement it differently but to help these
  drivers, mm_attach and mm_detach take 'attach_domain' and
  'detach_domain' parameters, that tell whether they need to set and clear
  the PASID entry or only send the required TLB invalidations.

* mm_detach(): detach an mm from a device. The IOMMU driver removes the
  PASID table entry and invalidates the IOTLBs.

* mm_free(): free a structure allocated by mm_alloc(), and let arch
  release the process.

mm_attach and mm_detach operations are serialized with a spinlock. When
trying to optimize this code, we should at least prevent concurrent
attach()/detach() on the same domain (so multi-level PASID table code can
allocate tables lazily). mm_alloc() can sleep, but mm_free must not
(because we'll have to call it from call_srcu later on).

At the moment we use an IDR for allocating PASIDs and retrieving contexts.
We also use a single spinlock. These can be refined and optimized later (a
custom allocator will be needed for top-down PASID allocation).

Keeping track of address spaces requires the use of MMU notifiers.
Handling process exit with regard to unbind() is tricky, so it is left for
another patch and we explicitly fail mm_alloc() for the moment.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

---
v1->v2: sanity-check of flags
---
 drivers/iommu/iommu-sva.c | 380 +++++++++++++++++++++++++++++++++++++-
 drivers/iommu/iommu.c     |   1 +
 include/linux/iommu.h     |  28 +++
 3 files changed, 406 insertions(+), 3 deletions(-)

Jacob Pan May 16, 2018, 11:31 p.m. UTC | #1

On Fri, 11 May 2018 20:06:04 +0100
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> Allocate IOMMU mm structures and binding them to devices. Four
> operations are added to IOMMU drivers:
> 
> * mm_alloc(): to create an io_mm structure and perform architecture-
>   specific operations required to grab the process (for instance on
> ARM, pin down the CPU ASID so that the process doesn't get assigned a
> new ASID on rollover).
> 
>   There is a single valid io_mm structure per Linux mm. Future
> extensions may also use io_mm for kernel-managed address spaces,
> populated with map()/unmap() calls instead of bound to process
> address spaces. This patch focuses on "shared" io_mm.
> 
> * mm_attach(): attach an mm to a device. The IOMMU driver checks that
> the device is capable of sharing an address space, and writes the
> PASID table entry to install the pgd.
> 
>   Some IOMMU drivers will have a single PASID table per domain, for
>   convenience. Other can implement it differently but to help these
>   drivers, mm_attach and mm_detach take 'attach_domain' and
>   'detach_domain' parameters, that tell whether they need to set and
> clear the PASID entry or only send the required TLB invalidations.
> 
> * mm_detach(): detach an mm from a device. The IOMMU driver removes
> the PASID table entry and invalidates the IOTLBs.
> 
> * mm_free(): free a structure allocated by mm_alloc(), and let arch
>   release the process.
> 
> mm_attach and mm_detach operations are serialized with a spinlock.
> When trying to optimize this code, we should at least prevent
> concurrent attach()/detach() on the same domain (so multi-level PASID
> table code can allocate tables lazily). mm_alloc() can sleep, but
> mm_free must not (because we'll have to call it from call_srcu later
> on).
> 
> At the moment we use an IDR for allocating PASIDs and retrieving
> contexts. We also use a single spinlock. These can be refined and
> optimized later (a custom allocator will be needed for top-down PASID
> allocation).
> 
> Keeping track of address spaces requires the use of MMU notifiers.
> Handling process exit with regard to unbind() is tricky, so it is
> left for another patch and we explicitly fail mm_alloc() for the
> moment.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> ---
> v1->v2: sanity-check of flags
> ---
>  drivers/iommu/iommu-sva.c | 380
> +++++++++++++++++++++++++++++++++++++- drivers/iommu/iommu.c     |
> 1 + include/linux/iommu.h     |  28 +++
>  3 files changed, 406 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> index 8d98f9c09864..6ac679c48f3c 100644
> --- a/drivers/iommu/iommu-sva.c
> +++ b/drivers/iommu/iommu-sva.c
> @@ -5,8 +5,298 @@
>   * Copyright (C) 2018 ARM Ltd.
>   */
>  
> +#include <linux/idr.h>
>  #include <linux/iommu.h>
> +#include <linux/sched/mm.h>
>  #include <linux/slab.h>
> +#include <linux/spinlock.h>
> +
> +/**
> + * DOC: io_mm model
> + *
> + * The io_mm keeps track of process address spaces shared between
> CPU and IOMMU.
> + * The following example illustrates the relation between structures
> + * iommu_domain, io_mm and iommu_bond. An iommu_bond is a link
> between io_mm and
> + * device. A device can have multiple io_mm and an io_mm may be
> bound to
> + * multiple devices.
> + *              ___________________________
> + *             |  IOMMU domain A           |
> + *             |  ________________         |
> + *             | |  IOMMU group   |        +------- io_pgtables
> + *             | |                |        |
> + *             | |   dev 00:00.0 ----+------- bond --- io_mm X
> + *             | |________________|   \    |
> + *             |                       '----- bond ---.
> + *             |___________________________|           \
> + *              ___________________________             \
> + *             |  IOMMU domain B           |           io_mm Y
> + *             |  ________________         |           / /
> + *             | |  IOMMU group   |        |          / /
> + *             | |                |        |         / /
> + *             | |   dev 00:01.0 ------------ bond -' /
> + *             | |   dev 00:01.1 ------------ bond --'
> + *             | |________________|        |
> + *             |                           +------- io_pgtables
> + *             |___________________________|
> + *
> + * In this example, device 00:00.0 is in domain A, devices 00:01.*
> are in domain
> + * B. All devices within the same domain access the same address
> spaces. Device
> + * 00:00.0 accesses address spaces X and Y, each corresponding to an
> mm_struct.
> + * Devices 00:01.* only access address space Y. In addition each
> + * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable,
> that is
> + * managed with iommu_map()/iommu_unmap(), and isn't shared with the
> CPU MMU.
> + *
> + * To obtain the above configuration, users would for instance issue
> the
> + * following calls:
> + *
> + *     iommu_sva_bind_device(dev 00:00.0, mm X, ...) -> PASID 1
> + *     iommu_sva_bind_device(dev 00:00.0, mm Y, ...) -> PASID 2
> + *     iommu_sva_bind_device(dev 00:01.0, mm Y, ...) -> PASID 2
> + *     iommu_sva_bind_device(dev 00:01.1, mm Y, ...) -> PASID 2
> + *
> + * A single Process Address Space ID (PASID) is allocated for each
> mm. In the
> + * example, devices use PASID 1 to read/write into address space X
> and PASID 2
> + * to read/write into address space Y.
> + *
> + * Hardware tables describing this configuration in the IOMMU would
> typically
> + * look like this:
> + *
> + *                                PASID tables
> + *                                 of domain A
> + *                              .->+--------+
> + *                             / 0 |        |-------> io_pgtable
> + *                            /    +--------+
> + *            Device tables  /   1 |        |-------> pgd X
> + *              +--------+  /      +--------+
> + *      00:00.0 |      A |-'     2 |        |--.
> + *              +--------+         +--------+   \
> + *              :        :       3 |        |    \
> + *              +--------+         +--------+     --> pgd Y
> + *      00:01.0 |      B |--.                    /
> + *              +--------+   \                  |
> + *      00:01.1 |      B |----+   PASID tables  |
> + *              +--------+     \   of domain B  |
> + *                              '->+--------+   |
> + *                               0 |        |-- | --> io_pgtable
> + *                                 +--------+   |
> + *                               1 |        |   |
> + *                                 +--------+   |
> + *                               2 |        |---'
> + *                                 +--------+
> + *                               3 |        |
> + *                                 +--------+
> + *
I am a little confused about domain vs. pasid relationship. If
each domain represents a address space, should there be a domain for
each pasid?

> + * With this model, a single call binds all devices in a given
> domain to an
> + * address space. Other devices in the domain will get the same bond
> implicitly.
> + * However, users must issue one bind() for each device, because
> IOMMUs may
> + * implement SVA differently. Furthermore, mandating one bind() per
> device
> + * allows the driver to perform sanity-checks on device capabilities.
> + *
> + * On Arm and AMD IOMMUs, entry 0 of the PASID table can be used to
> hold
> + * non-PASID translations. In this case PASID 0 is reserved and
> entry 0 points
> + * to the io_pgtable base. On Intel IOMMU, the io_pgtable base would
> be held in
> + * the device table and PASID 0 would be available to the allocator.
> + */
> +
> +struct iommu_bond {
> +	struct io_mm		*io_mm;
> +	struct device		*dev;
> +	struct iommu_domain	*domain;
> +
> +	struct list_head	mm_head;
> +	struct list_head	dev_head;
> +	struct list_head	domain_head;
> +
> +	void			*drvdata;
> +};
> +
> +/*
> + * Because we're using an IDR, PASIDs are limited to 31 bits (the
> sign bit is
> + * used for returning errors). In practice implementations will use
> at most 20
> + * bits, which is the PCI limit.
> + */
> +static DEFINE_IDR(iommu_pasid_idr);
> +
> +/*
> + * For the moment this is an all-purpose lock. It serializes
> + * access/modifications to bonds, access/modifications to the PASID
> IDR, and
> + * changes to io_mm refcount as well.
> + */
> +static DEFINE_SPINLOCK(iommu_sva_lock);
> +
> +static struct io_mm *
> +io_mm_alloc(struct iommu_domain *domain, struct device *dev,
> +	    struct mm_struct *mm, unsigned long flags)
> +{
> +	int ret;
> +	int pasid;
> +	struct io_mm *io_mm;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	if (!domain->ops->mm_alloc || !domain->ops->mm_free)
> +		return ERR_PTR(-ENODEV);
> +
> +	io_mm = domain->ops->mm_alloc(domain, mm, flags);
> +	if (IS_ERR(io_mm))
> +		return io_mm;
> +	if (!io_mm)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/*
> +	 * The mm must not be freed until after the driver frees the
> io_mm
> +	 * (which may involve unpinning the CPU ASID for instance,
> requiring a
> +	 * valid mm struct.)
> +	 */
> +	mmgrab(mm);
> +
> +	io_mm->flags		= flags;
> +	io_mm->mm		= mm;
> +	io_mm->release		= domain->ops->mm_free;
> +	INIT_LIST_HEAD(&io_mm->devices);
> +
> +	idr_preload(GFP_KERNEL);
> +	spin_lock(&iommu_sva_lock);
> +	pasid = idr_alloc(&iommu_pasid_idr, io_mm, param->min_pasid,
> +			  param->max_pasid + 1, GFP_ATOMIC);
> +	io_mm->pasid = pasid;
> +	spin_unlock(&iommu_sva_lock);
> +	idr_preload_end();
> +
> +	if (pasid < 0) {
> +		ret = pasid;
> +		goto err_free_mm;
> +	}
> +
> +	/* TODO: keep track of mm. For the moment, abort. */
> +	ret = -ENOSYS;
> +	spin_lock(&iommu_sva_lock);
> +	idr_remove(&iommu_pasid_idr, io_mm->pasid);
> +	spin_unlock(&iommu_sva_lock);
> +
> +err_free_mm:
> +	domain->ops->mm_free(io_mm);
> +	mmdrop(mm);
> +
> +	return ERR_PTR(ret);
> +}
> +
> +static void io_mm_free(struct io_mm *io_mm)
> +{
> +	struct mm_struct *mm = io_mm->mm;
> +
> +	io_mm->release(io_mm);
> +	mmdrop(mm);
> +}
> +
> +static void io_mm_release(struct kref *kref)
> +{
> +	struct io_mm *io_mm;
> +
> +	io_mm = container_of(kref, struct io_mm, kref);
> +	WARN_ON(!list_empty(&io_mm->devices));
> +
> +	idr_remove(&iommu_pasid_idr, io_mm->pasid);
> +
> +	io_mm_free(io_mm);
> +}
> +
> +/*
> + * Returns non-zero if a reference to the io_mm was successfully
> taken.
> + * Returns zero if the io_mm is being freed and should not be used.
> + */
> +static int io_mm_get_locked(struct io_mm *io_mm)
> +{
> +	if (io_mm)
> +		return kref_get_unless_zero(&io_mm->kref);
> +
> +	return 0;
> +}
> +
> +static void io_mm_put_locked(struct io_mm *io_mm)
> +{
> +	kref_put(&io_mm->kref, io_mm_release);
> +}
> +
> +static void io_mm_put(struct io_mm *io_mm)
> +{
> +	spin_lock(&iommu_sva_lock);
> +	io_mm_put_locked(io_mm);
> +	spin_unlock(&iommu_sva_lock);
> +}
> +
> +static int io_mm_attach(struct iommu_domain *domain, struct device
> *dev,
> +			struct io_mm *io_mm, void *drvdata)
> +{
> +	int ret;
> +	bool attach_domain = true;
> +	int pasid = io_mm->pasid;
> +	struct iommu_bond *bond, *tmp;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	if (!domain->ops->mm_attach || !domain->ops->mm_detach)
> +		return -ENODEV;
> +
> +	if (pasid > param->max_pasid || pasid < param->min_pasid)
> +		return -ERANGE;
> +
> +	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> +	if (!bond)
> +		return -ENOMEM;
> +
> +	bond->domain		= domain;
> +	bond->io_mm		= io_mm;
> +	bond->dev		= dev;
> +	bond->drvdata		= drvdata;
> +
> +	spin_lock(&iommu_sva_lock);
> +	/*
> +	 * Check if this io_mm is already bound to the domain. In
> which case the
> +	 * IOMMU driver doesn't have to install the PASID table
> entry.
> +	 */
> +	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
> +		if (tmp->io_mm == io_mm) {
> +			attach_domain = false;
> +			break;
> +		}
> +	}
> +
> +	ret = domain->ops->mm_attach(domain, dev, io_mm,
> attach_domain);
> +	if (ret) {
> +		kfree(bond);
> +		spin_unlock(&iommu_sva_lock);
> +		return ret;
> +	}
> +
> +	list_add(&bond->mm_head, &io_mm->devices);
> +	list_add(&bond->domain_head, &domain->mm_list);
> +	list_add(&bond->dev_head, &param->mm_list);
> +	spin_unlock(&iommu_sva_lock);
> +
> +	return 0;
> +}
> +
> +static void io_mm_detach_locked(struct iommu_bond *bond)
> +{
> +	struct iommu_bond *tmp;
> +	bool detach_domain = true;
> +	struct iommu_domain *domain = bond->domain;
> +
> +	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
> +		if (tmp->io_mm == bond->io_mm && tmp->dev !=
> bond->dev) {
> +			detach_domain = false;
> +			break;
> +		}
> +	}
> +
> +	domain->ops->mm_detach(domain, bond->dev, bond->io_mm,
> detach_domain); +
> +	list_del(&bond->mm_head);
> +	list_del(&bond->domain_head);
> +	list_del(&bond->dev_head);
> +	io_mm_put_locked(bond->io_mm);
> +
> +	kfree(bond);
> +}
>  
>  /**
>   * iommu_sva_device_init() - Initialize Shared Virtual Addressing
> for a device @@ -47,6 +337,7 @@ int iommu_sva_device_init(struct
> device *dev, unsigned long features, 
>  	param->features		= features;
>  	param->max_pasid	= max_pasid;
> +	INIT_LIST_HEAD(&param->mm_list);
>  
>  	/*
>  	 * IOMMU driver updates the limits depending on the IOMMU
> and device @@ -114,13 +405,87 @@
> EXPORT_SYMBOL_GPL(iommu_sva_device_shutdown); int
> __iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, int
> *pasid, unsigned long flags, void *drvdata) {
> -	return -ENOSYS; /* TODO */
> +	int i, ret = 0;
> +	struct io_mm *io_mm = NULL;
> +	struct iommu_domain *domain;
> +	struct iommu_bond *bond = NULL, *tmp;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	domain = iommu_get_domain_for_dev(dev);
> +	if (!domain)
> +		return -EINVAL;
> +
> +	/*
> +	 * The device driver does not call sva_device_init/shutdown
> and
> +	 * bind/unbind concurrently, so no need to take the param
> lock.
> +	 */
> +	if (WARN_ON_ONCE(!param) || (flags & ~param->features))
> +		return -EINVAL;
> +
> +	/* If an io_mm already exists, use it */
> +	spin_lock(&iommu_sva_lock);
> +	idr_for_each_entry(&iommu_pasid_idr, io_mm, i) {
> +		if (io_mm->mm == mm && io_mm_get_locked(io_mm)) {
> +			/* ... Unless it's already bound to this
> device */
> +			list_for_each_entry(tmp, &io_mm->devices,
> mm_head) {
> +				if (tmp->dev == dev) {
> +					bond = tmp;
> +					io_mm_put_locked(io_mm);
> +					break;
> +				}
> +			}
> +			break;
> +		}
> +	}
> +	spin_unlock(&iommu_sva_lock);
> +
> +	if (bond)
> +		return -EEXIST;
> +
> +	/* Require identical features within an io_mm for now */
> +	if (io_mm && (flags != io_mm->flags)) {
> +		io_mm_put(io_mm);
> +		return -EDOM;
> +	}
> +
> +	if (!io_mm) {
> +		io_mm = io_mm_alloc(domain, dev, mm, flags);
> +		if (IS_ERR(io_mm))
> +			return PTR_ERR(io_mm);
> +	}
> +
> +	ret = io_mm_attach(domain, dev, io_mm, drvdata);
> +	if (ret)
> +		io_mm_put(io_mm);
> +	else
> +		*pasid = io_mm->pasid;
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(__iommu_sva_bind_device);
>  
>  int __iommu_sva_unbind_device(struct device *dev, int pasid)
>  {
> -	return -ENOSYS; /* TODO */
> +	int ret = -ESRCH;
> +	struct iommu_domain *domain;
> +	struct iommu_bond *bond = NULL;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	domain = iommu_get_domain_for_dev(dev);
> +	if (!param || WARN_ON(!domain))
> +		return -EINVAL;
> +
> +	spin_lock(&iommu_sva_lock);
> +	list_for_each_entry(bond, &param->mm_list, dev_head) {
> +		if (bond->io_mm->pasid == pasid) {
> +			io_mm_detach_locked(bond);
> +			ret = 0;
> +			break;
> +		}
> +	}
> +	spin_unlock(&iommu_sva_lock);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(__iommu_sva_unbind_device);
>  
> @@ -132,6 +497,15 @@ EXPORT_SYMBOL_GPL(__iommu_sva_unbind_device);
>   */
>  void __iommu_sva_unbind_dev_all(struct device *dev)
>  {
> -	/* TODO */
> +	struct iommu_sva_param *param;
> +	struct iommu_bond *bond, *next;
> +
> +	param = dev->iommu_param->sva_param;
> +	if (param) {
> +		spin_lock(&iommu_sva_lock);
> +		list_for_each_entry_safe(bond, next,
> &param->mm_list, dev_head)
> +			io_mm_detach_locked(bond);
> +		spin_unlock(&iommu_sva_lock);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(__iommu_sva_unbind_dev_all);
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index bd2819deae5b..333801e1519c 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1463,6 +1463,7 @@ static struct iommu_domain
> *__iommu_domain_alloc(struct bus_type *bus, domain->type = type;
>  	/* Assume all sizes by default; the driver may override this
> later */ domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;
> +	INIT_LIST_HEAD(&domain->mm_list);
>  
>  	return domain;
>  }
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index da59c20c4f12..d5f21719a5a0 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -100,6 +100,20 @@ struct iommu_domain {
>  	void *handler_token;
>  	struct iommu_domain_geometry geometry;
>  	void *iova_cookie;
> +
> +	struct list_head mm_list;
> +};
> +
> +struct io_mm {
> +	int			pasid;
> +	/* IOMMU_SVA_FEAT_* */
> +	unsigned long		flags;
> +	struct list_head	devices;
> +	struct kref		kref;
> +	struct mm_struct	*mm;
> +
> +	/* Release callback for this mm */
> +	void (*release)(struct io_mm *io_mm);
>  };
>  
>  enum iommu_cap {
> @@ -216,6 +230,7 @@ struct iommu_sva_param {
>  	unsigned long features;
>  	unsigned int min_pasid;
>  	unsigned int max_pasid;
> +	struct list_head mm_list;
>  };
>  
>  /**
> @@ -227,6 +242,11 @@ struct iommu_sva_param {
>   * @detach_dev: detach device from an iommu domain
>   * @sva_device_init: initialize Shared Virtual Adressing for a device
>   * @sva_device_shutdown: shutdown Shared Virtual Adressing for a
> device
> + * @mm_alloc: allocate io_mm
> + * @mm_free: free io_mm
> + * @mm_attach: attach io_mm to a device. Install PASID entry if
> necessary
> + * @mm_detach: detach io_mm from a device. Remove PASID entry and
> + *             flush associated TLB entries.
>   * @map: map a physically contiguous memory region to an iommu domain
>   * @unmap: unmap a physically contiguous memory region from an iommu
> domain
>   * @map_sg: map a scatter-gather list of physically contiguous
> memory chunks @@ -268,6 +288,14 @@ struct iommu_ops {
>  			       struct iommu_sva_param *param);
>  	void (*sva_device_shutdown)(struct device *dev,
>  				    struct iommu_sva_param *param);
> +	struct io_mm *(*mm_alloc)(struct iommu_domain *domain,
> +				  struct mm_struct *mm,
> +				  unsigned long flags);
> +	void (*mm_free)(struct io_mm *io_mm);
> +	int (*mm_attach)(struct iommu_domain *domain, struct device
> *dev,
> +			 struct io_mm *io_mm, bool attach_domain);
> +	void (*mm_detach)(struct iommu_domain *domain, struct device
> *dev,
> +			  struct io_mm *io_mm, bool detach_domain);
>  	int (*map)(struct iommu_domain *domain, unsigned long iova,
>  		   phys_addr_t paddr, size_t size, int prot);
>  	size_t (*unmap)(struct iommu_domain *domain, unsigned long
> iova,

[Jacob Pan]

Jean-Philippe Brucker May 17, 2018, 10:02 a.m. UTC | #2

On 17/05/18 00:31, Jacob Pan wrote:
> On Fri, 11 May 2018 20:06:04 +0100
> I am a little confused about domain vs. pasid relationship. If
> each domain represents a address space, should there be a domain for
> each pasid?

I don't think there is a formal definition, but from previous discussion
the consensus seems to be: domains are a collection of devices that have
the same virtual address spaces (one or many).

Keeping that definition makes things easier, in my opinion. Some time
ago, I did try to represent PASIDs using "subdomains" (introducing a
hierarchy of struct iommu_domain), but it required invasive changes in
the IOMMU subsystem and probably all over the tree.

You do need some kind of "root domain" for each device, so that
"iommu_get_domain_for_dev()" still makes sense. That root domain doesn't
have a single address space but a collection of subdomains. If you need
this anyway, representing a PASID with an iommu_domain doesn't seem
preferable than using a different structure (io_mm), because they don't
have anything in common.

Thanks,
Jean

Jonathan Cameron May 17, 2018, 2:25 p.m. UTC | #3

On Fri, 11 May 2018 20:06:04 +0100
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> Allocate IOMMU mm structures and binding them to devices. Four operations
> are added to IOMMU drivers:
> 
> * mm_alloc(): to create an io_mm structure and perform architecture-
>   specific operations required to grab the process (for instance on ARM,
>   pin down the CPU ASID so that the process doesn't get assigned a new
>   ASID on rollover).
> 
>   There is a single valid io_mm structure per Linux mm. Future extensions
>   may also use io_mm for kernel-managed address spaces, populated with
>   map()/unmap() calls instead of bound to process address spaces. This
>   patch focuses on "shared" io_mm.
> 
> * mm_attach(): attach an mm to a device. The IOMMU driver checks that the
>   device is capable of sharing an address space, and writes the PASID
>   table entry to install the pgd.
> 
>   Some IOMMU drivers will have a single PASID table per domain, for
>   convenience. Other can implement it differently but to help these
>   drivers, mm_attach and mm_detach take 'attach_domain' and
>   'detach_domain' parameters, that tell whether they need to set and clear
>   the PASID entry or only send the required TLB invalidations.
> 
> * mm_detach(): detach an mm from a device. The IOMMU driver removes the
>   PASID table entry and invalidates the IOTLBs.
> 
> * mm_free(): free a structure allocated by mm_alloc(), and let arch
>   release the process.
> 
> mm_attach and mm_detach operations are serialized with a spinlock. When
> trying to optimize this code, we should at least prevent concurrent
> attach()/detach() on the same domain (so multi-level PASID table code can
> allocate tables lazily). mm_alloc() can sleep, but mm_free must not
> (because we'll have to call it from call_srcu later on).
> 
> At the moment we use an IDR for allocating PASIDs and retrieving contexts.
> We also use a single spinlock. These can be refined and optimized later (a
> custom allocator will be needed for top-down PASID allocation).
> 
> Keeping track of address spaces requires the use of MMU notifiers.
> Handling process exit with regard to unbind() is tricky, so it is left for
> another patch and we explicitly fail mm_alloc() for the moment.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

A few minor bits and bobs inline.  Looks good in general + nice diags!

Thanks,

Jonathan

> 
> ---
> v1->v2: sanity-check of flags
> ---
>  drivers/iommu/iommu-sva.c | 380 +++++++++++++++++++++++++++++++++++++-
>  drivers/iommu/iommu.c     |   1 +
>  include/linux/iommu.h     |  28 +++
>  3 files changed, 406 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> index 8d98f9c09864..6ac679c48f3c 100644
> --- a/drivers/iommu/iommu-sva.c
> +++ b/drivers/iommu/iommu-sva.c
> @@ -5,8 +5,298 @@
>   * Copyright (C) 2018 ARM Ltd.
>   */
>  
> +#include <linux/idr.h>
>  #include <linux/iommu.h>
> +#include <linux/sched/mm.h>
>  #include <linux/slab.h>
> +#include <linux/spinlock.h>
> +
> +/**
> + * DOC: io_mm model
> + *
> + * The io_mm keeps track of process address spaces shared between CPU and IOMMU.
> + * The following example illustrates the relation between structures
> + * iommu_domain, io_mm and iommu_bond. An iommu_bond is a link between io_mm and
> + * device. A device can have multiple io_mm and an io_mm may be bound to
> + * multiple devices.
> + *              ___________________________
> + *             |  IOMMU domain A           |
> + *             |  ________________         |
> + *             | |  IOMMU group   |        +------- io_pgtables
> + *             | |                |        |
> + *             | |   dev 00:00.0 ----+------- bond --- io_mm X
> + *             | |________________|   \    |
> + *             |                       '----- bond ---.
> + *             |___________________________|           \
> + *              ___________________________             \
> + *             |  IOMMU domain B           |           io_mm Y
> + *             |  ________________         |           / /
> + *             | |  IOMMU group   |        |          / /
> + *             | |                |        |         / /
> + *             | |   dev 00:01.0 ------------ bond -' /
> + *             | |   dev 00:01.1 ------------ bond --'
> + *             | |________________|        |
> + *             |                           +------- io_pgtables
> + *             |___________________________|
> + *
> + * In this example, device 00:00.0 is in domain A, devices 00:01.* are in domain
> + * B. All devices within the same domain access the same address spaces. Device
> + * 00:00.0 accesses address spaces X and Y, each corresponding to an mm_struct.
> + * Devices 00:01.* only access address space Y. In addition each
> + * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable, that is
> + * managed with iommu_map()/iommu_unmap(), and isn't shared with the CPU MMU.
> + *
> + * To obtain the above configuration, users would for instance issue the
> + * following calls:
> + *
> + *     iommu_sva_bind_device(dev 00:00.0, mm X, ...) -> PASID 1
> + *     iommu_sva_bind_device(dev 00:00.0, mm Y, ...) -> PASID 2
> + *     iommu_sva_bind_device(dev 00:01.0, mm Y, ...) -> PASID 2
> + *     iommu_sva_bind_device(dev 00:01.1, mm Y, ...) -> PASID 2
> + *
> + * A single Process Address Space ID (PASID) is allocated for each mm. In the
> + * example, devices use PASID 1 to read/write into address space X and PASID 2
> + * to read/write into address space Y.
> + *
> + * Hardware tables describing this configuration in the IOMMU would typically
> + * look like this:
> + *
> + *                                PASID tables
> + *                                 of domain A
> + *                              .->+--------+
> + *                             / 0 |        |-------> io_pgtable
> + *                            /    +--------+
> + *            Device tables  /   1 |        |-------> pgd X
> + *              +--------+  /      +--------+
> + *      00:00.0 |      A |-'     2 |        |--.
> + *              +--------+         +--------+   \
> + *              :        :       3 |        |    \
> + *              +--------+         +--------+     --> pgd Y
> + *      00:01.0 |      B |--.                    /
> + *              +--------+   \                  |
> + *      00:01.1 |      B |----+   PASID tables  |
> + *              +--------+     \   of domain B  |
> + *                              '->+--------+   |
> + *                               0 |        |-- | --> io_pgtable
> + *                                 +--------+   |
> + *                               1 |        |   |
> + *                                 +--------+   |
> + *                               2 |        |---'
> + *                                 +--------+
> + *                               3 |        |
> + *                                 +--------+
> + *
> + * With this model, a single call binds all devices in a given domain to an
> + * address space. Other devices in the domain will get the same bond implicitly.
> + * However, users must issue one bind() for each device, because IOMMUs may
> + * implement SVA differently. Furthermore, mandating one bind() per device
> + * allows the driver to perform sanity-checks on device capabilities.
> + *
> + * On Arm and AMD IOMMUs, entry 0 of the PASID table can be used to hold
> + * non-PASID translations. In this case PASID 0 is reserved and entry 0 points
> + * to the io_pgtable base. On Intel IOMMU, the io_pgtable base would be held in
> + * the device table and PASID 0 would be available to the allocator.
> + */
> +
> +struct iommu_bond {
> +	struct io_mm		*io_mm;
> +	struct device		*dev;
> +	struct iommu_domain	*domain;
> +
> +	struct list_head	mm_head;
> +	struct list_head	dev_head;
> +	struct list_head	domain_head;
> +
> +	void			*drvdata;
> +};
> +
> +/*
> + * Because we're using an IDR, PASIDs are limited to 31 bits (the sign bit is
> + * used for returning errors). In practice implementations will use at most 20
> + * bits, which is the PCI limit.
> + */
> +static DEFINE_IDR(iommu_pasid_idr);
> +
> +/*
> + * For the moment this is an all-purpose lock. It serializes
> + * access/modifications to bonds, access/modifications to the PASID IDR, and
> + * changes to io_mm refcount as well.
> + */
> +static DEFINE_SPINLOCK(iommu_sva_lock);
> +
> +static struct io_mm *
> +io_mm_alloc(struct iommu_domain *domain, struct device *dev,
> +	    struct mm_struct *mm, unsigned long flags)
> +{
> +	int ret;
> +	int pasid;
> +	struct io_mm *io_mm;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	if (!domain->ops->mm_alloc || !domain->ops->mm_free)
> +		return ERR_PTR(-ENODEV);
> +
> +	io_mm = domain->ops->mm_alloc(domain, mm, flags);
> +	if (IS_ERR(io_mm))
> +		return io_mm;
> +	if (!io_mm)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/*
> +	 * The mm must not be freed until after the driver frees the io_mm
> +	 * (which may involve unpinning the CPU ASID for instance, requiring a
> +	 * valid mm struct.)
> +	 */
> +	mmgrab(mm);
> +
> +	io_mm->flags		= flags;
> +	io_mm->mm		= mm;
> +	io_mm->release		= domain->ops->mm_free;
> +	INIT_LIST_HEAD(&io_mm->devices);
> +
> +	idr_preload(GFP_KERNEL);
> +	spin_lock(&iommu_sva_lock);
> +	pasid = idr_alloc(&iommu_pasid_idr, io_mm, param->min_pasid,
> +			  param->max_pasid + 1, GFP_ATOMIC);

I'd expect the IDR cleanup to be in io_mm_free as that would 'match'
against io_mm_alloc but it's in io_mm_release just before the io_mm_free
call, perhaps move it or am I missing something?

Hmm. This is reworked in patch 5 to use call rcu to do the free.  I suppose
we may be burning an idr entry if we take a while to get round to the
free..  If so a comment to explain this would be great.

> +	io_mm->pasid = pasid;
> +	spin_unlock(&iommu_sva_lock);
> +	idr_preload_end();
> +
> +	if (pasid < 0) {
> +		ret = pasid;
> +		goto err_free_mm;
> +	}
> +
> +	/* TODO: keep track of mm. For the moment, abort. */

From later patches, I can now see why we didn't init the kref
here, but perhaps a comment would make that clear rather than
people checking it is correctly used throughout?  Actually just grab
the comment from patch 5 and put it in this one and that will
do the job nicely.

> +	ret = -ENOSYS;
> +	spin_lock(&iommu_sva_lock);
> +	idr_remove(&iommu_pasid_idr, io_mm->pasid);
> +	spin_unlock(&iommu_sva_lock);
> +
> +err_free_mm:
> +	domain->ops->mm_free(io_mm);

Really minor, but you now have io_mm->release set so to keep
this obviously the same as the io_mm_free path, perhaps call
that rather than mm_free directly.

> +	mmdrop(mm);
> +
> +	return ERR_PTR(ret);
> +}
> +
> +static void io_mm_free(struct io_mm *io_mm)
> +{
> +	struct mm_struct *mm = io_mm->mm;
> +
> +	io_mm->release(io_mm);
> +	mmdrop(mm);
> +}
> +
> +static void io_mm_release(struct kref *kref)
> +{
> +	struct io_mm *io_mm;
> +
> +	io_mm = container_of(kref, struct io_mm, kref);
> +	WARN_ON(!list_empty(&io_mm->devices));
> +
> +	idr_remove(&iommu_pasid_idr, io_mm->pasid);
> +
> +	io_mm_free(io_mm);
> +}
> +
> +/*
> + * Returns non-zero if a reference to the io_mm was successfully taken.
> + * Returns zero if the io_mm is being freed and should not be used.
> + */
> +static int io_mm_get_locked(struct io_mm *io_mm)
> +{
> +	if (io_mm)
> +		return kref_get_unless_zero(&io_mm->kref);
> +
> +	return 0;
> +}
> +
> +static void io_mm_put_locked(struct io_mm *io_mm)
> +{
> +	kref_put(&io_mm->kref, io_mm_release);
> +}
> +
> +static void io_mm_put(struct io_mm *io_mm)
> +{
> +	spin_lock(&iommu_sva_lock);
> +	io_mm_put_locked(io_mm);
> +	spin_unlock(&iommu_sva_lock);
> +}
> +
> +static int io_mm_attach(struct iommu_domain *domain, struct device *dev,
> +			struct io_mm *io_mm, void *drvdata)
> +{
> +	int ret;
> +	bool attach_domain = true;
> +	int pasid = io_mm->pasid;
> +	struct iommu_bond *bond, *tmp;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	if (!domain->ops->mm_attach || !domain->ops->mm_detach)
> +		return -ENODEV;
> +
> +	if (pasid > param->max_pasid || pasid < param->min_pasid)
> +		return -ERANGE;
> +
> +	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> +	if (!bond)
> +		return -ENOMEM;
> +
> +	bond->domain		= domain;
> +	bond->io_mm		= io_mm;
> +	bond->dev		= dev;
> +	bond->drvdata		= drvdata;
> +
> +	spin_lock(&iommu_sva_lock);
> +	/*
> +	 * Check if this io_mm is already bound to the domain. In which case the
> +	 * IOMMU driver doesn't have to install the PASID table entry.
> +	 */
> +	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
> +		if (tmp->io_mm == io_mm) {
> +			attach_domain = false;
> +			break;
> +		}
> +	}
> +
> +	ret = domain->ops->mm_attach(domain, dev, io_mm, attach_domain);
> +	if (ret) {
> +		kfree(bond);
> +		spin_unlock(&iommu_sva_lock);
> +		return ret;
> +	}
> +
> +	list_add(&bond->mm_head, &io_mm->devices);
> +	list_add(&bond->domain_head, &domain->mm_list);
> +	list_add(&bond->dev_head, &param->mm_list);
> +	spin_unlock(&iommu_sva_lock);
> +
> +	return 0;
> +}
> +
> +static void io_mm_detach_locked(struct iommu_bond *bond)
> +{
> +	struct iommu_bond *tmp;
> +	bool detach_domain = true;
> +	struct iommu_domain *domain = bond->domain;
> +
> +	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
> +		if (tmp->io_mm == bond->io_mm && tmp->dev != bond->dev) {
> +			detach_domain = false;
> +			break;
> +		}
> +	}
> +
> +	domain->ops->mm_detach(domain, bond->dev, bond->io_mm, detach_domain);
> +

I can't see an immediate reason to have a different order in her to the reverse of
the attach above.   So I think you should be detaching after the list_del calls.
If there is a reason, can we have a comment so I don't ask on v10.

> +	list_del(&bond->mm_head);
> +	list_del(&bond->domain_head);
> +	list_del(&bond->dev_head);
> +	io_mm_put_locked(bond->io_mm);
> +
> +	kfree(bond);
> +}
>  
>  /**
>   * iommu_sva_device_init() - Initialize Shared Virtual Addressing for a device
> @@ -47,6 +337,7 @@ int iommu_sva_device_init(struct device *dev, unsigned long features,
>  
>  	param->features		= features;
>  	param->max_pasid	= max_pasid;
> +	INIT_LIST_HEAD(&param->mm_list);
>  
>  	/*
>  	 * IOMMU driver updates the limits depending on the IOMMU and device
> @@ -114,13 +405,87 @@ EXPORT_SYMBOL_GPL(iommu_sva_device_shutdown);
>  int __iommu_sva_bind_device(struct device *dev, struct mm_struct *mm,
>  			    int *pasid, unsigned long flags, void *drvdata)
>  {
> -	return -ENOSYS; /* TODO */
> +	int i, ret = 0;
> +	struct io_mm *io_mm = NULL;
> +	struct iommu_domain *domain;
> +	struct iommu_bond *bond = NULL, *tmp;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	domain = iommu_get_domain_for_dev(dev);
> +	if (!domain)
> +		return -EINVAL;
> +
> +	/*
> +	 * The device driver does not call sva_device_init/shutdown and
> +	 * bind/unbind concurrently, so no need to take the param lock.
> +	 */
> +	if (WARN_ON_ONCE(!param) || (flags & ~param->features))
> +		return -EINVAL;
> +
> +	/* If an io_mm already exists, use it */
> +	spin_lock(&iommu_sva_lock);
> +	idr_for_each_entry(&iommu_pasid_idr, io_mm, i) {
> +		if (io_mm->mm == mm && io_mm_get_locked(io_mm)) {
> +			/* ... Unless it's already bound to this device */
> +			list_for_each_entry(tmp, &io_mm->devices, mm_head) {
> +				if (tmp->dev == dev) {
> +					bond = tmp;

Using bond for this is clear in a sense, but can we not just use ret
so it is obvious here that we are going to return -EEXIST?
At first glance I thought you were going to carry on with this bond
and couldn't work out why it would ever make sense to have two bonds
between a device an an io_mm (which it doesn't!)

> +					io_mm_put_locked(io_mm);
> +					break;
> +				}
> +			}
> +			break;
> +		}
> +	}
> +	spin_unlock(&iommu_sva_lock);
> +
> +	if (bond)
> +		return -EEXIST;
> +
> +	/* Require identical features within an io_mm for now */
> +	if (io_mm && (flags != io_mm->flags)) {
> +		io_mm_put(io_mm);
> +		return -EDOM;
> +	}
> +
> +	if (!io_mm) {
> +		io_mm = io_mm_alloc(domain, dev, mm, flags);
> +		if (IS_ERR(io_mm))
> +			return PTR_ERR(io_mm);
> +	}
> +
> +	ret = io_mm_attach(domain, dev, io_mm, drvdata);
> +	if (ret)
> +		io_mm_put(io_mm);
> +	else
> +		*pasid = io_mm->pasid;
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(__iommu_sva_bind_device);
>  
>  int __iommu_sva_unbind_device(struct device *dev, int pasid)
>  {
> -	return -ENOSYS; /* TODO */
> +	int ret = -ESRCH;
> +	struct iommu_domain *domain;
> +	struct iommu_bond *bond = NULL;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	domain = iommu_get_domain_for_dev(dev);
> +	if (!param || WARN_ON(!domain))
> +		return -EINVAL;
> +
> +	spin_lock(&iommu_sva_lock);
> +	list_for_each_entry(bond, &param->mm_list, dev_head) {
> +		if (bond->io_mm->pasid == pasid) {
> +			io_mm_detach_locked(bond);
> +			ret = 0;
> +			break;
> +		}
> +	}
> +	spin_unlock(&iommu_sva_lock);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(__iommu_sva_unbind_device);
>  
> @@ -132,6 +497,15 @@ EXPORT_SYMBOL_GPL(__iommu_sva_unbind_device);
>   */
>  void __iommu_sva_unbind_dev_all(struct device *dev)
>  {
> -	/* TODO */
> +	struct iommu_sva_param *param;
> +	struct iommu_bond *bond, *next;
> +
> +	param = dev->iommu_param->sva_param;
> +	if (param) {
> +		spin_lock(&iommu_sva_lock);
> +		list_for_each_entry_safe(bond, next, &param->mm_list, dev_head)
> +			io_mm_detach_locked(bond);
> +		spin_unlock(&iommu_sva_lock);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(__iommu_sva_unbind_dev_all);
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index bd2819deae5b..333801e1519c 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1463,6 +1463,7 @@ static struct iommu_domain *__iommu_domain_alloc(struct bus_type *bus,
>  	domain->type = type;
>  	/* Assume all sizes by default; the driver may override this later */
>  	domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;
> +	INIT_LIST_HEAD(&domain->mm_list);
>  
>  	return domain;
>  }
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index da59c20c4f12..d5f21719a5a0 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -100,6 +100,20 @@ struct iommu_domain {
>  	void *handler_token;
>  	struct iommu_domain_geometry geometry;
>  	void *iova_cookie;
> +
> +	struct list_head mm_list;
> +};
> +
> +struct io_mm {
> +	int			pasid;
> +	/* IOMMU_SVA_FEAT_* */
> +	unsigned long		flags;
> +	struct list_head	devices;
> +	struct kref		kref;
> +	struct mm_struct	*mm;
> +
> +	/* Release callback for this mm */
> +	void (*release)(struct io_mm *io_mm);
>  };
>  
>  enum iommu_cap {
> @@ -216,6 +230,7 @@ struct iommu_sva_param {
>  	unsigned long features;
>  	unsigned int min_pasid;
>  	unsigned int max_pasid;
> +	struct list_head mm_list;
>  };
>  
>  /**
> @@ -227,6 +242,11 @@ struct iommu_sva_param {
>   * @detach_dev: detach device from an iommu domain
>   * @sva_device_init: initialize Shared Virtual Adressing for a device
>   * @sva_device_shutdown: shutdown Shared Virtual Adressing for a device
> + * @mm_alloc: allocate io_mm
> + * @mm_free: free io_mm
> + * @mm_attach: attach io_mm to a device. Install PASID entry if necessary
> + * @mm_detach: detach io_mm from a device. Remove PASID entry and
> + *             flush associated TLB entries.
>   * @map: map a physically contiguous memory region to an iommu domain
>   * @unmap: unmap a physically contiguous memory region from an iommu domain
>   * @map_sg: map a scatter-gather list of physically contiguous memory chunks
> @@ -268,6 +288,14 @@ struct iommu_ops {
>  			       struct iommu_sva_param *param);
>  	void (*sva_device_shutdown)(struct device *dev,
>  				    struct iommu_sva_param *param);
> +	struct io_mm *(*mm_alloc)(struct iommu_domain *domain,
> +				  struct mm_struct *mm,
> +				  unsigned long flags);
> +	void (*mm_free)(struct io_mm *io_mm);
> +	int (*mm_attach)(struct iommu_domain *domain, struct device *dev,
> +			 struct io_mm *io_mm, bool attach_domain);
> +	void (*mm_detach)(struct iommu_domain *domain, struct device *dev,
> +			  struct io_mm *io_mm, bool detach_domain);
>  	int (*map)(struct iommu_domain *domain, unsigned long iova,
>  		   phys_addr_t paddr, size_t size, int prot);
>  	size_t (*unmap)(struct iommu_domain *domain, unsigned long iova,

Jean-Philippe Brucker May 21, 2018, 2:44 p.m. UTC | #4

On 17/05/18 15:25, Jonathan Cameron wrote:
>> +static struct io_mm *
>> +io_mm_alloc(struct iommu_domain *domain, struct device *dev,
>> +	    struct mm_struct *mm, unsigned long flags)
>> +{
>> +	int ret;
>> +	int pasid;
>> +	struct io_mm *io_mm;
>> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
>> +
>> +	if (!domain->ops->mm_alloc || !domain->ops->mm_free)
>> +		return ERR_PTR(-ENODEV);
>> +
>> +	io_mm = domain->ops->mm_alloc(domain, mm, flags);
>> +	if (IS_ERR(io_mm))
>> +		return io_mm;
>> +	if (!io_mm)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	/*
>> +	 * The mm must not be freed until after the driver frees the io_mm
>> +	 * (which may involve unpinning the CPU ASID for instance, requiring a
>> +	 * valid mm struct.)
>> +	 */
>> +	mmgrab(mm);
>> +
>> +	io_mm->flags		= flags;
>> +	io_mm->mm		= mm;
>> +	io_mm->release		= domain->ops->mm_free;
>> +	INIT_LIST_HEAD(&io_mm->devices);
>> +
>> +	idr_preload(GFP_KERNEL);
>> +	spin_lock(&iommu_sva_lock);
>> +	pasid = idr_alloc(&iommu_pasid_idr, io_mm, param->min_pasid,
>> +			  param->max_pasid + 1, GFP_ATOMIC);
> 
> I'd expect the IDR cleanup to be in io_mm_free as that would 'match'
> against io_mm_alloc but it's in io_mm_release just before the io_mm_free
> call, perhaps move it or am I missing something?
> 
> Hmm. This is reworked in patch 5 to use call rcu to do the free.  I suppose
> we may be burning an idr entry if we take a while to get round to the
> free..  If so a comment to explain this would be great.

Ok, I'll see if I can come up with some comments for both patch 3 and 5.

>> +	io_mm->pasid = pasid;
>> +	spin_unlock(&iommu_sva_lock);
>> +	idr_preload_end();
>> +
>> +	if (pasid < 0) {
>> +		ret = pasid;
>> +		goto err_free_mm;
>> +	}
>> +
>> +	/* TODO: keep track of mm. For the moment, abort. */
> 
> From later patches, I can now see why we didn't init the kref
> here, but perhaps a comment would make that clear rather than
> people checking it is correctly used throughout?  Actually just grab
> the comment from patch 5 and put it in this one and that will
> do the job nicely.

Ok

>> +	ret = -ENOSYS;
>> +	spin_lock(&iommu_sva_lock);
>> +	idr_remove(&iommu_pasid_idr, io_mm->pasid);
>> +	spin_unlock(&iommu_sva_lock);
>> +
>> +err_free_mm:
>> +	domain->ops->mm_free(io_mm);
> 
> Really minor, but you now have io_mm->release set so to keep
> this obviously the same as the io_mm_free path, perhaps call
> that rather than mm_free directly.

Yes, makes sense

>> +static void io_mm_detach_locked(struct iommu_bond *bond)
>> +{
>> +	struct iommu_bond *tmp;
>> +	bool detach_domain = true;
>> +	struct iommu_domain *domain = bond->domain;
>> +
>> +	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
>> +		if (tmp->io_mm == bond->io_mm && tmp->dev != bond->dev) {
>> +			detach_domain = false;
>> +			break;
>> +		}
>> +	}
>> +
>> +	domain->ops->mm_detach(domain, bond->dev, bond->io_mm, detach_domain);
>> +
> 
> I can't see an immediate reason to have a different order in her to the reverse of
> the attach above.   So I think you should be detaching after the list_del calls.
> If there is a reason, can we have a comment so I don't ask on v10.

I don't see a reason either right now, I'll see if it can be moved

> 
>> +	list_del(&bond->mm_head);
>> +	list_del(&bond->domain_head);
>> +	list_del(&bond->dev_head);
>> +	io_mm_put_locked(bond->io_mm);


>> +	/* If an io_mm already exists, use it */
>> +	spin_lock(&iommu_sva_lock);
>> +	idr_for_each_entry(&iommu_pasid_idr, io_mm, i) {
>> +		if (io_mm->mm == mm && io_mm_get_locked(io_mm)) {
>> +			/* ... Unless it's already bound to this device */
>> +			list_for_each_entry(tmp, &io_mm->devices, mm_head) {
>> +				if (tmp->dev == dev) {
>> +					bond = tmp;
> 
> Using bond for this is clear in a sense, but can we not just use ret
> so it is obvious here that we are going to return -EEXIST?
> At first glance I thought you were going to carry on with this bond
> and couldn't work out why it would ever make sense to have two bonds
> between a device an an io_mm (which it doesn't!)

Yes, using ret is nicer

Thanks,
Jean

Jacob Pan May 22, 2018, 4:43 p.m. UTC | #5

On Thu, 17 May 2018 11:02:42 +0100
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> On 17/05/18 00:31, Jacob Pan wrote:
> > On Fri, 11 May 2018 20:06:04 +0100
> > I am a little confused about domain vs. pasid relationship. If
> > each domain represents a address space, should there be a domain for
> > each pasid?  
> 
> I don't think there is a formal definition, but from previous
> discussion the consensus seems to be: domains are a collection of
> devices that have the same virtual address spaces (one or many).
> 
> Keeping that definition makes things easier, in my opinion. Some time
> ago, I did try to represent PASIDs using "subdomains" (introducing a
> hierarchy of struct iommu_domain), but it required invasive changes in
> the IOMMU subsystem and probably all over the tree.
> 
> You do need some kind of "root domain" for each device, so that
> "iommu_get_domain_for_dev()" still makes sense. That root domain
> doesn't have a single address space but a collection of subdomains.
> If you need this anyway, representing a PASID with an iommu_domain
> doesn't seem preferable than using a different structure (io_mm),
> because they don't have anything in common.
> 
My main concern is the PASID table storage. If PASID table storage
is tied to domain, it is ok to scale up, i.e. multiple devices in a
domain share a single PASID table. But to scale down, e.g. further
partition device with VFIO mdev for assignment, each mdev may get its
own domain via vfio. But there is no IOMMU storage for PASID table at
mdev device level. Perhaps we do need root domain or some parent-child
relationship to locate PASID table.

> Thanks,
> Jean

[Jacob Pan]

Jean-Philippe Brucker May 24, 2018, 11:44 a.m. UTC | #6

On 22/05/18 17:43, Jacob Pan wrote:
> On Thu, 17 May 2018 11:02:42 +0100
> Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:
> 
>> On 17/05/18 00:31, Jacob Pan wrote:
>>> On Fri, 11 May 2018 20:06:04 +0100
>>> I am a little confused about domain vs. pasid relationship. If
>>> each domain represents a address space, should there be a domain for
>>> each pasid?  
>>
>> I don't think there is a formal definition, but from previous
>> discussion the consensus seems to be: domains are a collection of
>> devices that have the same virtual address spaces (one or many).
>>
>> Keeping that definition makes things easier, in my opinion. Some time
>> ago, I did try to represent PASIDs using "subdomains" (introducing a
>> hierarchy of struct iommu_domain), but it required invasive changes in
>> the IOMMU subsystem and probably all over the tree.
>>
>> You do need some kind of "root domain" for each device, so that
>> "iommu_get_domain_for_dev()" still makes sense. That root domain
>> doesn't have a single address space but a collection of subdomains.
>> If you need this anyway, representing a PASID with an iommu_domain
>> doesn't seem preferable than using a different structure (io_mm),
>> because they don't have anything in common.
>>
> My main concern is the PASID table storage. If PASID table storage
> is tied to domain, it is ok to scale up, i.e. multiple devices in a
> domain share a single PASID table. But to scale down, e.g. further
> partition device with VFIO mdev for assignment, each mdev may get its
> own domain via vfio. But there is no IOMMU storage for PASID table at
> mdev device level. Perhaps we do need root domain or some parent-child
> relationship to locate PASID table.

Interesting, I hadn't thought about this use-case before. At first I
thought you were talking about mdev devices assigned to VMs, but I think
you're referring to mdevs assigned to userspace drivers instead? Out of
curiosity, is it only theoretical or does someone actually need this?

I don't think mdev for VM assignment are compatible with PASID, at least
not when the IOMMU is involved. I usually ignore mdev in my reasoning
because, as far as I know, currently they only affect devices that have
their own MMU, and IOMMU domains don't come into play. However, if a
device was backed by the IOMMU, and the device driver wanted to
partition it into mdevs, then users would be tempted to assign mdev1 to
VM1 and mdev2 to VM2.

It doesn't work with PASID, because the PASID spaces of VM1 and VM2 will
conflict. If both VM1 and VM2 allocate PASID #1, then the host has to
somehow arbitrate device accesses, for example scheduling first mdev1
then mdev2. That's possible if the device driver is in charge of the
MMU, but not really compatible with the IOMMU.

So in the IOMMU subsystem, for assigning devices to VMs the only
model that makes sense is SR-IOV, where each VF/mdev has its own RID and
its own PASID table. In that case you'd get one IOMMU domain per VF.

But considering userspace drivers in the host alone, it might make sense
to partition a device into mdevs and assign them to multiple processes.
Interestingly this scheme still doesn't work with the classic MAP/UNMAP
ioctl: since there is a single device context entry for all mdevs, the
mediating driver would have to catch all MAP/UNMAP ioctls and reject
those with IOVAs that overlap those of another mdev. It's doesn't seem
viable. But when using PASID then each mdev has its own address space,
and since PASIDs are allocated by the kernel there is no such conflict.

Anyway, I think this use-case can work with the current structures, if
mediating driver does the bind() instead of VFIO. That's necessary
because you can't let userspace program the PASID into the device, or
they would be able to access address spaces owned by other mdevs. Then
the mediating driver does the bind(), and keeps internal structures to
associate the process to the given mdev.

Thanks,
Jean

Ilias Apalodimas May 24, 2018, 11:50 a.m. UTC | #7

> Interesting, I hadn't thought about this use-case before. At first I
> thought you were talking about mdev devices assigned to VMs, but I think
> you're referring to mdevs assigned to userspace drivers instead? Out of
> curiosity, is it only theoretical or does someone actually need this?

There has been some non upstreamed efforts to have mdev and produce userspace
drivers. Huawei is using it on what they call "wrapdrive" for crypto devices and
we did a proof of concept for ethernet interfaces. At the time we choose not to
involve the IOMMU for the reason you mentioned, but having it there would be
good.

Thanks
Ilias

Jean-Philippe Brucker May 24, 2018, 3:04 p.m. UTC | #8

On 24/05/18 12:50, Ilias Apalodimas wrote:
>> Interesting, I hadn't thought about this use-case before. At first I
>> thought you were talking about mdev devices assigned to VMs, but I think
>> you're referring to mdevs assigned to userspace drivers instead? Out of
>> curiosity, is it only theoretical or does someone actually need this?
> 
> There has been some non upstreamed efforts to have mdev and produce userspace
> drivers. Huawei is using it on what they call "wrapdrive" for crypto devices and
> we did a proof of concept for ethernet interfaces. At the time we choose not to
> involve the IOMMU for the reason you mentioned, but having it there would be
> good.

I'm guessing there were good reasons to do it that way but I wonder, is
it not simpler to just have the kernel driver create a /dev/foo, with a
standard ioctl/mmap/poll interface? Here VFIO adds a layer of
indirection, and since the mediating driver has to implement these
operations already, what is gained?

Thanks,
Jean

Ilias Apalodimas May 25, 2018, 6:33 a.m. UTC | #9

On Thu, May 24, 2018 at 04:04:39PM +0100, Jean-Philippe Brucker wrote:
> On 24/05/18 12:50, Ilias Apalodimas wrote:
> >> Interesting, I hadn't thought about this use-case before. At first I
> >> thought you were talking about mdev devices assigned to VMs, but I think
> >> you're referring to mdevs assigned to userspace drivers instead? Out of
> >> curiosity, is it only theoretical or does someone actually need this?
> > 
> > There has been some non upstreamed efforts to have mdev and produce userspace
> > drivers. Huawei is using it on what they call "wrapdrive" for crypto devices and
> > we did a proof of concept for ethernet interfaces. At the time we choose not to
> > involve the IOMMU for the reason you mentioned, but having it there would be
> > good.
> 
> I'm guessing there were good reasons to do it that way but I wonder, is
> it not simpler to just have the kernel driver create a /dev/foo, with a
> standard ioctl/mmap/poll interface? Here VFIO adds a layer of
> indirection, and since the mediating driver has to implement these
> operations already, what is gained?
The best reason i can come up with is "common code". You already have one API
doing that for you so we replicate it in a /dev file?
The mdev approach still needs extentions to support what we tried to do (i.e
mdev bus might need yo have access on iommu_ops), but as far as i undestand it's
a possible case.
> 
> Thanks,
> Jean

Jonathan Cameron May 25, 2018, 8:39 a.m. UTC | #10

+CC Kenneth Lee

On Fri, 25 May 2018 09:33:11 +0300
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:

> On Thu, May 24, 2018 at 04:04:39PM +0100, Jean-Philippe Brucker wrote:
> > On 24/05/18 12:50, Ilias Apalodimas wrote:  
> > >> Interesting, I hadn't thought about this use-case before. At first I
> > >> thought you were talking about mdev devices assigned to VMs, but I think
> > >> you're referring to mdevs assigned to userspace drivers instead? Out of
> > >> curiosity, is it only theoretical or does someone actually need this?  
> > > 
> > > There has been some non upstreamed efforts to have mdev and produce userspace
> > > drivers. Huawei is using it on what they call "wrapdrive" for crypto devices and
> > > we did a proof of concept for ethernet interfaces. At the time we choose not to
> > > involve the IOMMU for the reason you mentioned, but having it there would be
> > > good.  
> > 
> > I'm guessing there were good reasons to do it that way but I wonder, is
> > it not simpler to just have the kernel driver create a /dev/foo, with a
> > standard ioctl/mmap/poll interface? Here VFIO adds a layer of
> > indirection, and since the mediating driver has to implement these
> > operations already, what is gained?  
> The best reason i can come up with is "common code". You already have one API
> doing that for you so we replicate it in a /dev file?
> The mdev approach still needs extentions to support what we tried to do (i.e
> mdev bus might need yo have access on iommu_ops), but as far as i undestand it's
> a possible case.
> > 
> > Thanks,
> > Jean

Kenneth Lee June 11, 2018, 4:32 p.m. UTC | #11

On Fri, May 25, 2018 at 09:39:59AM +0100, Jonathan Cameron wrote:
> Date: Fri, 25 May 2018 09:39:59 +0100
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> To: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> CC: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>,
>  "xieyisheng1@huawei.com" <xieyisheng1@huawei.com>, "kvm@vger.kernel.org"
>  <kvm@vger.kernel.org>, "linux-pci@vger.kernel.org"
>  <linux-pci@vger.kernel.org>, "xuzaibo@huawei.com" <xuzaibo@huawei.com>,
>  Will Deacon <Will.Deacon@arm.com>, "okaya@codeaurora.org"
>  <okaya@codeaurora.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>,
>  "yi.l.liu@intel.com" <yi.l.liu@intel.com>, "ashok.raj@intel.com"
>  <ashok.raj@intel.com>, "tn@semihalf.com" <tn@semihalf.com>,
>  "joro@8bytes.org" <joro@8bytes.org>, "robdclark@gmail.com"
>  <robdclark@gmail.com>, "bharatku@xilinx.com" <bharatku@xilinx.com>,
>  "linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
>  "liudongdong3@huawei.com" <liudongdong3@huawei.com>, "rfranz@cavium.com"
>  <rfranz@cavium.com>, "devicetree@vger.kernel.org"
>  <devicetree@vger.kernel.org>, "kevin.tian@intel.com"
>  <kevin.tian@intel.com>, Jacob Pan <jacob.jun.pan@linux.intel.com>,
>  "alex.williamson@redhat.com" <alex.williamson@redhat.com>,
>  "rgummal@xilinx.com" <rgummal@xilinx.com>, "thunder.leizhen@huawei.com"
>  <thunder.leizhen@huawei.com>, "linux-arm-kernel@lists.infradead.org"
>  <linux-arm-kernel@lists.infradead.org>, "shunyong.yang@hxt-semitech.com"
>  <shunyong.yang@hxt-semitech.com>, "dwmw2@infradead.org"
>  <dwmw2@infradead.org>, "liubo95@huawei.com" <liubo95@huawei.com>,
>  "jcrouse@codeaurora.org" <jcrouse@codeaurora.org>,
>  "iommu@lists.linux-foundation.org" <iommu@lists.linux-foundation.org>,
>  Robin Murphy <Robin.Murphy@arm.com>, "christian.koenig@amd.com"
>  <christian.koenig@amd.com>, "nwatters@codeaurora.org"
>  <nwatters@codeaurora.org>, "baolu.lu@linux.intel.com"
>  <baolu.lu@linux.intel.com>, liguozhu@hisilicon.com,
>  kenneth-lee-2012@foxmail.com
> Subject: Re: [PATCH v2 03/40] iommu/sva: Manage process address spaces
> Message-ID: <20180525093959.000040a7@huawei.com>
> Organization: Huawei
> X-Mailer: Claws Mail 3.15.0 (GTK+ 2.24.31; x86_64-w64-mingw32)
> 
> +CC Kenneth Lee
> 
> On Fri, 25 May 2018 09:33:11 +0300
> Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> 
> > On Thu, May 24, 2018 at 04:04:39PM +0100, Jean-Philippe Brucker wrote:
> > > On 24/05/18 12:50, Ilias Apalodimas wrote:  
> > > >> Interesting, I hadn't thought about this use-case before. At first I
> > > >> thought you were talking about mdev devices assigned to VMs, but I think
> > > >> you're referring to mdevs assigned to userspace drivers instead? Out of
> > > >> curiosity, is it only theoretical or does someone actually need this?  
> > > > 
> > > > There has been some non upstreamed efforts to have mdev and produce userspace
> > > > drivers. Huawei is using it on what they call "wrapdrive" for crypto devices and
> > > > we did a proof of concept for ethernet interfaces. At the time we choose not to
> > > > involve the IOMMU for the reason you mentioned, but having it there would be
> > > > good.  
> > > 
> > > I'm guessing there were good reasons to do it that way but I wonder, is
> > > it not simpler to just have the kernel driver create a /dev/foo, with a
> > > standard ioctl/mmap/poll interface? Here VFIO adds a layer of
> > > indirection, and since the mediating driver has to implement these
> > > operations already, what is gained?  
> > The best reason i can come up with is "common code". You already have one API
> > doing that for you so we replicate it in a /dev file?
> > The mdev approach still needs extentions to support what we tried to do (i.e
> > mdev bus might need yo have access on iommu_ops), but as far as i undestand it's
> > a possible case.

Hi, Jean, Please allow me to share my understanding here:
https://zhuanlan.zhihu.com/p/35489035

The reason we do not use the /dev/foo scheme is that the devices to be
shared are programmable accelerators. We cannot fix up the kernel driver for
them.

> > > 
> > > Thanks,
> > > Jean  
> 
> 

(p.s. I sent this mail on May 26 from my public email count. But it
seems the email server is blocked. I resent it from my company count until my
colleague told me just now. Sorry for inconvenience)

Eric Auger Sept. 5, 2018, 12:14 p.m. UTC | #12

Hi Jean-Philippe,

On 05/11/2018 09:06 PM, Jean-Philippe Brucker wrote:
> Allocate IOMMU mm structures and binding them to devices. Four operations
s/binding/bind
> are added to IOMMU drivers:
> 
> * mm_alloc(): to create an io_mm structure and perform architecture-
>   specific operations required to grab the process (for instance on ARM,
>   pin down the CPU ASID so that the process doesn't get assigned a new
>   ASID on rollover).
> 
>   There is a single valid io_mm structure per Linux mm. Future extensions
>   may also use io_mm for kernel-managed address spaces, populated with
>   map()/unmap() calls instead of bound to process address spaces. This
>   patch focuses on "shared" io_mm.
> 
> * mm_attach(): attach an mm to a device. The IOMMU driver checks that the
>   device is capable of sharing an address space, and writes the PASID
>   table entry to install the pgd.
> 
>   Some IOMMU drivers will have a single PASID table per domain, for
>   convenience. Other can implement it differently but to help these
>   drivers, mm_attach and mm_detach take 'attach_domain' and
>   'detach_domain' parameters, that tell whether they need to set and clear
>   the PASID entry or only send the required TLB invalidations.
> 
> * mm_detach(): detach an mm from a device. The IOMMU driver removes the
>   PASID table entry and invalidates the IOTLBs.
> 
> * mm_free(): free a structure allocated by mm_alloc(), and let arch
>   release the process.
> 
> mm_attach and mm_detach operations are serialized with a spinlock. When
> trying to optimize this code, we should at least prevent concurrent
> attach()/detach() on the same domain (so multi-level PASID table code can
> allocate tables lazily). mm_alloc() can sleep, but mm_free must not
> (because we'll have to call it from call_srcu later on).
> 
> At the moment we use an IDR for allocating PASIDs and retrieving contexts.
> We also use a single spinlock. These can be refined and optimized later (a
> custom allocator will be needed for top-down PASID allocation).
> 
> Keeping track of address spaces requires the use of MMU notifiers.
> Handling process exit with regard to unbind() is tricky, so it is left for
> another patch and we explicitly fail mm_alloc() for the moment.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> ---
> v1->v2: sanity-check of flags
> ---
>  drivers/iommu/iommu-sva.c | 380 +++++++++++++++++++++++++++++++++++++-
>  drivers/iommu/iommu.c     |   1 +
>  include/linux/iommu.h     |  28 +++
>  3 files changed, 406 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> index 8d98f9c09864..6ac679c48f3c 100644
> --- a/drivers/iommu/iommu-sva.c
> +++ b/drivers/iommu/iommu-sva.c
> @@ -5,8 +5,298 @@
>   * Copyright (C) 2018 ARM Ltd.
>   */
>  
> +#include <linux/idr.h>
>  #include <linux/iommu.h>
> +#include <linux/sched/mm.h>
>  #include <linux/slab.h>
> +#include <linux/spinlock.h>
> +
> +/**
> + * DOC: io_mm model
> + *
> + * The io_mm keeps track of process address spaces shared between CPU and IOMMU.
> + * The following example illustrates the relation between structures
> + * iommu_domain, io_mm and iommu_bond. An iommu_bond is a link between io_mm and
> + * device. A device can have multiple io_mm and an io_mm may be bound to
> + * multiple devices.
> + *              ___________________________
> + *             |  IOMMU domain A           |
> + *             |  ________________         |
> + *             | |  IOMMU group   |        +------- io_pgtables
> + *             | |                |        |
> + *             | |   dev 00:00.0 ----+------- bond --- io_mm X
> + *             | |________________|   \    |
> + *             |                       '----- bond ---.
> + *             |___________________________|           \
> + *              ___________________________             \
> + *             |  IOMMU domain B           |           io_mm Y
> + *             |  ________________         |           / /
> + *             | |  IOMMU group   |        |          / /
> + *             | |                |        |         / /
> + *             | |   dev 00:01.0 ------------ bond -' /
> + *             | |   dev 00:01.1 ------------ bond --'
> + *             | |________________|        |
> + *             |                           +------- io_pgtables
> + *             |___________________________|
> + *
> + * In this example, device 00:00.0 is in domain A, devices 00:01.* are in domain
> + * B. All devices within the same domain access the same address spaces. Device
> + * 00:00.0 accesses address spaces X and Y, each corresponding to an mm_struct.
> + * Devices 00:01.* only access address space Y. In addition each
> + * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable, that is
> + * managed with iommu_map()/iommu_unmap(), and isn't shared with the CPU MMU.
> + *
> + * To obtain the above configuration, users would for instance issue the
> + * following calls:
> + *
> + *     iommu_sva_bind_device(dev 00:00.0, mm X, ...) -> PASID 1
> + *     iommu_sva_bind_device(dev 00:00.0, mm Y, ...) -> PASID 2
> + *     iommu_sva_bind_device(dev 00:01.0, mm Y, ...) -> PASID 2
> + *     iommu_sva_bind_device(dev 00:01.1, mm Y, ...) -> PASID 2
> + *
> + * A single Process Address Space ID (PASID) is allocated for each mm. In the
> + * example, devices use PASID 1 to read/write into address space X and PASID 2
> + * to read/write into address space Y.
> + *
> + * Hardware tables describing this configuration in the IOMMU would typically
> + * look like this:
> + *
> + *                                PASID tables
> + *                                 of domain A
> + *                              .->+--------+
> + *                             / 0 |        |-------> io_pgtable
> + *                            /    +--------+
> + *            Device tables  /   1 |        |-------> pgd X
> + *              +--------+  /      +--------+
> + *      00:00.0 |      A |-'     2 |        |--.
> + *              +--------+         +--------+   \
> + *              :        :       3 |        |    \
> + *              +--------+         +--------+     --> pgd Y
> + *      00:01.0 |      B |--.                    /
> + *              +--------+   \                  |
> + *      00:01.1 |      B |----+   PASID tables  |
> + *              +--------+     \   of domain B  |
> + *                              '->+--------+   |
> + *                               0 |        |-- | --> io_pgtable
> + *                                 +--------+   |
> + *                               1 |        |   |
> + *                                 +--------+   |
> + *                               2 |        |---'
> + *                                 +--------+
> + *                               3 |        |
> + *                                 +--------+
> + *
> + * With this model, a single call binds all devices in a given domain to an
> + * address space. Other devices in the domain will get the same bond implicitly.
> + * However, users must issue one bind() for each device, because IOMMUs may
> + * implement SVA differently. Furthermore, mandating one bind() per device
> + * allows the driver to perform sanity-checks on device capabilities.
> + *
> + * On Arm and AMD IOMMUs, entry 0 of the PASID table can be used to hold
> + * non-PASID translations. In this case PASID 0 is reserved and entry 0 points
> + * to the io_pgtable base. On Intel IOMMU, the io_pgtable base would be held in
> + * the device table and PASID 0 would be available to the allocator.
> + */
very nice explanation
> +
> +struct iommu_bond {
> +	struct io_mm		*io_mm;
> +	struct device		*dev;
> +	struct iommu_domain	*domain;
> +
> +	struct list_head	mm_head;
> +	struct list_head	dev_head;
> +	struct list_head	domain_head;
> +
> +	void			*drvdata;
> +};
> +
> +/*
> + * Because we're using an IDR, PASIDs are limited to 31 bits (the sign bit is
> + * used for returning errors). In practice implementations will use at most 20
> + * bits, which is the PCI limit.
> + */
> +static DEFINE_IDR(iommu_pasid_idr);
> +
> +/*
> + * For the moment this is an all-purpose lock. It serializes
> + * access/modifications to bonds, access/modifications to the PASID IDR, and
> + * changes to io_mm refcount as well.
> + */
> +static DEFINE_SPINLOCK(iommu_sva_lock);
> +
> +static struct io_mm *
> +io_mm_alloc(struct iommu_domain *domain, struct device *dev,
> +	    struct mm_struct *mm, unsigned long flags)
> +{
> +	int ret;
> +	int pasid;
> +	struct io_mm *io_mm;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
don't you need to check param != NULL and flags are compatible with
those set at init?
> +	if (!domain->ops->mm_alloc || !domain->ops->mm_free)
> +		return ERR_PTR(-ENODEV);
> +
> +	io_mm = domain->ops->mm_alloc(domain, mm, flags);
> +	if (IS_ERR(io_mm))
> +		return io_mm;
> +	if (!io_mm)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/*
> +	 * The mm must not be freed until after the driver frees the io_mm
> +	 * (which may involve unpinning the CPU ASID for instance, requiring a
> +	 * valid mm struct.)
> +	 */
> +	mmgrab(mm);
> +
> +	io_mm->flags		= flags;
> +	io_mm->mm		= mm;
> +	io_mm->release		= domain->ops->mm_free;
> +	INIT_LIST_HEAD(&io_mm->devices);
> +
> +	idr_preload(GFP_KERNEL);
> +	spin_lock(&iommu_sva_lock);
> +	pasid = idr_alloc(&iommu_pasid_idr, io_mm, param->min_pasid,
> +			  param->max_pasid + 1, GFP_ATOMIC);
isn't it param->max_pasid - 1?
> +	io_mm->pasid = pasid;
> +	spin_unlock(&iommu_sva_lock);
> +	idr_preload_end();
> +
> +	if (pasid < 0) {
> +		ret = pasid;
> +		goto err_free_mm;
> +	}
> +
> +	/* TODO: keep track of mm. For the moment, abort. */I don't get where you return with success?
> +	ret = -ENOSYS;
> +	spin_lock(&iommu_sva_lock);
> +	idr_remove(&iommu_pasid_idr, io_mm->pasid);
> +	spin_unlock(&iommu_sva_lock);
> +
> +err_free_mm:
> +	domain->ops->mm_free(io_mm);
> +	mmdrop(mm);
> +
> +	return ERR_PTR(ret)
> +}
> +
> +static void io_mm_free(struct io_mm *io_mm)
> +{
> +	struct mm_struct *mm = io_mm->mm;
> +
> +	io_mm->release(io_mm);
> +	mmdrop(mm);
> +}
> +
> +static void io_mm_release(struct kref *kref)
> +{
> +	struct io_mm *io_mm;
> +
> +	io_mm = container_of(kref, struct io_mm, kref);
> +	WARN_ON(!list_empty(&io_mm->devices));
> +
> +	idr_remove(&iommu_pasid_idr, io_mm->pasid);
> +
> +	io_mm_free(io_mm);
> +}
> +
> +/*
> + * Returns non-zero if a reference to the io_mm was successfully taken.
> + * Returns zero if the io_mm is being freed and should not be used.
> + */
> +static int io_mm_get_locked(struct io_mm *io_mm)
> +{
> +	if (io_mm)
> +		return kref_get_unless_zero(&io_mm->kref);
> +
> +	return 0;
> +}
> +
> +static void io_mm_put_locked(struct io_mm *io_mm)
> +{
> +	kref_put(&io_mm->kref, io_mm_release);
> +}
> +
> +static void io_mm_put(struct io_mm *io_mm)
> +{
> +	spin_lock(&iommu_sva_lock);
> +	io_mm_put_locked(io_mm);
> +	spin_unlock(&iommu_sva_lock);
> +}
> +
> +static int io_mm_attach(struct iommu_domain *domain, struct device *dev,
> +			struct io_mm *io_mm, void *drvdata)
> +{
> +	int ret;
> +	bool attach_domain = true;
> +	int pasid = io_mm->pasid;
> +	struct iommu_bond *bond, *tmp;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	if (!domain->ops->mm_attach || !domain->ops->mm_detach)
> +		return -ENODEV;
don't you need to check param is not NULL?
> +
> +	if (pasid > param->max_pasid || pasid < param->min_pasid)
pasid >= param->max_pasid ?
> +		return -ERANGE;
> +
> +	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> +	if (!bond)
> +		return -ENOMEM;
> +
> +	bond->domain		= domain;
> +	bond->io_mm		= io_mm;
> +	bond->dev		= dev;
> +	bond->drvdata		= drvdata;
> +
> +	spin_lock(&iommu_sva_lock);
> +	/*
> +	 * Check if this io_mm is already bound to the domain. In which case the
> +	 * IOMMU driver doesn't have to install the PASID table entry.
> +	 */
> +	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
> +		if (tmp->io_mm == io_mm) {
> +			attach_domain = false;
> +			break;
> +		}
> +	}
> +
> +	ret = domain->ops->mm_attach(domain, dev, io_mm, attach_domain);
the fact the mm_attach/detach() must not sleep may be documented in the
API doc.
> +	if (ret) {
> +		kfree(bond);
> +		spin_unlock(&iommu_sva_lock);
> +		return ret;
nit: goto unlock simplification
> +	}
> +
> +	list_add(&bond->mm_head, &io_mm->devices);
> +	list_add(&bond->domain_head, &domain->mm_list);
> +	list_add(&bond->dev_head, &param->mm_list);
> +	spin_unlock(&iommu_sva_lock);
> +
> +	return 0;
> +}
> +
> +static void io_mm_detach_locked(struct iommu_bond *bond)
> +{
> +	struct iommu_bond *tmp;
> +	bool detach_domain = true;
> +	struct iommu_domain *domain = bond->domain;
> +
> +	list_for_each_entry(tmp, &domain->mm_list, domain_head) {
> +		if (tmp->io_mm == bond->io_mm && tmp->dev != bond->dev) {
> +			detach_domain = false;
> +			break;
> +		}
> +	}
> +
> +	domain->ops->mm_detach(domain, bond->dev, bond->io_mm, detach_domain);
> +
> +	list_del(&bond->mm_head);
> +	list_del(&bond->domain_head);
> +	list_del(&bond->dev_head);
> +	io_mm_put_locked(bond->io_mm);
> +
> +	kfree(bond);
> +}
>  
>  /**
>   * iommu_sva_device_init() - Initialize Shared Virtual Addressing for a device
> @@ -47,6 +337,7 @@ int iommu_sva_device_init(struct device *dev, unsigned long features,
>  
>  	param->features		= features;
>  	param->max_pasid	= max_pasid;
> +	INIT_LIST_HEAD(&param->mm_list);
>  
>  	/*
>  	 * IOMMU driver updates the limits depending on the IOMMU and device
> @@ -114,13 +405,87 @@ EXPORT_SYMBOL_GPL(iommu_sva_device_shutdown);
>  int __iommu_sva_bind_device(struct device *dev, struct mm_struct *mm,
>  			    int *pasid, unsigned long flags, void *drvdata)
>  {
> -	return -ENOSYS; /* TODO */
> +	int i, ret = 0;
> +	struct io_mm *io_mm = NULL;
> +	struct iommu_domain *domain;
> +	struct iommu_bond *bond = NULL, *tmp;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	domain = iommu_get_domain_for_dev(dev);
> +	if (!domain)
> +		return -EINVAL;
> +
> +	/*
> +	 * The device driver does not call sva_device_init/shutdown and
> +	 * bind/unbind concurrently, so no need to take the param lock.
> +	 */
what does guarantee that?
> +	if (WARN_ON_ONCE(!param) || (flags & ~param->features))
> +		return -EINVAL;
> +
> +	/* If an io_mm already exists, use it */
> +	spin_lock(&iommu_sva_lock);
> +	idr_for_each_entry(&iommu_pasid_idr, io_mm, i) {
> +		if (io_mm->mm == mm && io_mm_get_locked(io_mm)) {
> +			/* ... Unless it's already bound to this device */
> +			list_for_each_entry(tmp, &io_mm->devices, mm_head) {
> +				if (tmp->dev == dev) {
> +					bond = tmp;
> +					io_mm_put_locked(io_mm);
> +					break;
> +				}
> +			}
> +			break;
> +		}
> +	}
> +	spin_unlock(&iommu_sva_lock);
> +
> +	if (bond)
> +		return -EEXIST;
> +
> +	/* Require identical features within an io_mm for now */
> +	if (io_mm && (flags != io_mm->flags)) {
> +		io_mm_put(io_mm);
> +		return -EDOM;
> +	}
> +
> +	if (!io_mm) {
> +		io_mm = io_mm_alloc(domain, dev, mm, flags);
> +		if (IS_ERR(io_mm))
> +			return PTR_ERR(io_mm);
> +	}
> +
> +	ret = io_mm_attach(domain, dev, io_mm, drvdata);
> +	if (ret)
> +		io_mm_put(io_mm);
dont't you want to free the io_mm if just allocated?
> +	else
> +		*pasid = io_mm->pasid;
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(__iommu_sva_bind_device);
>  
>  int __iommu_sva_unbind_device(struct device *dev, int pasid)
>  {
> -	return -ENOSYS; /* TODO */
> +	int ret = -ESRCH;
> +	struct iommu_domain *domain;
> +	struct iommu_bond *bond = NULL;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	domain = iommu_get_domain_for_dev(dev);
> +	if (!param || WARN_ON(!domain))
> +		return -EINVAL;
> +
> +	spin_lock(&iommu_sva_lock);
> +	list_for_each_entry(bond, &param->mm_list, dev_head) {
> +		if (bond->io_mm->pasid == pasid) {
> +			io_mm_detach_locked(bond);
> +			ret = 0;
> +			break;
> +		}
> +	}
> +	spin_unlock(&iommu_sva_lock);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(__iommu_sva_unbind_device);
>  
> @@ -132,6 +497,15 @@ EXPORT_SYMBOL_GPL(__iommu_sva_unbind_device);
>   */
>  void __iommu_sva_unbind_dev_all(struct device *dev)
>  {
> -	/* TODO */
> +	struct iommu_sva_param *param;
> +	struct iommu_bond *bond, *next;
> +
> +	param = dev->iommu_param->sva_param;
> +	if (param) {
> +		spin_lock(&iommu_sva_lock);
> +		list_for_each_entry_safe(bond, next, &param->mm_list, dev_head)
> +			io_mm_detach_locked(bond);
> +		spin_unlock(&iommu_sva_lock);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(__iommu_sva_unbind_dev_all);
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index bd2819deae5b..333801e1519c 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1463,6 +1463,7 @@ static struct iommu_domain *__iommu_domain_alloc(struct bus_type *bus,
>  	domain->type = type;
>  	/* Assume all sizes by default; the driver may override this later */
>  	domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;
> +	INIT_LIST_HEAD(&domain->mm_list);
>  
>  	return domain;
>  }
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index da59c20c4f12..d5f21719a5a0 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -100,6 +100,20 @@ struct iommu_domain {
>  	void *handler_token;
>  	struct iommu_domain_geometry geometry;
>  	void *iova_cookie;
> +
> +	struct list_head mm_list;
> +};
> +
> +struct io_mm {
> +	int			pasid;
> +	/* IOMMU_SVA_FEAT_* */
> +	unsigned long		flags;
> +	struct list_head	devices;
> +	struct kref		kref;
> +	struct mm_struct	*mm;
> +
> +	/* Release callback for this mm */
> +	void (*release)(struct io_mm *io_mm);
>  };
>  
>  enum iommu_cap {
> @@ -216,6 +230,7 @@ struct iommu_sva_param {
>  	unsigned long features;
>  	unsigned int min_pasid;
>  	unsigned int max_pasid;
> +	struct list_head mm_list;
>  };
>  
>  /**
> @@ -227,6 +242,11 @@ struct iommu_sva_param {
>   * @detach_dev: detach device from an iommu domain
>   * @sva_device_init: initialize Shared Virtual Adressing for a device
>   * @sva_device_shutdown: shutdown Shared Virtual Adressing for a device
> + * @mm_alloc: allocate io_mm
> + * @mm_free: free io_mm
> + * @mm_attach: attach io_mm to a device. Install PASID entry if necessary
> + * @mm_detach: detach io_mm from a device. Remove PASID entry and
> + *             flush associated TLB entries.
if necessary too?
>   * @map: map a physically contiguous memory region to an iommu domain
>   * @unmap: unmap a physically contiguous memory region from an iommu domain
>   * @map_sg: map a scatter-gather list of physically contiguous memory chunks
> @@ -268,6 +288,14 @@ struct iommu_ops {
>  			       struct iommu_sva_param *param);
>  	void (*sva_device_shutdown)(struct device *dev,
>  				    struct iommu_sva_param *param);
> +	struct io_mm *(*mm_alloc)(struct iommu_domain *domain,
> +				  struct mm_struct *mm,
> +				  unsigned long flags);
> +	void (*mm_free)(struct io_mm *io_mm);
> +	int (*mm_attach)(struct iommu_domain *domain, struct device *dev,
> +			 struct io_mm *io_mm, bool attach_domain);
> +	void (*mm_detach)(struct iommu_domain *domain, struct device *dev,
> +			  struct io_mm *io_mm, bool detach_domain);
>  	int (*map)(struct iommu_domain *domain, unsigned long iova,
>  		   phys_addr_t paddr, size_t size, int prot);
>  	size_t (*unmap)(struct iommu_domain *domain, unsigned long iova,
> 
Thanks

Eric

Jacob Pan Sept. 5, 2018, 6:18 p.m. UTC | #13

On Wed, 5 Sep 2018 14:14:12 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> > + *
> > + * On Arm and AMD IOMMUs, entry 0 of the PASID table can be used
> > to hold
> > + * non-PASID translations. In this case PASID 0 is reserved and
> > entry 0 points
> > + * to the io_pgtable base. On Intel IOMMU, the io_pgtable base
> > would be held in
> > + * the device table and PASID 0 would be available to the
> > allocator.
> > + */  
> very nice explanation
With the new Vt-d 3.0 spec., 2nd level IO page table base is no longer
held in the device context table. Instead it is held in the PASID table
entry pointed by the RID_PASID field in the device context entry. If
RID_PASID = 0, then it is the same as ARM and AMD IOMMUs.
You can refer to ch3.4.3 of the VT-d spec.

Jean-Philippe Brucker Sept. 6, 2018, 11:10 a.m. UTC | #14

On 05/09/2018 13:14, Auger Eric wrote:
>> +static struct io_mm *
>> +io_mm_alloc(struct iommu_domain *domain, struct device *dev,
>> +	    struct mm_struct *mm, unsigned long flags)
>> +{
>> +	int ret;
>> +	int pasid;
>> +	struct io_mm *io_mm;
>> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
>> +
> don't you need to check param != NULL and flags are compatible with
> those set at init?

It would be redundant, parameters are already checked by bind().
Following your comment below I think this function should also be called
under iommu_param->lock

>> +	idr_preload(GFP_KERNEL);
>> +	spin_lock(&iommu_sva_lock);
>> +	pasid = idr_alloc(&iommu_pasid_idr, io_mm, param->min_pasid,
>> +			  param->max_pasid + 1, GFP_ATOMIC);
> isn't it param->max_pasid - 1?

max_pasid is the last allocatable PASID, and the 'end' parameter of
idr_alloc is exclusive, so this needs to be max_pasid + 1.

>> +static int io_mm_attach(struct iommu_domain *domain, struct device *dev,
>> +			struct io_mm *io_mm, void *drvdata)
>> +{
>> +	int ret;
>> +	bool attach_domain = true;
>> +	int pasid = io_mm->pasid;
>> +	struct iommu_bond *bond, *tmp;
>> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
>> +
>> +	if (!domain->ops->mm_attach || !domain->ops->mm_detach)
>> +		return -ENODEV;
> don't you need to check param is not NULL?

As mm_alloc, this is called by bind() which already performs argument checks

>> +
>> +	if (pasid > param->max_pasid || pasid < param->min_pasid)
> pasid >= param->max_pasid ?

max_pasid is inclusive

>> +	ret = domain->ops->mm_attach(domain, dev, io_mm, attach_domain);
> the fact the mm_attach/detach() must not sleep may be documented in the
> API doc.

Ok

>>  int __iommu_sva_bind_device(struct device *dev, struct mm_struct *mm,
>>  			    int *pasid, unsigned long flags, void *drvdata)
>>  {
>> -	return -ENOSYS; /* TODO */
>> +	int i, ret = 0;
>> +	struct io_mm *io_mm = NULL;
>> +	struct iommu_domain *domain;
>> +	struct iommu_bond *bond = NULL, *tmp;
>> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
>> +
>> +	domain = iommu_get_domain_for_dev(dev);
>> +	if (!domain)
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * The device driver does not call sva_device_init/shutdown and
>> +	 * bind/unbind concurrently, so no need to take the param lock.
>> +	 */
> what does guarantee that?

The doc for iommu_sva_bind_device mandates that iommu_sva_device_init()
is called before bind(), but nothing is said about unbind and shutdown.
I think that was just based on the assumption that the device driver
doesn't have any reason to call unbind and shutdown concurrently, but
taking the lock here and in unbind is probably safer.

>> +	ret = io_mm_attach(domain, dev, io_mm, drvdata);
>> +	if (ret)
>> +		io_mm_put(io_mm);
> dont't you want to free the io_mm if just allocated?

We do: if the io_mm has just been allocated, it has a single reference
so io_mm_put frees it.

>> + * @mm_attach: attach io_mm to a device. Install PASID entry if necessary
>> + * @mm_detach: detach io_mm from a device. Remove PASID entry and
>> + *             flush associated TLB entries.
> if necessary too?

Right

Thanks,
Jean

Jean-Philippe Brucker Sept. 6, 2018, 5:40 p.m. UTC | #15

On 05/09/2018 19:18, Jacob Pan wrote:
> On Wed, 5 Sep 2018 14:14:12 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>>> + *
>>> + * On Arm and AMD IOMMUs, entry 0 of the PASID table can be used
>>> to hold
>>> + * non-PASID translations. In this case PASID 0 is reserved and
>>> entry 0 points
>>> + * to the io_pgtable base. On Intel IOMMU, the io_pgtable base
>>> would be held in
>>> + * the device table and PASID 0 would be available to the
>>> allocator.
>>> + */  
>> very nice explanation
> With the new Vt-d 3.0 spec., 2nd level IO page table base is no longer
> held in the device context table. Instead it is held in the PASID table
> entry pointed by the RID_PASID field in the device context entry. If
> RID_PASID = 0, then it is the same as ARM and AMD IOMMUs.
> You can refer to ch3.4.3 of the VT-d spec.

I could simplify that paragraph by removing the specific implementations:

"In some IOMMUs, entry 0 of the PASID table can be used to hold
non-PASID translations. In this case PASID 0 is reserved and entry 0
points to the io_pgtable base. In other IOMMUs the io_pgtable base is
held in the device table and PASID 0 is available to the allocator."

I guess in Linux there isn't any reason to set RID_PASID to a non-zero
value? Otherwise the iommu-sva allocator will need minor changes.

Thanks,
Jean

[v2,03/40] iommu/sva: Manage process address spaces

Commit Message

Comments

Patch