[v2,01/40] iommu: Introduce Shared Virtual Addressing API

Message ID	20180511190641.23008-2-jean-philippe.brucker@arm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of jean-philippe.brucker@arm.com designates 217.140.101.70 as permitted sender) client-ip=217.140.101.70; From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> To: linux-arm-kernel@lists.infradead.org, linux-pci@vger.kernel.org, linux-acpi@vger.kernel.org, devicetree@vger.kernel.org, iommu@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org Cc: joro@8bytes.org, will.deacon@arm.com, robin.murphy@arm.com, alex.williamson@redhat.com, tn@semihalf.com, liubo95@huawei.com, thunder.leizhen@huawei.com, xieyisheng1@huawei.com, xuzaibo@huawei.com, ilias.apalodimas@linaro.org, jonathan.cameron@huawei.com, liudongdong3@huawei.com, shunyong.yang@hxt-semitech.com, nwatters@codeaurora.org, okaya@codeaurora.org, jcrouse@codeaurora.org, rfranz@cavium.com, dwmw2@infradead.org, jacob.jun.pan@linux.intel.com, yi.l.liu@intel.com, ashok.raj@intel.com, kevin.tian@intel.com, baolu.lu@linux.intel.com, robdclark@gmail.com, christian.koenig@amd.com, bharatku@xilinx.com, rgummal@xilinx.com Subject: [PATCH v2 01/40] iommu: Introduce Shared Virtual Addressing API Date: Fri, 11 May 2018 20:06:02 +0100 Message-Id: <20180511190641.23008-2-jean-philippe.brucker@arm.com> In-Reply-To: <20180511190641.23008-1-jean-philippe.brucker@arm.com> References: <20180511190641.23008-1-jean-philippe.brucker@arm.com> Sender: owner-linux-mm@kvack.org Precedence: bulk

Jean-Philippe Brucker May 11, 2018, 7:06 p.m. UTC

Shared Virtual Addressing (SVA) provides a way for device drivers to bind
process address spaces to devices. This requires the IOMMU to support page
table format and features compatible with the CPUs, and usually requires
the system to support I/O Page Faults (IOPF) and Process Address Space ID
(PASID). When all of these are available, DMA can access virtual addresses
of a process. A PASID is allocated for each process, and the device driver
programs it into the device in an implementation-specific way.

Add a new API for sharing process page tables with devices. Introduce two
IOMMU operations, sva_device_init() and sva_device_shutdown(), that
prepare the IOMMU driver for SVA. For example allocate PASID tables and
fault queues. Subsequent patches will implement the bind() and unbind()
operations.

Support for I/O Page Faults will be added in a later patch using a new
feature bit (IOMMU_SVA_FEAT_IOPF). With the current API users must pin
down all shared mappings. Other feature bits that may be added in the
future are IOMMU_SVA_FEAT_PRIVATE, to support private PASID address
spaces, and IOMMU_SVA_FEAT_NO_PASID, to bind the whole device address
space to a process.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

---
v1->v2:
* Add sva_param structure to iommu_param
* CONFIG option is only selectable by IOMMU drivers
---
 drivers/iommu/Kconfig     |   4 ++
 drivers/iommu/Makefile    |   1 +
 drivers/iommu/iommu-sva.c | 110 ++++++++++++++++++++++++++++++++++++++
 include/linux/iommu.h     |  32 +++++++++++
 4 files changed, 147 insertions(+)
 create mode 100644 drivers/iommu/iommu-sva.c

Jacob Pan May 16, 2018, 8:41 p.m. UTC | #1

On Fri, 11 May 2018 20:06:02 +0100
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> Shared Virtual Addressing (SVA) provides a way for device drivers to
> bind process address spaces to devices. This requires the IOMMU to
> support page table format and features compatible with the CPUs, and
> usually requires the system to support I/O Page Faults (IOPF) and
> Process Address Space ID (PASID). When all of these are available,
> DMA can access virtual addresses of a process. A PASID is allocated
> for each process, and the device driver programs it into the device
> in an implementation-specific way.
> 
> Add a new API for sharing process page tables with devices. Introduce
> two IOMMU operations, sva_device_init() and sva_device_shutdown(),
> that prepare the IOMMU driver for SVA. For example allocate PASID
> tables and fault queues. Subsequent patches will implement the bind()
> and unbind() operations.
> 
> Support for I/O Page Faults will be added in a later patch using a new
> feature bit (IOMMU_SVA_FEAT_IOPF). With the current API users must pin
> down all shared mappings. Other feature bits that may be added in the
> future are IOMMU_SVA_FEAT_PRIVATE, to support private PASID address
> spaces, and IOMMU_SVA_FEAT_NO_PASID, to bind the whole device address
> space to a process.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> ---
> v1->v2:
> * Add sva_param structure to iommu_param
> * CONFIG option is only selectable by IOMMU drivers
> ---
>  drivers/iommu/Kconfig     |   4 ++
>  drivers/iommu/Makefile    |   1 +
>  drivers/iommu/iommu-sva.c | 110
> ++++++++++++++++++++++++++++++++++++++ include/linux/iommu.h     |
> 32 +++++++++++ 4 files changed, 147 insertions(+)
>  create mode 100644 drivers/iommu/iommu-sva.c
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 7564237f788d..cca8e06903c7 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -74,6 +74,10 @@ config IOMMU_DMA
>  	select IOMMU_IOVA
>  	select NEED_SG_DMA_LENGTH
>  
> +config IOMMU_SVA
> +	bool
> +	select IOMMU_API
> +
>  config FSL_PAMU
>  	bool "Freescale IOMMU support"
>  	depends on PCI
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 1fb695854809..1dbcc89ebe4c 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -3,6 +3,7 @@ obj-$(CONFIG_IOMMU_API) += iommu.o
>  obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>  obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
>  obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
> +obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> new file mode 100644
> index 000000000000..8b4afb7c63ae
> --- /dev/null
> +++ b/drivers/iommu/iommu-sva.c
> @@ -0,0 +1,110 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Manage PASIDs and bind process address spaces to devices.
> + *
> + * Copyright (C) 2018 ARM Ltd.
> + */
> +
> +#include <linux/iommu.h>
> +#include <linux/slab.h>
> +
> +/**
> + * iommu_sva_device_init() - Initialize Shared Virtual Addressing
> for a device
> + * @dev: the device
> + * @features: bitmask of features that need to be initialized
> + * @max_pasid: max PASID value supported by the device
> + *
> + * Users of the bind()/unbind() API must call this function to
> initialize all
> + * features required for SVA.
> + *
> + * The device must support multiple address spaces (e.g. PCI PASID).
> By default
> + * the PASID allocated during bind() is limited by the IOMMU
> capacity, and by
> + * the device PASID width defined in the PCI capability or in the
> firmware
> + * description. Setting @max_pasid to a non-zero value smaller than
> this limit
> + * overrides it.
> + *
seems the min_pasid never gets used. do you really need it?
 
> + * The device should not be performing any DMA while this function
> is running,
> + * otherwise the behavior is undefined.
> + *
> + * Return 0 if initialization succeeded, or an error.
> + */
> +int iommu_sva_device_init(struct device *dev, unsigned long features,
> +			  unsigned int max_pasid)
> +{
> +	int ret;
> +	struct iommu_sva_param *param;
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +
> +	if (!domain || !domain->ops->sva_device_init)
> +		return -ENODEV;
> +
> +	if (features)
> +		return -EINVAL;
should it be !features?

> +
> +	param = kzalloc(sizeof(*param), GFP_KERNEL);
> +	if (!param)
> +		return -ENOMEM;
> +
> +	param->features		= features;
> +	param->max_pasid	= max_pasid;
> +
> +	/*
> +	 * IOMMU driver updates the limits depending on the IOMMU
> and device
> +	 * capabilities.
> +	 */
> +	ret = domain->ops->sva_device_init(dev, param);
> +	if (ret)
> +		goto err_free_param;
> +
> +	mutex_lock(&dev->iommu_param->lock);
> +	if (dev->iommu_param->sva_param)
> +		ret = -EEXIST;
> +	else
> +		dev->iommu_param->sva_param = param;
> +	mutex_unlock(&dev->iommu_param->lock);
> +	if (ret)
> +		goto err_device_shutdown;
> +
> +	return 0;
> +
> +err_device_shutdown:
> +	if (domain->ops->sva_device_shutdown)
> +		domain->ops->sva_device_shutdown(dev, param);
> +
> +err_free_param:
> +	kfree(param);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_device_init);
> +
> +/**
> + * iommu_sva_device_shutdown() - Shutdown Shared Virtual Addressing
> for a device
> + * @dev: the device
> + *
> + * Disable SVA. Device driver should ensure that the device isn't
> performing any
> + * DMA while this function is running.
> + */
> +int iommu_sva_device_shutdown(struct device *dev)
> +{
> +	struct iommu_sva_param *param;
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +
> +	if (!domain)
> +		return -ENODEV;
> +
> +	mutex_lock(&dev->iommu_param->lock);
> +	param = dev->iommu_param->sva_param;
> +	dev->iommu_param->sva_param = NULL;
> +	mutex_unlock(&dev->iommu_param->lock);
> +	if (!param)
> +		return -ENODEV;
> +
> +	if (domain->ops->sva_device_shutdown)
> +		domain->ops->sva_device_shutdown(dev, param);
seems a little mismatch here, do you need pass the param. I don't think
there is anything else model specific iommu driver need to do for the
param.
> +
> +	kfree(param);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_device_shutdown);
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 0933f726d2e6..2efe7738bedb 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -212,6 +212,12 @@ struct page_response_msg {
>  	u64 private_data;
>  };
>  
> +struct iommu_sva_param {
> +	unsigned long features;
> +	unsigned int min_pasid;
> +	unsigned int max_pasid;
> +};
> +
>  /**
>   * struct iommu_ops - iommu ops and capabilities
>   * @capable: check capability
> @@ -219,6 +225,8 @@ struct page_response_msg {
>   * @domain_free: free iommu domain
>   * @attach_dev: attach device to an iommu domain
>   * @detach_dev: detach device from an iommu domain
> + * @sva_device_init: initialize Shared Virtual Adressing for a device
> + * @sva_device_shutdown: shutdown Shared Virtual Adressing for a
> device
>   * @map: map a physically contiguous memory region to an iommu domain
>   * @unmap: unmap a physically contiguous memory region from an iommu
> domain
>   * @map_sg: map a scatter-gather list of physically contiguous
> memory chunks @@ -256,6 +264,10 @@ struct iommu_ops {
>  
>  	int (*attach_dev)(struct iommu_domain *domain, struct device
> *dev); void (*detach_dev)(struct iommu_domain *domain, struct device
> *dev);
> +	int (*sva_device_init)(struct device *dev,
> +			       struct iommu_sva_param *param);
> +	void (*sva_device_shutdown)(struct device *dev,
> +				    struct iommu_sva_param *param);
>  	int (*map)(struct iommu_domain *domain, unsigned long iova,
>  		   phys_addr_t paddr, size_t size, int prot);
>  	size_t (*unmap)(struct iommu_domain *domain, unsigned long
> iova, @@ -413,6 +425,7 @@ struct iommu_fault_param {
>   * struct iommu_param - collection of per-device IOMMU data
>   *
>   * @fault_param: IOMMU detected device fault reporting data
> + * @sva_param: SVA parameters
>   *
>   * TODO: migrate other per device data pointers under
> iommu_dev_data, e.g.
>   *	struct iommu_group	*iommu_group;
> @@ -421,6 +434,7 @@ struct iommu_fault_param {
>  struct iommu_param {
>  	struct mutex lock;
>  	struct iommu_fault_param *fault_param;
> +	struct iommu_sva_param *sva_param;
>  };
>  
>  int  iommu_device_register(struct iommu_device *iommu);
> @@ -920,4 +934,22 @@ static inline int iommu_sva_invalidate(struct
> iommu_domain *domain, 
>  #endif /* CONFIG_IOMMU_API */
>  
> +#ifdef CONFIG_IOMMU_SVA
> +extern int iommu_sva_device_init(struct device *dev, unsigned long
> features,
> +				 unsigned int max_pasid);
> +extern int iommu_sva_device_shutdown(struct device *dev);
> +#else /* CONFIG_IOMMU_SVA */
> +static inline int iommu_sva_device_init(struct device *dev,
> +					unsigned long features,
> +					unsigned int max_pasid)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline int iommu_sva_device_shutdown(struct device *dev)
> +{
> +	return -ENODEV;
> +}
> +#endif /* CONFIG_IOMMU_SVA */
> +
>  #endif /* __LINUX_IOMMU_H */

[Jacob Pan]

Jean-Philippe Brucker May 17, 2018, 10:02 a.m. UTC | #2

Hi Jacob,

Thanks for reviewing this

On 16/05/18 21:41, Jacob Pan wrote:
>> + * The device must support multiple address spaces (e.g. PCI PASID).
>> By default
>> + * the PASID allocated during bind() is limited by the IOMMU
>> capacity, and by
>> + * the device PASID width defined in the PCI capability or in the
>> firmware
>> + * description. Setting @max_pasid to a non-zero value smaller than
>> this limit
>> + * overrides it.
>> + *
> seems the min_pasid never gets used. do you really need it?

Yes, the SMMU sets it to 1 in patch 28/40, because it needs to reserve
PASID 0

>> + * The device should not be performing any DMA while this function
>> is running,
>> + * otherwise the behavior is undefined.
>> + *
>> + * Return 0 if initialization succeeded, or an error.
>> + */
>> +int iommu_sva_device_init(struct device *dev, unsigned long features,
>> +			  unsigned int max_pasid)
>> +{
>> +	int ret;
>> +	struct iommu_sva_param *param;
>> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>> +
>> +	if (!domain || !domain->ops->sva_device_init)
>> +		return -ENODEV;
>> +
>> +	if (features)
>> +		return -EINVAL;
> should it be !features?

This checks if the user sets any unsupported bit in features. No feature
is supported right now, but patch 09/40 adds IOMMU_SVA_FEAT_IOPF, and
changes this line to "features & ~IOMMU_SVA_FEAT_IOPF"

>> +	mutex_lock(&dev->iommu_param->lock);
>> +	param = dev->iommu_param->sva_param;
>> +	dev->iommu_param->sva_param = NULL;
>> +	mutex_unlock(&dev->iommu_param->lock);
>> +	if (!param)
>> +		return -ENODEV;
>> +
>> +	if (domain->ops->sva_device_shutdown)
>> +		domain->ops->sva_device_shutdown(dev, param);
> seems a little mismatch here, do you need pass the param. I don't think
> there is anything else model specific iommu driver need to do for the
> param.

SMMU doesn't use it, but maybe it would remind other IOMMU driver which
features were enabled, so they don't have to keep track of that
themselves? I can remove it if it isn't useful

Thanks,
Jean

Jacob Pan May 17, 2018, 5 p.m. UTC | #3

On Thu, 17 May 2018 11:02:02 +0100
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> Hi Jacob,
> 
> Thanks for reviewing this
> 
> On 16/05/18 21:41, Jacob Pan wrote:
>  [...]  
> > seems the min_pasid never gets used. do you really need it?  
> 
> Yes, the SMMU sets it to 1 in patch 28/40, because it needs to reserve
> PASID 0
> 
>  [...]  
> > should it be !features?  
> 
> This checks if the user sets any unsupported bit in features. No
> feature is supported right now, but patch 09/40 adds
> IOMMU_SVA_FEAT_IOPF, and changes this line to "features &
> ~IOMMU_SVA_FEAT_IOPF"
> 
> >> +	mutex_lock(&dev->iommu_param->lock);
> >> +	param = dev->iommu_param->sva_param;
> >> +	dev->iommu_param->sva_param = NULL;
> >> +	mutex_unlock(&dev->iommu_param->lock);
> >> +	if (!param)
> >> +		return -ENODEV;
> >> +
> >> +	if (domain->ops->sva_device_shutdown)
> >> +		domain->ops->sva_device_shutdown(dev, param);  
> > seems a little mismatch here, do you need pass the param. I don't
> > think there is anything else model specific iommu driver need to do
> > for the param.  
> 
> SMMU doesn't use it, but maybe it would remind other IOMMU driver
> which features were enabled, so they don't have to keep track of that
> themselves? I can remove it if it isn't useful
> 
If there is a use case, I guess iommu driver can always retrieve the
param from struct device.
> Thanks,
> Jean

[Jacob Pan]

Eric Auger Sept. 5, 2018, 11:29 a.m. UTC | #4

Hi Jean-Philippe,
On 05/11/2018 09:06 PM, Jean-Philippe Brucker wrote:
> Shared Virtual Addressing (SVA) provides a way for device drivers to bind
> process address spaces to devices. This requires the IOMMU to support page
> table format and features compatible with the CPUs, and usually requires
> the system to support I/O Page Faults (IOPF) and Process Address Space ID
> (PASID). When all of these are available, DMA can access virtual addresses
> of a process. A PASID is allocated for each process, and the device driver
> programs it into the device in an implementation-specific way.
> 
> Add a new API for sharing process page tables with devices. Introduce two
> IOMMU operations, sva_device_init() and sva_device_shutdown(), that
> prepare the IOMMU driver for SVA. For example allocate PASID tables and
> fault queues. Subsequent patches will implement the bind() and unbind()
> operations.
> 
> Support for I/O Page Faults will be added in a later patch using a new
> feature bit (IOMMU_SVA_FEAT_IOPF). With the current API users must pin
> down all shared mappings. Other feature bits that may be added in the
> future are IOMMU_SVA_FEAT_PRIVATE, to support private PASID address
> spaces, and IOMMU_SVA_FEAT_NO_PASID, to bind the whole device address
> space to a process.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> ---
> v1->v2:
> * Add sva_param structure to iommu_param
> * CONFIG option is only selectable by IOMMU drivers
> ---
>  drivers/iommu/Kconfig     |   4 ++
>  drivers/iommu/Makefile    |   1 +
>  drivers/iommu/iommu-sva.c | 110 ++++++++++++++++++++++++++++++++++++++
>  include/linux/iommu.h     |  32 +++++++++++
>  4 files changed, 147 insertions(+)
>  create mode 100644 drivers/iommu/iommu-sva.c
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 7564237f788d..cca8e06903c7 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -74,6 +74,10 @@ config IOMMU_DMA
>  	select IOMMU_IOVA
>  	select NEED_SG_DMA_LENGTH
>  
> +config IOMMU_SVA
> +	bool
> +	select IOMMU_API
> +
>  config FSL_PAMU
>  	bool "Freescale IOMMU support"
>  	depends on PCI
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 1fb695854809..1dbcc89ebe4c 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -3,6 +3,7 @@ obj-$(CONFIG_IOMMU_API) += iommu.o
>  obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>  obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
>  obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
> +obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> new file mode 100644
> index 000000000000..8b4afb7c63ae
> --- /dev/null
> +++ b/drivers/iommu/iommu-sva.c
> @@ -0,0 +1,110 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Manage PASIDs and bind process address spaces to devices.
> + *
> + * Copyright (C) 2018 ARM Ltd.
> + */
> +
> +#include <linux/iommu.h>
> +#include <linux/slab.h>
> +
> +/**
> + * iommu_sva_device_init() - Initialize Shared Virtual Addressing for a device
> + * @dev: the device
> + * @features: bitmask of features that need to be initialized
> + * @max_pasid: max PASID value supported by the device
> + *
> + * Users of the bind()/unbind() API must call this function to initialize all
> + * features required for SVA.
> + *
> + * The device must support multiple address spaces (e.g. PCI PASID). By default
> + * the PASID allocated during bind() is limited by the IOMMU capacity, and by
> + * the device PASID width defined in the PCI capability or in the firmware
> + * description. Setting @max_pasid to a non-zero value smaller than this limit
> + * overrides it.
> + *
> + * The device should not be performing any DMA while this function is running,
> + * otherwise the behavior is undefined.
> + *
> + * Return 0 if initialization succeeded, or an error.
> + */
> +int iommu_sva_device_init(struct device *dev, unsigned long features,
> +			  unsigned int max_pasid)
what about min_pasid?
> +{
> +	int ret;
> +	struct iommu_sva_param *param;
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +
> +	if (!domain || !domain->ops->sva_device_init)
> +		return -ENODEV;
> +
> +	if (features)
> +		return -EINVAL;
> +
> +	param = kzalloc(sizeof(*param), GFP_KERNEL);
> +	if (!param)
> +		return -ENOMEM;
> +
> +	param->features		= features;
> +	param->max_pasid	= max_pasid;
> +
> +	/*
> +	 * IOMMU driver updates the limits depending on the IOMMU and device
> +	 * capabilities.
> +	 */
> +	ret = domain->ops->sva_device_init(dev, param);
> +	if (ret)
> +		goto err_free_param;
So you are likely to call sva_device_init even if it was already called
(ie. dev->iommu_param->sva_param is already set). Can't you test whether
dev->iommu_param->sva_param is already set first?
> +
> +	mutex_lock(&dev->iommu_param->lock);
> +	if (dev->iommu_param->sva_param)
> +		ret = -EEXIST;
> +	else
> +		dev->iommu_param->sva_param = param;
> +	mutex_unlock(&dev->iommu_param->lock);
> +	if (ret)
> +		goto err_device_shutdown;
> +
> +	return 0;
> +
> +err_device_shutdown:
> +	if (domain->ops->sva_device_shutdown)
> +		domain->ops->sva_device_shutdown(dev, param);
> +
> +err_free_param:
> +	kfree(param);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_device_init);
> +
> +/**
> + * iommu_sva_device_shutdown() - Shutdown Shared Virtual Addressing for a device
> + * @dev: the device
> + *
> + * Disable SVA. Device driver should ensure that the device isn't performing any
> + * DMA while this function is running.
> + */
> +int iommu_sva_device_shutdown(struct device *dev)
Not sure the returned value is required for a shutdown operation.
> +{
> +	struct iommu_sva_param *param;
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +
> +	if (!domain)
> +		return -ENODEV;
> +
> +	mutex_lock(&dev->iommu_param->lock);
> +	param = dev->iommu_param->sva_param;
> +	dev->iommu_param->sva_param = NULL;
> +	mutex_unlock(&dev->iommu_param->lock);
> +	if (!param)
> +		return -ENODEV;
> +
> +	if (domain->ops->sva_device_shutdown)
> +		domain->ops->sva_device_shutdown(dev, param);
> +
> +	kfree(param);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_device_shutdown);
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 0933f726d2e6..2efe7738bedb 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -212,6 +212,12 @@ struct page_response_msg {
>  	u64 private_data;
>  };
>  
> +struct iommu_sva_param {
What are the feature values?
> +	unsigned long features;
> +	unsigned int min_pasid;
> +	unsigned int max_pasid;
> +};
> +
>  /**
>   * struct iommu_ops - iommu ops and capabilities
>   * @capable: check capability
> @@ -219,6 +225,8 @@ struct page_response_msg {
>   * @domain_free: free iommu domain
>   * @attach_dev: attach device to an iommu domain
>   * @detach_dev: detach device from an iommu domain
> + * @sva_device_init: initialize Shared Virtual Adressing for a device
Addressing
> + * @sva_device_shutdown: shutdown Shared Virtual Adressing for a device
Addressing
>   * @map: map a physically contiguous memory region to an iommu domain
>   * @unmap: unmap a physically contiguous memory region from an iommu domain
>   * @map_sg: map a scatter-gather list of physically contiguous memory chunks
> @@ -256,6 +264,10 @@ struct iommu_ops {
>  
>  	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
>  	void (*detach_dev)(struct iommu_domain *domain, struct device *dev);
> +	int (*sva_device_init)(struct device *dev,
> +			       struct iommu_sva_param *param);
> +	void (*sva_device_shutdown)(struct device *dev,
> +				    struct iommu_sva_param *param);
>  	int (*map)(struct iommu_domain *domain, unsigned long iova,
>  		   phys_addr_t paddr, size_t size, int prot);
>  	size_t (*unmap)(struct iommu_domain *domain, unsigned long iova,
> @@ -413,6 +425,7 @@ struct iommu_fault_param {
>   * struct iommu_param - collection of per-device IOMMU data
>   *
>   * @fault_param: IOMMU detected device fault reporting data
> + * @sva_param: SVA parameters
>   *
>   * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
>   *	struct iommu_group	*iommu_group;
> @@ -421,6 +434,7 @@ struct iommu_fault_param {
>  struct iommu_param {
>  	struct mutex lock;
>  	struct iommu_fault_param *fault_param;
> +	struct iommu_sva_param *sva_param;
>  };
>  
>  int  iommu_device_register(struct iommu_device *iommu);
> @@ -920,4 +934,22 @@ static inline int iommu_sva_invalidate(struct iommu_domain *domain,
>  
>  #endif /* CONFIG_IOMMU_API */
>  
> +#ifdef CONFIG_IOMMU_SVA
> +extern int iommu_sva_device_init(struct device *dev, unsigned long features,
> +				 unsigned int max_pasid);
> +extern int iommu_sva_device_shutdown(struct device *dev);
> +#else /* CONFIG_IOMMU_SVA */
> +static inline int iommu_sva_device_init(struct device *dev,
> +					unsigned long features,
> +					unsigned int max_pasid)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline int iommu_sva_device_shutdown(struct device *dev)
> +{
> +	return -ENODEV;
> +}
> +#endif /* CONFIG_IOMMU_SVA */
> +
>  #endif /* __LINUX_IOMMU_H */
> 

Thanks

Eric

Jean-Philippe Brucker Sept. 6, 2018, 11:09 a.m. UTC | #5

Hi Eric,

Thanks for reviewing

On 05/09/2018 12:29, Auger Eric wrote:
>> +int iommu_sva_device_init(struct device *dev, unsigned long features,
>> +			  unsigned int max_pasid)
> what about min_pasid?

No one asked for it... The max_pasid parameter is here for drivers that
have vendor-specific PASID size limits, such as AMD KFD (see
kfd_iommu_device_init and
https://patchwork.kernel.org/patch/9989307/#21389571). But in most cases
the PASID size will only depend on the PCI PASID capability and the
IOMMU limits, both known by the IOMMU driver, so device drivers won't
have to set max_pasid.

IOMMU drivers need to set min_pasid in the sva_device_init callback
because it may be either 1 (e.g. Arm where PASID #0 is reserved) or 0
(Intel Vt-d rev2), but at the moment I can't see a reason for device
drivers to override min_pasid

>> +	/*
>> +	 * IOMMU driver updates the limits depending on the IOMMU and device
>> +	 * capabilities.
>> +	 */
>> +	ret = domain->ops->sva_device_init(dev, param);
>> +	if (ret)
>> +		goto err_free_param;
> So you are likely to call sva_device_init even if it was already called
> (ie. dev->iommu_param->sva_param is already set). Can't you test whether
> dev->iommu_param->sva_param is already set first?

Right, that's probably better

>> +/**
>> + * iommu_sva_device_shutdown() - Shutdown Shared Virtual Addressing for a device
>> + * @dev: the device
>> + *
>> + * Disable SVA. Device driver should ensure that the device isn't performing any
>> + * DMA while this function is running.
>> + */
>> +int iommu_sva_device_shutdown(struct device *dev)
> Not sure the returned value is required for a shutdown operation.

I don't know either. I like them because they help me debug, but are
otherwise rather useless if we don't describe precise semantics. The
caller cannot do anything with it. Given that the corresponding IOMMU op
is already void, I can change this function to void as well

>> +struct iommu_sva_param {
> What are the feature values?

At the moment only IOMMU_SVA_FEAT_IOPF, introduced by patch 09

>> +	unsigned long features;
>> +	unsigned int min_pasid;
>> +	unsigned int max_pasid;
>> +};
>> +
>>  /**
>>   * struct iommu_ops - iommu ops and capabilities
>>   * @capable: check capability
>> @@ -219,6 +225,8 @@ struct page_response_msg {
>>   * @domain_free: free iommu domain
>>   * @attach_dev: attach device to an iommu domain
>>   * @detach_dev: detach device from an iommu domain
>> + * @sva_device_init: initialize Shared Virtual Adressing for a device
> Addressing
>> + * @sva_device_shutdown: shutdown Shared Virtual Adressing for a device
> Addressing

Nice catch

Thanks,
Jean

Christian König Sept. 6, 2018, 11:12 a.m. UTC | #6

Am 06.09.2018 um 13:09 schrieb Jean-Philippe Brucker:
> Hi Eric,
>
> Thanks for reviewing
>
> On 05/09/2018 12:29, Auger Eric wrote:
>>> +int iommu_sva_device_init(struct device *dev, unsigned long features,
>>> +			  unsigned int max_pasid)
>> what about min_pasid?
> No one asked for it... The max_pasid parameter is here for drivers that
> have vendor-specific PASID size limits, such as AMD KFD (see
> kfd_iommu_device_init and
> https://patchwork.kernel.org/patch/9989307/#21389571). But in most cases
> the PASID size will only depend on the PCI PASID capability and the
> IOMMU limits, both known by the IOMMU driver, so device drivers won't
> have to set max_pasid.
>
> IOMMU drivers need to set min_pasid in the sva_device_init callback
> because it may be either 1 (e.g. Arm where PASID #0 is reserved) or 0
> (Intel Vt-d rev2), but at the moment I can't see a reason for device
> drivers to override min_pasid

Sorry to ruin your day, but if I'm not completely mistaken PASID zero is 
reserved in the AMD KFD as well.

Regards,
Christian.

>
>>> +	/*
>>> +	 * IOMMU driver updates the limits depending on the IOMMU and device
>>> +	 * capabilities.
>>> +	 */
>>> +	ret = domain->ops->sva_device_init(dev, param);
>>> +	if (ret)
>>> +		goto err_free_param;
>> So you are likely to call sva_device_init even if it was already called
>> (ie. dev->iommu_param->sva_param is already set). Can't you test whether
>> dev->iommu_param->sva_param is already set first?
> Right, that's probably better
>
>>> +/**
>>> + * iommu_sva_device_shutdown() - Shutdown Shared Virtual Addressing for a device
>>> + * @dev: the device
>>> + *
>>> + * Disable SVA. Device driver should ensure that the device isn't performing any
>>> + * DMA while this function is running.
>>> + */
>>> +int iommu_sva_device_shutdown(struct device *dev)
>> Not sure the returned value is required for a shutdown operation.
> I don't know either. I like them because they help me debug, but are
> otherwise rather useless if we don't describe precise semantics. The
> caller cannot do anything with it. Given that the corresponding IOMMU op
> is already void, I can change this function to void as well
>
>>> +struct iommu_sva_param {
>> What are the feature values?
> At the moment only IOMMU_SVA_FEAT_IOPF, introduced by patch 09
>
>>> +	unsigned long features;
>>> +	unsigned int min_pasid;
>>> +	unsigned int max_pasid;
>>> +};
>>> +
>>>   /**
>>>    * struct iommu_ops - iommu ops and capabilities
>>>    * @capable: check capability
>>> @@ -219,6 +225,8 @@ struct page_response_msg {
>>>    * @domain_free: free iommu domain
>>>    * @attach_dev: attach device to an iommu domain
>>>    * @detach_dev: detach device from an iommu domain
>>> + * @sva_device_init: initialize Shared Virtual Adressing for a device
>> Addressing
>>> + * @sva_device_shutdown: shutdown Shared Virtual Adressing for a device
>> Addressing
> Nice catch
>
> Thanks,
> Jean

Jean-Philippe Brucker Sept. 6, 2018, 12:45 p.m. UTC | #7

On 06/09/2018 12:12, Christian König wrote:
> Am 06.09.2018 um 13:09 schrieb Jean-Philippe Brucker:
>> Hi Eric,
>>
>> Thanks for reviewing
>>
>> On 05/09/2018 12:29, Auger Eric wrote:
>>>> +int iommu_sva_device_init(struct device *dev, unsigned long features,
>>>> +              unsigned int max_pasid)
>>> what about min_pasid?
>> No one asked for it... The max_pasid parameter is here for drivers that
>> have vendor-specific PASID size limits, such as AMD KFD (see
>> kfd_iommu_device_init and
>> https://patchwork.kernel.org/patch/9989307/#21389571). But in most cases
>> the PASID size will only depend on the PCI PASID capability and the
>> IOMMU limits, both known by the IOMMU driver, so device drivers won't
>> have to set max_pasid.
>>
>> IOMMU drivers need to set min_pasid in the sva_device_init callback
>> because it may be either 1 (e.g. Arm where PASID #0 is reserved) or 0
>> (Intel Vt-d rev2), but at the moment I can't see a reason for device
>> drivers to override min_pasid
> 
> Sorry to ruin your day, but if I'm not completely mistaken PASID zero is
> reserved in the AMD KFD as well.

Heh, fair enough. I'll add the min_pasid parameter

Thanks,
Jean

Christian König Sept. 7, 2018, 8:55 a.m. UTC | #8

Am 06.09.2018 um 14:45 schrieb Jean-Philippe Brucker:
> On 06/09/2018 12:12, Christian König wrote:
>> Am 06.09.2018 um 13:09 schrieb Jean-Philippe Brucker:
>>> Hi Eric,
>>>
>>> Thanks for reviewing
>>>
>>> On 05/09/2018 12:29, Auger Eric wrote:
>>>>> +int iommu_sva_device_init(struct device *dev, unsigned long features,
>>>>> +              unsigned int max_pasid)
>>>> what about min_pasid?
>>> No one asked for it... The max_pasid parameter is here for drivers that
>>> have vendor-specific PASID size limits, such as AMD KFD (see
>>> kfd_iommu_device_init and
>>> https://patchwork.kernel.org/patch/9989307/#21389571). But in most cases
>>> the PASID size will only depend on the PCI PASID capability and the
>>> IOMMU limits, both known by the IOMMU driver, so device drivers won't
>>> have to set max_pasid.
>>>
>>> IOMMU drivers need to set min_pasid in the sva_device_init callback
>>> because it may be either 1 (e.g. Arm where PASID #0 is reserved) or 0
>>> (Intel Vt-d rev2), but at the moment I can't see a reason for device
>>> drivers to override min_pasid
>> Sorry to ruin your day, but if I'm not completely mistaken PASID zero is
>> reserved in the AMD KFD as well.
> Heh, fair enough. I'll add the min_pasid parameter

I will take this as an opportunity to summarize some of the requirements 
we have for PASID management from the amdgpu driver point of view:

1. We need to be able to allocate PASID between 1 and some maximum. Zero 
is reserved as far as I know, but we don't necessary need a minimum.

2. We need to be able to allocate PASIDs without a process address space 
backing it. E.g. our hardware uses PASIDs even without Shared Virtual 
Addressing enabled to distinct clients from each other.
         Would be a pity if we need to still have a separate PASID 
handling because the system wide is only available when IOMMU is turned on.

3. Even after destruction of a process address space we need some grace 
period before a PASID is reused because it can be that the specific 
PASID is still in some hardware queues etc...
         At bare minimum all device drivers using process binding need 
to explicitly note to the core when they are done with a PASID.

4. It would be nice to have to be able to set a "void *" for each 
PASID/device combination while binding to a process which then can be 
queried later on based on the PASID.
         E.g. when you have a per PASID/device structure around anyway, 
just add an extra field.

5. It would be nice to have to allocate multiple PASIDs for the same 
process address space.
         E.g. some teams at AMD want to use a separate GPU address space 
for their userspace client library. I'm still trying to avoid that, but 
it is perfectly possible that we are going to need that.
         Additional to that it is sometimes quite useful for debugging 
to isolate where exactly an incorrect access (segfault) is coming from.

Let me know if there are some problems with that, especially I want to 
know if there is pushback on #5 so that I can forward that :)

Thanks,
Christian.

>
> Thanks,
> Jean

Jean-Philippe Brucker Sept. 7, 2018, 3:45 p.m. UTC | #9

On 07/09/2018 09:55, Christian König wrote:
> I will take this as an opportunity to summarize some of the requirements 
> we have for PASID management from the amdgpu driver point of view:

That's incredibly useful, thanks :)

> 1. We need to be able to allocate PASID between 1 and some maximum. Zero 
> is reserved as far as I know, but we don't necessary need a minimum.

Should be fine. The PASID range is restricted by the PCI PASID
capability, firmware description (for non-PCI devices), the IOMMU
capacity, and what the device driver passes to iommu_sva_device_init.
Not all IOMMUs reserve PASID 0 (AMD IOMMU without GIoSup doesn't, if I'm
not mistaken), so the KFD driver will need to pass min_pasid=1 to make
sure that 0 isn't allocated.

> 2. We need to be able to allocate PASIDs without a process address space 
> backing it. E.g. our hardware uses PASIDs even without Shared Virtual 
> Addressing enabled to distinct clients from each other.
>          Would be a pity if we need to still have a separate PASID 
> handling because the system wide is only available when IOMMU is turned on.

I'm still not sure about this one. From my point of view we shouldn't
add to the IOMMU subsystem helpers for devices without an IOMMU.
iommu-sva expects everywhere that the device has an iommu_domain, it's
the first thing we check on entry. Bypassing all of this would call
idr_alloc() directly, and wouldn't have any code in common with the
current iommu-sva. So it seems like you need a layer on top of iommu-sva
calling idr_alloc() when an IOMMU isn't present, but I don't think it
should be in drivers/iommu/

> 3. Even after destruction of a process address space we need some grace 
> period before a PASID is reused because it can be that the specific 
> PASID is still in some hardware queues etc...
>          At bare minimum all device drivers using process binding need 
> to explicitly note to the core when they are done with a PASID.

Right, much of the horribleness in iommu-sva deals with this:

The process dies, iommu-sva is notified and calls the mm_exit() function
passed by the device driver to iommu_sva_device_init(). In mm_exit() the
device driver needs to clear any reference to the PASID in hardware and
in its own structures. When the device driver returns from mm_exit(), it
effectively tells the core that it has finished using the PASID, and
iommu-sva can reuse the PASID for another process. mm_exit() is allowed
to block, so the device driver has time to clean up and flush the queues.

If the device driver finishes using the PASID before the process exits,
it just calls unbind().

> 4. It would be nice to have to be able to set a "void *" for each 
> PASID/device combination while binding to a process which then can be 
> queried later on based on the PASID.
>          E.g. when you have a per PASID/device structure around anyway, 
> just add an extra field.

iommu_sva_bind_device() takes a "drvdata" pointer that is stored
internally for the PASID/device combination (iommu_bond). It is passed
to mm_exit(), but I haven't added anything for the device driver to
query it back.

> 5. It would be nice to have to allocate multiple PASIDs for the same 
> process address space.
>          E.g. some teams at AMD want to use a separate GPU address space 
> for their userspace client library. I'm still trying to avoid that, but 
> it is perfectly possible that we are going to need that.

Two PASIDs pointing to the same process pgd? At first glance it seems
feasible, maybe with a flag passed to bind() and a few changes to
internal structures. It will duplicate ATC invalidation commands for
each process address space change (munmap etc) so you might take a
performance hit.

Intel's SVM code has the SVM_FLAG_PRIVATE_PASID which seems similar to
what you describe, but I don't plan to support it in this series (the
io_mm model is already pretty complicated). I think it can be added
without too much effort in a future series, though with a different flag
name since we'd like to use "private PASID" for something else
(https://www.spinics.net/lists/dri-devel/msg177007.html).

Thanks,
Jean

>          Additional to that it is sometimes quite useful for debugging 
> to isolate where exactly an incorrect access (segfault) is coming from.
> 
> Let me know if there are some problems with that, especially I want to 
> know if there is pushback on #5 so that I can forward that :)
> 
> Thanks,
> Christian.
> 
>>
>> Thanks,
>> Jean
>

Christian König Sept. 7, 2018, 6:02 p.m. UTC | #10

Am 07.09.2018 um 17:45 schrieb Jean-Philippe Brucker:
> On 07/09/2018 09:55, Christian König wrote:
>> I will take this as an opportunity to summarize some of the requirements
>> we have for PASID management from the amdgpu driver point of view:
> That's incredibly useful, thanks :)
>
>> 1. We need to be able to allocate PASID between 1 and some maximum. Zero
>> is reserved as far as I know, but we don't necessary need a minimum.
> Should be fine. The PASID range is restricted by the PCI PASID
> capability, firmware description (for non-PCI devices), the IOMMU
> capacity, and what the device driver passes to iommu_sva_device_init.
> Not all IOMMUs reserve PASID 0 (AMD IOMMU without GIoSup doesn't, if I'm
> not mistaken), so the KFD driver will need to pass min_pasid=1 to make
> sure that 0 isn't allocated.
>
>> 2. We need to be able to allocate PASIDs without a process address space
>> backing it. E.g. our hardware uses PASIDs even without Shared Virtual
>> Addressing enabled to distinct clients from each other.
>>           Would be a pity if we need to still have a separate PASID
>> handling because the system wide is only available when IOMMU is turned on.
> I'm still not sure about this one. From my point of view we shouldn't
> add to the IOMMU subsystem helpers for devices without an IOMMU.

I agree on that.

> iommu-sva expects everywhere that the device has an iommu_domain, it's
> the first thing we check on entry. Bypassing all of this would call
> idr_alloc() directly, and wouldn't have any code in common with the
> current iommu-sva. So it seems like you need a layer on top of iommu-sva
> calling idr_alloc() when an IOMMU isn't present, but I don't think it
> should be in drivers/iommu/

In this case I question if the PASID handling should be under 
drivers/iommu at all.

See I can have a mix of VM context which are bound to processes (some 
few) and VM contexts which are standalone and doesn't care for a process 
address space. But for each VM context I need a distinct PASID for the 
hardware to work.

I can live if we say if IOMMU is completely disabled we use a simple ida 
to allocate them, but when IOMMU is enabled I certainly need a way to 
reserve a PASID without an associated process.

>> 3. Even after destruction of a process address space we need some grace
>> period before a PASID is reused because it can be that the specific
>> PASID is still in some hardware queues etc...
>>           At bare minimum all device drivers using process binding need
>> to explicitly note to the core when they are done with a PASID.
> Right, much of the horribleness in iommu-sva deals with this:
>
> The process dies, iommu-sva is notified and calls the mm_exit() function
> passed by the device driver to iommu_sva_device_init(). In mm_exit() the
> device driver needs to clear any reference to the PASID in hardware and
> in its own structures. When the device driver returns from mm_exit(), it
> effectively tells the core that it has finished using the PASID, and
> iommu-sva can reuse the PASID for another process. mm_exit() is allowed
> to block, so the device driver has time to clean up and flush the queues.
>
> If the device driver finishes using the PASID before the process exits,
> it just calls unbind().

Exactly that's what Michal Hocko is probably going to not like at all.

Can we have a different approach where each driver is informed by the 
mm_exit(), but needs to explicitly call unbind() before a PASID is reused?

During that teardown transition it would be ideal if that PASID only 
points to a dummy root page directory with only invalid entries.

>
>> 4. It would be nice to have to be able to set a "void *" for each
>> PASID/device combination while binding to a process which then can be
>> queried later on based on the PASID.
>>           E.g. when you have a per PASID/device structure around anyway,
>> just add an extra field.
> iommu_sva_bind_device() takes a "drvdata" pointer that is stored
> internally for the PASID/device combination (iommu_bond). It is passed
> to mm_exit(), but I haven't added anything for the device driver to
> query it back.

Nice! Looks like all we need additionally is a function to retrieve that 
based on the PASID.

>> 5. It would be nice to have to allocate multiple PASIDs for the same
>> process address space.
>>           E.g. some teams at AMD want to use a separate GPU address space
>> for their userspace client library. I'm still trying to avoid that, but
>> it is perfectly possible that we are going to need that.
> Two PASIDs pointing to the same process pgd? At first glance it seems
> feasible, maybe with a flag passed to bind() and a few changes to
> internal structures. It will duplicate ATC invalidation commands for
> each process address space change (munmap etc) so you might take a
> performance hit.
>
> Intel's SVM code has the SVM_FLAG_PRIVATE_PASID which seems similar to
> what you describe, but I don't plan to support it in this series (the
> io_mm model is already pretty complicated). I think it can be added
> without too much effort in a future series, though with a different flag
> name since we'd like to use "private PASID" for something else
> (https://www.spinics.net/lists/dri-devel/msg177007.html).

To be honest I hoped that you would say: No never! So that I have a good 
argument to pushback on such requirements :)

But if it's doable it would be at least nice to have for debugging.

Thanks a lot for working on that,
Christian.

>
> Thanks,
> Jean
>
>>           Additional to that it is sometimes quite useful for debugging
>> to isolate where exactly an incorrect access (segfault) is coming from.
>>
>> Let me know if there are some problems with that, especially I want to
>> know if there is pushback on #5 so that I can forward that :)
>>
>> Thanks,
>> Christian.
>>
>>> Thanks,
>>> Jean

Jacob Pan Sept. 7, 2018, 9:25 p.m. UTC | #11

On Fri, 7 Sep 2018 20:02:54 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 07.09.2018 um 17:45 schrieb Jean-Philippe Brucker:
> > On 07/09/2018 09:55, Christian König wrote:  
> >> I will take this as an opportunity to summarize some of the
> >> requirements we have for PASID management from the amdgpu driver
> >> point of view:  
> > That's incredibly useful, thanks :)
> >  
> >> 1. We need to be able to allocate PASID between 1 and some
> >> maximum. Zero is reserved as far as I know, but we don't necessary
> >> need a minimum.  
>  [...]  
> >> 2. We need to be able to allocate PASIDs without a process address
> >> space backing it. E.g. our hardware uses PASIDs even without
> >> Shared Virtual Addressing enabled to distinct clients from each
> >> other. Would be a pity if we need to still have a separate PASID
> >> handling because the system wide is only available when IOMMU is
> >> turned on.  
>  [...]  
> 
> I agree on that.
> 
> > iommu-sva expects everywhere that the device has an iommu_domain,
> > it's the first thing we check on entry. Bypassing all of this would
> > call idr_alloc() directly, and wouldn't have any code in common
> > with the current iommu-sva. So it seems like you need a layer on
> > top of iommu-sva calling idr_alloc() when an IOMMU isn't present,
> > but I don't think it should be in drivers/iommu/  
> 
> In this case I question if the PASID handling should be under 
> drivers/iommu at all.
> 
> See I can have a mix of VM context which are bound to processes (some 
> few) and VM contexts which are standalone and doesn't care for a
> process address space. But for each VM context I need a distinct
> PASID for the hardware to work.
> 
> I can live if we say if IOMMU is completely disabled we use a simple
> ida to allocate them, but when IOMMU is enabled I certainly need a
> way to reserve a PASID without an associated process.
> 
VT-d would also have such requirement. There is a virtual command
register for allocate and free PASID for VM use. When that PASID
allocation request gets propagated to the host IOMMU driver, we need to
allocate PASID w/o mm.

If the PASID allocation is done via VFIO, can we have FD to track PASID
life cycle instead of mm_exit()? i.e. all FDs get closed before
mm_exit, I assume?

> >> 3. Even after destruction of a process address space we need some
> >> grace period before a PASID is reused because it can be that the
> >> specific PASID is still in some hardware queues etc...
> >>           At bare minimum all device drivers using process binding
> >> need to explicitly note to the core when they are done with a
> >> PASID.  
> > Right, much of the horribleness in iommu-sva deals with this:
> >
> > The process dies, iommu-sva is notified and calls the mm_exit()
> > function passed by the device driver to iommu_sva_device_init(). In
> > mm_exit() the device driver needs to clear any reference to the
> > PASID in hardware and in its own structures. When the device driver
> > returns from mm_exit(), it effectively tells the core that it has
> > finished using the PASID, and iommu-sva can reuse the PASID for
> > another process. mm_exit() is allowed to block, so the device
> > driver has time to clean up and flush the queues.
> >
> > If the device driver finishes using the PASID before the process
> > exits, it just calls unbind().  
> 
> Exactly that's what Michal Hocko is probably going to not like at all.
> 
> Can we have a different approach where each driver is informed by the 
> mm_exit(), but needs to explicitly call unbind() before a PASID is
> reused?
> 
> During that teardown transition it would be ideal if that PASID only 
> points to a dummy root page directory with only invalid entries.
> 
I guess this can be vendor specific, In VT-d I plan to mark PASID
entry not present and disable fault reporting while draining remaining
activities.

> >  
> >> 4. It would be nice to have to be able to set a "void *" for each
> >> PASID/device combination while binding to a process which then can
> >> be queried later on based on the PASID.
> >>           E.g. when you have a per PASID/device structure around
> >> anyway, just add an extra field.  
> > iommu_sva_bind_device() takes a "drvdata" pointer that is stored
> > internally for the PASID/device combination (iommu_bond). It is
> > passed to mm_exit(), but I haven't added anything for the device
> > driver to query it back.  
> 
> Nice! Looks like all we need additionally is a function to retrieve
> that based on the PASID.
> 
> >> 5. It would be nice to have to allocate multiple PASIDs for the
> >> same process address space.
> >>           E.g. some teams at AMD want to use a separate GPU
> >> address space for their userspace client library. I'm still trying
> >> to avoid that, but it is perfectly possible that we are going to
> >> need that.  
> > Two PASIDs pointing to the same process pgd? At first glance it
> > seems feasible, maybe with a flag passed to bind() and a few
> > changes to internal structures. It will duplicate ATC invalidation
> > commands for each process address space change (munmap etc) so you
> > might take a performance hit.
> >
> > Intel's SVM code has the SVM_FLAG_PRIVATE_PASID which seems similar
> > to what you describe, but I don't plan to support it in this series
> > (the io_mm model is already pretty complicated). I think it can be
> > added without too much effort in a future series, though with a
> > different flag name since we'd like to use "private PASID" for
> > something else
> > (https://www.spinics.net/lists/dri-devel/msg177007.html).  
> 
> To be honest I hoped that you would say: No never! So that I have a
> good argument to pushback on such requirements :)
> 
> But if it's doable it would be at least nice to have for debugging.
> 
> Thanks a lot for working on that,
> Christian.
> 
> >
> > Thanks,
> > Jean
> >  
> >>           Additional to that it is sometimes quite useful for
> >> debugging to isolate where exactly an incorrect access (segfault)
> >> is coming from.
> >>
> >> Let me know if there are some problems with that, especially I
> >> want to know if there is pushback on #5 so that I can forward
> >> that :)
> >>
> >> Thanks,
> >> Christian.
> >>  
> >>> Thanks,
> >>> Jean  
> 

[Jacob Pan]

Christian König Sept. 8, 2018, 7:29 a.m. UTC | #12

Am 07.09.2018 um 23:25 schrieb Jacob Pan:
> On Fri, 7 Sep 2018 20:02:54 +0200
> Christian König <christian.koenig@amd.com> wrote:
>> [SNIP]
>>> iommu-sva expects everywhere that the device has an iommu_domain,
>>> it's the first thing we check on entry. Bypassing all of this would
>>> call idr_alloc() directly, and wouldn't have any code in common
>>> with the current iommu-sva. So it seems like you need a layer on
>>> top of iommu-sva calling idr_alloc() when an IOMMU isn't present,
>>> but I don't think it should be in drivers/iommu/
>> In this case I question if the PASID handling should be under
>> drivers/iommu at all.
>>
>> See I can have a mix of VM context which are bound to processes (some
>> few) and VM contexts which are standalone and doesn't care for a
>> process address space. But for each VM context I need a distinct
>> PASID for the hardware to work.
>>
>> I can live if we say if IOMMU is completely disabled we use a simple
>> ida to allocate them, but when IOMMU is enabled I certainly need a
>> way to reserve a PASID without an associated process.
>>
> VT-d would also have such requirement. There is a virtual command
> register for allocate and free PASID for VM use. When that PASID
> allocation request gets propagated to the host IOMMU driver, we need to
> allocate PASID w/o mm.
>
> If the PASID allocation is done via VFIO, can we have FD to track PASID
> life cycle instead of mm_exit()? i.e. all FDs get closed before
> mm_exit, I assume?

Yes, exactly. I just need a PASID which is never used by the OS for a 
process and we can easily give that back when the last FD reference is 
closed.

>>>> 3. Even after destruction of a process address space we need some
>>>> grace period before a PASID is reused because it can be that the
>>>> specific PASID is still in some hardware queues etc...
>>>>            At bare minimum all device drivers using process binding
>>>> need to explicitly note to the core when they are done with a
>>>> PASID.
>>> Right, much of the horribleness in iommu-sva deals with this:
>>>
>>> The process dies, iommu-sva is notified and calls the mm_exit()
>>> function passed by the device driver to iommu_sva_device_init(). In
>>> mm_exit() the device driver needs to clear any reference to the
>>> PASID in hardware and in its own structures. When the device driver
>>> returns from mm_exit(), it effectively tells the core that it has
>>> finished using the PASID, and iommu-sva can reuse the PASID for
>>> another process. mm_exit() is allowed to block, so the device
>>> driver has time to clean up and flush the queues.
>>>
>>> If the device driver finishes using the PASID before the process
>>> exits, it just calls unbind().
>> Exactly that's what Michal Hocko is probably going to not like at all.
>>
>> Can we have a different approach where each driver is informed by the
>> mm_exit(), but needs to explicitly call unbind() before a PASID is
>> reused?
>>
>> During that teardown transition it would be ideal if that PASID only
>> points to a dummy root page directory with only invalid entries.
>>
> I guess this can be vendor specific, In VT-d I plan to mark PASID
> entry not present and disable fault reporting while draining remaining
> activities.

Sounds good to me.

Point is at least in the case where the process was killed by the OOM 
killer we should not block in mm_exit().

Instead operations issued by the process to a device driver which uses 
SVA needs to be terminated as soon as possible to make sure that the OOM 
killer can advance.

Thanks,
Christian.

Jean-Philippe Brucker Sept. 12, 2018, 12:40 p.m. UTC | #13

On 08/09/2018 08:29, Christian König wrote:
> Yes, exactly. I just need a PASID which is never used by the OS for a 
> process and we can easily give that back when the last FD reference is 
> closed.

Alright, iommu-sva can get its PASID from this external allocator as
well, as long as it has an interface similar to idr. Where would it go,
drivers/base/, mm/, kernel/...?

>>>> The process dies, iommu-sva is notified and calls the mm_exit()
>>>> function passed by the device driver to iommu_sva_device_init(). In
>>>> mm_exit() the device driver needs to clear any reference to the
>>>> PASID in hardware and in its own structures. When the device driver
>>>> returns from mm_exit(), it effectively tells the core that it has
>>>> finished using the PASID, and iommu-sva can reuse the PASID for
>>>> another process. mm_exit() is allowed to block, so the device
>>>> driver has time to clean up and flush the queues.
>>>>
>>>> If the device driver finishes using the PASID before the process
>>>> exits, it just calls unbind().
>>> Exactly that's what Michal Hocko is probably going to not like at all.
>>>
>>> Can we have a different approach where each driver is informed by the
>>> mm_exit(), but needs to explicitly call unbind() before a PASID is
>>> reused?

It's awful from the IOMMU driver perspective. In addition to "enabled"
and "disabled" PASID states, you add "disabled but DMA still running
normally". Between that new state and "disabled", the IOMMU will be
flooded by translation faults (non-recoverable ones), which it needs to
ignore instead of reporting to the kernel. Not all IOMMUs can deal with
this in hardware (SMMU and VT-d can quiesce translation faults
per-PASID, but I don't think AMD IOMMU can.) Some drivers will have to
filter fault events themselves, depending on the PASID state.

>>> During that teardown transition it would be ideal if that PASID only
>>> points to a dummy root page directory with only invalid entries.
>>>
>> I guess this can be vendor specific, In VT-d I plan to mark PASID
>> entry not present and disable fault reporting while draining remaining
>> activities.
>
> Sounds good to me.
> 
> Point is at least in the case where the process was killed by the OOM 
> killer we should not block in mm_exit().
>
> Instead operations issued by the process to a device driver which uses 
> SVA needs to be terminated as soon as possible to make sure that the OOM 
> killer can advance.

I don't see how we're preventing the OOM killer from advancing, so I'm
looking for a stronger argument that justifies adding this complexity to
IOMMU drivers. Time limit of the release MMU notifier, locking
requirement, a concrete example where things break, even a comment
somewhere in mm/ would do...

In my tests I can't manage to disturb the OOM killer, but I could be
missing some special case. Even if the mm_exit() callback
(unrealistically) sleeps for 60 seconds, the OOM killer isn't affected
because oom_reap_task_mm() wipes the victim's address space in another
thread, either before or while the release notifier is running.

Thanks,
Jean

Christian König Sept. 12, 2018, 12:56 p.m. UTC | #14

Am 12.09.2018 um 14:40 schrieb Jean-Philippe Brucker:
> On 08/09/2018 08:29, Christian König wrote:
>> Yes, exactly. I just need a PASID which is never used by the OS for a
>> process and we can easily give that back when the last FD reference is
>> closed.
> Alright, iommu-sva can get its PASID from this external allocator as
> well, as long as it has an interface similar to idr. Where would it go,
> drivers/base/, mm/, kernel/...?

Good question, my initial instinct was to put it under drivers/pci.

But AFAIKS now you are supporting SVA implementations which are not 
based on PCI.

So drivers/base sounds like a good place to me.

>
>>>>> The process dies, iommu-sva is notified and calls the mm_exit()
>>>>> function passed by the device driver to iommu_sva_device_init(). In
>>>>> mm_exit() the device driver needs to clear any reference to the
>>>>> PASID in hardware and in its own structures. When the device driver
>>>>> returns from mm_exit(), it effectively tells the core that it has
>>>>> finished using the PASID, and iommu-sva can reuse the PASID for
>>>>> another process. mm_exit() is allowed to block, so the device
>>>>> driver has time to clean up and flush the queues.
>>>>>
>>>>> If the device driver finishes using the PASID before the process
>>>>> exits, it just calls unbind().
>>>> Exactly that's what Michal Hocko is probably going to not like at all.
>>>>
>>>> Can we have a different approach where each driver is informed by the
>>>> mm_exit(), but needs to explicitly call unbind() before a PASID is
>>>> reused?
> It's awful from the IOMMU driver perspective. In addition to "enabled"
> and "disabled" PASID states, you add "disabled but DMA still running
> normally". Between that new state and "disabled", the IOMMU will be
> flooded by translation faults (non-recoverable ones), which it needs to
> ignore instead of reporting to the kernel. Not all IOMMUs can deal with
> this in hardware (SMMU and VT-d can quiesce translation faults
> per-PASID, but I don't think AMD IOMMU can.) Some drivers will have to
> filter fault events themselves, depending on the PASID state.

Puh, yeah that is probably true.

Ok let us skip that for a moment, we just need to invest more work in 
killing DMA operations quickly when the process address space is teared 
down.

>>>> During that teardown transition it would be ideal if that PASID only
>>>> points to a dummy root page directory with only invalid entries.
>>>>
>>> I guess this can be vendor specific, In VT-d I plan to mark PASID
>>> entry not present and disable fault reporting while draining remaining
>>> activities.
>> Sounds good to me.
>>
>> Point is at least in the case where the process was killed by the OOM
>> killer we should not block in mm_exit().
>>
>> Instead operations issued by the process to a device driver which uses
>> SVA needs to be terminated as soon as possible to make sure that the OOM
>> killer can advance.
> I don't see how we're preventing the OOM killer from advancing, so I'm
> looking for a stronger argument that justifies adding this complexity to
> IOMMU drivers. Time limit of the release MMU notifier, locking
> requirement, a concrete example where things break, even a comment
> somewhere in mm/ would do...
>
> In my tests I can't manage to disturb the OOM killer, but I could be
> missing some special case. Even if the mm_exit() callback
> (unrealistically) sleeps for 60 seconds,

Well you are *COMPLETELY* under estimating this. A compute task with a 
huge wave launch can take multiple minutes to tear down.

That's why I'm so concerned about that, but to be honest I think that 
just the hardware needs to become better and we need to be able to block 
dead tasks from spawning threads again.

Regards,
Christian.

>   the OOM killer isn't affected
> because oom_reap_task_mm() wipes the victim's address space in another
> thread, either before or while the release notifier is running.
>
> Thanks,
> Jean

Tian, Kevin Sept. 13, 2018, 7:15 a.m. UTC | #15

> From: Jacob Pan [mailto:jacob.jun.pan@linux.intel.com]
> Sent: Saturday, September 8, 2018 5:25 AM
> > > iommu-sva expects everywhere that the device has an iommu_domain,
> > > it's the first thing we check on entry. Bypassing all of this would
> > > call idr_alloc() directly, and wouldn't have any code in common
> > > with the current iommu-sva. So it seems like you need a layer on
> > > top of iommu-sva calling idr_alloc() when an IOMMU isn't present,
> > > but I don't think it should be in drivers/iommu/
> >
> > In this case I question if the PASID handling should be under
> > drivers/iommu at all.
> >
> > See I can have a mix of VM context which are bound to processes (some
> > few) and VM contexts which are standalone and doesn't care for a
> > process address space. But for each VM context I need a distinct
> > PASID for the hardware to work.

I'm confused about VM context vs. process. Is VM referring to Virtual
Machine or something else? If yes, I don't understand the binding part
- what VM context is bound to (host?) process?

> >
> > I can live if we say if IOMMU is completely disabled we use a simple
> > ida to allocate them, but when IOMMU is enabled I certainly need a
> > way to reserve a PASID without an associated process.
> >
> VT-d would also have such requirement. There is a virtual command
> register for allocate and free PASID for VM use. When that PASID
> allocation request gets propagated to the host IOMMU driver, we need to
> allocate PASID w/o mm.
> 

VT-d is a bit different. In host side, PASID allocation always happens in
Qemu's context, so those PASIDs are recorded with Qemu process, 
though the entries may point to guest page tables instead of host mm
of Qemu.

Thanks
Kevin

Tian, Kevin Sept. 13, 2018, 7:26 a.m. UTC | #16

> From: Christian König
> Sent: Friday, September 7, 2018 4:56 PM
> 
> 5. It would be nice to have to allocate multiple PASIDs for the same
> process address space.
>          E.g. some teams at AMD want to use a separate GPU address space
> for their userspace client library. I'm still trying to avoid that, but
> it is perfectly possible that we are going to need that.
>          Additional to that it is sometimes quite useful for debugging
> to isolate where exactly an incorrect access (segfault) is coming from.
> 
> Let me know if there are some problems with that, especially I want to
> know if there is pushback on #5 so that I can forward that :)
> 

We have similar requirement, except that it is "multiple PASIDs for
same process" instead of "for same process address space".

Intel VT-d goes to a 'true' system-wide PASID allocation policy, 
cross both host processes and guest processes. As Jacob explains,
there will be a virtual cmd register on virtual vtd, through which
guest IOMMU driver requests to get system-wide PASIDs allocated
by host IOMMU driver.

with that design, Qemu represents all guest processes in host
side, thus will get "multiple PASIDs allocated for same process".
However instead of binding all PASIDs to same host address space
of  Qemu, each of PASID entry points to guest address space if 
used by guest process.

Thanks
Kevin

[v2,01/40] iommu: Introduce Shared Virtual Addressing API

Commit Message

Comments

Patch