diff mbox

[RFC,29/30] vfio: Add support for Shared Virtual Memory

Message ID 20170227195441.5170-30-jean-philippe.brucker@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jean-Philippe Brucker Feb. 27, 2017, 7:54 p.m. UTC
Add two new ioctl for VFIO devices. VFIO_DEVICE_BIND_TASK creates a bond
between a device and a process address space, identified by a
device-specific ID named PASID. This allows the device to target DMA
transactions at the process virtual addresses without a need for mapping
and unmapping buffers explicitly in the IOMMU. The process page tables are
shared with the IOMMU, and mechanisms such as PCI ATS/PRI may be used to
handle faults. VFIO_DEVICE_UNBIND_TASK removed a bond identified by a
PASID.

Also add a capability flag in device info to detect whether the system and
the device support SVM.

Users need to specify the state of a PASID when unbinding, with flags
VFIO_PASID_RELEASE_FLUSHED and VFIO_PASID_RELEASE_CLEAN. Even for PCI,
PASID invalidation is specific to each device and only partially covered
by the specification:

* Device must have an implementation-defined mechanism for stopping the
  use of a PASID. When this mechanism finishes, the device has stopped
  issuing transactions for this PASID and all transactions for this PASID
  have been flushed to the IOMMU.

* Device may either wait for all outstanding PRI requests for this PASID
  to finish, or issue a Stop Marker message, a barrier that separates PRI
  requests affecting this instance of the PASID from PRI requests
  affecting the next instance. In the first case, we say that the PASID is
  "clean", in the second case it is "flushed" (and the IOMMU has to wait
  for the Stop Marker before reassigning the PASID.)

We expect similar distinctions for platform devices. Ideally there should
be a callback for each PCI device, allowing the IOMMU to ask the device to
stop using a PASID. When the callback returns, the PASID is either flushed
or clean and the return value tells which.

For the moment I don't know how to implement this callback for PCI, so if
the user forgets to call unbind with either "clean" or "flushed", the
PASID is never reused. For platform devices, it might be simpler to
implement since we could associate an invalidate_pasid callback to a DT
compatible string, as is currently done for reset.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 drivers/vfio/pci/vfio_pci.c |  24 ++++++++++
 drivers/vfio/vfio.c         | 104 ++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h   |  55 +++++++++++++++++++++++
 3 files changed, 183 insertions(+)

Comments

Alex Williamson Feb. 28, 2017, 3:54 a.m. UTC | #1
On Mon, 27 Feb 2017 19:54:40 +0000
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> Add two new ioctl for VFIO devices. VFIO_DEVICE_BIND_TASK creates a bond
> between a device and a process address space, identified by a
> device-specific ID named PASID. This allows the device to target DMA
> transactions at the process virtual addresses without a need for mapping
> and unmapping buffers explicitly in the IOMMU. The process page tables are
> shared with the IOMMU, and mechanisms such as PCI ATS/PRI may be used to
> handle faults. VFIO_DEVICE_UNBIND_TASK removed a bond identified by a
> PASID.
> 
> Also add a capability flag in device info to detect whether the system and
> the device support SVM.
> 
> Users need to specify the state of a PASID when unbinding, with flags
> VFIO_PASID_RELEASE_FLUSHED and VFIO_PASID_RELEASE_CLEAN. Even for PCI,
> PASID invalidation is specific to each device and only partially covered
> by the specification:
> 
> * Device must have an implementation-defined mechanism for stopping the
>   use of a PASID. When this mechanism finishes, the device has stopped
>   issuing transactions for this PASID and all transactions for this PASID
>   have been flushed to the IOMMU.
> 
> * Device may either wait for all outstanding PRI requests for this PASID
>   to finish, or issue a Stop Marker message, a barrier that separates PRI
>   requests affecting this instance of the PASID from PRI requests
>   affecting the next instance. In the first case, we say that the PASID is
>   "clean", in the second case it is "flushed" (and the IOMMU has to wait
>   for the Stop Marker before reassigning the PASID.)
> 
> We expect similar distinctions for platform devices. Ideally there should
> be a callback for each PCI device, allowing the IOMMU to ask the device to
> stop using a PASID. When the callback returns, the PASID is either flushed
> or clean and the return value tells which.
> 
> For the moment I don't know how to implement this callback for PCI, so if
> the user forgets to call unbind with either "clean" or "flushed", the
> PASID is never reused. For platform devices, it might be simpler to
> implement since we could associate an invalidate_pasid callback to a DT
> compatible string, as is currently done for reset.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  drivers/vfio/pci/vfio_pci.c |  24 ++++++++++
>  drivers/vfio/vfio.c         | 104 ++++++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h   |  55 +++++++++++++++++++++++
>  3 files changed, 183 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 324c52e3a1a4..3d7733f94891 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -22,6 +22,7 @@
>  #include <linux/mutex.h>
>  #include <linux/notifier.h>
>  #include <linux/pci.h>
> +#include <linux/pci-ats.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/slab.h>
>  #include <linux/types.h>
> @@ -623,6 +624,26 @@ int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
>  	return 0;
>  }
>  
> +static bool vfio_pci_supports_svm(struct vfio_pci_device *vdev)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +
> +	if (!pdev->ats_enabled)
> +		return false;
> +
> +	if (!pdev->pasid_enabled || pci_max_pasids(pdev) <= 1)
> +		return false;
> +
> +	if (!pdev->pri_enabled)
> +		return false;
> +
> +	/*
> +	 * If the IOMMU driver enabled all of these, then it supports PCI SVM
> +	 * for this device.
> +	 */
> +	return true;
> +}
> +
>  static long vfio_pci_ioctl(void *device_data,
>  			   unsigned int cmd, unsigned long arg)
>  {
> @@ -642,6 +663,9 @@ static long vfio_pci_ioctl(void *device_data,
>  
>  		info.flags = VFIO_DEVICE_FLAGS_PCI;
>  
> +		if (vfio_pci_supports_svm(vdev))
> +			info.flags |= VFIO_DEVICE_FLAGS_SVM;
> +
>  		if (vdev->reset_works)
>  			info.flags |= VFIO_DEVICE_FLAGS_RESET;
>  
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 609f4f982c74..c4505d8f4c61 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -97,6 +97,14 @@ struct vfio_device {
>  	struct vfio_group		*group;
>  	struct list_head		group_next;
>  	void				*device_data;
> +
> +	struct mutex			tasks_lock;
> +	struct list_head		tasks;
> +};
> +
> +struct vfio_task {
> +	int				pasid;
> +	struct list_head		list;
>  };
>  
>  #ifdef CONFIG_VFIO_NOIOMMU
> @@ -520,6 +528,9 @@ struct vfio_device *vfio_group_create_device(struct vfio_group *group,
>  	device->device_data = device_data;
>  	dev_set_drvdata(dev, device);
>  
> +	mutex_init(&device->tasks_lock);
> +	INIT_LIST_HEAD(&device->tasks);
> +
>  	/* No need to get group_lock, caller has group reference */
>  	vfio_group_get(group);
>  
> @@ -532,6 +543,8 @@ struct vfio_device *vfio_group_create_device(struct vfio_group *group,
>  
>  static void vfio_device_release(struct kref *kref)
>  {
> +	int ret;
> +	struct vfio_task *tmp, *task;
>  	struct vfio_device *device = container_of(kref,
>  						  struct vfio_device, kref);
>  	struct vfio_group *group = device->group;
> @@ -539,6 +552,22 @@ static void vfio_device_release(struct kref *kref)
>  	list_del(&device->group_next);
>  	mutex_unlock(&group->device_lock);
>  
> +	mutex_lock(&device->tasks_lock);
> +	list_for_each_entry_safe(task, tmp, &device->tasks, list) {
> +		/*
> +		 * This might leak the PASID, since the IOMMU won't know
> +		 * if it is safe to reuse.
> +		 */
> +		ret = iommu_unbind_task(device->dev, task->pasid, 0);
> +		if (ret)
> +			dev_warn(device->dev, "failed to unbind PASID %u\n",
> +				 task->pasid);
> +
> +		list_del(&task->list);
> +		kfree(task);
> +	}
> +	mutex_unlock(&device->tasks_lock);
> +
>  	dev_set_drvdata(device->dev, NULL);
>  
>  	kfree(device);
> @@ -1622,6 +1651,75 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
>  	return 0;
>  }
>  
> +static long vfio_svm_ioctl(struct vfio_device *device, unsigned int cmd,
> +			   unsigned long arg)
> +{
> +	int ret;
> +	unsigned long minsz;
> +
> +	struct vfio_device_svm svm;
> +	struct vfio_task *vfio_task;
> +
> +	minsz = offsetofend(struct vfio_device_svm, pasid);
> +
> +	if (copy_from_user(&svm, (void __user *)arg, minsz))
> +		return -EFAULT;
> +
> +	if (svm.argsz < minsz)
> +		return -EINVAL;
> +
> +	if (cmd == VFIO_DEVICE_BIND_TASK) {
> +		struct task_struct *task = current;

Seems like SVM should be in the name of these ioctls.

svm.flags needs to be validated here or else we lose the field for
future use... you add this in the next patch, but see compatibility
comment there.

> +
> +		ret = iommu_bind_task(device->dev, task, &svm.pasid, 0, NULL);
> +		if (ret)
> +			return ret;

vfio-pci advertises the device feature, but vfio intercepts the ioctl
and attempts to handle it regardless of device support.

We also need to be careful of using, or even referencing iommu_ops
without regard to the device or IOMMU backend.  SPAPR doesn't fully
implement IOMMU API, vfio-noiommu devices don't have iommu_ops, mdev
devices don't either.  I agree with your comments in the cover letter,
it's not entirely clear that the device fd is the right place to host
this.

> +
> +		vfio_task = kzalloc(sizeof(*vfio_task), GFP_KERNEL);
> +		if (!vfio_task) {
> +			iommu_unbind_task(device->dev, svm.pasid,
> +					  IOMMU_PASID_CLEAN);
> +			return -ENOMEM;
> +		}
> +
> +		vfio_task->pasid = svm.pasid;
> +
> +		mutex_lock(&device->tasks_lock);
> +		list_add(&vfio_task->list, &device->tasks);
> +		mutex_unlock(&device->tasks_lock);
> +
> +	} else {
> +		int flags = 0;
> +
> +		if (svm.flags & ~(VFIO_SVM_PASID_RELEASE_FLUSHED |
> +				  VFIO_SVM_PASID_RELEASE_CLEAN))
> +			return -EINVAL;
> +
> +		if (svm.flags & VFIO_SVM_PASID_RELEASE_FLUSHED)
> +			flags = IOMMU_PASID_FLUSHED;
> +		else if (svm.flags & VFIO_SVM_PASID_RELEASE_CLEAN)
> +			flags = IOMMU_PASID_CLEAN;
> +
> +		mutex_lock(&device->tasks_lock);
> +		list_for_each_entry(vfio_task, &device->tasks, list) {
> +			if (vfio_task->pasid != svm.pasid)
> +				continue;
> +
> +			ret = iommu_unbind_task(device->dev, svm.pasid, flags);
> +			if (ret)
> +				dev_warn(device->dev, "failed to unbind PASID %u\n",
> +					 vfio_task->pasid);
> +
> +			list_del(&vfio_task->list);
> +			kfree(vfio_task);
> +			break;
> +		}
> +		mutex_unlock(&device->tasks_lock);
> +	}
> +
> +	return copy_to_user((void __user *)arg, &svm, minsz) ? -EFAULT : 0;
> +}
> +
>  static long vfio_device_fops_unl_ioctl(struct file *filep,
>  				       unsigned int cmd, unsigned long arg)
>  {
> @@ -1630,6 +1728,12 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
>  	if (unlikely(!device->ops->ioctl))
>  		return -EINVAL;
>  
> +	switch (cmd) {
> +	case VFIO_DEVICE_BIND_TASK:
> +	case VFIO_DEVICE_UNBIND_TASK:
> +		return vfio_svm_ioctl(device, cmd, arg);
> +	}
> +
>  	return device->ops->ioctl(device->device_data, cmd, arg);
>  }
>  
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 519eff362c1c..3fe4197a5ea0 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -198,6 +198,7 @@ struct vfio_device_info {
>  #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
>  #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
>  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
> +#define VFIO_DEVICE_FLAGS_SVM	(1 << 4)	/* Device supports bind/unbind */

We could also define one of the bits in vfio_device_svm.flags to be
"probe" (ie. no-op, return success).  Using an SVM flag follows the
model we used for RESET support, but I'm not convinced that's a great
model to follow.

>  	__u32	num_regions;	/* Max region index + 1 */
>  	__u32	num_irqs;	/* Max IRQ index + 1 */
>  };
> @@ -409,6 +410,60 @@ struct vfio_irq_set {
>   */
>  #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
>  
> +struct vfio_device_svm {
> +	__u32	argsz;
> +	__u32	flags;
> +#define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 0)
> +#define VFIO_SVM_PASID_RELEASE_CLEAN	(1 << 1)
> +	__u32	pasid;
> +};
> +/*
> + * VFIO_DEVICE_BIND_TASK - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> + *                               struct vfio_device_svm)
> + *
> + * Share a process' virtual address space with the device.
> + *
> + * This feature creates a new address space for the device, which is not
> + * affected by VFIO_IOMMU_MAP/UNMAP_DMA. Instead, the device can tag its DMA
> + * traffic with the given @pasid to perform transactions on the associated
> + * virtual address space. Mapping and unmapping of buffers is performed by
> + * standard functions such as mmap and malloc.
> + *
> + * On success, VFIO writes a Process Address Space ID (PASID) into @pasid. This
> + * ID is unique to a device.
> + *
> + * The bond between device and process must be removed with
> + * VFIO_DEVICE_UNBIND_TASK before exiting.

I'm not sure I understand this since we do a pass of unbinds on
release.  Certainly we can't rely on the user for cleanup.

> + *
> + * On fork, the child inherits the device fd and can use the bonds setup by its
> + * parent. Consequently, the child has R/W access on the address spaces bound by
> + * its parent. After an execv, the device fd is closed and the child doesn't
> + * have access to the address space anymore.
> + *
> + * Availability of this feature depends on the device, its bus, the underlying
> + * IOMMU and the CPU architecture. All of these are guaranteed when the device
> + * has VFIO_DEVICE_FLAGS_SVM flag set.
> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +#define VFIO_DEVICE_BIND_TASK	_IO(VFIO_TYPE, VFIO_BASE + 22)
> +
> +/*
> + * VFIO_DEVICE_UNBIND_TASK - _IOWR(VFIO_TYPE, VFIO_BASE + 23,
> + *                                 struct vfio_device_svm)
> + *
> + * Unbind address space identified by @pasid from device. Device must have
> + * stopped issuing any DMA transaction for the PASID and flushed any reference
> + * to this PASID upstream. Some IOMMUs need to know when a PASID is safe to
> + * reuse, in which case one of the following must be present in @flags
> + *
> + * VFIO_PASID_RELEASE_FLUSHED: the PASID is safe to reassign after the IOMMU
> + *       receives an invalidation message from the device.
> + *
> + * VFIO_PASID_RELEASE_CLEAN: the PASID is safe to reassign immediately.
> + */
> +#define VFIO_DEVICE_UNBIND_TASK	_IO(VFIO_TYPE, VFIO_BASE + 23)
> +
>  /*
>   * The VFIO-PCI bus driver makes use of the following fixed region and
>   * IRQ index mapping.  Unimplemented regions return a size of zero.
Jean-Philippe Brucker Feb. 28, 2017, 3:17 p.m. UTC | #2
Hi Alex,

Thanks for the feedback!

On Mon, Feb 27, 2017 at 08:54:09PM -0700, Alex Williamson wrote:
> On Mon, 27 Feb 2017 19:54:40 +0000
> Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:
[...]
> >  
> > +static long vfio_svm_ioctl(struct vfio_device *device, unsigned int cmd,
> > +			   unsigned long arg)
> > +{
> > +	int ret;
> > +	unsigned long minsz;
> > +
> > +	struct vfio_device_svm svm;
> > +	struct vfio_task *vfio_task;
> > +
> > +	minsz = offsetofend(struct vfio_device_svm, pasid);
> > +
> > +	if (copy_from_user(&svm, (void __user *)arg, minsz))
> > +		return -EFAULT;
> > +
> > +	if (svm.argsz < minsz)
> > +		return -EINVAL;
> > +
> > +	if (cmd == VFIO_DEVICE_BIND_TASK) {
> > +		struct task_struct *task = current;
> 
> Seems like SVM should be in the name of these ioctls.
> 
> svm.flags needs to be validated here or else we lose the field for
> future use... you add this in the next patch, but see compatibility
> comment there.

Agreed, I'll be more careful with the flags.

> > +
> > +		ret = iommu_bind_task(device->dev, task, &svm.pasid, 0, NULL);
> > +		if (ret)
> > +			return ret;
> 
> vfio-pci advertises the device feature, but vfio intercepts the ioctl
> and attempts to handle it regardless of device support.
> 
> We also need to be careful of using, or even referencing iommu_ops
> without regard to the device or IOMMU backend.  SPAPR doesn't fully
> implement IOMMU API, vfio-noiommu devices don't have iommu_ops, mdev
> devices don't either.  I agree with your comments in the cover letter,
> it's not entirely clear that the device fd is the right place to host
> this.

Yes, and I like the idea of moving the ioctl into type1 IOMMU.
Something like VFIO_IOMMU_BIND_TASK (or perhaps VFIO_IOMMU_SVM_BIND?),
applied on the container instead of the device might be better. The
semantics are tricky to define though, both for VFIO and IOMMU, because
devices in a container or an IOMMU group might have different SVM
capabilities.

When this ioctl successfully returns with a PASID, two possibilities:
A. either it implies that all devices attached to the container are now
   able to perform DMA with this PASID,
B. or some devices in the container do not support SVM, but those that
   support it can all use the PASID. The user needs to inspect device
   flags individually to know which can support SVM. When user is a
   userspace device driver, it is familiar with the device it's driving
   and knows whether is supports SVM or not, but a VMM wouldn't.

After binding the container to the task and obtaining a PASID, user
wants to add a group to the container. So we need to replay the binding
on the new group, by telling the IOMMU to use that particular PASID. If
the device supports less PASID bits, I guess we should reject the
attach? If the device doesn't support SVM, for case A we should reject
the attach, for case B we accept it. Alternatively, we could simply
forbid to add groups to containers after a bind.

The problem is similar for adding devices to IOMMU groups. If a group is
bound to an address space, and a less capable device is added to the
group, we probably don't want to reject the device altogether, nor do we
want to unbind the PASID.

> > +
> > +		vfio_task = kzalloc(sizeof(*vfio_task), GFP_KERNEL);
> > +		if (!vfio_task) {
> > +			iommu_unbind_task(device->dev, svm.pasid,
> > +					  IOMMU_PASID_CLEAN);
> > +			return -ENOMEM;
> > +		}
> > +
> > +		vfio_task->pasid = svm.pasid;
> > +
> > +		mutex_lock(&device->tasks_lock);
> > +		list_add(&vfio_task->list, &device->tasks);
> > +		mutex_unlock(&device->tasks_lock);
> > +
> > +	} else {
> > +		int flags = 0;
> > +
> > +		if (svm.flags & ~(VFIO_SVM_PASID_RELEASE_FLUSHED |
> > +				  VFIO_SVM_PASID_RELEASE_CLEAN))
> > +			return -EINVAL;
> > +
> > +		if (svm.flags & VFIO_SVM_PASID_RELEASE_FLUSHED)
> > +			flags = IOMMU_PASID_FLUSHED;
> > +		else if (svm.flags & VFIO_SVM_PASID_RELEASE_CLEAN)
> > +			flags = IOMMU_PASID_CLEAN;
> > +
> > +		mutex_lock(&device->tasks_lock);
> > +		list_for_each_entry(vfio_task, &device->tasks, list) {
> > +			if (vfio_task->pasid != svm.pasid)
> > +				continue;
> > +
> > +			ret = iommu_unbind_task(device->dev, svm.pasid, flags);
> > +			if (ret)
> > +				dev_warn(device->dev, "failed to unbind PASID %u\n",
> > +					 vfio_task->pasid);
> > +
> > +			list_del(&vfio_task->list);
> > +			kfree(vfio_task);
> > +			break;
> > +		}
> > +		mutex_unlock(&device->tasks_lock);
> > +	}
> > +
> > +	return copy_to_user((void __user *)arg, &svm, minsz) ? -EFAULT : 0;
> > +}
> > +
> >  static long vfio_device_fops_unl_ioctl(struct file *filep,
> >  				       unsigned int cmd, unsigned long arg)
> >  {
> > @@ -1630,6 +1728,12 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
> >  	if (unlikely(!device->ops->ioctl))
> >  		return -EINVAL;
> >  
> > +	switch (cmd) {
> > +	case VFIO_DEVICE_BIND_TASK:
> > +	case VFIO_DEVICE_UNBIND_TASK:
> > +		return vfio_svm_ioctl(device, cmd, arg);
> > +	}
> > +
> >  	return device->ops->ioctl(device->device_data, cmd, arg);
> >  }
> >  
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 519eff362c1c..3fe4197a5ea0 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -198,6 +198,7 @@ struct vfio_device_info {
> >  #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
> >  #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
> >  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
> > +#define VFIO_DEVICE_FLAGS_SVM	(1 << 4)	/* Device supports bind/unbind */
> 
> We could also define one of the bits in vfio_device_svm.flags to be
> "probe" (ie. no-op, return success).  Using an SVM flag follows the
> model we used for RESET support, but I'm not convinced that's a great
> model to follow.
> 
> >  	__u32	num_regions;	/* Max region index + 1 */
> >  	__u32	num_irqs;	/* Max IRQ index + 1 */
> >  };
> > @@ -409,6 +410,60 @@ struct vfio_irq_set {
> >   */
> >  #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
> >  
> > +struct vfio_device_svm {
> > +	__u32	argsz;
> > +	__u32	flags;
> > +#define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 0)
> > +#define VFIO_SVM_PASID_RELEASE_CLEAN	(1 << 1)
> > +	__u32	pasid;
> > +};
> > +/*
> > + * VFIO_DEVICE_BIND_TASK - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> > + *                               struct vfio_device_svm)
> > + *
> > + * Share a process' virtual address space with the device.
> > + *
> > + * This feature creates a new address space for the device, which is not
> > + * affected by VFIO_IOMMU_MAP/UNMAP_DMA. Instead, the device can tag its DMA
> > + * traffic with the given @pasid to perform transactions on the associated
> > + * virtual address space. Mapping and unmapping of buffers is performed by
> > + * standard functions such as mmap and malloc.
> > + *
> > + * On success, VFIO writes a Process Address Space ID (PASID) into @pasid. This
> > + * ID is unique to a device.
> > + *
> > + * The bond between device and process must be removed with
> > + * VFIO_DEVICE_UNBIND_TASK before exiting.
> 
> I'm not sure I understand this since we do a pass of unbinds on
> release.  Certainly we can't rely on the user for cleanup.

We probably shouldn't rely on the user for cleanup, but we need its
assistance. My concern is about PASID state when unbinding. By letting
the user tell via flag "PASID_RELEASE_CLEAN" that it waited for
transactions to finish, we know that the PASID can be recycled and
reused for another task. Otherwise VFIO cannot guarantee on release that
the PASID is safe to reuse. If it did, a pending page fault in the IOMMU
or the downstream bus might hit the next address space that uses this
PASID.

So for the moment, if user doesn't explicitly call unbind with PASID
state flags, the SMMU driver considers that it isn't safe to reuse and
the PASID is never re-allocated.

We could get rid of this concern by having a PCI driver provide VFIO (or
rather the IOMMU driver) with a PASID invalidation callback.

Thanks,
Jean-Philippe

> > + *
> > + * On fork, the child inherits the device fd and can use the bonds setup by its
> > + * parent. Consequently, the child has R/W access on the address spaces bound by
> > + * its parent. After an execv, the device fd is closed and the child doesn't
> > + * have access to the address space anymore.
> > + *
> > + * Availability of this feature depends on the device, its bus, the underlying
> > + * IOMMU and the CPU architecture. All of these are guaranteed when the device
> > + * has VFIO_DEVICE_FLAGS_SVM flag set.
> > + *
> > + * returns: 0 on success, -errno on failure.
> > + */
> > +#define VFIO_DEVICE_BIND_TASK	_IO(VFIO_TYPE, VFIO_BASE + 22)
> > +
> > +/*
> > + * VFIO_DEVICE_UNBIND_TASK - _IOWR(VFIO_TYPE, VFIO_BASE + 23,
> > + *                                 struct vfio_device_svm)
> > + *
> > + * Unbind address space identified by @pasid from device. Device must have
> > + * stopped issuing any DMA transaction for the PASID and flushed any reference
> > + * to this PASID upstream. Some IOMMUs need to know when a PASID is safe to
> > + * reuse, in which case one of the following must be present in @flags
> > + *
> > + * VFIO_PASID_RELEASE_FLUSHED: the PASID is safe to reassign after the IOMMU
> > + *       receives an invalidation message from the device.
> > + *
> > + * VFIO_PASID_RELEASE_CLEAN: the PASID is safe to reassign immediately.
> > + */
> > +#define VFIO_DEVICE_UNBIND_TASK	_IO(VFIO_TYPE, VFIO_BASE + 23)
> > +
> >  /*
> >   * The VFIO-PCI bus driver makes use of the following fixed region and
> >   * IRQ index mapping.  Unimplemented regions return a size of zero.
>
Yi Liu March 21, 2017, 7:04 a.m. UTC | #3
Hi Jean,

I'm working on virtual SVM, and have some comments on the VFIO channel
definition.

> -----Original Message-----
> From: iommu-bounces@lists.linux-foundation.org [mailto:iommu-
> bounces@lists.linux-foundation.org] On Behalf Of Jean-Philippe Brucker
> Sent: Tuesday, February 28, 2017 3:55 AM
> Cc: Shanker Donthineni <shankerd@qti.qualcomm.com>; kvm@vger.kernel.org;
> Catalin Marinas <catalin.marinas@arm.com>; Sinan Kaya
> <okaya@qti.qualcomm.com>; Will Deacon <will.deacon@arm.com>;
> iommu@lists.linux-foundation.org; Harv Abdulhamid <harba@qti.qualcomm.com>;
> linux-pci@vger.kernel.org; Bjorn Helgaas <bhelgaas@google.com>; David
> Woodhouse <dwmw2@infradead.org>; linux-arm-kernel@lists.infradead.org; Nate
> Watterson <nwatters@qti.qualcomm.com>
> Subject: [RFC PATCH 29/30] vfio: Add support for Shared Virtual Memory
> 
> Add two new ioctl for VFIO devices. VFIO_DEVICE_BIND_TASK creates a bond
> between a device and a process address space, identified by a device-specific ID
> named PASID. This allows the device to target DMA transactions at the process
> virtual addresses without a need for mapping and unmapping buffers explicitly in the
> IOMMU. The process page tables are shared with the IOMMU, and mechanisms such
> as PCI ATS/PRI may be used to handle faults. VFIO_DEVICE_UNBIND_TASK removed
> a bond identified by a PASID.
> 
> Also add a capability flag in device info to detect whether the system and the device
> support SVM.
> 
> Users need to specify the state of a PASID when unbinding, with flags
> VFIO_PASID_RELEASE_FLUSHED and VFIO_PASID_RELEASE_CLEAN. Even for PCI,
> PASID invalidation is specific to each device and only partially covered by the
> specification:
> 
> * Device must have an implementation-defined mechanism for stopping the
>   use of a PASID. When this mechanism finishes, the device has stopped
>   issuing transactions for this PASID and all transactions for this PASID
>   have been flushed to the IOMMU.
> 
> * Device may either wait for all outstanding PRI requests for this PASID
>   to finish, or issue a Stop Marker message, a barrier that separates PRI
>   requests affecting this instance of the PASID from PRI requests
>   affecting the next instance. In the first case, we say that the PASID is
>   "clean", in the second case it is "flushed" (and the IOMMU has to wait
>   for the Stop Marker before reassigning the PASID.)
> 
> We expect similar distinctions for platform devices. Ideally there should be a callback
> for each PCI device, allowing the IOMMU to ask the device to stop using a PASID.
> When the callback returns, the PASID is either flushed or clean and the return value
> tells which.
> 
> For the moment I don't know how to implement this callback for PCI, so if the user
> forgets to call unbind with either "clean" or "flushed", the PASID is never reused. For
> platform devices, it might be simpler to implement since we could associate an
> invalidate_pasid callback to a DT compatible string, as is currently done for reset.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

[...]

>  drivers/vfio/pci/vfio_pci.c |  24 ++++++++++
>  drivers/vfio/vfio.c         | 104 ++++++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h   |  55 +++++++++++++++++++++++
>  3 files changed, 183 insertions(+)
> 
...
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index
> 519eff362c1c..3fe4197a5ea0 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -198,6 +198,7 @@ struct vfio_device_info {
>  #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
>  #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
>  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
> +#define VFIO_DEVICE_FLAGS_SVM	(1 << 4)	/* Device supports bind/unbind */
>  	__u32	num_regions;	/* Max region index + 1 */
>  	__u32	num_irqs;	/* Max IRQ index + 1 */
>  };
> @@ -409,6 +410,60 @@ struct vfio_irq_set {
>   */
>  #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
> 
> +struct vfio_device_svm {
> +	__u32	argsz;
> +	__u32	flags;
> +#define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 0)
> +#define VFIO_SVM_PASID_RELEASE_CLEAN	(1 << 1)
> +	__u32	pasid;
> +};

For virtual SVM work, the VFIO channel would be used to passdown guest
PASID tale PTR and invalidation information. And may have further usage
except the above.

Here is the virtual SVM design doc which illustrates the VFIO usage.
https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html

For the guest PASID table ptr passdown, I've following message in pseudo code.
struct pasid_table_info {
        __u64 ptr;
        __u32 size;
 };

For invalidation, I've following info in in pseudo code.
struct iommu_svm_tlb_invalidate_info
{
       __u32 inv_type;
#define IOTLB_INV			(1 << 0)
#define EXTENDED_IOTLB_INV		(1 << 1)
#define DEVICE_IOTLB_INV		(1 << 2)
#define EXTENDED_DEVICE_IOTLB_INV	(1 << 3)
#define PASID_CACHE_INV		(1 << 4)
       __u32 pasid;
       __u64 addr;
       __u64 size;
       __u8 granularity;
#define DEFAULT_INV_GRN        0
#define PAGE_SELECTIVE_INV     (1 << 0)
#define PASID_SELECVIVE_INV    (1 << 1)
       __u64 flags;
#define INVALIDATE_HINT_BIT    (1 << 0)
#define GLOBAL_HINT_BIT        (1 << 1)
#define DRAIN_READ_BIT         (1 << 2)
#define DRAIN_WRITE_BIT        (1 << 3)
#define DEVICE_TLB_GLOBAL_BIT  (1 << 4)
       __u8 mip;
       __u16 pfsid;
};

Although your proposal is for userspace driver SVM usage while mine is
for  SVM usage in virtual machine, there should be a chance to make the
channel meet our request. And I think it would be more acceptable. So I'd
like to see your comments if we define the channel as following definition.
If any better solution, pls feel free let me know.

struct vfio_device_svm {
       __u32   argsz;
#define VFIO_SVM_BIND_PASIDTP           (1 << 0)
#define VFIO_SVM_PASSDOWN_INVALIDATE    (1 << 1)
#define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 2)
#define VFIO_SVM_PASID_RELEASE_CLEAN	  (1 << 3)
       __u32   flags;
       __u32   length;
       __u8    data[];
};

Thanks,
Yi L

> + * VFIO_DEVICE_BIND_TASK - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> + *                               struct vfio_device_svm)
> + *
> + * Share a process' virtual address space with the device.
> + *
> + * This feature creates a new address space for the device, which is
> +not
> + * affected by VFIO_IOMMU_MAP/UNMAP_DMA. Instead, the device can tag
> +its DMA
> + * traffic with the given @pasid to perform transactions on the
> +associated
> + * virtual address space. Mapping and unmapping of buffers is performed
> +by
> + * standard functions such as mmap and malloc.
> + *
> + * On success, VFIO writes a Process Address Space ID (PASID) into
> +@pasid. This
> + * ID is unique to a device.
> + *
> + * The bond between device and process must be removed with
> + * VFIO_DEVICE_UNBIND_TASK before exiting.
> + *
> + * On fork, the child inherits the device fd and can use the bonds
> +setup by its
> + * parent. Consequently, the child has R/W access on the address spaces
> +bound by
> + * its parent. After an execv, the device fd is closed and the child
> +doesn't
> + * have access to the address space anymore.
> + *
> + * Availability of this feature depends on the device, its bus, the
> +underlying
> + * IOMMU and the CPU architecture. All of these are guaranteed when the
> +device
> + * has VFIO_DEVICE_FLAGS_SVM flag set.
> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +#define VFIO_DEVICE_BIND_TASK	_IO(VFIO_TYPE, VFIO_BASE + 22)
> +
> +/*
> + * VFIO_DEVICE_UNBIND_TASK - _IOWR(VFIO_TYPE, VFIO_BASE + 23,
> + *                                 struct vfio_device_svm)
> + *
> + * Unbind address space identified by @pasid from device. Device must
> +have
> + * stopped issuing any DMA transaction for the PASID and flushed any
> +reference
> + * to this PASID upstream. Some IOMMUs need to know when a PASID is
> +safe to
> + * reuse, in which case one of the following must be present in @flags
> + *
> + * VFIO_PASID_RELEASE_FLUSHED: the PASID is safe to reassign after the IOMMU
> + *       receives an invalidation message from the device.
> + *
> + * VFIO_PASID_RELEASE_CLEAN: the PASID is safe to reassign immediately.
> + */
> +#define VFIO_DEVICE_UNBIND_TASK	_IO(VFIO_TYPE, VFIO_BASE + 23)
> +
>  /*
>   * The VFIO-PCI bus driver makes use of the following fixed region and
>   * IRQ index mapping.  Unimplemented regions return a size of zero.
> --
> 2.11.0
> 
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
Jean-Philippe Brucker March 21, 2017, 7:37 p.m. UTC | #4
On 21/03/17 07:04, Liu, Yi L wrote:
> Hi Jean,
> 
> I'm working on virtual SVM, and have some comments on the VFIO channel
> definition.

Thanks a lot for the comments, this is quite interesting to me. I just
have some concerns about portability so I'm proposing a way to be slightly
more generic below.

>> -----Original Message-----
>> From: iommu-bounces@lists.linux-foundation.org [mailto:iommu-
>> bounces@lists.linux-foundation.org] On Behalf Of Jean-Philippe Brucker
>> Sent: Tuesday, February 28, 2017 3:55 AM
>> Cc: Shanker Donthineni <shankerd@qti.qualcomm.com>; kvm@vger.kernel.org;
>> Catalin Marinas <catalin.marinas@arm.com>; Sinan Kaya
>> <okaya@qti.qualcomm.com>; Will Deacon <will.deacon@arm.com>;
>> iommu@lists.linux-foundation.org; Harv Abdulhamid <harba@qti.qualcomm.com>;
>> linux-pci@vger.kernel.org; Bjorn Helgaas <bhelgaas@google.com>; David
>> Woodhouse <dwmw2@infradead.org>; linux-arm-kernel@lists.infradead.org; Nate
>> Watterson <nwatters@qti.qualcomm.com>
>> Subject: [RFC PATCH 29/30] vfio: Add support for Shared Virtual Memory
>>
>> Add two new ioctl for VFIO devices. VFIO_DEVICE_BIND_TASK creates a bond
>> between a device and a process address space, identified by a device-specific ID
>> named PASID. This allows the device to target DMA transactions at the process
>> virtual addresses without a need for mapping and unmapping buffers explicitly in the
>> IOMMU. The process page tables are shared with the IOMMU, and mechanisms such
>> as PCI ATS/PRI may be used to handle faults. VFIO_DEVICE_UNBIND_TASK removed
>> a bond identified by a PASID.
>>
>> Also add a capability flag in device info to detect whether the system and the device
>> support SVM.
>>
>> Users need to specify the state of a PASID when unbinding, with flags
>> VFIO_PASID_RELEASE_FLUSHED and VFIO_PASID_RELEASE_CLEAN. Even for PCI,
>> PASID invalidation is specific to each device and only partially covered by the
>> specification:
>>
>> * Device must have an implementation-defined mechanism for stopping the
>>   use of a PASID. When this mechanism finishes, the device has stopped
>>   issuing transactions for this PASID and all transactions for this PASID
>>   have been flushed to the IOMMU.
>>
>> * Device may either wait for all outstanding PRI requests for this PASID
>>   to finish, or issue a Stop Marker message, a barrier that separates PRI
>>   requests affecting this instance of the PASID from PRI requests
>>   affecting the next instance. In the first case, we say that the PASID is
>>   "clean", in the second case it is "flushed" (and the IOMMU has to wait
>>   for the Stop Marker before reassigning the PASID.)
>>
>> We expect similar distinctions for platform devices. Ideally there should be a callback
>> for each PCI device, allowing the IOMMU to ask the device to stop using a PASID.
>> When the callback returns, the PASID is either flushed or clean and the return value
>> tells which.
>>
>> For the moment I don't know how to implement this callback for PCI, so if the user
>> forgets to call unbind with either "clean" or "flushed", the PASID is never reused. For
>> platform devices, it might be simpler to implement since we could associate an
>> invalidate_pasid callback to a DT compatible string, as is currently done for reset.
>>
>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> [...]
> 
>>  drivers/vfio/pci/vfio_pci.c |  24 ++++++++++
>>  drivers/vfio/vfio.c         | 104 ++++++++++++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/vfio.h   |  55 +++++++++++++++++++++++
>>  3 files changed, 183 insertions(+)
>>
> ...
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index
>> 519eff362c1c..3fe4197a5ea0 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -198,6 +198,7 @@ struct vfio_device_info {
>>  #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
>>  #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
>>  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
>> +#define VFIO_DEVICE_FLAGS_SVM	(1 << 4)	/* Device supports bind/unbind */
>>  	__u32	num_regions;	/* Max region index + 1 */
>>  	__u32	num_irqs;	/* Max IRQ index + 1 */
>>  };
>> @@ -409,6 +410,60 @@ struct vfio_irq_set {
>>   */
>>  #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
>>
>> +struct vfio_device_svm {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +#define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 0)
>> +#define VFIO_SVM_PASID_RELEASE_CLEAN	(1 << 1)
>> +	__u32	pasid;
>> +};
> 
> For virtual SVM work, the VFIO channel would be used to passdown guest
> PASID tale PTR and invalidation information. And may have further usage
> except the above.
> 
> Here is the virtual SVM design doc which illustrates the VFIO usage.
> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> 
> For the guest PASID table ptr passdown, I've following message in pseudo code.
> struct pasid_table_info {
>         __u64 ptr;
>         __u32 size;
>  };

There should probably be a way to specify the table format, so that the
pIOMMU driver can check that it recognizes the format used by the vIOMMU
before attaching it. This would allow to reuse the structure for other
IOMMU architectures. If, for instance, the host has an intel IOMMU and
someone decides to emulate an ARM SMMU with Qemu (their loss :), it can
certainly use VFIO for passing-through devices with MAP/UNMAP. But if Qemu
then attempts to passdown a PASID table in SMMU format, the Intel driver
should have a way to reject it, as the SMMU format isn't compatible.

I'm tackling a similar problem at the moment, but for passing a single
page directory instead of full PASID table to the IOMMU.

So we need some kind of high-level classification that the vIOMMU must
communicate to the physical one. Each IOMMU flavor would get a unique,
global identifier, simply to make sure that vIOMMU and pIOMMU speak the
same language. For example:

0x65776886 "AMDV" AMD IOMMU
0x73788476 "INTL" Intel IOMMU
0x83515748 "S390" s390 IOMMU
0x83777785 "SMMU" ARM SMMU
etc.

It needs to be a global magic number that everyone can recognize. Could be
as simple as 32-bit numbers allocated from 0. Once we have a global magic
number, we can use it to differentiate architecture-specific details.

struct pasid_table_info {
	__u64 ptr;
	__u64 size;		/* Is it number of entry or size in
				   bytes? */

	__u32 model;		/* magic number */
	__u32 variant;		/* version of the IOMMU architecture,
				   maybe? IOMMU-specific. */
	__u8 opaque[];		/* IOMMU-specific details */
};

And then each IOMMU or page-table code can do low-level validation of the
format, by reading the details in 'opaque'. I assume that for Intel this
would be empty. But for instance on ARM SMMUv3, PASID table can have
either one or two levels, and vIOMMU would specify which one of the three
available formats it is using.

struct pasid_table_info_smmu {
	/*
	 * In 'opaque', architecture details only the IOMMU driver should
	 * be caring about.
	 */
	__u8 s1fmt;
	__u8 s1dss;
}

If the physical SMMU doesn't implement the particular PASID table format,
it should reject the bind.

This would allow to keep architecture details outside of VFIO core (as
well as virtio in my case), and only have vIOMMU and pIOMMU understand
those details.

> 
> For invalidation, I've following info in in pseudo code.
> struct iommu_svm_tlb_invalidate_info
> {
>        __u32 inv_type;
> #define IOTLB_INV			(1 << 0)
> #define EXTENDED_IOTLB_INV		(1 << 1)
> #define DEVICE_IOTLB_INV		(1 << 2)
> #define EXTENDED_DEVICE_IOTLB_INV	(1 << 3)
> #define PASID_CACHE_INV		(1 << 4)
>        __u32 pasid;
>        __u64 addr;
>        __u64 size;
>        __u8 granularity;
> #define DEFAULT_INV_GRN        0
> #define PAGE_SELECTIVE_INV     (1 << 0)
> #define PASID_SELECVIVE_INV    (1 << 1)
>        __u64 flags;
> #define INVALIDATE_HINT_BIT    (1 << 0)
> #define GLOBAL_HINT_BIT        (1 << 1)
> #define DRAIN_READ_BIT         (1 << 2)
> #define DRAIN_WRITE_BIT        (1 << 3)
> #define DEVICE_TLB_GLOBAL_BIT  (1 << 4)
>        __u8 mip;
>        __u16 pfsid;
> };

This would also benefit from being split into generic and architectural
parts. Former would be defined in VFIO, latter would be in the IOMMU driver.

struct tlb_invalidate_info
{
	__u8 granularity
#define DEFAULT_INV_GRN		0	/* What is default? */
#define PAGE_SELECTIVE_INV	(1 << 0)
#define PASID_SELECTIVE_INV	(1 << 1)
	__u32 pasid;
	__u64 addr;
	__u64 size;

	/* Since IOMMU format has already been validated for this table,
	   the IOMMU driver knows that the following structure is in a
	   format it knows */
	__u8 opaque[];
};

struct tlb_invalidate_info_intel
{
	__u32 inv_type;
	...
	__u64 flags;
	...
	__u8 mip;
	__u16 pfsid;
};

> Although your proposal is for userspace driver SVM usage while mine is
> for  SVM usage in virtual machine, there should be a chance to make the
> channel meet our request. And I think it would be more acceptable. So I'd
> like to see your comments if we define the channel as following definition.
> If any better solution, pls feel free let me know.
> 
> struct vfio_device_svm {
>        __u32   argsz;
> #define VFIO_SVM_BIND_PASIDTP           (1 << 0)

To check we're on the same page: the absence of BIND_PASIDTP flag would
mean "bind a single PASID" and in that case, data[] would be a "u32 pasid"?

> #define VFIO_SVM_PASSDOWN_INVALIDATE    (1 << 1)

Using the vfio_device_svm structure for invalidate operations is a bit
odd, it might be nicer to add a new VFIO_SVM_INVALIDATE ioctl, that takes
the above iommu_svm_tlb_invalidate_info as argument (with an added argsz.)

> #define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 2)
> #define VFIO_SVM_PASID_RELEASE_CLEAN	  (1 << 3)
>        __u32   flags;
>        __u32   length;

If length is the size of data[], I guess we can already deduce this info
from argsz.

>        __u8    data[];
> };

In general, I think that your proposal would work fine along mine. Note
that for my next version of this patch, I would like to move the
BIND/UNBIND SVM operations onto a VFIO container fd, instead of a VFIO
device fd, if possible.

---
As an aside, it also aligns with the one I'm working on at the moment, for
virtual SVM at a finer granularity, where the BIND call is for a page
table. I would add this flag to vfio_device_svm:

#define VFIO_SVM_BIND_PGTABLE		(1 << x)

Which would indicate the following data (just a draft, I'm oscillating
between this and a generic PASID table solution, which would instead reuse
your proposal):

struct pgtable_info {
	__u32 pasid;

	__u64 pgd;

	__u32 model;
	__u32 variant;

	__u8 opaque[];
};

On ARM SMMU we would need to specify an io-pgtable variant and the opaque
structure would be bits of io_pgtable_cfg (tcr, mair, etc.)

The problem of portability is slightly different here, because while PASID
table format is specific to an IOMMU, page table format might be the same
across multiple flavors of IOMMUs. For instance, the PASID table format I
use in this series can only be found in the ARM SMMUv3 architecture, but
the page table format is the same as ARM SMMUv2, and other MMUs. I'd like
to implement an IOMMU independent of any page-table formats, that could
use whatever the host offers (not necessarily MMU PT). The model numbers
described above wouldn't be suitable, so we'd need another set of magic
numbers for page table formats.
---

Thanks,
Jean-Philippe

>> + * VFIO_DEVICE_BIND_TASK - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
>> + *                               struct vfio_device_svm)
>> + *
>> + * Share a process' virtual address space with the device.
>> + *
>> + * This feature creates a new address space for the device, which is
>> +not
>> + * affected by VFIO_IOMMU_MAP/UNMAP_DMA. Instead, the device can tag
>> +its DMA
>> + * traffic with the given @pasid to perform transactions on the
>> +associated
>> + * virtual address space. Mapping and unmapping of buffers is performed
>> +by
>> + * standard functions such as mmap and malloc.
>> + *
>> + * On success, VFIO writes a Process Address Space ID (PASID) into
>> +@pasid. This
>> + * ID is unique to a device.
>> + *
>> + * The bond between device and process must be removed with
>> + * VFIO_DEVICE_UNBIND_TASK before exiting.
>> + *
>> + * On fork, the child inherits the device fd and can use the bonds
>> +setup by its
>> + * parent. Consequently, the child has R/W access on the address spaces
>> +bound by
>> + * its parent. After an execv, the device fd is closed and the child
>> +doesn't
>> + * have access to the address space anymore.
>> + *
>> + * Availability of this feature depends on the device, its bus, the
>> +underlying
>> + * IOMMU and the CPU architecture. All of these are guaranteed when the
>> +device
>> + * has VFIO_DEVICE_FLAGS_SVM flag set.
>> + *
>> + * returns: 0 on success, -errno on failure.
>> + */
>> +#define VFIO_DEVICE_BIND_TASK	_IO(VFIO_TYPE, VFIO_BASE + 22)
>> +
>> +/*
>> + * VFIO_DEVICE_UNBIND_TASK - _IOWR(VFIO_TYPE, VFIO_BASE + 23,
>> + *                                 struct vfio_device_svm)
>> + *
>> + * Unbind address space identified by @pasid from device. Device must
>> +have
>> + * stopped issuing any DMA transaction for the PASID and flushed any
>> +reference
>> + * to this PASID upstream. Some IOMMUs need to know when a PASID is
>> +safe to
>> + * reuse, in which case one of the following must be present in @flags
>> + *
>> + * VFIO_PASID_RELEASE_FLUSHED: the PASID is safe to reassign after the IOMMU
>> + *       receives an invalidation message from the device.
>> + *
>> + * VFIO_PASID_RELEASE_CLEAN: the PASID is safe to reassign immediately.
>> + */
>> +#define VFIO_DEVICE_UNBIND_TASK	_IO(VFIO_TYPE, VFIO_BASE + 23)
>> +
>>  /*
>>   * The VFIO-PCI bus driver makes use of the following fixed region and
>>   * IRQ index mapping.  Unimplemented regions return a size of zero.
>> --
>> 2.11.0
>>
>> _______________________________________________
>> iommu mailing list
>> iommu@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
Jacob Pan March 21, 2017, 8:56 p.m. UTC | #5
On Tue, 21 Mar 2017 19:37:56 +0000
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> > For invalidation, I've following info in in pseudo code.
> > struct iommu_svm_tlb_invalidate_info
> > {
> >        __u32 inv_type;
> > #define IOTLB_INV			(1 << 0)
> > #define EXTENDED_IOTLB_INV		(1 << 1)
> > #define DEVICE_IOTLB_INV		(1 << 2)
> > #define EXTENDED_DEVICE_IOTLB_INV	(1 << 3)
> > #define PASID_CACHE_INV		(1 << 4)
> >        __u32 pasid;
> >        __u64 addr;
> >        __u64 size;
> >        __u8 granularity;
> > #define DEFAULT_INV_GRN        0
> > #define PAGE_SELECTIVE_INV     (1 << 0)
> > #define PASID_SELECVIVE_INV    (1 << 1)
> >        __u64 flags;
> > #define INVALIDATE_HINT_BIT    (1 << 0)
> > #define GLOBAL_HINT_BIT        (1 << 1)
> > #define DRAIN_READ_BIT         (1 << 2)
> > #define DRAIN_WRITE_BIT        (1 << 3)
> > #define DEVICE_TLB_GLOBAL_BIT  (1 << 4)
> >        __u8 mip;
> >        __u16 pfsid;
> > };  
> 
> This would also benefit from being split into generic and
> architectural parts. Former would be defined in VFIO, latter would be
> in the IOMMU driver.
> 
> struct tlb_invalidate_info
> {
> 	__u8 granularity
> #define DEFAULT_INV_GRN		0	/* What is default? */
> #define PAGE_SELECTIVE_INV	(1 << 0)
> #define PASID_SELECTIVE_INV	(1 << 1)
> 	__u32 pasid;
> 	__u64 addr;
> 	__u64 size;
> 
> 	/* Since IOMMU format has already been validated for this
> table, the IOMMU driver knows that the following structure is in a
> 	   format it knows */
> 	__u8 opaque[];
> };
> 
> struct tlb_invalidate_info_intel
> {
> 	__u32 inv_type;
> 	...
> 	__u64 flags;
> 	...
> 	__u8 mip;
> 	__u16 pfsid;
> };

I presume for Intel/arch specific part, the opaque data can simply be
the descriptor itself. i.e. struct qi_desc. There is no need to
generalize since it has already been validated to be architecture
specific data.

Arch specific IOMMU driver can take further action on the opaque data,
e.g. replace guest DID with host DID based on other info passed down
from the caller.
Yi Liu March 23, 2017, 8:39 a.m. UTC | #6
Hi Jean,

Thx for the excellent ideas. Pls refer to comments inline.

[...]

> > Hi Jean,
> >
> > I'm working on virtual SVM, and have some comments on the VFIO channel
> > definition.
> 
> Thanks a lot for the comments, this is quite interesting to me. I just have some
> concerns about portability so I'm proposing a way to be slightly more generic below.
> 

yes, portability is what need to consider.

[...]

> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index
> >> 519eff362c1c..3fe4197a5ea0 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -198,6 +198,7 @@ struct vfio_device_info {
> >>  #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
> >>  #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
> >>  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
> >> +#define VFIO_DEVICE_FLAGS_SVM	(1 << 4)	/* Device supports bind/unbind */
> >>  	__u32	num_regions;	/* Max region index + 1 */
> >>  	__u32	num_irqs;	/* Max IRQ index + 1 */
> >>  };
> >> @@ -409,6 +410,60 @@ struct vfio_irq_set {
> >>   */
> >>  #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
> >>
> >> +struct vfio_device_svm {
> >> +	__u32	argsz;
> >> +	__u32	flags;
> >> +#define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 0)
> >> +#define VFIO_SVM_PASID_RELEASE_CLEAN	(1 << 1)
> >> +	__u32	pasid;
> >> +};
> >
> > For virtual SVM work, the VFIO channel would be used to passdown guest
> > PASID tale PTR and invalidation information. And may have further
> > usage except the above.
> >
> > Here is the virtual SVM design doc which illustrates the VFIO usage.
> > https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> >
> > For the guest PASID table ptr passdown, I've following message in pseudo code.
> > struct pasid_table_info {
> >         __u64 ptr;
> >         __u32 size;
> >  };
> 
> There should probably be a way to specify the table format, so that the pIOMMU
> driver can check that it recognizes the format used by the vIOMMU before attaching
> it. This would allow to reuse the structure for other IOMMU architectures. If, for
> instance, the host has an intel IOMMU and someone decides to emulate an ARM
> SMMU with Qemu (their loss :), it can certainly use VFIO for passing-through devices
> with MAP/UNMAP. But if Qemu then attempts to passdown a PASID table in SMMU
> format, the Intel driver should have a way to reject it, as the SMMU format isn't
> compatible.

Exactly, it would be grt if we can have the API defined as generic as MAP/UNMAP. The
case you mentioned to emulate an ARM SMMU on an Intel platform is representative.
For such cases, the problem is different vendors may have different PASID table format
and also different page table format. In my understanding, these incompatible things
may just result in failure if users try such emulation. What's your opinion here?
Anyhow, better to listen to different voices.

> 
> I'm tackling a similar problem at the moment, but for passing a single page directory
> instead of full PASID table to the IOMMU.

For, Intel IOMMU, passing the whole guest PASID table is enough and it also avoids 
too much pgd passing. However, I'm open on this idea. You may just add a new flag
in "struct vfio_device_svm" and pass the single pgd down to host.

> 
> So we need some kind of high-level classification that the vIOMMU must
> communicate to the physical one. Each IOMMU flavor would get a unique, global
> identifier, simply to make sure that vIOMMU and pIOMMU speak the same language.
> For example:
> 
> 0x65776886 "AMDV" AMD IOMMU
> 0x73788476 "INTL" Intel IOMMU
> 0x83515748 "S390" s390 IOMMU
> 0x83777785 "SMMU" ARM SMMU
> etc.
> 
> It needs to be a global magic number that everyone can recognize. Could be as
> simple as 32-bit numbers allocated from 0. Once we have a global magic number, we
> can use it to differentiate architecture-specific details.

I may need to think more on this part.
 
> struct pasid_table_info {
> 	__u64 ptr;
> 	__u64 size;		/* Is it number of entry or size in
> 				   bytes? */

For Intel platform, it's encoded. But I can make it in bytes. Here, I'd like
to check with you if whole guest PASID info is also needed on ARM?

> 
> 	__u32 model;		/* magic number */
> 	__u32 variant;		/* version of the IOMMU architecture,
> 				   maybe? IOMMU-specific. */
> 	__u8 opaque[];		/* IOMMU-specific details */
> };
> 
> And then each IOMMU or page-table code can do low-level validation of the format,
> by reading the details in 'opaque'. I assume that for Intel this would be empty. But

yes, for Intel, if the PASID ptr is in the definition, opaque would be empty.

> for instance on ARM SMMUv3, PASID table can have either one or two levels, and
> vIOMMU would specify which one of the three available formats it is using.

Yes it is. PASID table could also be multi-level. I agree to have it into consideration.


> struct pasid_table_info_smmu {
> 	/*
> 	 * In 'opaque', architecture details only the IOMMU driver should
> 	 * be caring about.
> 	 */
> 	__u8 s1fmt;
> 	__u8 s1dss;
> }
> 
> If the physical SMMU doesn't implement the particular PASID table format, it should
> reject the bind.

So far, I think reject may be the best policy. I can't assume that different have consistent
format for the PASID table and page table as previous comments.

> 
> This would allow to keep architecture details outside of VFIO core (as well as virtio in
> my case), and only have vIOMMU and pIOMMU understand those details.
> 
> >
> > For invalidation, I've following info in in pseudo code.
> > struct iommu_svm_tlb_invalidate_info
> > {
> >        __u32 inv_type;
> > #define IOTLB_INV			(1 << 0)
> > #define EXTENDED_IOTLB_INV		(1 << 1)
> > #define DEVICE_IOTLB_INV		(1 << 2)
> > #define EXTENDED_DEVICE_IOTLB_INV	(1 << 3)
> > #define PASID_CACHE_INV		(1 << 4)
> >        __u32 pasid;
> >        __u64 addr;
> >        __u64 size;
> >        __u8 granularity;
> > #define DEFAULT_INV_GRN        0
> > #define PAGE_SELECTIVE_INV     (1 << 0)
> > #define PASID_SELECVIVE_INV    (1 << 1)
> >        __u64 flags;
> > #define INVALIDATE_HINT_BIT    (1 << 0)
> > #define GLOBAL_HINT_BIT        (1 << 1)
> > #define DRAIN_READ_BIT         (1 << 2)
> > #define DRAIN_WRITE_BIT        (1 << 3)
> > #define DEVICE_TLB_GLOBAL_BIT  (1 << 4)
> >        __u8 mip;
> >        __u16 pfsid;
> > };
> 
> This would also benefit from being split into generic and architectural parts. Former
> would be defined in VFIO, latter would be in the IOMMU driver.

For invalidation part, I'm trying to have a generic definition by including all possible information
for a TLB invalidation. Anyhow, I would  split the information when I send out my RFC patch
for virtual SVM.

> 
> struct tlb_invalidate_info
> {
> 	__u8 granularity
> #define DEFAULT_INV_GRN		0	/* What is default? */

It's device selective. Since all invalidation from guest should be at least device-selective, so I
name it as default. Would rename it to make it clear.

> #define PAGE_SELECTIVE_INV	(1 << 0)
> #define PASID_SELECTIVE_INV	(1 << 1)
> 	__u32 pasid;
> 	__u64 addr;
> 	__u64 size;
> 
> 	/* Since IOMMU format has already been validated for this table,
> 	   the IOMMU driver knows that the following structure is in a
> 	   format it knows */
> 	__u8 opaque[];
> };
> 
> struct tlb_invalidate_info_intel
> {
> 	__u32 inv_type;
> 	...
> 	__u64 flags;
> 	...
> 	__u8 mip;
> 	__u16 pfsid;
> };
> 
> > Although your proposal is for userspace driver SVM usage while mine is
> > for  SVM usage in virtual machine, there should be a chance to make
> > the channel meet our request. And I think it would be more acceptable.
> > So I'd like to see your comments if we define the channel as following definition.
> > If any better solution, pls feel free let me know.
> >
> > struct vfio_device_svm {
> >        __u32   argsz;
> > #define VFIO_SVM_BIND_PASIDTP           (1 << 0)
> 
> To check we're on the same page: the absence of BIND_PASIDTP flag would mean
> "bind a single PASID" and in that case, data[] would be a "u32 pasid"?

Actually, I planned to use a single channel for both guest PASID table ptr passdown
and invalidation info passdown. So it is defined in this way.

VFIO_SVM_BIND_PASIDTP   -> data[] includes guest PASID table ptr and table size
VFIO_SVM_PASSDOWN_INVALIDATE    -> data[] includes infos for invalidataion

Now, we want to have it shared by different vendors. So I would remove invalidate
definition from it. Regards to your example, yes it would be a "u32 pasid" if you are
passing a PASID value from guest. I think we are on the same page for the usage? 

> 
> > #define VFIO_SVM_PASSDOWN_INVALIDATE    (1 << 1)
> 
> Using the vfio_device_svm structure for invalidate operations is a bit odd, it might be
> nicer to add a new VFIO_SVM_INVALIDATE ioctl, that takes the above
> iommu_svm_tlb_invalidate_info as argument (with an added argsz.)

Agree, would add a separate IOCTL for invalidation.

> 
> > #define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 2)
> > #define VFIO_SVM_PASID_RELEASE_CLEAN	  (1 << 3)
> >        __u32   flags;
> >        __u32   length;
> 
> If length is the size of data[], I guess we can already deduce this info from argsz.

yes, it is size of data. Maybe remove argsz. How about your opinion?
 
> >        __u8    data[];
> > };
> 
> In general, I think that your proposal would work fine along mine. Note that for my
> next version of this patch, I would like to move the BIND/UNBIND SVM operations
> onto a VFIO container fd, instead of a VFIO device fd, if possible.

Attach the BIND/UNBIND operation onto VFIO container fd is practical.

BTW. Before you send out your next version, we'd better have a consensus on the
vfio_device_svm definition. So that we can continue to drive our own work separately.

> ---
> As an aside, it also aligns with the one I'm working on at the moment, for virtual
> SVM at a finer granularity, where the BIND call is for a page table. I would add this
> flag to vfio_device_svm:
> 
> #define VFIO_SVM_BIND_PGTABLE		(1 << x)

Sure. I think it may also be a requirement from other vendors. I think you've mentioned
it in the above statements.

> Which would indicate the following data (just a draft, I'm oscillating between this
> and a generic PASID table solution, which would instead reuse your proposal):
> 
> struct pgtable_info {
> 	__u32 pasid;
> 
> 	__u64 pgd;
> 
> 	__u32 model;
> 	__u32 variant;
> 
> 	__u8 opaque[];
> };
> 
> On ARM SMMU we would need to specify an io-pgtable variant and the opaque
> structure would be bits of io_pgtable_cfg (tcr, mair, etc.)
> 
> The problem of portability is slightly different here, because while PASID table
> format is specific to an IOMMU, page table format might be the same across
> multiple flavors of IOMMUs. For instance, the PASID table format I use in this series
> can only be found in the ARM SMMUv3 architecture, but the page table format is the
> same as ARM SMMUv2, and other MMUs. I'd like to implement an IOMMU
> independent of any page-table formats, that could use whatever the host offers (not
> necessarily MMU PT). The model numbers described above wouldn't be suitable, so
> we'd need another set of magic numbers for page table formats.

Not sure if I totally got your points. We may assume PASID table format differs from
vendor to vendor. For the page table format, I assume you mean the page table of
process. Or in other words MMU page table.

I'm a little bit confused about the following statements. Could you speak a little bit more?
Is it for virtual SVM or user-space driver SVM usage?
" I'd like to implement an IOMMU independent of any page-table formats, that could
use whatever the host offers (not necessarily MMU PT)."

I's nice to have such discussion. Let's co-work and have a well-defined API.

Thanks,
Yi L
Jean-Philippe Brucker March 23, 2017, 1:38 p.m. UTC | #7
On 23/03/17 08:39, Liu, Yi L wrote:
> Hi Jean,
> 
> Thx for the excellent ideas. Pls refer to comments inline.
> 
> [...]
> 
>>> Hi Jean,
>>>
>>> I'm working on virtual SVM, and have some comments on the VFIO channel
>>> definition.
>>
>> Thanks a lot for the comments, this is quite interesting to me. I just have some
>> concerns about portability so I'm proposing a way to be slightly more generic below.
>>
> 
> yes, portability is what need to consider.
> 
> [...]
> 
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index
>>>> 519eff362c1c..3fe4197a5ea0 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -198,6 +198,7 @@ struct vfio_device_info {
>>>>  #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
>>>>  #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
>>>>  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
>>>> +#define VFIO_DEVICE_FLAGS_SVM	(1 << 4)	/* Device supports bind/unbind */
>>>>  	__u32	num_regions;	/* Max region index + 1 */
>>>>  	__u32	num_irqs;	/* Max IRQ index + 1 */
>>>>  };
>>>> @@ -409,6 +410,60 @@ struct vfio_irq_set {
>>>>   */
>>>>  #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
>>>>
>>>> +struct vfio_device_svm {
>>>> +	__u32	argsz;
>>>> +	__u32	flags;
>>>> +#define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 0)
>>>> +#define VFIO_SVM_PASID_RELEASE_CLEAN	(1 << 1)
>>>> +	__u32	pasid;
>>>> +};
>>>
>>> For virtual SVM work, the VFIO channel would be used to passdown guest
>>> PASID tale PTR and invalidation information. And may have further
>>> usage except the above.
>>>
>>> Here is the virtual SVM design doc which illustrates the VFIO usage.
>>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
>>>
>>> For the guest PASID table ptr passdown, I've following message in pseudo code.
>>> struct pasid_table_info {
>>>         __u64 ptr;
>>>         __u32 size;
>>>  };
>>
>> There should probably be a way to specify the table format, so that the pIOMMU
>> driver can check that it recognizes the format used by the vIOMMU before attaching
>> it. This would allow to reuse the structure for other IOMMU architectures. If, for
>> instance, the host has an intel IOMMU and someone decides to emulate an ARM
>> SMMU with Qemu (their loss :), it can certainly use VFIO for passing-through devices
>> with MAP/UNMAP. But if Qemu then attempts to passdown a PASID table in SMMU
>> format, the Intel driver should have a way to reject it, as the SMMU format isn't
>> compatible.
> 
> Exactly, it would be grt if we can have the API defined as generic as MAP/UNMAP. The
> case you mentioned to emulate an ARM SMMU on an Intel platform is representative.
> For such cases, the problem is different vendors may have different PASID table format
> and also different page table format. In my understanding, these incompatible things
> may just result in failure if users try such emulation. What's your opinion here?
> Anyhow, better to listen to different voices.

Yes, in case the vIOMMU and pIOMMU implement different PASID table
formats, there is simply no way to virtualize SVM, the physical IOMMU will
fault when reading the foreign PASID table, since it won't find the pgd
pointers at the right location. Rejecting the bind gracefully seems much
preferable than letting the pIOMMU fault, so the vIOMMU can always use the
generic MAP/UNMAP API as a fallback.

>> I'm tackling a similar problem at the moment, but for passing a single page directory
>> instead of full PASID table to the IOMMU.
> 
> For, Intel IOMMU, passing the whole guest PASID table is enough and it also avoids 
> too much pgd passing. However, I'm open on this idea. You may just add a new flag
> in "struct vfio_device_svm" and pass the single pgd down to host.
> 
>>
>> So we need some kind of high-level classification that the vIOMMU must
>> communicate to the physical one. Each IOMMU flavor would get a unique, global
>> identifier, simply to make sure that vIOMMU and pIOMMU speak the same language.
>> For example:
>>
>> 0x65776886 "AMDV" AMD IOMMU
>> 0x73788476 "INTL" Intel IOMMU
>> 0x83515748 "S390" s390 IOMMU
>> 0x83777785 "SMMU" ARM SMMU
>> etc.
>>
>> It needs to be a global magic number that everyone can recognize. Could be as
>> simple as 32-bit numbers allocated from 0. Once we have a global magic number, we
>> can use it to differentiate architecture-specific details.
> 
> I may need to think more on this part.
>  
>> struct pasid_table_info {
>> 	__u64 ptr;
>> 	__u64 size;		/* Is it number of entry or size in
>> 				   bytes? */
> 
> For Intel platform, it's encoded. But I can make it in bytes. Here, I'd like
> to check with you if whole guest PASID info is also needed on ARM?

It will be needed on ARM if someone ever emulates the SMMU with SVM.
Though I'm not planning on doing that myself, it is unavoidable. And it
would be a shame for the next SVM virtualization solution to have to
introduce a new flag "VFIO_SVM_BIND_PASIDPT_2" if they could reuse most of
the BIND_PASIDPT interface but simply needed to add one or two
configuration fields specific to their IOMMU.

>>
>> 	__u32 model;		/* magic number */
>> 	__u32 variant;		/* version of the IOMMU architecture,
>> 				   maybe? IOMMU-specific. */
>> 	__u8 opaque[];		/* IOMMU-specific details */
>> };
>>
>> And then each IOMMU or page-table code can do low-level validation of the format,
>> by reading the details in 'opaque'. I assume that for Intel this would be empty. But
> 
> yes, for Intel, if the PASID ptr is in the definition, opaque would be empty.
> 
>> for instance on ARM SMMUv3, PASID table can have either one or two levels, and
>> vIOMMU would specify which one of the three available formats it is using.
> 
> Yes it is. PASID table could also be multi-level. I agree to have it into consideration.
> 
> 
>> struct pasid_table_info_smmu {
>> 	/*
>> 	 * In 'opaque', architecture details only the IOMMU driver should
>> 	 * be caring about.
>> 	 */
>> 	__u8 s1fmt;
>> 	__u8 s1dss;
>> }
>>
>> If the physical SMMU doesn't implement the particular PASID table format, it should
>> reject the bind.
> 
> So far, I think reject may be the best policy. I can't assume that different have consistent
> format for the PASID table and page table as previous comments.

Yes, for example AMD and Intel formats could be compatible, but the SMMU
format is quite different, using 64 bytes per PASID entry instead of 8 bytes.

>>
>> This would allow to keep architecture details outside of VFIO core (as well as virtio in
>> my case), and only have vIOMMU and pIOMMU understand those details.
>>
>>>
>>> For invalidation, I've following info in in pseudo code.
>>> struct iommu_svm_tlb_invalidate_info
>>> {
>>>        __u32 inv_type;
>>> #define IOTLB_INV			(1 << 0)
>>> #define EXTENDED_IOTLB_INV		(1 << 1)
>>> #define DEVICE_IOTLB_INV		(1 << 2)
>>> #define EXTENDED_DEVICE_IOTLB_INV	(1 << 3)
>>> #define PASID_CACHE_INV		(1 << 4)
>>>        __u32 pasid;
>>>        __u64 addr;
>>>        __u64 size;
>>>        __u8 granularity;
>>> #define DEFAULT_INV_GRN        0
>>> #define PAGE_SELECTIVE_INV     (1 << 0)
>>> #define PASID_SELECVIVE_INV    (1 << 1)
>>>        __u64 flags;
>>> #define INVALIDATE_HINT_BIT    (1 << 0)
>>> #define GLOBAL_HINT_BIT        (1 << 1)
>>> #define DRAIN_READ_BIT         (1 << 2)
>>> #define DRAIN_WRITE_BIT        (1 << 3)
>>> #define DEVICE_TLB_GLOBAL_BIT  (1 << 4)
>>>        __u8 mip;
>>>        __u16 pfsid;
>>> };
>>
>> This would also benefit from being split into generic and architectural parts. Former
>> would be defined in VFIO, latter would be in the IOMMU driver.
> 
> For invalidation part, I'm trying to have a generic definition by including all possible information
> for a TLB invalidation. Anyhow, I would  split the information when I send out my RFC patch
> for virtual SVM.
> 
>>
>> struct tlb_invalidate_info
>> {
>> 	__u8 granularity
>> #define DEFAULT_INV_GRN		0	/* What is default? */
> 
> It's device selective. Since all invalidation from guest should be at least device-selective, so I
> name it as default. Would rename it to make it clear.
> 
>> #define PAGE_SELECTIVE_INV	(1 << 0)
>> #define PASID_SELECTIVE_INV	(1 << 1)
>> 	__u32 pasid;
>> 	__u64 addr;
>> 	__u64 size;
>>
>> 	/* Since IOMMU format has already been validated for this table,
>> 	   the IOMMU driver knows that the following structure is in a
>> 	   format it knows */
>> 	__u8 opaque[];
>> };
>>
>> struct tlb_invalidate_info_intel
>> {
>> 	__u32 inv_type;
>> 	...
>> 	__u64 flags;
>> 	...
>> 	__u8 mip;
>> 	__u16 pfsid;
>> };
>>
>>> Although your proposal is for userspace driver SVM usage while mine is
>>> for  SVM usage in virtual machine, there should be a chance to make
>>> the channel meet our request. And I think it would be more acceptable.
>>> So I'd like to see your comments if we define the channel as following definition.
>>> If any better solution, pls feel free let me know.
>>>
>>> struct vfio_device_svm {
>>>        __u32   argsz;
>>> #define VFIO_SVM_BIND_PASIDTP           (1 << 0)
>>
>> To check we're on the same page: the absence of BIND_PASIDTP flag would mean
>> "bind a single PASID" and in that case, data[] would be a "u32 pasid"?
> 
> Actually, I planned to use a single channel for both guest PASID table ptr passdown
> and invalidation info passdown. So it is defined in this way.
> 
> VFIO_SVM_BIND_PASIDTP   -> data[] includes guest PASID table ptr and table size
> VFIO_SVM_PASSDOWN_INVALIDATE    -> data[] includes infos for invalidataion
> 
> Now, we want to have it shared by different vendors. So I would remove invalidate
> definition from it. Regards to your example, yes it would be a "u32 pasid" if you are
> passing a PASID value from guest. I think we are on the same page for the usage? 
> 

Yes, that seems sensible. I could add an explicit VFIO_BIND_PASID flags to
make it explicit that data[] is "u32 pasid" and avoid having any default.

>>
>>> #define VFIO_SVM_PASSDOWN_INVALIDATE    (1 << 1)
>>
>> Using the vfio_device_svm structure for invalidate operations is a bit odd, it might be
>> nicer to add a new VFIO_SVM_INVALIDATE ioctl, that takes the above
>> iommu_svm_tlb_invalidate_info as argument (with an added argsz.)
> 
> Agree, would add a separate IOCTL for invalidation.
> 
>>
>>> #define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 2)
>>> #define VFIO_SVM_PASID_RELEASE_CLEAN	  (1 << 3)
>>>        __u32   flags;
>>>        __u32   length;
>>
>> If length is the size of data[], I guess we can already deduce this info from argsz.
> 
> yes, it is size of data. Maybe remove argsz. How about your opinion?

Argsz as first argument is standard across all VFIO ioctls, it allows the
kernel to check that the structure passed by the guest is consistent with
the structure it knows (see the comment at the beginning of
include/uapi/linux/vfio.h) So argsz would be the preferred way to
communicate the size of data[]

>>>        __u8    data[];
>>> };
>>
>> In general, I think that your proposal would work fine along mine. Note that for my
>> next version of this patch, I would like to move the BIND/UNBIND SVM operations
>> onto a VFIO container fd, instead of a VFIO device fd, if possible.
> 
> Attach the BIND/UNBIND operation onto VFIO container fd is practical.
> 
> BTW. Before you send out your next version, we'd better have a consensus on the
> vfio_device_svm definition. So that we can continue to drive our own work separately.
> 
>> ---
>> As an aside, it also aligns with the one I'm working on at the moment, for virtual
>> SVM at a finer granularity, where the BIND call is for a page table. I would add this
>> flag to vfio_device_svm:
>>
>> #define VFIO_SVM_BIND_PGTABLE		(1 << x)
> 
> Sure. I think it may also be a requirement from other vendors. I think you've mentioned
> it in the above statements.
> 
>> Which would indicate the following data (just a draft, I'm oscillating between this
>> and a generic PASID table solution, which would instead reuse your proposal):
>>
>> struct pgtable_info {
>> 	__u32 pasid;
>>
>> 	__u64 pgd;
>>
>> 	__u32 model;
>> 	__u32 variant;
>>
>> 	__u8 opaque[];
>> };
>>
>> On ARM SMMU we would need to specify an io-pgtable variant and the opaque
>> structure would be bits of io_pgtable_cfg (tcr, mair, etc.)
>>
>> The problem of portability is slightly different here, because while PASID table
>> format is specific to an IOMMU, page table format might be the same across
>> multiple flavors of IOMMUs. For instance, the PASID table format I use in this series
>> can only be found in the ARM SMMUv3 architecture, but the page table format is the
>> same as ARM SMMUv2, and other MMUs. I'd like to implement an IOMMU
>> independent of any page-table formats, that could use whatever the host offers (not
>> necessarily MMU PT). The model numbers described above wouldn't be suitable, so
>> we'd need another set of magic numbers for page table formats.
> 
> Not sure if I totally got your points. We may assume PASID table format differs from
> vendor to vendor. For the page table format, I assume you mean the page table of
> process. Or in other words MMU page table.
> 
> I'm a little bit confused about the following statements. Could you speak a little bit more?
> Is it for virtual SVM or user-space driver SVM usage?
> " I'd like to implement an IOMMU independent of any page-table formats, that could
> use whatever the host offers (not necessarily MMU PT)."

The IOMMU might be using a different page table format than the MMU. I'm
currently toying with the idea of having a portable paravirtualized IOMMU
that can query the page table format used by pIOMMU, and adopt it if the
page table handling code is available in the guest. This would be for
non-SVM nested translation, using VFIO to bind page tables private to the
IOMMU. I'll send a separate RFC to discuss this.

> I's nice to have such discussion. Let's co-work and have a well-defined API.

I agree. I don't plan to send a new version immediately since it will take
some time rework, but we can sync-up here or in private to avoid
conflicting proposals.

Thanks,
Jean-Philippe
Yi Liu March 24, 2017, 7:46 a.m. UTC | #8
> -----Original Message-----
> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf
> Of Jean-Philippe Brucker
> Sent: Thursday, March 23, 2017 9:38 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; Alex Williamson <alex.williamson@redhat.com>
> Cc: Shanker Donthineni <shankerd@qti.qualcomm.com>; kvm@vger.kernel.org;
> Catalin Marinas <catalin.marinas@arm.com>; Sinan Kaya
> <okaya@qti.qualcomm.com>; Will Deacon <will.deacon@arm.com>;
> iommu@lists.linux-foundation.org; Harv Abdulhamid <harba@qti.qualcomm.com>;
> linux-pci@vger.kernel.org; Bjorn Helgaas <bhelgaas@google.com>; David
> Woodhouse <dwmw2@infradead.org>; linux-arm-kernel@lists.infradead.org; Nate
> Watterson <nwatters@qti.qualcomm.com>; Tian, Kevin <kevin.tian@intel.com>;
> Lan, Tianyu <tianyu.lan@intel.com>; Raj, Ashok <ashok.raj@intel.com>; Pan, Jacob
> jun <jacob.jun.pan@intel.com>; Joerg Roedel <joro@8bytes.org>; Robin Murphy
> <robin.murphy@arm.com>
> Subject: Re: [RFC PATCH 29/30] vfio: Add support for Shared Virtual Memory
> 
> On 23/03/17 08:39, Liu, Yi L wrote:
> > Hi Jean,
> >
> > Thx for the excellent ideas. Pls refer to comments inline.
> >
> > [...]
> >
> >>> Hi Jean,
> >>>
> >>> I'm working on virtual SVM, and have some comments on the VFIO
> >>> channel definition.
> >>
> >> Thanks a lot for the comments, this is quite interesting to me. I
> >> just have some concerns about portability so I'm proposing a way to be slightly
> more generic below.
> >>
> >
> > yes, portability is what need to consider.
> >
> > [...]
> >
> >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>>> index
> >>>> 519eff362c1c..3fe4197a5ea0 100644
> >>>> --- a/include/uapi/linux/vfio.h
> >>>> +++ b/include/uapi/linux/vfio.h
> >>>> @@ -198,6 +198,7 @@ struct vfio_device_info {
> >>>>  #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
> >>>>  #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
> >>>>  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
> >>>> +#define VFIO_DEVICE_FLAGS_SVM	(1 << 4)	/* Device supports
> bind/unbind */
> >>>>  	__u32	num_regions;	/* Max region index + 1 */
> >>>>  	__u32	num_irqs;	/* Max IRQ index + 1 */
> >>>>  };
> >>>> @@ -409,6 +410,60 @@ struct vfio_irq_set {
> >>>>   */
> >>>>  #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
> >>>>
> >>>> +struct vfio_device_svm {
> >>>> +	__u32	argsz;
> >>>> +	__u32	flags;
> >>>> +#define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 0)
> >>>> +#define VFIO_SVM_PASID_RELEASE_CLEAN	(1 << 1)
> >>>> +	__u32	pasid;
> >>>> +};
> >>>
> >>> For virtual SVM work, the VFIO channel would be used to passdown
> >>> guest PASID tale PTR and invalidation information. And may have
> >>> further usage except the above.
> >>>
> >>> Here is the virtual SVM design doc which illustrates the VFIO usage.
> >>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> >>>
> >>> For the guest PASID table ptr passdown, I've following message in pseudo code.
> >>> struct pasid_table_info {
> >>>         __u64 ptr;
> >>>         __u32 size;
> >>>  };
> >>
> >> There should probably be a way to specify the table format, so that
> >> the pIOMMU driver can check that it recognizes the format used by the
> >> vIOMMU before attaching it. This would allow to reuse the structure
> >> for other IOMMU architectures. If, for instance, the host has an
> >> intel IOMMU and someone decides to emulate an ARM SMMU with Qemu
> >> (their loss :), it can certainly use VFIO for passing-through devices
> >> with MAP/UNMAP. But if Qemu then attempts to passdown a PASID table
> >> in SMMU format, the Intel driver should have a way to reject it, as the SMMU
> format isn't compatible.
> >
> > Exactly, it would be grt if we can have the API defined as generic as
> > MAP/UNMAP. The case you mentioned to emulate an ARM SMMU on an Intel
> platform is representative.
> > For such cases, the problem is different vendors may have different
> > PASID table format and also different page table format. In my
> > understanding, these incompatible things may just result in failure if users try such
> emulation. What's your opinion here?
> > Anyhow, better to listen to different voices.
> 
> Yes, in case the vIOMMU and pIOMMU implement different PASID table formats,
> there is simply no way to virtualize SVM, the physical IOMMU will fault when reading
> the foreign PASID table, since it won't find the pgd pointers at the right location.
> Rejecting the bind gracefully seems much preferable than letting the pIOMMU fault,
> so the vIOMMU can always use the generic MAP/UNMAP API as a fallback.
> 
> >> I'm tackling a similar problem at the moment, but for passing a
> >> single page directory instead of full PASID table to the IOMMU.
> >
> > For, Intel IOMMU, passing the whole guest PASID table is enough and it
> > also avoids too much pgd passing. However, I'm open on this idea. You
> > may just add a new flag in "struct vfio_device_svm" and pass the single pgd down
> to host.
> >
> >>
> >> So we need some kind of high-level classification that the vIOMMU
> >> must communicate to the physical one. Each IOMMU flavor would get a
> >> unique, global identifier, simply to make sure that vIOMMU and pIOMMU speak
> the same language.
> >> For example:
> >>
> >> 0x65776886 "AMDV" AMD IOMMU
> >> 0x73788476 "INTL" Intel IOMMU
> >> 0x83515748 "S390" s390 IOMMU
> >> 0x83777785 "SMMU" ARM SMMU
> >> etc.
> >>
> >> It needs to be a global magic number that everyone can recognize.
> >> Could be as simple as 32-bit numbers allocated from 0. Once we have a
> >> global magic number, we can use it to differentiate architecture-specific details.

I prefer simple numbers to stand for each vendor.

> > I may need to think more on this part.
> >
> >> struct pasid_table_info {
> >> 	__u64 ptr;
> >> 	__u64 size;		/* Is it number of entry or size in
> >> 				   bytes? */
> >
> > For Intel platform, it's encoded. But I can make it in bytes. Here,
> > I'd like to check with you if whole guest PASID info is also needed on ARM?
> 
> It will be needed on ARM if someone ever emulates the SMMU with SVM.
> Though I'm not planning on doing that myself, it is unavoidable. And it would be a
> shame for the next SVM virtualization solution to have to introduce a new flag
> "VFIO_SVM_BIND_PASIDPT_2" if they could reuse most of the BIND_PASIDPT
> interface but simply needed to add one or two configuration fields specific to their
> IOMMU.

So you are totally fine with putting PASID table ptr and size in the generic part? Maybe
we have different usage for it. For me, it's a guest PASID table ptr. For you, it may be
different.

> >>
> >> 	__u32 model;		/* magic number */
> >> 	__u32 variant;		/* version of the IOMMU architecture,
> >> 				   maybe? IOMMU-specific. */

For variant, it will be combined with model to do sanity check. Am I right?
Maybe it could be moved to opaque.
 
> >> 	__u8 opaque[];		/* IOMMU-specific details */
> >> };
> >>
> >> And then each IOMMU or page-table code can do low-level validation of
> >> the format, by reading the details in 'opaque'. I assume that for
> >> Intel this would be empty. But
> >
> > yes, for Intel, if the PASID ptr is in the definition, opaque would be empty.
> >
> >> for instance on ARM SMMUv3, PASID table can have either one or two
> >> levels, and vIOMMU would specify which one of the three available formats it is
> using.
> >
> > Yes it is. PASID table could also be multi-level. I agree to have it into consideration.
> >
> >
> >> struct pasid_table_info_smmu {
> >> 	/*
> >> 	 * In 'opaque', architecture details only the IOMMU driver should
> >> 	 * be caring about.
> >> 	 */
> >> 	__u8 s1fmt;
> >> 	__u8 s1dss;
> >> }
> >>
> >> If the physical SMMU doesn't implement the particular PASID table
> >> format, it should reject the bind.
> >
> > So far, I think reject may be the best policy. I can't assume that
> > different have consistent format for the PASID table and page table as previous
> comments.
> 
> Yes, for example AMD and Intel formats could be compatible, but the SMMU format
> is quite different, using 64 bytes per PASID entry instead of 8 bytes.
> 
> >>
> >> This would allow to keep architecture details outside of VFIO core
> >> (as well as virtio in my case), and only have vIOMMU and pIOMMU understand
> those details.
> >>
> >>>
> >>> For invalidation, I've following info in in pseudo code.
> >>> struct iommu_svm_tlb_invalidate_info {
> >>>        __u32 inv_type;
> >>> #define IOTLB_INV			(1 << 0)
> >>> #define EXTENDED_IOTLB_INV		(1 << 1)
> >>> #define DEVICE_IOTLB_INV		(1 << 2)
> >>> #define EXTENDED_DEVICE_IOTLB_INV	(1 << 3)
> >>> #define PASID_CACHE_INV		(1 << 4)
> >>>        __u32 pasid;
> >>>        __u64 addr;
> >>>        __u64 size;
> >>>        __u8 granularity;
> >>> #define DEFAULT_INV_GRN        0
> >>> #define PAGE_SELECTIVE_INV     (1 << 0)
> >>> #define PASID_SELECVIVE_INV    (1 << 1)
> >>>        __u64 flags;
> >>> #define INVALIDATE_HINT_BIT    (1 << 0)
> >>> #define GLOBAL_HINT_BIT        (1 << 1)
> >>> #define DRAIN_READ_BIT         (1 << 2)
> >>> #define DRAIN_WRITE_BIT        (1 << 3)
> >>> #define DEVICE_TLB_GLOBAL_BIT  (1 << 4)
> >>>        __u8 mip;
> >>>        __u16 pfsid;
> >>> };
> >>
> >> This would also benefit from being split into generic and
> >> architectural parts. Former would be defined in VFIO, latter would be in the
> IOMMU driver.
> >
> > For invalidation part, I'm trying to have a generic definition by
> > including all possible information for a TLB invalidation. Anyhow, I
> > would  split the information when I send out my RFC patch for virtual SVM.
> >
> >>
> >> struct tlb_invalidate_info
> >> {
> >> 	__u8 granularity
> >> #define DEFAULT_INV_GRN		0	/* What is default? */
> >
> > It's device selective. Since all invalidation from guest should be at
> > least device-selective, so I name it as default. Would rename it to make it clear.
> >
> >> #define PAGE_SELECTIVE_INV	(1 << 0)
> >> #define PASID_SELECTIVE_INV	(1 << 1)
> >> 	__u32 pasid;
> >> 	__u64 addr;
> >> 	__u64 size;
> >>
> >> 	/* Since IOMMU format has already been validated for this table,
> >> 	   the IOMMU driver knows that the following structure is in a
> >> 	   format it knows */
> >> 	__u8 opaque[];
> >> };
> >>
> >> struct tlb_invalidate_info_intel
> >> {
> >> 	__u32 inv_type;
> >> 	...
> >> 	__u64 flags;
> >> 	...
> >> 	__u8 mip;
> >> 	__u16 pfsid;
> >> };
> >>
> >>> Although your proposal is for userspace driver SVM usage while mine
> >>> is for  SVM usage in virtual machine, there should be a chance to
> >>> make the channel meet our request. And I think it would be more acceptable.
> >>> So I'd like to see your comments if we define the channel as following definition.
> >>> If any better solution, pls feel free let me know.
> >>>
> >>> struct vfio_device_svm {
> >>>        __u32   argsz;
> >>> #define VFIO_SVM_BIND_PASIDTP           (1 << 0)
> >>
> >> To check we're on the same page: the absence of BIND_PASIDTP flag
> >> would mean "bind a single PASID" and in that case, data[] would be a "u32 pasid"?
> >
> > Actually, I planned to use a single channel for both guest PASID table
> > ptr passdown and invalidation info passdown. So it is defined in this way.
> >
> > VFIO_SVM_BIND_PASIDTP   -> data[] includes guest PASID table ptr and table size
> > VFIO_SVM_PASSDOWN_INVALIDATE    -> data[] includes infos for invalidataion
> >
> > Now, we want to have it shared by different vendors. So I would remove
> > invalidate definition from it. Regards to your example, yes it would
> > be a "u32 pasid" if you are passing a PASID value from guest. I think we are on the
> same page for the usage?
> >
> 
> Yes, that seems sensible. I could add an explicit VFIO_BIND_PASID flags to make it
> explicit that data[] is "u32 pasid" and avoid having any default.

Add it in the comment I suppose. The length is 4 byes, it could be deduced from argsz.

> 
> >>
> >>> #define VFIO_SVM_PASSDOWN_INVALIDATE    (1 << 1)
> >>
> >> Using the vfio_device_svm structure for invalidate operations is a
> >> bit odd, it might be nicer to add a new VFIO_SVM_INVALIDATE ioctl,
> >> that takes the above iommu_svm_tlb_invalidate_info as argument (with
> >> an added argsz.)
> >
> > Agree, would add a separate IOCTL for invalidation.
> >
> >>
> >>> #define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 2)
> >>> #define VFIO_SVM_PASID_RELEASE_CLEAN	  (1 << 3)
> >>>        __u32   flags;
> >>>        __u32   length;
> >>
> >> If length is the size of data[], I guess we can already deduce this info from argsz.
> >
> > yes, it is size of data. Maybe remove argsz. How about your opinion?
> 
> Argsz as first argument is standard across all VFIO ioctls, it allows the kernel to check
> that the structure passed by the guest is consistent with the structure it knows (see
> the comment at the beginning of
> include/uapi/linux/vfio.h) So argsz would be the preferred way to communicate the
> size of data[]

yes, it makes sense. BTW. I think it is still possible to leave the "length" since there is
similar usage. e.g. struct vfio_irq_set, count tells the size of data[] in that structure.
Anyhow, it's no big deal here.

> >>>        __u8    data[];
> >>> };
> >>
> >> In general, I think that your proposal would work fine along mine.
> >> Note that for my next version of this patch, I would like to move the
> >> BIND/UNBIND SVM operations onto a VFIO container fd, instead of a VFIO device
> fd, if possible.
> >
> > Attach the BIND/UNBIND operation onto VFIO container fd is practical.
> >
> > BTW. Before you send out your next version, we'd better have a
> > consensus on the vfio_device_svm definition. So that we can continue to drive our
> own work separately.
> >
> >> ---
> >> As an aside, it also aligns with the one I'm working on at the
> >> moment, for virtual SVM at a finer granularity, where the BIND call
> >> is for a page table. I would add this flag to vfio_device_svm:
> >>
> >> #define VFIO_SVM_BIND_PGTABLE		(1 << x)
> >
> > Sure. I think it may also be a requirement from other vendors. I think
> > you've mentioned it in the above statements.
> >
> >> Which would indicate the following data (just a draft, I'm
> >> oscillating between this and a generic PASID table solution, which would instead
> reuse your proposal):
> >>
> >> struct pgtable_info {
> >> 	__u32 pasid;
> >>
> >> 	__u64 pgd;
> >>
> >> 	__u32 model;
> >> 	__u32 variant;
> >>
> >> 	__u8 opaque[];
> >> };

I think you would have this definition as an ARM specific one. Would be filled in
the data[] of vfio_device_svm. Is it?

> >>
> >> On ARM SMMU we would need to specify an io-pgtable variant and the
> >> opaque structure would be bits of io_pgtable_cfg (tcr, mair, etc.)
> >>
> >> The problem of portability is slightly different here, because while
> >> PASID table format is specific to an IOMMU, page table format might
> >> be the same across multiple flavors of IOMMUs. For instance, the
> >> PASID table format I use in this series can only be found in the ARM
> >> SMMUv3 architecture, but the page table format is the same as ARM
> >> SMMUv2, and other MMUs. I'd like to implement an IOMMU independent of
> >> any page-table formats, that could use whatever the host offers (not
> >> necessarily MMU PT). The model numbers described above wouldn't be suitable,
> so we'd need another set of magic numbers for page table formats.
> >
> > Not sure if I totally got your points. We may assume PASID table
> > format differs from vendor to vendor. For the page table format, I
> > assume you mean the page table of process. Or in other words MMU page table.
> >
> > I'm a little bit confused about the following statements. Could you speak a little bit
> more?
> > Is it for virtual SVM or user-space driver SVM usage?
> > " I'd like to implement an IOMMU independent of any page-table
> > formats, that could use whatever the host offers (not necessarily MMU PT)."
> 
> The IOMMU might be using a different page table format than the MMU. I'm
> currently toying with the idea of having a portable paravirtualized IOMMU that can
> query the page table format used by pIOMMU, and adopt it if the page table
> handling code is available in the guest. This would be for non-SVM nested translation,
> using VFIO to bind page tables private to the IOMMU. I'll send a separate RFC to
> discuss this.

sounds interesting. Look forward to your further work.

> 
> > I's nice to have such discussion. Let's co-work and have a well-defined API.
> 
> I agree. I don't plan to send a new version immediately since it will take some time
> rework, but we can sync-up here or in private to avoid conflicting proposals.

yes, let's keep in touch. For "struct vfio_device_svm", how about define it in such way:

struct vfio_device_svm {
       __u32   argsz;
     /* data length would be sizeof(pasid_table_info) */
#define VFIO_SVM_BIND_PASIDTP                            (1 << 0) 
     /* data length would be 4 byte */
#define VFIO_BIND_PASID                                           (1 << 1) 
     /* data length would be sizeof(x) */
#define VFIO_SVM_PASID_RELEASE_FLUSHED	 (1 << 2) 
     /* data length would be sizeof(y) */
#define VFIO_SVM_PASID_RELEASE_CLEAN	 (1 << 3) 
    /* data length would be sizeof(pgtable_info) */
#define VFIO_SVM_BIND_PGTABLE		 (1 << 4) 
       __u32   flags;
       __u32   length;  /*can remove it if you think it's better */
       __u8    data[];   /* payload data for different bind/unbind request*/
};

struct pasid_table_info {
               __u16 sid;             /* It's BDF of device. */
	__u64 ptr;            /* PASID table ptr */
	__u64 size;	/* PASID table size in bytes */
	__u32 model;	/* magic number */
	__u32 variant;	/* YiL: maybe move it to opaque? */
	__u8 opaque[];	/* IOMMU-specific details */
};

I added a new field as "sid". The reason is as below:
Since the IOCTL is going to be on container fd, then needs to tell where does
the bind request come from. SID could be used to filter the target core IOMMU.
Also, it's needed to find the corresponding context entry with the SID on Intel
platform. I assume you also need an equivalent on your work. Pls let me know
if you prefer different definition.

Thanks,
Yi L
Jean-Philippe Brucker March 27, 2017, 10:13 a.m. UTC | #9
On 24/03/17 07:46, Liu, Yi L wrote:
[...]
>>>>
>>>> So we need some kind of high-level classification that the vIOMMU
>>>> must communicate to the physical one. Each IOMMU flavor would get a
>>>> unique, global identifier, simply to make sure that vIOMMU and pIOMMU speak
>> the same language.
>>>> For example:
>>>>
>>>> 0x65776886 "AMDV" AMD IOMMU
>>>> 0x73788476 "INTL" Intel IOMMU
>>>> 0x83515748 "S390" s390 IOMMU
>>>> 0x83777785 "SMMU" ARM SMMU
>>>> etc.
>>>>
>>>> It needs to be a global magic number that everyone can recognize.
>>>> Could be as simple as 32-bit numbers allocated from 0. Once we have a
>>>> global magic number, we can use it to differentiate architecture-specific details.
> 
> I prefer simple numbers to stand for each vendor.

Sure, I don't have any preference. Simple numbers could be easier to allocate.

>>> I may need to think more on this part.
>>>
>>>> struct pasid_table_info {
>>>> 	__u64 ptr;
>>>> 	__u64 size;		/* Is it number of entry or size in
>>>> 				   bytes? */
>>>
>>> For Intel platform, it's encoded. But I can make it in bytes. Here,
>>> I'd like to check with you if whole guest PASID info is also needed on ARM?
>>
>> It will be needed on ARM if someone ever emulates the SMMU with SVM.
>> Though I'm not planning on doing that myself, it is unavoidable. And it would be a
>> shame for the next SVM virtualization solution to have to introduce a new flag
>> "VFIO_SVM_BIND_PASIDPT_2" if they could reuse most of the BIND_PASIDPT
>> interface but simply needed to add one or two configuration fields specific to their
>> IOMMU.
> 
> So you are totally fine with putting PASID table ptr and size in the generic part? Maybe
> we have different usage for it. For me, it's a guest PASID table ptr. For you, it may be
> different.

It's the same for SMMU, with some added format specifiers that would go in
'opaque[]'. I think that table pointer and size (in bytes, or number of
entries) is generic enough for a "bind table" call and can be reused by
future implementations.

>>>>
>>>> 	__u32 model;		/* magic number */
>>>> 	__u32 variant;		/* version of the IOMMU architecture,
>>>> 				   maybe? IOMMU-specific. */
> 
> For variant, it will be combined with model to do sanity check. Am I right?
> Maybe it could be moved to opaque.

Yes I guess it could be moved to opaque. It would be a version of the
model used, so we wouldn't have to allocate a new model number whenever an
architecture updates the fields of its PASID descriptors, but we can let
IOMMU drivers decide if they need it and what to put in there.

>>>> 	__u8 opaque[];		/* IOMMU-specific details */
>>>> };
>>>>
[...]
>>
>> Yes, that seems sensible. I could add an explicit VFIO_BIND_PASID flags to make it
>> explicit that data[] is "u32 pasid" and avoid having any default.
> 
> Add it in the comment I suppose. The length is 4 byes, it could be deduced from argsz.
> 
>>
>>>>
>>>>> #define VFIO_SVM_PASSDOWN_INVALIDATE    (1 << 1)
>>>>
>>>> Using the vfio_device_svm structure for invalidate operations is a
>>>> bit odd, it might be nicer to add a new VFIO_SVM_INVALIDATE ioctl,
>>>> that takes the above iommu_svm_tlb_invalidate_info as argument (with
>>>> an added argsz.)
>>>
>>> Agree, would add a separate IOCTL for invalidation.
>>>
>>>>
>>>>> #define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 2)
>>>>> #define VFIO_SVM_PASID_RELEASE_CLEAN	  (1 << 3)
>>>>>        __u32   flags;
>>>>>        __u32   length;
>>>>
>>>> If length is the size of data[], I guess we can already deduce this info from argsz.
>>>
>>> yes, it is size of data. Maybe remove argsz. How about your opinion?
>>
>> Argsz as first argument is standard across all VFIO ioctls, it allows the kernel to check
>> that the structure passed by the guest is consistent with the structure it knows (see
>> the comment at the beginning of
>> include/uapi/linux/vfio.h) So argsz would be the preferred way to communicate the
>> size of data[]
> 
> yes, it makes sense. BTW. I think it is still possible to leave the "length" since there is
> similar usage. e.g. struct vfio_irq_set, count tells the size of data[] in that structure.
> Anyhow, it's no big deal here.
> 
>>>>>        __u8    data[];
>>>>> };
>>>>
>>>> In general, I think that your proposal would work fine along mine.
>>>> Note that for my next version of this patch, I would like to move the
>>>> BIND/UNBIND SVM operations onto a VFIO container fd, instead of a VFIO device
>> fd, if possible.
>>>
>>> Attach the BIND/UNBIND operation onto VFIO container fd is practical.
>>>
>>> BTW. Before you send out your next version, we'd better have a
>>> consensus on the vfio_device_svm definition. So that we can continue to drive our
>> own work separately.
>>>
>>>> ---
>>>> As an aside, it also aligns with the one I'm working on at the
>>>> moment, for virtual SVM at a finer granularity, where the BIND call
>>>> is for a page table. I would add this flag to vfio_device_svm:
>>>>
>>>> #define VFIO_SVM_BIND_PGTABLE		(1 << x)
>>>
>>> Sure. I think it may also be a requirement from other vendors. I think
>>> you've mentioned it in the above statements.
>>>
>>>> Which would indicate the following data (just a draft, I'm
>>>> oscillating between this and a generic PASID table solution, which would instead
>> reuse your proposal):
>>>>
>>>> struct pgtable_info {
>>>> 	__u32 pasid;
>>>>
>>>> 	__u64 pgd;
>>>>
>>>> 	__u32 model;
>>>> 	__u32 variant;
>>>>
>>>> 	__u8 opaque[];
>>>> };
> 
> I think you would have this definition as an ARM specific one. Would be filled in
> the data[] of vfio_device_svm. Is it?

Yes it would be another kind of data[] for vfio_device_svm.

>>>>
>>>> On ARM SMMU we would need to specify an io-pgtable variant and the
>>>> opaque structure would be bits of io_pgtable_cfg (tcr, mair, etc.)
>>>>
>>>> The problem of portability is slightly different here, because while
>>>> PASID table format is specific to an IOMMU, page table format might
>>>> be the same across multiple flavors of IOMMUs. For instance, the
>>>> PASID table format I use in this series can only be found in the ARM
>>>> SMMUv3 architecture, but the page table format is the same as ARM
>>>> SMMUv2, and other MMUs. I'd like to implement an IOMMU independent of
>>>> any page-table formats, that could use whatever the host offers (not
>>>> necessarily MMU PT). The model numbers described above wouldn't be suitable,
>> so we'd need another set of magic numbers for page table formats.
>>>
>>> Not sure if I totally got your points. We may assume PASID table
>>> format differs from vendor to vendor. For the page table format, I
>>> assume you mean the page table of process. Or in other words MMU page table.
>>>
>>> I'm a little bit confused about the following statements. Could you speak a little bit
>> more?
>>> Is it for virtual SVM or user-space driver SVM usage?
>>> " I'd like to implement an IOMMU independent of any page-table
>>> formats, that could use whatever the host offers (not necessarily MMU PT)."
>>
>> The IOMMU might be using a different page table format than the MMU. I'm
>> currently toying with the idea of having a portable paravirtualized IOMMU that can
>> query the page table format used by pIOMMU, and adopt it if the page table
>> handling code is available in the guest. This would be for non-SVM nested translation,
>> using VFIO to bind page tables private to the IOMMU. I'll send a separate RFC to
>> discuss this.
> 
> sounds interesting. Look forward to your further work.
> 
>>
>>> I's nice to have such discussion. Let's co-work and have a well-defined API.
>>
>> I agree. I don't plan to send a new version immediately since it will take some time
>> rework, but we can sync-up here or in private to avoid conflicting proposals.
> 
> yes, let's keep in touch. For "struct vfio_device_svm", how about define it in such way:
> 
> struct vfio_device_svm {
>        __u32   argsz;
>      /* data length would be sizeof(pasid_table_info) */
> #define VFIO_SVM_BIND_PASIDTP                            (1 << 0) 
>      /* data length would be 4 byte */
> #define VFIO_BIND_PASID                                           (1 << 1) 
>      /* data length would be sizeof(x) */
> #define VFIO_SVM_PASID_RELEASE_FLUSHED	 (1 << 2) 
>      /* data length would be sizeof(y) */
> #define VFIO_SVM_PASID_RELEASE_CLEAN	 (1 << 3) 

(Note that I'll probably drop these two flags in next version.)

>     /* data length would be sizeof(pgtable_info) */
> #define VFIO_SVM_BIND_PGTABLE		 (1 << 4) 
>        __u32   flags;
>        __u32   length;  /*can remove it if you think it's better */
>        __u8    data[];   /* payload data for different bind/unbind request*/
> };
> 
> struct pasid_table_info {
>                __u16 sid;             /* It's BDF of device. */
> 	__u64 ptr;            /* PASID table ptr */
> 	__u64 size;	/* PASID table size in bytes */
> 	__u32 model;	/* magic number */
> 	__u32 variant;	/* YiL: maybe move it to opaque? */

Yes, I think it could be in opaque, not all formats would need that. ptr,
size, model and opaque seem sufficient.

> 	__u8 opaque[];	/* IOMMU-specific details */
> };
> 
> I added a new field as "sid". The reason is as below:
> Since the IOCTL is going to be on container fd, then needs to tell where does
> the bind request come from. SID could be used to filter the target core IOMMU.
> Also, it's needed to find the corresponding context entry with the SID on Intel
> platform. I assume you also need an equivalent on your work. Pls let me know
> if you prefer different definition.

Actually moving the ioctl to the container changes the BIND model: the
PASID table would be attached to all devices in the container, so there
shouldn't be any device selector in the ioctl. I think it is more in line
with the existing MAP/UNMAP API, where a container represents an address
space, and all devices in the container access the same virtual addresses.
It adds complexity on the driver side, but seems more clear from the user
point of view.

Thanks,
Jean-Philippe
Yi Liu March 29, 2017, 6:17 a.m. UTC | #10
> -----Original Message-----
> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Monday, March 27, 2017 6:14 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; Alex Williamson <alex.williamson@redhat.com>
> Cc: Shanker Donthineni <shankerd@qti.qualcomm.com>; kvm@vger.kernel.org;
> Catalin Marinas <catalin.marinas@arm.com>; Sinan Kaya
> <okaya@qti.qualcomm.com>; Will Deacon <will.deacon@arm.com>;
> iommu@lists.linux-foundation.org; Harv Abdulhamid <harba@qti.qualcomm.com>;
> linux-pci@vger.kernel.org; Bjorn Helgaas <bhelgaas@google.com>; David
> Woodhouse <dwmw2@infradead.org>; linux-arm-kernel@lists.infradead.org; Nate
> Watterson <nwatters@qti.qualcomm.com>; Tian, Kevin <kevin.tian@intel.com>;
> Lan, Tianyu <tianyu.lan@intel.com>; Raj, Ashok <ashok.raj@intel.com>; Pan, Jacob
> jun <jacob.jun.pan@intel.com>; Joerg Roedel <joro@8bytes.org>; Robin Murphy
> <robin.murphy@arm.com>
> Subject: Re: [RFC PATCH 29/30] vfio: Add support for Shared Virtual Memory
> 
> On 24/03/17 07:46, Liu, Yi L wrote:
> [...]
> >>>>
> >>>> So we need some kind of high-level classification that the vIOMMU
> >>>> must communicate to the physical one. Each IOMMU flavor would get a
> >>>> unique, global identifier, simply to make sure that vIOMMU and
> >>>> pIOMMU speak
> >> the same language.
> >>>> For example:
> >>>>
> >>>> 0x65776886 "AMDV" AMD IOMMU
> >>>> 0x73788476 "INTL" Intel IOMMU
> >>>> 0x83515748 "S390" s390 IOMMU
> >>>> 0x83777785 "SMMU" ARM SMMU
> >>>> etc.
> >>>>
> >>>> It needs to be a global magic number that everyone can recognize.
> >>>> Could be as simple as 32-bit numbers allocated from 0. Once we have
> >>>> a global magic number, we can use it to differentiate architecture-specific
> details.
> >
> > I prefer simple numbers to stand for each vendor.
> 
> Sure, I don't have any preference. Simple numbers could be easier to allocate.
> 
> >>> I may need to think more on this part.
> >>>
> >>>> struct pasid_table_info {
> >>>> 	__u64 ptr;
> >>>> 	__u64 size;		/* Is it number of entry or size in
> >>>> 				   bytes? */
> >>>
> >>> For Intel platform, it's encoded. But I can make it in bytes. Here,
> >>> I'd like to check with you if whole guest PASID info is also needed on ARM?
> >>
> >> It will be needed on ARM if someone ever emulates the SMMU with SVM.
> >> Though I'm not planning on doing that myself, it is unavoidable. And
> >> it would be a shame for the next SVM virtualization solution to have
> >> to introduce a new flag "VFIO_SVM_BIND_PASIDPT_2" if they could reuse
> >> most of the BIND_PASIDPT interface but simply needed to add one or
> >> two configuration fields specific to their IOMMU.
> >
> > So you are totally fine with putting PASID table ptr and size in the
> > generic part? Maybe we have different usage for it. For me, it's a
> > guest PASID table ptr. For you, it may be different.
> 
> It's the same for SMMU, with some added format specifiers that would go in
> 'opaque[]'. I think that table pointer and size (in bytes, or number of
> entries) is generic enough for a "bind table" call and can be reused by future
> implementations.
> 
> >>>>
> >>>> 	__u32 model;		/* magic number */
> >>>> 	__u32 variant;		/* version of the IOMMU architecture,
> >>>> 				   maybe? IOMMU-specific. */
> >
> > For variant, it will be combined with model to do sanity check. Am I right?
> > Maybe it could be moved to opaque.
> 
> Yes I guess it could be moved to opaque. It would be a version of the model used, so
> we wouldn't have to allocate a new model number whenever an architecture
> updates the fields of its PASID descriptors, but we can let IOMMU drivers decide if
> they need it and what to put in there.
> 
> >>>> 	__u8 opaque[];		/* IOMMU-specific details */
> >>>> };
> >>>>
> [...]
> >>
> >> Yes, that seems sensible. I could add an explicit VFIO_BIND_PASID
> >> flags to make it explicit that data[] is "u32 pasid" and avoid having any default.
> >
> > Add it in the comment I suppose. The length is 4 byes, it could be deduced from
> argsz.
> >
> >>
> >>>>
> >>>>> #define VFIO_SVM_PASSDOWN_INVALIDATE    (1 << 1)
> >>>>
> >>>> Using the vfio_device_svm structure for invalidate operations is a
> >>>> bit odd, it might be nicer to add a new VFIO_SVM_INVALIDATE ioctl,
> >>>> that takes the above iommu_svm_tlb_invalidate_info as argument
> >>>> (with an added argsz.)
> >>>
> >>> Agree, would add a separate IOCTL for invalidation.
> >>>
> >>>>
> >>>>> #define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 2)
> >>>>> #define VFIO_SVM_PASID_RELEASE_CLEAN	  (1 << 3)
> >>>>>        __u32   flags;
> >>>>>        __u32   length;
> >>>>
> >>>> If length is the size of data[], I guess we can already deduce this info from argsz.
> >>>
> >>> yes, it is size of data. Maybe remove argsz. How about your opinion?
> >>
> >> Argsz as first argument is standard across all VFIO ioctls, it allows
> >> the kernel to check that the structure passed by the guest is
> >> consistent with the structure it knows (see the comment at the
> >> beginning of
> >> include/uapi/linux/vfio.h) So argsz would be the preferred way to
> >> communicate the size of data[]
> >
> > yes, it makes sense. BTW. I think it is still possible to leave the
> > "length" since there is similar usage. e.g. struct vfio_irq_set, count tells the size of
> data[] in that structure.
> > Anyhow, it's no big deal here.
> >
> >>>>>        __u8    data[];
> >>>>> };
> >>>>
> >>>> In general, I think that your proposal would work fine along mine.
> >>>> Note that for my next version of this patch, I would like to move
> >>>> the BIND/UNBIND SVM operations onto a VFIO container fd, instead of
> >>>> a VFIO device
> >> fd, if possible.
> >>>
> >>> Attach the BIND/UNBIND operation onto VFIO container fd is practical.
> >>>
> >>> BTW. Before you send out your next version, we'd better have a
> >>> consensus on the vfio_device_svm definition. So that we can continue
> >>> to drive our
> >> own work separately.
> >>>
> >>>> ---
> >>>> As an aside, it also aligns with the one I'm working on at the
> >>>> moment, for virtual SVM at a finer granularity, where the BIND call
> >>>> is for a page table. I would add this flag to vfio_device_svm:
> >>>>
> >>>> #define VFIO_SVM_BIND_PGTABLE		(1 << x)
> >>>
> >>> Sure. I think it may also be a requirement from other vendors. I
> >>> think you've mentioned it in the above statements.
> >>>
> >>>> Which would indicate the following data (just a draft, I'm
> >>>> oscillating between this and a generic PASID table solution, which
> >>>> would instead
> >> reuse your proposal):
> >>>>
> >>>> struct pgtable_info {
> >>>> 	__u32 pasid;
> >>>>
> >>>> 	__u64 pgd;
> >>>>
> >>>> 	__u32 model;
> >>>> 	__u32 variant;
> >>>>
> >>>> 	__u8 opaque[];
> >>>> };
> >
> > I think you would have this definition as an ARM specific one. Would
> > be filled in the data[] of vfio_device_svm. Is it?
> 
> Yes it would be another kind of data[] for vfio_device_svm.
> 
> >>>>
> >>>> On ARM SMMU we would need to specify an io-pgtable variant and the
> >>>> opaque structure would be bits of io_pgtable_cfg (tcr, mair, etc.)
> >>>>
> >>>> The problem of portability is slightly different here, because
> >>>> while PASID table format is specific to an IOMMU, page table format
> >>>> might be the same across multiple flavors of IOMMUs. For instance,
> >>>> the PASID table format I use in this series can only be found in
> >>>> the ARM
> >>>> SMMUv3 architecture, but the page table format is the same as ARM
> >>>> SMMUv2, and other MMUs. I'd like to implement an IOMMU independent
> >>>> of any page-table formats, that could use whatever the host offers
> >>>> (not necessarily MMU PT). The model numbers described above
> >>>> wouldn't be suitable,
> >> so we'd need another set of magic numbers for page table formats.
> >>>
> >>> Not sure if I totally got your points. We may assume PASID table
> >>> format differs from vendor to vendor. For the page table format, I
> >>> assume you mean the page table of process. Or in other words MMU page table.
> >>>
> >>> I'm a little bit confused about the following statements. Could you
> >>> speak a little bit
> >> more?
> >>> Is it for virtual SVM or user-space driver SVM usage?
> >>> " I'd like to implement an IOMMU independent of any page-table
> >>> formats, that could use whatever the host offers (not necessarily MMU PT)."
> >>
> >> The IOMMU might be using a different page table format than the MMU.
> >> I'm currently toying with the idea of having a portable
> >> paravirtualized IOMMU that can query the page table format used by
> >> pIOMMU, and adopt it if the page table handling code is available in
> >> the guest. This would be for non-SVM nested translation, using VFIO
> >> to bind page tables private to the IOMMU. I'll send a separate RFC to discuss this.
> >
> > sounds interesting. Look forward to your further work.
> >
> >>
> >>> I's nice to have such discussion. Let's co-work and have a well-defined API.
> >>
> >> I agree. I don't plan to send a new version immediately since it will
> >> take some time rework, but we can sync-up here or in private to avoid conflicting
> proposals.
> >
> > yes, let's keep in touch. For "struct vfio_device_svm", how about define it in such
> way:
> >
> > struct vfio_device_svm {
> >        __u32   argsz;
> >      /* data length would be sizeof(pasid_table_info) */
> > #define VFIO_SVM_BIND_PASIDTP                            (1 << 0)
> >      /* data length would be 4 byte */
> > #define VFIO_BIND_PASID                                           (1 << 1)
> >      /* data length would be sizeof(x) */
> > #define VFIO_SVM_PASID_RELEASE_FLUSHED	 (1 << 2)
> >      /* data length would be sizeof(y) */
> > #define VFIO_SVM_PASID_RELEASE_CLEAN	 (1 << 3)
> 
> (Note that I'll probably drop these two flags in next version.)
> 
> >     /* data length would be sizeof(pgtable_info) */
> > #define VFIO_SVM_BIND_PGTABLE		 (1 << 4)
> >        __u32   flags;
> >        __u32   length;  /*can remove it if you think it's better */
> >        __u8    data[];   /* payload data for different bind/unbind request*/
> > };
> >
> > struct pasid_table_info {
> >                __u16 sid;             /* It's BDF of device. */
> > 	__u64 ptr;            /* PASID table ptr */
> > 	__u64 size;	/* PASID table size in bytes */
> > 	__u32 model;	/* magic number */
> > 	__u32 variant;	/* YiL: maybe move it to opaque? */
> 
> Yes, I think it could be in opaque, not all formats would need that. ptr, size, model
> and opaque seem sufficient.
> 
> > 	__u8 opaque[];	/* IOMMU-specific details */
> > };
> >
> > I added a new field as "sid". The reason is as below:
> > Since the IOCTL is going to be on container fd, then needs to tell
> > where does the bind request come from. SID could be used to filter the target core
> IOMMU.
> > Also, it's needed to find the corresponding context entry with the SID
> > on Intel platform. I assume you also need an equivalent on your work.
> > Pls let me know if you prefer different definition.
> 
> Actually moving the ioctl to the container changes the BIND model: the PASID table
> would be attached to all devices in the container, so there shouldn't be any device
> selector in the ioctl. I think it is more in line with the existing MAP/UNMAP API,
> where a container represents an address space, and all devices in the container
> access the same virtual addresses.
> It adds complexity on the driver side, but seems more clear from the user point of
> view.

Hi Jean,

I do need a device selector here. Bind whole guest PASID table in host thus it needs 
to modify a context entry which is per-device. In other words, it is to replace the PASID
table in host instead of modifying some entries. Maybe I can move it to opaque. Then
it won't bother other user. Since my op is actually per-device, so it may be an open
if I need to place IOCTL in vfio-pci. Anyhow, iommu ops usage mainly happen in type1,
I don't want to break the rule so far. So I would go on with the consensus we had here.

Regards,
Yi L
Tomasz Nowicki April 26, 2017, 6:53 a.m. UTC | #11
Hi Jean,

On 27.02.2017 20:54, Jean-Philippe Brucker wrote:
> Add two new ioctl for VFIO devices. VFIO_DEVICE_BIND_TASK creates a bond
> between a device and a process address space, identified by a
> device-specific ID named PASID. This allows the device to target DMA
> transactions at the process virtual addresses without a need for mapping
> and unmapping buffers explicitly in the IOMMU. The process page tables are
> shared with the IOMMU, and mechanisms such as PCI ATS/PRI may be used to
> handle faults. VFIO_DEVICE_UNBIND_TASK removed a bond identified by a
> PASID.
>
> Also add a capability flag in device info to detect whether the system and
> the device support SVM.
>
> Users need to specify the state of a PASID when unbinding, with flags
> VFIO_PASID_RELEASE_FLUSHED and VFIO_PASID_RELEASE_CLEAN. Even for PCI,
> PASID invalidation is specific to each device and only partially covered
> by the specification:
>
> * Device must have an implementation-defined mechanism for stopping the
>   use of a PASID. When this mechanism finishes, the device has stopped
>   issuing transactions for this PASID and all transactions for this PASID
>   have been flushed to the IOMMU.
>
> * Device may either wait for all outstanding PRI requests for this PASID
>   to finish, or issue a Stop Marker message, a barrier that separates PRI
>   requests affecting this instance of the PASID from PRI requests
>   affecting the next instance. In the first case, we say that the PASID is
>   "clean", in the second case it is "flushed" (and the IOMMU has to wait
>   for the Stop Marker before reassigning the PASID.)
>
> We expect similar distinctions for platform devices. Ideally there should
> be a callback for each PCI device, allowing the IOMMU to ask the device to
> stop using a PASID. When the callback returns, the PASID is either flushed
> or clean and the return value tells which.
>
> For the moment I don't know how to implement this callback for PCI, so if
> the user forgets to call unbind with either "clean" or "flushed", the
> PASID is never reused. For platform devices, it might be simpler to
> implement since we could associate an invalidate_pasid callback to a DT
> compatible string, as is currently done for reset.
>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  drivers/vfio/pci/vfio_pci.c |  24 ++++++++++
>  drivers/vfio/vfio.c         | 104 ++++++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h   |  55 +++++++++++++++++++++++
>  3 files changed, 183 insertions(+)
>
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 324c52e3a1a4..3d7733f94891 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -22,6 +22,7 @@
>  #include <linux/mutex.h>
>  #include <linux/notifier.h>
>  #include <linux/pci.h>
> +#include <linux/pci-ats.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/slab.h>
>  #include <linux/types.h>
> @@ -623,6 +624,26 @@ int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
>  	return 0;
>  }
>

[...]

>
>  	kfree(device);
> @@ -1622,6 +1651,75 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
>  	return 0;
>  }
>
> +static long vfio_svm_ioctl(struct vfio_device *device, unsigned int cmd,
> +			   unsigned long arg)
> +{
> +	int ret;
> +	unsigned long minsz;
> +
> +	struct vfio_device_svm svm;
> +	struct vfio_task *vfio_task;
> +
> +	minsz = offsetofend(struct vfio_device_svm, pasid);
> +
> +	if (copy_from_user(&svm, (void __user *)arg, minsz))
> +		return -EFAULT;
> +
> +	if (svm.argsz < minsz)
> +		return -EINVAL;
> +
> +	if (cmd == VFIO_DEVICE_BIND_TASK) {
> +		struct task_struct *task = current;
> +
> +		ret = iommu_bind_task(device->dev, task, &svm.pasid, 0, NULL);
> +		if (ret)
> +			return ret;
> +
> +		vfio_task = kzalloc(sizeof(*vfio_task), GFP_KERNEL);
> +		if (!vfio_task) {
> +			iommu_unbind_task(device->dev, svm.pasid,
> +					  IOMMU_PASID_CLEAN);
> +			return -ENOMEM;
> +		}
> +
> +		vfio_task->pasid = svm.pasid;
> +
> +		mutex_lock(&device->tasks_lock);
> +		list_add(&vfio_task->list, &device->tasks);
> +		mutex_unlock(&device->tasks_lock);
> +
> +	} else {
> +		int flags = 0;
> +
> +		if (svm.flags & ~(VFIO_SVM_PASID_RELEASE_FLUSHED |
> +				  VFIO_SVM_PASID_RELEASE_CLEAN))
> +			return -EINVAL;
> +
> +		if (svm.flags & VFIO_SVM_PASID_RELEASE_FLUSHED)
> +			flags = IOMMU_PASID_FLUSHED;
> +		else if (svm.flags & VFIO_SVM_PASID_RELEASE_CLEAN)
> +			flags = IOMMU_PASID_CLEAN;
> +
> +		mutex_lock(&device->tasks_lock);
> +		list_for_each_entry(vfio_task, &device->tasks, list) {
> +			if (vfio_task->pasid != svm.pasid)
> +				continue;
> +
> +			ret = iommu_unbind_task(device->dev, svm.pasid, flags);
> +			if (ret)
> +				dev_warn(device->dev, "failed to unbind PASID %u\n",
> +					 vfio_task->pasid);
> +
> +			list_del(&vfio_task->list);
> +			kfree(vfio_task);

Please use list_for_each_entry_safe.

Thanks,
Tomasz
Jean-Philippe Brucker April 26, 2017, 10:08 a.m. UTC | #12
On 26/04/17 07:53, Tomasz Nowicki wrote:
>> +        mutex_lock(&device->tasks_lock);
>> +        list_for_each_entry(vfio_task, &device->tasks, list) {
>> +            if (vfio_task->pasid != svm.pasid)
>> +                continue;
>> +
>> +            ret = iommu_unbind_task(device->dev, svm.pasid, flags);
>> +            if (ret)
>> +                dev_warn(device->dev, "failed to unbind PASID %u\n",
>> +                     vfio_task->pasid);
>> +
>> +            list_del(&vfio_task->list);
>> +            kfree(vfio_task);
> 
> Please use list_for_each_entry_safe.

There is:

+            break;

right after kfree, so we'd never follow vfio_task->list.next after freeing
vfio_task. The code searches for the _only_ task matching the PASID,
removes it and leaves the loop.

Thanks,
Jean-Philippe
Tomasz Nowicki April 26, 2017, 11:01 a.m. UTC | #13
On 26.04.2017 12:08, Jean-Philippe Brucker wrote:
> On 26/04/17 07:53, Tomasz Nowicki wrote:
>>> +        mutex_lock(&device->tasks_lock);
>>> +        list_for_each_entry(vfio_task, &device->tasks, list) {
>>> +            if (vfio_task->pasid != svm.pasid)
>>> +                continue;
>>> +
>>> +            ret = iommu_unbind_task(device->dev, svm.pasid, flags);
>>> +            if (ret)
>>> +                dev_warn(device->dev, "failed to unbind PASID %u\n",
>>> +                     vfio_task->pasid);
>>> +
>>> +            list_del(&vfio_task->list);
>>> +            kfree(vfio_task);
>>
>> Please use list_for_each_entry_safe.
>
> There is:
>
> +            break;
>
> right after kfree, so we'd never follow vfio_task->list.next after freeing
> vfio_task. The code searches for the _only_ task matching the PASID,
> removes it and leaves the loop.
>

Ah right. Sorry for the noise.

Tomasz
diff mbox

Patch

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 324c52e3a1a4..3d7733f94891 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -22,6 +22,7 @@ 
 #include <linux/mutex.h>
 #include <linux/notifier.h>
 #include <linux/pci.h>
+#include <linux/pci-ats.h>
 #include <linux/pm_runtime.h>
 #include <linux/slab.h>
 #include <linux/types.h>
@@ -623,6 +624,26 @@  int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
 	return 0;
 }
 
+static bool vfio_pci_supports_svm(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+
+	if (!pdev->ats_enabled)
+		return false;
+
+	if (!pdev->pasid_enabled || pci_max_pasids(pdev) <= 1)
+		return false;
+
+	if (!pdev->pri_enabled)
+		return false;
+
+	/*
+	 * If the IOMMU driver enabled all of these, then it supports PCI SVM
+	 * for this device.
+	 */
+	return true;
+}
+
 static long vfio_pci_ioctl(void *device_data,
 			   unsigned int cmd, unsigned long arg)
 {
@@ -642,6 +663,9 @@  static long vfio_pci_ioctl(void *device_data,
 
 		info.flags = VFIO_DEVICE_FLAGS_PCI;
 
+		if (vfio_pci_supports_svm(vdev))
+			info.flags |= VFIO_DEVICE_FLAGS_SVM;
+
 		if (vdev->reset_works)
 			info.flags |= VFIO_DEVICE_FLAGS_RESET;
 
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 609f4f982c74..c4505d8f4c61 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -97,6 +97,14 @@  struct vfio_device {
 	struct vfio_group		*group;
 	struct list_head		group_next;
 	void				*device_data;
+
+	struct mutex			tasks_lock;
+	struct list_head		tasks;
+};
+
+struct vfio_task {
+	int				pasid;
+	struct list_head		list;
 };
 
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -520,6 +528,9 @@  struct vfio_device *vfio_group_create_device(struct vfio_group *group,
 	device->device_data = device_data;
 	dev_set_drvdata(dev, device);
 
+	mutex_init(&device->tasks_lock);
+	INIT_LIST_HEAD(&device->tasks);
+
 	/* No need to get group_lock, caller has group reference */
 	vfio_group_get(group);
 
@@ -532,6 +543,8 @@  struct vfio_device *vfio_group_create_device(struct vfio_group *group,
 
 static void vfio_device_release(struct kref *kref)
 {
+	int ret;
+	struct vfio_task *tmp, *task;
 	struct vfio_device *device = container_of(kref,
 						  struct vfio_device, kref);
 	struct vfio_group *group = device->group;
@@ -539,6 +552,22 @@  static void vfio_device_release(struct kref *kref)
 	list_del(&device->group_next);
 	mutex_unlock(&group->device_lock);
 
+	mutex_lock(&device->tasks_lock);
+	list_for_each_entry_safe(task, tmp, &device->tasks, list) {
+		/*
+		 * This might leak the PASID, since the IOMMU won't know
+		 * if it is safe to reuse.
+		 */
+		ret = iommu_unbind_task(device->dev, task->pasid, 0);
+		if (ret)
+			dev_warn(device->dev, "failed to unbind PASID %u\n",
+				 task->pasid);
+
+		list_del(&task->list);
+		kfree(task);
+	}
+	mutex_unlock(&device->tasks_lock);
+
 	dev_set_drvdata(device->dev, NULL);
 
 	kfree(device);
@@ -1622,6 +1651,75 @@  static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 	return 0;
 }
 
+static long vfio_svm_ioctl(struct vfio_device *device, unsigned int cmd,
+			   unsigned long arg)
+{
+	int ret;
+	unsigned long minsz;
+
+	struct vfio_device_svm svm;
+	struct vfio_task *vfio_task;
+
+	minsz = offsetofend(struct vfio_device_svm, pasid);
+
+	if (copy_from_user(&svm, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (svm.argsz < minsz)
+		return -EINVAL;
+
+	if (cmd == VFIO_DEVICE_BIND_TASK) {
+		struct task_struct *task = current;
+
+		ret = iommu_bind_task(device->dev, task, &svm.pasid, 0, NULL);
+		if (ret)
+			return ret;
+
+		vfio_task = kzalloc(sizeof(*vfio_task), GFP_KERNEL);
+		if (!vfio_task) {
+			iommu_unbind_task(device->dev, svm.pasid,
+					  IOMMU_PASID_CLEAN);
+			return -ENOMEM;
+		}
+
+		vfio_task->pasid = svm.pasid;
+
+		mutex_lock(&device->tasks_lock);
+		list_add(&vfio_task->list, &device->tasks);
+		mutex_unlock(&device->tasks_lock);
+
+	} else {
+		int flags = 0;
+
+		if (svm.flags & ~(VFIO_SVM_PASID_RELEASE_FLUSHED |
+				  VFIO_SVM_PASID_RELEASE_CLEAN))
+			return -EINVAL;
+
+		if (svm.flags & VFIO_SVM_PASID_RELEASE_FLUSHED)
+			flags = IOMMU_PASID_FLUSHED;
+		else if (svm.flags & VFIO_SVM_PASID_RELEASE_CLEAN)
+			flags = IOMMU_PASID_CLEAN;
+
+		mutex_lock(&device->tasks_lock);
+		list_for_each_entry(vfio_task, &device->tasks, list) {
+			if (vfio_task->pasid != svm.pasid)
+				continue;
+
+			ret = iommu_unbind_task(device->dev, svm.pasid, flags);
+			if (ret)
+				dev_warn(device->dev, "failed to unbind PASID %u\n",
+					 vfio_task->pasid);
+
+			list_del(&vfio_task->list);
+			kfree(vfio_task);
+			break;
+		}
+		mutex_unlock(&device->tasks_lock);
+	}
+
+	return copy_to_user((void __user *)arg, &svm, minsz) ? -EFAULT : 0;
+}
+
 static long vfio_device_fops_unl_ioctl(struct file *filep,
 				       unsigned int cmd, unsigned long arg)
 {
@@ -1630,6 +1728,12 @@  static long vfio_device_fops_unl_ioctl(struct file *filep,
 	if (unlikely(!device->ops->ioctl))
 		return -EINVAL;
 
+	switch (cmd) {
+	case VFIO_DEVICE_BIND_TASK:
+	case VFIO_DEVICE_UNBIND_TASK:
+		return vfio_svm_ioctl(device, cmd, arg);
+	}
+
 	return device->ops->ioctl(device->device_data, cmd, arg);
 }
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 519eff362c1c..3fe4197a5ea0 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -198,6 +198,7 @@  struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
 #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
 #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
+#define VFIO_DEVICE_FLAGS_SVM	(1 << 4)	/* Device supports bind/unbind */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 };
@@ -409,6 +410,60 @@  struct vfio_irq_set {
  */
 #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
 
+struct vfio_device_svm {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_SVM_PASID_RELEASE_FLUSHED	(1 << 0)
+#define VFIO_SVM_PASID_RELEASE_CLEAN	(1 << 1)
+	__u32	pasid;
+};
+/*
+ * VFIO_DEVICE_BIND_TASK - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
+ *                               struct vfio_device_svm)
+ *
+ * Share a process' virtual address space with the device.
+ *
+ * This feature creates a new address space for the device, which is not
+ * affected by VFIO_IOMMU_MAP/UNMAP_DMA. Instead, the device can tag its DMA
+ * traffic with the given @pasid to perform transactions on the associated
+ * virtual address space. Mapping and unmapping of buffers is performed by
+ * standard functions such as mmap and malloc.
+ *
+ * On success, VFIO writes a Process Address Space ID (PASID) into @pasid. This
+ * ID is unique to a device.
+ *
+ * The bond between device and process must be removed with
+ * VFIO_DEVICE_UNBIND_TASK before exiting.
+ *
+ * On fork, the child inherits the device fd and can use the bonds setup by its
+ * parent. Consequently, the child has R/W access on the address spaces bound by
+ * its parent. After an execv, the device fd is closed and the child doesn't
+ * have access to the address space anymore.
+ *
+ * Availability of this feature depends on the device, its bus, the underlying
+ * IOMMU and the CPU architecture. All of these are guaranteed when the device
+ * has VFIO_DEVICE_FLAGS_SVM flag set.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_DEVICE_BIND_TASK	_IO(VFIO_TYPE, VFIO_BASE + 22)
+
+/*
+ * VFIO_DEVICE_UNBIND_TASK - _IOWR(VFIO_TYPE, VFIO_BASE + 23,
+ *                                 struct vfio_device_svm)
+ *
+ * Unbind address space identified by @pasid from device. Device must have
+ * stopped issuing any DMA transaction for the PASID and flushed any reference
+ * to this PASID upstream. Some IOMMUs need to know when a PASID is safe to
+ * reuse, in which case one of the following must be present in @flags
+ *
+ * VFIO_PASID_RELEASE_FLUSHED: the PASID is safe to reassign after the IOMMU
+ *       receives an invalidation message from the device.
+ *
+ * VFIO_PASID_RELEASE_CLEAN: the PASID is safe to reassign immediately.
+ */
+#define VFIO_DEVICE_UNBIND_TASK	_IO(VFIO_TYPE, VFIO_BASE + 23)
+
 /*
  * The VFIO-PCI bus driver makes use of the following fixed region and
  * IRQ index mapping.  Unimplemented regions return a size of zero.