diff mbox series

[v2,10/10] vfio/pci: Add VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO

Message ID 20230327093458.44939-11-yi.l.liu@intel.com (mailing list archive)
State New, archived
Headers show
Series Introduce new methods for verifying ownership in vfio PCI hot reset | expand

Commit Message

Yi Liu March 27, 2023, 9:34 a.m. UTC
This is a preparation for vfio device cdev as cdev gives userspace the
capability to open device cdev fd and management stack (e.g. libvirt)
could pass the device fd to the actual user (e.g. QEMU). As a result,
the actual user has no idea about the device's bus:devfn information.
This is a problem when user uses VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to
know the hot reset affected device scope as this ioctl returns bus:devfn
info. For the fd passing usage, the acutal user cannot map the bus:devfn
to the devices it has opened via the fd passed from management stack. So
a new ioctl is required.

This new ioctl reports the list of iommufd dev_id that is opened by the
user. If there is affected device that is not bound to vfio driver or
opened by another user, this command shall fail with -EPERM. For the
noiommu mode in the vfio device cdev path, this shall fail as no dev_id
would be generated, hence nothing to report.

This ioctl is useless to the users that open vfio group as such users
have no idea about the iommufd dev_id and it can use the existing
VFIO_DEVICE_GET_PCI_HOT_RESET_INFO. The user that uses the traditional
mode vfio group/container would be failed if invoking this ioctl. But
the user that uses the iommufd compat mode vfio group/container shall
succeed. This is harmless as long as user cannot make use of it and
should use VFIO_DEVICE_GET_PCI_HOT_RESET_INFO.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 98 ++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h        | 47 +++++++++++++++
 2 files changed, 145 insertions(+)

Comments

Alex Williamson March 27, 2023, 7:26 p.m. UTC | #1
On Mon, 27 Mar 2023 02:34:58 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> This is a preparation for vfio device cdev as cdev gives userspace the
> capability to open device cdev fd and management stack (e.g. libvirt)
> could pass the device fd to the actual user (e.g. QEMU). As a result,
> the actual user has no idea about the device's bus:devfn information.
> This is a problem when user uses VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to
> know the hot reset affected device scope as this ioctl returns bus:devfn
> info. For the fd passing usage, the acutal user cannot map the bus:devfn
> to the devices it has opened via the fd passed from management stack. So
> a new ioctl is required.
> 
> This new ioctl reports the list of iommufd dev_id that is opened by the
> user. If there is affected device that is not bound to vfio driver or
> opened by another user, this command shall fail with -EPERM. For the
> noiommu mode in the vfio device cdev path, this shall fail as no dev_id
> would be generated, hence nothing to report.
> 
> This ioctl is useless to the users that open vfio group as such users
> have no idea about the iommufd dev_id and it can use the existing
> VFIO_DEVICE_GET_PCI_HOT_RESET_INFO. The user that uses the traditional
> mode vfio group/container would be failed if invoking this ioctl. But
> the user that uses the iommufd compat mode vfio group/container shall
> succeed. This is harmless as long as user cannot make use of it and
> should use VFIO_DEVICE_GET_PCI_HOT_RESET_INFO.


So VFIO_DEVICE_GET_PCI_HOT_RESET_INFO reports a group and bdf, but
VFIO_DEVICE_GET_PCI_HOT_RESET_*GROUP*_INFO is meant for the non-group,
cdev use case and returns a dev_id rather than a group???

Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg that
isn't used, why do we need a new ioctl vs defining
VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID.  In fact, we could define
vfio_dependent_device as:

struct vfio_pci_dependent_device {
	union {
	        __u32   group_id;
		__u32	dev_id;
	}
        __u16   segment;
        __u8    bus;
        __u8    devfn;
};

If the user calls with the above flag, dev_id is valid, otherwise
group_id.  Perhaps segment:buus:devfn could still be filled in with a
NULL/invalid dev_id if the user doesn't have permissions for the device
so that debugging from userspace isn't so opaque.  Thanks,

Alex
 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 98 ++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h        | 47 +++++++++++++++
>  2 files changed, 145 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 19f5b075d70a..45edf4e9b98b 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1181,6 +1181,102 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
>  	return ret;
>  }
>  
> +static struct pci_dev *
> +vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set);
> +
> +static int vfio_pci_ioctl_get_pci_hot_reset_group_info(
> +	struct vfio_pci_core_device *vdev,
> +	struct vfio_pci_hot_reset_group_info __user *arg)
> +{
> +	unsigned long minsz =
> +		offsetofend(struct vfio_pci_hot_reset_group_info, count);
> +	struct vfio_pci_hot_reset_group_info hdr;
> +	struct iommufd_ctx *iommufd, *cur_iommufd;
> +	u32 count = 0, index = 0, *devices = NULL;
> +	struct vfio_pci_core_device *cur;
> +	bool slot = false;
> +	int ret = 0;
> +
> +	if (copy_from_user(&hdr, arg, minsz))
> +		return -EFAULT;
> +
> +	if (hdr.argsz < minsz)
> +		return -EINVAL;
> +
> +	hdr.flags = 0;
> +
> +	/* Can we do a slot or bus reset or neither? */
> +	if (!pci_probe_reset_slot(vdev->pdev->slot))
> +		slot = true;
> +	else if (pci_probe_reset_bus(vdev->pdev->bus))
> +		return -ENODEV;
> +
> +	mutex_lock(&vdev->vdev.dev_set->lock);
> +	if (!vfio_pci_dev_set_resettable(vdev->vdev.dev_set)) {
> +		ret = -EPERM;
> +		goto out_unlock;
> +	}
> +
> +	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> +	if (!iommufd) {
> +		ret = -EPERM;
> +		goto out_unlock;
> +	}
> +
> +	/* How many devices are affected? */
> +	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs,
> +					    &count, slot);
> +	if (ret)
> +		goto out_unlock;
> +
> +	WARN_ON(!count); /* Should always be at least one */
> +
> +	/*
> +	 * If there's enough space, fill it now, otherwise return -ENOSPC and
> +	 * the number of devices affected.
> +	 */
> +	if (hdr.argsz < sizeof(hdr) + (count * sizeof(*devices))) {
> +		ret = -ENOSPC;
> +		hdr.count = count;
> +		goto reset_info_exit;
> +	}
> +
> +	devices = kcalloc(count, sizeof(*devices), GFP_KERNEL);
> +	if (!devices) {
> +		ret = -ENOMEM;
> +		goto reset_info_exit;
> +	}
> +
> +	list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, vdev.dev_set_list) {
> +		cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev);
> +		if (cur->vdev.open_count) {
> +			if (cur_iommufd != iommufd) {
> +				ret = -EPERM;
> +				break;
> +			}
> +			ret = vfio_iommufd_physical_devid(&cur->vdev, &devices[index]);
> +			if (ret)
> +				break;
> +			index++;
> +		}
> +	}
> +
> +reset_info_exit:
> +	if (copy_to_user(arg, &hdr, minsz))
> +		ret = -EFAULT;
> +
> +	if (!ret) {
> +		if (copy_to_user(&arg->devices, devices,
> +				 hdr.count * sizeof(*devices)))
> +			ret = -EFAULT;
> +	}
> +
> +	kfree(devices);
> +out_unlock:
> +	mutex_unlock(&vdev->vdev.dev_set->lock);
> +	return ret;
> +}
> +
>  static int vfio_pci_ioctl_get_pci_hot_reset_info(
>  	struct vfio_pci_core_device *vdev,
>  	struct vfio_pci_hot_reset_info __user *arg)
> @@ -1404,6 +1500,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  		return vfio_pci_ioctl_get_irq_info(vdev, uarg);
>  	case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO:
>  		return vfio_pci_ioctl_get_pci_hot_reset_info(vdev, uarg);
> +	case VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO:
> +		return vfio_pci_ioctl_get_pci_hot_reset_group_info(vdev, uarg);
>  	case VFIO_DEVICE_GET_REGION_INFO:
>  		return vfio_pci_ioctl_get_region_info(vdev, uarg);
>  	case VFIO_DEVICE_IOEVENTFD:
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 25432ef213ee..61b801dfd40b 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -669,6 +669,53 @@ struct vfio_pci_hot_reset_info {
>  
>  #define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>  
> +/**
> + * VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
> + *						    struct vfio_pci_hot_reset_group_info)
> + *
> + * This is used in the vfio device cdev mode.  It returns the list of
> + * affected devices (represented by iommufd dev_id) when hot reset is
> + * issued on the current device with which this ioctl is invoked.  It
> + * only includes the devices that are opened by the current user in the
> + * time of this command is invoked.  This list may change when user opens
> + * new device or close opened device, hence user should re-invoke it
> + * after open/close devices.  This command has no guarantee on the result
> + * of VFIO_DEVICE_PCI_HOT_RESET since the not-opened affected device can
> + * be by other users in the window between the two ioctls.  If the affected
> + * devices are opened by multiple users, the VFIO_DEVICE_PCI_HOT_RESET
> + * shall fail, detail can check the description of VFIO_DEVICE_PCI_HOT_RESET.
> + *
> + * For the users that open vfio group/container, this ioctl is useless as
> + * they have no idea about the iommufd dev_id returned by this ioctl.  For
> + * the users of the traditional mode vfio group/container, this ioctl will
> + * fail as this mode does not use iommufd hence no dev_id to report back.
> + * For the users of the iommufd compat mode vfio group/container, this ioctl
> + * would succeed as this mode uses iommufd as container fd.  But such users
> + * still have no idea about the iommufd dev_id as the dev_id is only stored
> + * in kernel in this mode.  For the users of the vfio group/container, the
> + * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO should be used to know the hot reset
> + * affected devices.
> + *
> + * Return: 0 on success, -errno on failure:
> + *	-enospc = insufficient buffer;
> + *	-enodev = unsupported for device;
> + *	-eperm = no permission for device, this error comes:
> + *		 - when there are affected devices that are opened but
> + *		   not bound to the same iommufd with the current device
> + *		   with which this ioctl is invoked,
> + *		 - there are affected devices that are not bound to vfio
> + *		   driver yet.
> + *		 - no valid iommufd is bound (e.g. noiommu mode)
> + */
> +struct vfio_pci_hot_reset_group_info {
> +	__u32	argsz;
> +	__u32	flags;
> +	__u32	count;
> +	__u32	devices[];
> +};
> +
> +#define VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO	_IO(VFIO_TYPE, VFIO_BASE + 18)
> +
>  /**
>   * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
>   *				    struct vfio_pci_hot_reset)
Alex Williamson March 27, 2023, 8:40 p.m. UTC | #2
On Mon, 27 Mar 2023 13:26:19 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 27 Mar 2023 02:34:58 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > This is a preparation for vfio device cdev as cdev gives userspace the
> > capability to open device cdev fd and management stack (e.g. libvirt)
> > could pass the device fd to the actual user (e.g. QEMU). As a result,
> > the actual user has no idea about the device's bus:devfn information.
> > This is a problem when user uses VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to
> > know the hot reset affected device scope as this ioctl returns bus:devfn
> > info. For the fd passing usage, the acutal user cannot map the bus:devfn
> > to the devices it has opened via the fd passed from management stack. So
> > a new ioctl is required.
> > 
> > This new ioctl reports the list of iommufd dev_id that is opened by the
> > user. If there is affected device that is not bound to vfio driver or
> > opened by another user, this command shall fail with -EPERM. For the
> > noiommu mode in the vfio device cdev path, this shall fail as no dev_id
> > would be generated, hence nothing to report.
> > 
> > This ioctl is useless to the users that open vfio group as such users
> > have no idea about the iommufd dev_id and it can use the existing
> > VFIO_DEVICE_GET_PCI_HOT_RESET_INFO. The user that uses the traditional
> > mode vfio group/container would be failed if invoking this ioctl. But
> > the user that uses the iommufd compat mode vfio group/container shall
> > succeed. This is harmless as long as user cannot make use of it and
> > should use VFIO_DEVICE_GET_PCI_HOT_RESET_INFO.  
> 
> 
> So VFIO_DEVICE_GET_PCI_HOT_RESET_INFO reports a group and bdf, but
> VFIO_DEVICE_GET_PCI_HOT_RESET_*GROUP*_INFO is meant for the non-group,
> cdev use case and returns a dev_id rather than a group???
> 
> Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg that
> isn't used, why do we need a new ioctl vs defining
> VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID.  In fact, we could define
> vfio_dependent_device as:
> 
> struct vfio_pci_dependent_device {
> 	union {
> 	        __u32   group_id;
> 		__u32	dev_id;
> 	}
>         __u16   segment;
>         __u8    bus;
>         __u8    devfn;
> };
> 
> If the user calls with the above flag, dev_id is valid, otherwise
> group_id.  Perhaps segment:buus:devfn could still be filled in with a
> NULL/invalid dev_id if the user doesn't have permissions for the device
> so that debugging from userspace isn't so opaque.  Thanks,
> 
> Alex
>  
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 98 ++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h        | 47 +++++++++++++++
> >  2 files changed, 145 insertions(+)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 19f5b075d70a..45edf4e9b98b 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1181,6 +1181,102 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
> >  	return ret;
> >  }
> >  
> > +static struct pci_dev *
> > +vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set);
> > +
> > +static int vfio_pci_ioctl_get_pci_hot_reset_group_info(
> > +	struct vfio_pci_core_device *vdev,
> > +	struct vfio_pci_hot_reset_group_info __user *arg)
> > +{
> > +	unsigned long minsz =
> > +		offsetofend(struct vfio_pci_hot_reset_group_info, count);
> > +	struct vfio_pci_hot_reset_group_info hdr;
> > +	struct iommufd_ctx *iommufd, *cur_iommufd;
> > +	u32 count = 0, index = 0, *devices = NULL;
> > +	struct vfio_pci_core_device *cur;
> > +	bool slot = false;
> > +	int ret = 0;
> > +
> > +	if (copy_from_user(&hdr, arg, minsz))
> > +		return -EFAULT;
> > +
> > +	if (hdr.argsz < minsz)
> > +		return -EINVAL;
> > +
> > +	hdr.flags = 0;
> > +
> > +	/* Can we do a slot or bus reset or neither? */
> > +	if (!pci_probe_reset_slot(vdev->pdev->slot))
> > +		slot = true;
> > +	else if (pci_probe_reset_bus(vdev->pdev->bus))
> > +		return -ENODEV;
> > +
> > +	mutex_lock(&vdev->vdev.dev_set->lock);
> > +	if (!vfio_pci_dev_set_resettable(vdev->vdev.dev_set)) {
> > +		ret = -EPERM;
> > +		goto out_unlock;
> > +	}
> > +
> > +	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> > +	if (!iommufd) {
> > +		ret = -EPERM;
> > +		goto out_unlock;
> > +	}
> > +
> > +	/* How many devices are affected? */
> > +	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs,
> > +					    &count, slot);
> > +	if (ret)
> > +		goto out_unlock;
> > +
> > +	WARN_ON(!count); /* Should always be at least one */
> > +
> > +	/*
> > +	 * If there's enough space, fill it now, otherwise return -ENOSPC and
> > +	 * the number of devices affected.
> > +	 */
> > +	if (hdr.argsz < sizeof(hdr) + (count * sizeof(*devices))) {
> > +		ret = -ENOSPC;
> > +		hdr.count = count;
> > +		goto reset_info_exit;
> > +	}
> > +
> > +	devices = kcalloc(count, sizeof(*devices), GFP_KERNEL);
> > +	if (!devices) {
> > +		ret = -ENOMEM;
> > +		goto reset_info_exit;
> > +	}
> > +
> > +	list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, vdev.dev_set_list) {
> > +		cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev);
> > +		if (cur->vdev.open_count) {
> > +			if (cur_iommufd != iommufd) {
> > +				ret = -EPERM;
> > +				break;
> > +			}
> > +			ret = vfio_iommufd_physical_devid(&cur->vdev, &devices[index]);
> > +			if (ret)
> > +				break;
> > +			index++;
> > +		}
> > +	}

This also makes use of vfio_pci_for_each_slot_or_bus() to iterate
affected devices, but then still assumes those affected devices are
necessarily represented in the dev_set.  For example, a two-function
device with ACS isolation can have function 0 bound to vfio and
function 1 bound to a native host driver.  The above code requires the
user to pass a buffer large enough for both functions, but only
function 0 is part of the dev_set, so function 1 is not reported as
affected, nor does it generate an error.

It looks like we also fail to set hdr.count except in the error path
above, so the below copy_to_user() copies a user specified range beyond
the end the allocated devices buffer out to userspace!  Thanks,

Alex

> > +
> > +reset_info_exit:
> > +	if (copy_to_user(arg, &hdr, minsz))
> > +		ret = -EFAULT;
> > +
> > +	if (!ret) {
> > +		if (copy_to_user(&arg->devices, devices,
> > +				 hdr.count * sizeof(*devices)))
> > +			ret = -EFAULT;
> > +	}
> > +
> > +	kfree(devices);
> > +out_unlock:
> > +	mutex_unlock(&vdev->vdev.dev_set->lock);
> > +	return ret;
> > +}
> > +
> >  static int vfio_pci_ioctl_get_pci_hot_reset_info(
> >  	struct vfio_pci_core_device *vdev,
> >  	struct vfio_pci_hot_reset_info __user *arg)
> > @@ -1404,6 +1500,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> >  		return vfio_pci_ioctl_get_irq_info(vdev, uarg);
> >  	case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO:
> >  		return vfio_pci_ioctl_get_pci_hot_reset_info(vdev, uarg);
> > +	case VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO:
> > +		return vfio_pci_ioctl_get_pci_hot_reset_group_info(vdev, uarg);
> >  	case VFIO_DEVICE_GET_REGION_INFO:
> >  		return vfio_pci_ioctl_get_region_info(vdev, uarg);
> >  	case VFIO_DEVICE_IOEVENTFD:
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 25432ef213ee..61b801dfd40b 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -669,6 +669,53 @@ struct vfio_pci_hot_reset_info {
> >  
> >  #define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
> >  
> > +/**
> > + * VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
> > + *						    struct vfio_pci_hot_reset_group_info)
> > + *
> > + * This is used in the vfio device cdev mode.  It returns the list of
> > + * affected devices (represented by iommufd dev_id) when hot reset is
> > + * issued on the current device with which this ioctl is invoked.  It
> > + * only includes the devices that are opened by the current user in the
> > + * time of this command is invoked.  This list may change when user opens
> > + * new device or close opened device, hence user should re-invoke it
> > + * after open/close devices.  This command has no guarantee on the result
> > + * of VFIO_DEVICE_PCI_HOT_RESET since the not-opened affected device can
> > + * be by other users in the window between the two ioctls.  If the affected
> > + * devices are opened by multiple users, the VFIO_DEVICE_PCI_HOT_RESET
> > + * shall fail, detail can check the description of VFIO_DEVICE_PCI_HOT_RESET.
> > + *
> > + * For the users that open vfio group/container, this ioctl is useless as
> > + * they have no idea about the iommufd dev_id returned by this ioctl.  For
> > + * the users of the traditional mode vfio group/container, this ioctl will
> > + * fail as this mode does not use iommufd hence no dev_id to report back.
> > + * For the users of the iommufd compat mode vfio group/container, this ioctl
> > + * would succeed as this mode uses iommufd as container fd.  But such users
> > + * still have no idea about the iommufd dev_id as the dev_id is only stored
> > + * in kernel in this mode.  For the users of the vfio group/container, the
> > + * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO should be used to know the hot reset
> > + * affected devices.
> > + *
> > + * Return: 0 on success, -errno on failure:
> > + *	-enospc = insufficient buffer;
> > + *	-enodev = unsupported for device;
> > + *	-eperm = no permission for device, this error comes:
> > + *		 - when there are affected devices that are opened but
> > + *		   not bound to the same iommufd with the current device
> > + *		   with which this ioctl is invoked,
> > + *		 - there are affected devices that are not bound to vfio
> > + *		   driver yet.
> > + *		 - no valid iommufd is bound (e.g. noiommu mode)
> > + */
> > +struct vfio_pci_hot_reset_group_info {
> > +	__u32	argsz;
> > +	__u32	flags;
> > +	__u32	count;
> > +	__u32	devices[];
> > +};
> > +
> > +#define VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO	_IO(VFIO_TYPE, VFIO_BASE + 18)
> > +
> >  /**
> >   * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
> >   *				    struct vfio_pci_hot_reset)  
>
Yi Liu March 28, 2023, 3:32 a.m. UTC | #3
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, March 28, 2023 3:26 AM
> 
> On Mon, 27 Mar 2023 02:34:58 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > This is a preparation for vfio device cdev as cdev gives userspace the
> > capability to open device cdev fd and management stack (e.g. libvirt)
> > could pass the device fd to the actual user (e.g. QEMU). As a result,
> > the actual user has no idea about the device's bus:devfn information.
> > This is a problem when user uses VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to
> > know the hot reset affected device scope as this ioctl returns bus:devfn
> > info. For the fd passing usage, the acutal user cannot map the bus:devfn
> > to the devices it has opened via the fd passed from management stack. So
> > a new ioctl is required.
> >
> > This new ioctl reports the list of iommufd dev_id that is opened by the
> > user. If there is affected device that is not bound to vfio driver or
> > opened by another user, this command shall fail with -EPERM. For the
> > noiommu mode in the vfio device cdev path, this shall fail as no dev_id
> > would be generated, hence nothing to report.
> >
> > This ioctl is useless to the users that open vfio group as such users
> > have no idea about the iommufd dev_id and it can use the existing
> > VFIO_DEVICE_GET_PCI_HOT_RESET_INFO. The user that uses the traditional
> > mode vfio group/container would be failed if invoking this ioctl. But
> > the user that uses the iommufd compat mode vfio group/container shall
> > succeed. This is harmless as long as user cannot make use of it and
> > should use VFIO_DEVICE_GET_PCI_HOT_RESET_INFO.
> 
> 
> So VFIO_DEVICE_GET_PCI_HOT_RESET_INFO reports a group and bdf, but
> VFIO_DEVICE_GET_PCI_HOT_RESET_*GROUP*_INFO is meant for the non-
> group,
> cdev use case and returns a dev_id rather than a group???

Yes, this is the meaning, but poor naming here ☹ I also struggled on it.
Perhaps your below Suggestion makes more sense. Introducing a flag and
reuse the existing _INFO ioctl.

> Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg that
> isn't used, why do we need a new ioctl vs defining
> VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID.

Sure. I can follow this suggestion. BTW. I have a doubt here. This new flag
is set by user. What if in the future kernel has new extensions and needs
to report something new to the user and add new flags to tell user? Such
flag is set by kernel. Then the flags field may have two kinds of flags (some
set by user while some set by kernel). Will it mess up the flags space?

>  In fact, we could define vfio_dependent_device as:
> 
> struct vfio_pci_dependent_device {
> 	union {
> 	        __u32   group_id;
> 		__u32	dev_id;
> 	}
>         __u16   segment;
>         __u8    bus;
>         __u8    devfn;
> };
> 
> If the user calls with the above flag, dev_id is valid, otherwise
> group_id.  Perhaps segment:buus:devfn could still be filled in with a
> NULL/invalid dev_id if the user doesn't have permissions for the device
> so that debugging from userspace isn't so opaque.  Thanks,

Also, have one question here. Should the invalid dev_id be defined in
the vfio uapi or iommufd uapi? Maybe the latter one since dev_id is
generated by iommufd subsystem.

Regards,
Yi Liu
> 
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 98
> ++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h        | 47 +++++++++++++++
> >  2 files changed, 145 insertions(+)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c
> b/drivers/vfio/pci/vfio_pci_core.c
> > index 19f5b075d70a..45edf4e9b98b 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1181,6 +1181,102 @@ static int vfio_pci_ioctl_reset(struct
> vfio_pci_core_device *vdev,
> >  	return ret;
> >  }
> >
> > +static struct pci_dev *
> > +vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set);
> > +
> > +static int vfio_pci_ioctl_get_pci_hot_reset_group_info(
> > +	struct vfio_pci_core_device *vdev,
> > +	struct vfio_pci_hot_reset_group_info __user *arg)
> > +{
> > +	unsigned long minsz =
> > +		offsetofend(struct vfio_pci_hot_reset_group_info, count);
> > +	struct vfio_pci_hot_reset_group_info hdr;
> > +	struct iommufd_ctx *iommufd, *cur_iommufd;
> > +	u32 count = 0, index = 0, *devices = NULL;
> > +	struct vfio_pci_core_device *cur;
> > +	bool slot = false;
> > +	int ret = 0;
> > +
> > +	if (copy_from_user(&hdr, arg, minsz))
> > +		return -EFAULT;
> > +
> > +	if (hdr.argsz < minsz)
> > +		return -EINVAL;
> > +
> > +	hdr.flags = 0;
> > +
> > +	/* Can we do a slot or bus reset or neither? */
> > +	if (!pci_probe_reset_slot(vdev->pdev->slot))
> > +		slot = true;
> > +	else if (pci_probe_reset_bus(vdev->pdev->bus))
> > +		return -ENODEV;
> > +
> > +	mutex_lock(&vdev->vdev.dev_set->lock);
> > +	if (!vfio_pci_dev_set_resettable(vdev->vdev.dev_set)) {
> > +		ret = -EPERM;
> > +		goto out_unlock;
> > +	}
> > +
> > +	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> > +	if (!iommufd) {
> > +		ret = -EPERM;
> > +		goto out_unlock;
> > +	}
> > +
> > +	/* How many devices are affected? */
> > +	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
> vfio_pci_count_devs,
> > +					    &count, slot);
> > +	if (ret)
> > +		goto out_unlock;
> > +
> > +	WARN_ON(!count); /* Should always be at least one */
> > +
> > +	/*
> > +	 * If there's enough space, fill it now, otherwise return -ENOSPC and
> > +	 * the number of devices affected.
> > +	 */
> > +	if (hdr.argsz < sizeof(hdr) + (count * sizeof(*devices))) {
> > +		ret = -ENOSPC;
> > +		hdr.count = count;
> > +		goto reset_info_exit;
> > +	}
> > +
> > +	devices = kcalloc(count, sizeof(*devices), GFP_KERNEL);
> > +	if (!devices) {
> > +		ret = -ENOMEM;
> > +		goto reset_info_exit;
> > +	}
> > +
> > +	list_for_each_entry(cur, &vdev->vdev.dev_set->device_list,
> vdev.dev_set_list) {
> > +		cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev);
> > +		if (cur->vdev.open_count) {
> > +			if (cur_iommufd != iommufd) {
> > +				ret = -EPERM;
> > +				break;
> > +			}
> > +			ret = vfio_iommufd_physical_devid(&cur->vdev,
> &devices[index]);
> > +			if (ret)
> > +				break;
> > +			index++;
> > +		}
> > +	}
> > +
> > +reset_info_exit:
> > +	if (copy_to_user(arg, &hdr, minsz))
> > +		ret = -EFAULT;
> > +
> > +	if (!ret) {
> > +		if (copy_to_user(&arg->devices, devices,
> > +				 hdr.count * sizeof(*devices)))
> > +			ret = -EFAULT;
> > +	}
> > +
> > +	kfree(devices);
> > +out_unlock:
> > +	mutex_unlock(&vdev->vdev.dev_set->lock);
> > +	return ret;
> > +}
> > +
> >  static int vfio_pci_ioctl_get_pci_hot_reset_info(
> >  	struct vfio_pci_core_device *vdev,
> >  	struct vfio_pci_hot_reset_info __user *arg)
> > @@ -1404,6 +1500,8 @@ long vfio_pci_core_ioctl(struct vfio_device
> *core_vdev, unsigned int cmd,
> >  		return vfio_pci_ioctl_get_irq_info(vdev, uarg);
> >  	case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO:
> >  		return vfio_pci_ioctl_get_pci_hot_reset_info(vdev, uarg);
> > +	case VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO:
> > +		return vfio_pci_ioctl_get_pci_hot_reset_group_info(vdev,
> uarg);
> >  	case VFIO_DEVICE_GET_REGION_INFO:
> >  		return vfio_pci_ioctl_get_region_info(vdev, uarg);
> >  	case VFIO_DEVICE_IOEVENTFD:
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 25432ef213ee..61b801dfd40b 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -669,6 +669,53 @@ struct vfio_pci_hot_reset_info {
> >
> >  #define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO	_IO(VFIO_TYPE,
> VFIO_BASE + 12)
> >
> > +/**
> > + * VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO - _IOWR(VFIO_TYPE,
> VFIO_BASE + 12,
> > + *						    struct
> vfio_pci_hot_reset_group_info)
> > + *
> > + * This is used in the vfio device cdev mode.  It returns the list of
> > + * affected devices (represented by iommufd dev_id) when hot reset is
> > + * issued on the current device with which this ioctl is invoked.  It
> > + * only includes the devices that are opened by the current user in the
> > + * time of this command is invoked.  This list may change when user
> opens
> > + * new device or close opened device, hence user should re-invoke it
> > + * after open/close devices.  This command has no guarantee on the
> result
> > + * of VFIO_DEVICE_PCI_HOT_RESET since the not-opened affected device
> can
> > + * be by other users in the window between the two ioctls.  If the
> affected
> > + * devices are opened by multiple users, the
> VFIO_DEVICE_PCI_HOT_RESET
> > + * shall fail, detail can check the description of
> VFIO_DEVICE_PCI_HOT_RESET.
> > + *
> > + * For the users that open vfio group/container, this ioctl is useless as
> > + * they have no idea about the iommufd dev_id returned by this ioctl.
> For
> > + * the users of the traditional mode vfio group/container, this ioctl will
> > + * fail as this mode does not use iommufd hence no dev_id to report
> back.
> > + * For the users of the iommufd compat mode vfio group/container, this
> ioctl
> > + * would succeed as this mode uses iommufd as container fd.  But such
> users
> > + * still have no idea about the iommufd dev_id as the dev_id is only
> stored
> > + * in kernel in this mode.  For the users of the vfio group/container, the
> > + * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO should be used to know the
> hot reset
> > + * affected devices.
> > + *
> > + * Return: 0 on success, -errno on failure:
> > + *	-enospc = insufficient buffer;
> > + *	-enodev = unsupported for device;
> > + *	-eperm = no permission for device, this error comes:
> > + *		 - when there are affected devices that are opened but
> > + *		   not bound to the same iommufd with the current device
> > + *		   with which this ioctl is invoked,
> > + *		 - there are affected devices that are not bound to vfio
> > + *		   driver yet.
> > + *		 - no valid iommufd is bound (e.g. noiommu mode)
> > + */
> > +struct vfio_pci_hot_reset_group_info {
> > +	__u32	argsz;
> > +	__u32	flags;
> > +	__u32	count;
> > +	__u32	devices[];
> > +};
> > +
> > +#define VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO
> 	_IO(VFIO_TYPE, VFIO_BASE + 18)
> > +
> >  /**
> >   * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
> >   *				    struct vfio_pci_hot_reset)
Yi Liu March 28, 2023, 3:45 a.m. UTC | #4
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, March 28, 2023 4:41 AM
> 
> On Mon, 27 Mar 2023 13:26:19 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
[...]
> > > diff --git a/drivers/vfio/pci/vfio_pci_core.c
> b/drivers/vfio/pci/vfio_pci_core.c
> > > index 19f5b075d70a..45edf4e9b98b 100644
> > > --- a/drivers/vfio/pci/vfio_pci_core.c
> > > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > > @@ -1181,6 +1181,102 @@ static int vfio_pci_ioctl_reset(struct
> vfio_pci_core_device *vdev,
> > >  	return ret;
> > >  }
> > >
> > > +static struct pci_dev *
> > > +vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set);
> > > +
> > > +static int vfio_pci_ioctl_get_pci_hot_reset_group_info(
> > > +	struct vfio_pci_core_device *vdev,
> > > +	struct vfio_pci_hot_reset_group_info __user *arg)
> > > +{
> > > +	unsigned long minsz =
> > > +		offsetofend(struct vfio_pci_hot_reset_group_info, count);
> > > +	struct vfio_pci_hot_reset_group_info hdr;
> > > +	struct iommufd_ctx *iommufd, *cur_iommufd;
> > > +	u32 count = 0, index = 0, *devices = NULL;
> > > +	struct vfio_pci_core_device *cur;
> > > +	bool slot = false;
> > > +	int ret = 0;
> > > +
> > > +	if (copy_from_user(&hdr, arg, minsz))
> > > +		return -EFAULT;
> > > +
> > > +	if (hdr.argsz < minsz)
> > > +		return -EINVAL;
> > > +
> > > +	hdr.flags = 0;
> > > +
> > > +	/* Can we do a slot or bus reset or neither? */
> > > +	if (!pci_probe_reset_slot(vdev->pdev->slot))
> > > +		slot = true;
> > > +	else if (pci_probe_reset_bus(vdev->pdev->bus))
> > > +		return -ENODEV;
> > > +
> > > +	mutex_lock(&vdev->vdev.dev_set->lock);
> > > +	if (!vfio_pci_dev_set_resettable(vdev->vdev.dev_set)) {

[1]

> > > +		ret = -EPERM;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> > > +	if (!iommufd) {
> > > +		ret = -EPERM;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	/* How many devices are affected? */
> > > +	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev,
> vfio_pci_count_devs,
> > > +					    &count, slot);
> > > +	if (ret)
> > > +		goto out_unlock;
> > > +
> > > +	WARN_ON(!count); /* Should always be at least one */
> > > +
> > > +	/*
> > > +	 * If there's enough space, fill it now, otherwise return -ENOSPC and
> > > +	 * the number of devices affected.
> > > +	 */
> > > +	if (hdr.argsz < sizeof(hdr) + (count * sizeof(*devices))) {
> > > +		ret = -ENOSPC;
> > > +		hdr.count = count;
> > > +		goto reset_info_exit;
> > > +	}
> > > +
> > > +	devices = kcalloc(count, sizeof(*devices), GFP_KERNEL);
> > > +	if (!devices) {
> > > +		ret = -ENOMEM;
> > > +		goto reset_info_exit;
> > > +	}
> > > +
> > > +	list_for_each_entry(cur, &vdev->vdev.dev_set->device_list,
> vdev.dev_set_list) {
> > > +		cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev);
> > > +		if (cur->vdev.open_count) {
> > > +			if (cur_iommufd != iommufd) {
> > > +				ret = -EPERM;
> > > +				break;
> > > +			}
> > > +			ret = vfio_iommufd_physical_devid(&cur->vdev,
> &devices[index]);
> > > +			if (ret)
> > > +				break;
> > > +			index++;
> > > +		}
> > > +	}
> 
> This also makes use of vfio_pci_for_each_slot_or_bus() to iterate
> affected devices, but then still assumes those affected devices are
> necessarily represented in the dev_set.  For example, a two-function
> device with ACS isolation can have function 0 bound to vfio and
> function 1 bound to a native host driver.  The above code requires the
> user to pass a buffer large enough for both functions, but only
> function 0 is part of the dev_set, so function 1 is not reported as
> affected, nor does it generate an error.

The vfio_pci_dev_set_resettable() is used at [1] to check if all the affected
devices are in the dev_set. If not, then it fails at the first place. So in the
following code, looping the devices in the dev_set is equivalent to looping
devices with vfio_pci_for_each_slot_or_bus(). I need to loop dev_set as
dev_set has the vfio_device which is more convenient to get dev_id.

> 
> It looks like we also fail to set hdr.count except in the error path
> above, so the below copy_to_user() copies a user specified range beyond
> the end the allocated devices buffer out to userspace!  Thanks,

Oops, yes, it is. My test only has one device affected, so it does not hit
the problem. ☹ 

Thanks,
Yi Liu

> Alex
> 
> > > +
> > > +reset_info_exit:
> > > +	if (copy_to_user(arg, &hdr, minsz))
> > > +		ret = -EFAULT;
> > > +
> > > +	if (!ret) {
> > > +		if (copy_to_user(&arg->devices, devices,
> > > +				 hdr.count * sizeof(*devices)))
> > > +			ret = -EFAULT;
> > > +	}
> > > +
> > > +	kfree(devices);
> > > +out_unlock:
> > > +	mutex_unlock(&vdev->vdev.dev_set->lock);
> > > +	return ret;
> > > +}
> > > +
> > >  static int vfio_pci_ioctl_get_pci_hot_reset_info(
> > >  	struct vfio_pci_core_device *vdev,
> > >  	struct vfio_pci_hot_reset_info __user *arg)
> > > @@ -1404,6 +1500,8 @@ long vfio_pci_core_ioctl(struct vfio_device
> *core_vdev, unsigned int cmd,
> > >  		return vfio_pci_ioctl_get_irq_info(vdev, uarg);
> > >  	case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO:
> > >  		return vfio_pci_ioctl_get_pci_hot_reset_info(vdev, uarg);
> > > +	case VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO:
> > > +		return vfio_pci_ioctl_get_pci_hot_reset_group_info(vdev,
> uarg);
> > >  	case VFIO_DEVICE_GET_REGION_INFO:
> > >  		return vfio_pci_ioctl_get_region_info(vdev, uarg);
> > >  	case VFIO_DEVICE_IOEVENTFD:
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 25432ef213ee..61b801dfd40b 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -669,6 +669,53 @@ struct vfio_pci_hot_reset_info {
> > >
> > >  #define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO	_IO(VFIO_TYPE,
> VFIO_BASE + 12)
> > >
> > > +/**
> > > + * VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO -
> _IOWR(VFIO_TYPE, VFIO_BASE + 12,
> > > + *						    struct
> vfio_pci_hot_reset_group_info)
> > > + *
> > > + * This is used in the vfio device cdev mode.  It returns the list of
> > > + * affected devices (represented by iommufd dev_id) when hot reset is
> > > + * issued on the current device with which this ioctl is invoked.  It
> > > + * only includes the devices that are opened by the current user in the
> > > + * time of this command is invoked.  This list may change when user
> opens
> > > + * new device or close opened device, hence user should re-invoke it
> > > + * after open/close devices.  This command has no guarantee on the
> result
> > > + * of VFIO_DEVICE_PCI_HOT_RESET since the not-opened affected
> device can
> > > + * be by other users in the window between the two ioctls.  If the
> affected
> > > + * devices are opened by multiple users, the
> VFIO_DEVICE_PCI_HOT_RESET
> > > + * shall fail, detail can check the description of
> VFIO_DEVICE_PCI_HOT_RESET.
> > > + *
> > > + * For the users that open vfio group/container, this ioctl is useless as
> > > + * they have no idea about the iommufd dev_id returned by this ioctl.
> For
> > > + * the users of the traditional mode vfio group/container, this ioctl will
> > > + * fail as this mode does not use iommufd hence no dev_id to report
> back.
> > > + * For the users of the iommufd compat mode vfio group/container,
> this ioctl
> > > + * would succeed as this mode uses iommufd as container fd.  But such
> users
> > > + * still have no idea about the iommufd dev_id as the dev_id is only
> stored
> > > + * in kernel in this mode.  For the users of the vfio group/container, the
> > > + * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO should be used to know
> the hot reset
> > > + * affected devices.
> > > + *
> > > + * Return: 0 on success, -errno on failure:
> > > + *	-enospc = insufficient buffer;
> > > + *	-enodev = unsupported for device;
> > > + *	-eperm = no permission for device, this error comes:
> > > + *		 - when there are affected devices that are opened but
> > > + *		   not bound to the same iommufd with the current device
> > > + *		   with which this ioctl is invoked,
> > > + *		 - there are affected devices that are not bound to vfio
> > > + *		   driver yet.
> > > + *		 - no valid iommufd is bound (e.g. noiommu mode)
> > > + */
> > > +struct vfio_pci_hot_reset_group_info {
> > > +	__u32	argsz;
> > > +	__u32	flags;
> > > +	__u32	count;
> > > +	__u32	devices[];
> > > +};
> > > +
> > > +#define VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO
> 	_IO(VFIO_TYPE, VFIO_BASE + 18)
> > > +
> > >  /**
> > >   * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
> > >   *				    struct vfio_pci_hot_reset)
> >
Tian, Kevin March 28, 2023, 6:19 a.m. UTC | #5
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Tuesday, March 28, 2023 11:32 AM
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, March 28, 2023 3:26 AM
> >
> > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg that
> > isn't used, why do we need a new ioctl vs defining
> > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID.
> 
> Sure. I can follow this suggestion. BTW. I have a doubt here. This new flag
> is set by user. What if in the future kernel has new extensions and needs
> to report something new to the user and add new flags to tell user? Such
> flag is set by kernel. Then the flags field may have two kinds of flags (some
> set by user while some set by kernel). Will it mess up the flags space?
> 

flags in a GET_INFO ioctl is for output.

if user needs to use flags as input to select different type of info then it should
be split into multiple GET_INFO cmds.
Jason Gunthorpe March 28, 2023, 12:40 p.m. UTC | #6
On Mon, Mar 27, 2023 at 02:34:58AM -0700, Yi Liu wrote:

> +	devices = kcalloc(count, sizeof(*devices), GFP_KERNEL);
> +	if (!devices) {
> +		ret = -ENOMEM;
> +		goto reset_info_exit;
> +	}

This doesn't need to be so complicated

> +	list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, vdev.dev_set_list) {
> +		cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev);
> +		if (cur->vdev.open_count) {
> +			if (cur_iommufd != iommufd) {
> +				ret = -EPERM;
> +				break;
> +			}
> +			ret = vfio_iommufd_physical_devid(&cur->vdev, &devices[index]);


u32 device;

if (index >= hdr.count)
   return -ENOSPC;

ret = vfio_iommufd_physical_devid(&cur->vdev, &devices);
...

if (put_user(&arg->devices[index], device))
   -EFAULT

index++;

Jason
Alex Williamson March 28, 2023, 2:25 p.m. UTC | #7
On Tue, 28 Mar 2023 06:19:06 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Tuesday, March 28, 2023 11:32 AM
> >   
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Tuesday, March 28, 2023 3:26 AM
> > >
> > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg that
> > > isn't used, why do we need a new ioctl vs defining
> > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID.  
> > 
> > Sure. I can follow this suggestion. BTW. I have a doubt here. This new flag
> > is set by user. What if in the future kernel has new extensions and needs
> > to report something new to the user and add new flags to tell user? Such
> > flag is set by kernel. Then the flags field may have two kinds of flags (some
> > set by user while some set by kernel). Will it mess up the flags space?
> >   
> 
> flags in a GET_INFO ioctl is for output.
> 
> if user needs to use flags as input to select different type of info then it should
> be split into multiple GET_INFO cmds.

I don't know that that's actually a rule, however we don't currently
test flags is zero for input, so in this case I think we are stuck with
it only being for output.

Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO automatically
return the dev_id variant of the output and set a flag to indicate this
is the case when called on a device fd opened as a cdev?  Thanks,

Alex
Yi Liu March 28, 2023, 2:38 p.m. UTC | #8
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, March 28, 2023 10:26 PM
> 
> On Tue, 28 Mar 2023 06:19:06 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Tuesday, March 28, 2023 11:32 AM
> > >
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Tuesday, March 28, 2023 3:26 AM
> > > >
> > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg
> that
> > > > isn't used, why do we need a new ioctl vs defining
> > > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID.
> > >
> > > Sure. I can follow this suggestion. BTW. I have a doubt here. This new
> flag
> > > is set by user. What if in the future kernel has new extensions and needs
> > > to report something new to the user and add new flags to tell user? Such
> > > flag is set by kernel. Then the flags field may have two kinds of flags
> (some
> > > set by user while some set by kernel). Will it mess up the flags space?
> > >
> >
> > flags in a GET_INFO ioctl is for output.
> >
> > if user needs to use flags as input to select different type of info then it
> should
> > be split into multiple GET_INFO cmds.
> 
> I don't know that that's actually a rule, however we don't currently
> test flags is zero for input, so in this case I think we are stuck with
> it only being for output.
> 
> Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> automatically
> return the dev_id variant of the output and set a flag to indicate this
> is the case when called on a device fd opened as a cdev?  Thanks,

Personally I prefer that user asks for dev_id info explicitly. The major reason
that we return dev_id is that the group/bdf info is not enough for the device
fd passing case. But if qemu opens device by itself, the group/bdf info is still
enough. So a device opened as a cdev doesn't mean it should return dev_id,
it depends on if user has the bdf knowledge.

Regards,
Yi Liu
Yi Liu March 28, 2023, 2:45 p.m. UTC | #9
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 28, 2023 8:40 PM
> 
> On Mon, Mar 27, 2023 at 02:34:58AM -0700, Yi Liu wrote:
> 
> > +	devices = kcalloc(count, sizeof(*devices), GFP_KERNEL);
> > +	if (!devices) {
> > +		ret = -ENOMEM;
> > +		goto reset_info_exit;
> > +	}
> 
> This doesn't need to be so complicated
> 
> > +	list_for_each_entry(cur, &vdev->vdev.dev_set->device_list,
> vdev.dev_set_list) {
> > +		cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev);
> > +		if (cur->vdev.open_count) {
> > +			if (cur_iommufd != iommufd) {
> > +				ret = -EPERM;
> > +				break;
> > +			}
> > +			ret = vfio_iommufd_physical_devid(&cur->vdev,
> &devices[index]);
> 
> 
> u32 device;
> 
> if (index >= hdr.count)
>    return -ENOSPC;
> 
> ret = vfio_iommufd_physical_devid(&cur->vdev, &devices);

Ok, so the whole point is that if  cur->vdev->iommufd_ctx==null, then the
vfio_iommufd_physical_devid() shall fail as well.

> ...
> 
> if (put_user(&arg->devices[index], device))

Will modify it. let's close the "separate ioctl" v.s. "reuse ioctl + new flag" open with
Alex first.

Thanks,
Yi Liu

>    -EFAULT
> 
> index++;
> 
> Jason
Alex Williamson March 28, 2023, 2:46 p.m. UTC | #10
On Tue, 28 Mar 2023 14:38:12 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, March 28, 2023 10:26 PM
> > 
> > On Tue, 28 Mar 2023 06:19:06 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Tuesday, March 28, 2023 11:32 AM
> > > >  
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Tuesday, March 28, 2023 3:26 AM
> > > > >
> > > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags arg  
> > that  
> > > > > isn't used, why do we need a new ioctl vs defining
> > > > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID.  
> > > >
> > > > Sure. I can follow this suggestion. BTW. I have a doubt here. This new  
> > flag  
> > > > is set by user. What if in the future kernel has new extensions and needs
> > > > to report something new to the user and add new flags to tell user? Such
> > > > flag is set by kernel. Then the flags field may have two kinds of flags  
> > (some  
> > > > set by user while some set by kernel). Will it mess up the flags space?
> > > >  
> > >
> > > flags in a GET_INFO ioctl is for output.
> > >
> > > if user needs to use flags as input to select different type of info then it  
> > should  
> > > be split into multiple GET_INFO cmds.  
> > 
> > I don't know that that's actually a rule, however we don't currently
> > test flags is zero for input, so in this case I think we are stuck with
> > it only being for output.
> > 
> > Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > automatically
> > return the dev_id variant of the output and set a flag to indicate this
> > is the case when called on a device fd opened as a cdev?  Thanks,  
> 
> Personally I prefer that user asks for dev_id info explicitly. The major reason
> that we return dev_id is that the group/bdf info is not enough for the device
> fd passing case. But if qemu opens device by itself, the group/bdf info is still
> enough. So a device opened as a cdev doesn't mean it should return dev_id,
> it depends on if user has the bdf knowledge.

But if QEMU opens the cdev, vs getting it from the group, does it make
any sense to return a set of group-ids + bdf in the host-reset info?
I'm inclined to think the answer is no.

Per my previous suggestion, I think we should always return the bdf. We
can't know if the user is accessing through an fd they opened
themselves or were passed, but it allows an actually useful debugging
report if userspace can know "I can't perform a hot reset of this
device because my iommufd context doesn't know about device <bdf>", vs
an opaque -EPERM.  Therefore I think we're only discussing the data
conveyed in the u32, a group-id or dev_id.  Thanks,

Alex
Yi Liu March 28, 2023, 3 p.m. UTC | #11
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, March 28, 2023 10:46 PM
> 
> On Tue, 28 Mar 2023 14:38:12 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Tuesday, March 28, 2023 10:26 PM
> > >
> > > On Tue, 28 Mar 2023 06:19:06 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Tuesday, March 28, 2023 11:32 AM
> > > > >
> > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Sent: Tuesday, March 28, 2023 3:26 AM
> > > > > >
> > > > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags
> arg
> > > that
> > > > > > isn't used, why do we need a new ioctl vs defining
> > > > > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID.
> > > > >
> > > > > Sure. I can follow this suggestion. BTW. I have a doubt here. This
> new
> > > flag
> > > > > is set by user. What if in the future kernel has new extensions and
> needs
> > > > > to report something new to the user and add new flags to tell user?
> Such
> > > > > flag is set by kernel. Then the flags field may have two kinds of flags
> > > (some
> > > > > set by user while some set by kernel). Will it mess up the flags space?
> > > > >
> > > >
> > > > flags in a GET_INFO ioctl is for output.
> > > >
> > > > if user needs to use flags as input to select different type of info then it
> > > should
> > > > be split into multiple GET_INFO cmds.
> > >
> > > I don't know that that's actually a rule, however we don't currently
> > > test flags is zero for input, so in this case I think we are stuck with
> > > it only being for output.
> > >
> > > Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > automatically
> > > return the dev_id variant of the output and set a flag to indicate this
> > > is the case when called on a device fd opened as a cdev?  Thanks,
> >
> > Personally I prefer that user asks for dev_id info explicitly. The major
> reason
> > that we return dev_id is that the group/bdf info is not enough for the
> device
> > fd passing case. But if qemu opens device by itself, the group/bdf info is
> still
> > enough. So a device opened as a cdev doesn't mean it should return
> dev_id,
> > it depends on if user has the bdf knowledge.
> 
> But if QEMU opens the cdev, vs getting it from the group, does it make
> any sense to return a set of group-ids + bdf in the host-reset info?
> I'm inclined to think the answer is no.
> 
> Per my previous suggestion, I think we should always return the bdf. We
> can't know if the user is accessing through an fd they opened
> themselves or were passed,

Oh, yes. I'm convinced by this reason since only cdev mode supports device fd
passing. So I'll reuse the existing _INFO and let kernel set a flag to mark the returned
info is dev_id+bdf.

A check. If the device that the _INFIO is invoked is opened via cdev, but there
are devices in the dev_set that are got via VFIO_GROUP_GET_DEVICE_FD, should
I fail it or allow it?

> but it allows an actually useful debugging
> report if userspace can know "I can't perform a hot reset of this
> device because my iommufd context doesn't know about device <bdf>", vs
> an opaque -EPERM.  Therefore I think we're only discussing the data
> conveyed in the u32, a group-id or dev_id.  Thanks,

Sure.

Regards,
Yi Liu
Alex Williamson March 28, 2023, 3:18 p.m. UTC | #12
On Tue, 28 Mar 2023 15:00:42 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, March 28, 2023 10:46 PM
> > 
> > On Tue, 28 Mar 2023 14:38:12 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Tuesday, March 28, 2023 10:26 PM
> > > >
> > > > On Tue, 28 Mar 2023 06:19:06 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Sent: Tuesday, March 28, 2023 11:32 AM
> > > > > >  
> > > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > Sent: Tuesday, March 28, 2023 3:26 AM
> > > > > > >
> > > > > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a flags  
> > arg  
> > > > that  
> > > > > > > isn't used, why do we need a new ioctl vs defining
> > > > > > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID.  
> > > > > >
> > > > > > Sure. I can follow this suggestion. BTW. I have a doubt here. This  
> > new  
> > > > flag  
> > > > > > is set by user. What if in the future kernel has new extensions and  
> > needs  
> > > > > > to report something new to the user and add new flags to tell user?  
> > Such  
> > > > > > flag is set by kernel. Then the flags field may have two kinds of flags  
> > > > (some  
> > > > > > set by user while some set by kernel). Will it mess up the flags space?
> > > > > >  
> > > > >
> > > > > flags in a GET_INFO ioctl is for output.
> > > > >
> > > > > if user needs to use flags as input to select different type of info then it  
> > > > should  
> > > > > be split into multiple GET_INFO cmds.  
> > > >
> > > > I don't know that that's actually a rule, however we don't currently
> > > > test flags is zero for input, so in this case I think we are stuck with
> > > > it only being for output.
> > > >
> > > > Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > automatically
> > > > return the dev_id variant of the output and set a flag to indicate this
> > > > is the case when called on a device fd opened as a cdev?  Thanks,  
> > >
> > > Personally I prefer that user asks for dev_id info explicitly. The major  
> > reason  
> > > that we return dev_id is that the group/bdf info is not enough for the  
> > device  
> > > fd passing case. But if qemu opens device by itself, the group/bdf info is  
> > still  
> > > enough. So a device opened as a cdev doesn't mean it should return  
> > dev_id,  
> > > it depends on if user has the bdf knowledge.  
> > 
> > But if QEMU opens the cdev, vs getting it from the group, does it make
> > any sense to return a set of group-ids + bdf in the host-reset info?
> > I'm inclined to think the answer is no.
> > 
> > Per my previous suggestion, I think we should always return the bdf. We
> > can't know if the user is accessing through an fd they opened
> > themselves or were passed,  
> 
> Oh, yes. I'm convinced by this reason since only cdev mode supports device fd
> passing. So I'll reuse the existing _INFO and let kernel set a flag to mark the returned
> info is dev_id+bdf.
> 
> A check. If the device that the _INFIO is invoked is opened via cdev, but there
> are devices in the dev_set that are got via VFIO_GROUP_GET_DEVICE_FD, should
> I fail it or allow it?

It's a niche case, but I think it needs to be allowed.  We'd still
report the bdf for those devices, but make use of the invalid/null
dev-id.  I think this empowers userspace that they could make the same
call on a group opened fd if necessary.  An alternative would be to
redefine the returned data structure entirely with a flag per entry
describing the output, but then I think we need to invent a kernel
policy of which gets reported, which seems overly complicated if our
goal is to phase out group usage.  Make sense, or will this bite us?
Thanks,

Alex
Yi Liu March 28, 2023, 3:45 p.m. UTC | #13
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, March 28, 2023 11:18 PM
> 
> On Tue, 28 Mar 2023 15:00:42 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Tuesday, March 28, 2023 10:46 PM
> > >
> > > On Tue, 28 Mar 2023 14:38:12 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Tuesday, March 28, 2023 10:26 PM
> > > > >
> > > > > On Tue, 28 Mar 2023 06:19:06 +0000
> > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > >
> > > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > Sent: Tuesday, March 28, 2023 11:32 AM
> > > > > > >
> > > > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > Sent: Tuesday, March 28, 2023 3:26 AM
> > > > > > > >
> > > > > > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a
> flags
> > > arg
> > > > > that
> > > > > > > > isn't used, why do we need a new ioctl vs defining
> > > > > > > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID.
> > > > > > >
> > > > > > > Sure. I can follow this suggestion. BTW. I have a doubt here. This
> > > new
> > > > > flag
> > > > > > > is set by user. What if in the future kernel has new extensions and
> > > needs
> > > > > > > to report something new to the user and add new flags to tell
> user?
> > > Such
> > > > > > > flag is set by kernel. Then the flags field may have two kinds of
> flags
> > > > > (some
> > > > > > > set by user while some set by kernel). Will it mess up the flags
> space?
> > > > > > >
> > > > > >
> > > > > > flags in a GET_INFO ioctl is for output.
> > > > > >
> > > > > > if user needs to use flags as input to select different type of info
> then it
> > > > > should
> > > > > > be split into multiple GET_INFO cmds.
> > > > >
> > > > > I don't know that that's actually a rule, however we don't currently
> > > > > test flags is zero for input, so in this case I think we are stuck with
> > > > > it only being for output.
> > > > >
> > > > > Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > > automatically
> > > > > return the dev_id variant of the output and set a flag to indicate this
> > > > > is the case when called on a device fd opened as a cdev?  Thanks,
> > > >
> > > > Personally I prefer that user asks for dev_id info explicitly. The major
> > > reason
> > > > that we return dev_id is that the group/bdf info is not enough for the
> > > device
> > > > fd passing case. But if qemu opens device by itself, the group/bdf info
> is
> > > still
> > > > enough. So a device opened as a cdev doesn't mean it should return
> > > dev_id,
> > > > it depends on if user has the bdf knowledge.
> > >
> > > But if QEMU opens the cdev, vs getting it from the group, does it make
> > > any sense to return a set of group-ids + bdf in the host-reset info?
> > > I'm inclined to think the answer is no.
> > >
> > > Per my previous suggestion, I think we should always return the bdf. We
> > > can't know if the user is accessing through an fd they opened
> > > themselves or were passed,
> >
> > Oh, yes. I'm convinced by this reason since only cdev mode supports
> device fd
> > passing. So I'll reuse the existing _INFO and let kernel set a flag to mark
> the returned
> > info is dev_id+bdf.
> >
> > A check. If the device that the _INFIO is invoked is opened via cdev, but
> there
> > are devices in the dev_set that are got via VFIO_GROUP_GET_DEVICE_FD,
> should
> > I fail it or allow it?
> 
> It's a niche case, but I think it needs to be allowed. 

I'm also wondering if it is common in the future. Actually, a user should be
preferred to either use the group or cdev, but not both. Otherwise, it looks
to be bad programming.:-)

Also, as an earlier remark from Jason. If there are affected devices that are
opened by other users, the new _INFO should fail with -EPERM. I know this
remark was for the new _INFO ioctl. But now, we are going to reuse the
existing _INFO, so I'd also want to check if we still need this policy? If yes,
then it is a problem to check the owner of the devices that are opened by
the group path.

https://lore.kernel.org/kvm/ZBsF950laMs2ldFc@nvidia.com/


> We'd still
> report the bdf for those devices, but make use of the invalid/null
> dev-id.  I think this empowers userspace that they could make the same
> call on a group opened fd if necessary.

For the devices opened via group path, it should have an entry that
includes invalid_dev_id+bdf. So user can map it to the device. But
there is no group_id, this may be fine since group is just a shortcut
to find the device. Is it?


> An alternative would be to
> redefine the returned data structure entirely with a flag per entry
> describing the output, but then I think we need to invent a kernel
> policy of which gets reported, which seems overly complicated if our
> goal is to phase out group usage.  Make sense, or will this bite us?

This does smell complicated. 
Alex Williamson March 28, 2023, 4 p.m. UTC | #14
On Tue, 28 Mar 2023 15:45:38 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, March 28, 2023 11:18 PM
> > 
> > On Tue, 28 Mar 2023 15:00:42 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Tuesday, March 28, 2023 10:46 PM
> > > >
> > > > On Tue, 28 Mar 2023 14:38:12 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Sent: Tuesday, March 28, 2023 10:26 PM
> > > > > >
> > > > > > On Tue, 28 Mar 2023 06:19:06 +0000
> > > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > > >  
> > > > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > > Sent: Tuesday, March 28, 2023 11:32 AM
> > > > > > > >  
> > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > Sent: Tuesday, March 28, 2023 3:26 AM
> > > > > > > > >
> > > > > > > > > Additionally, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO has a  
> > flags  
> > > > arg  
> > > > > > that  
> > > > > > > > > isn't used, why do we need a new ioctl vs defining
> > > > > > > > > VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID.  
> > > > > > > >
> > > > > > > > Sure. I can follow this suggestion. BTW. I have a doubt here. This  
> > > > new  
> > > > > > flag  
> > > > > > > > is set by user. What if in the future kernel has new extensions and  
> > > > needs  
> > > > > > > > to report something new to the user and add new flags to tell  
> > user?  
> > > > Such  
> > > > > > > > flag is set by kernel. Then the flags field may have two kinds of  
> > flags  
> > > > > > (some  
> > > > > > > > set by user while some set by kernel). Will it mess up the flags  
> > space?  
> > > > > > > >  
> > > > > > >
> > > > > > > flags in a GET_INFO ioctl is for output.
> > > > > > >
> > > > > > > if user needs to use flags as input to select different type of info  
> > then it  
> > > > > > should  
> > > > > > > be split into multiple GET_INFO cmds.  
> > > > > >
> > > > > > I don't know that that's actually a rule, however we don't currently
> > > > > > test flags is zero for input, so in this case I think we are stuck with
> > > > > > it only being for output.
> > > > > >
> > > > > > Alternatively, should VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > > > automatically
> > > > > > return the dev_id variant of the output and set a flag to indicate this
> > > > > > is the case when called on a device fd opened as a cdev?  Thanks,  
> > > > >
> > > > > Personally I prefer that user asks for dev_id info explicitly. The major  
> > > > reason  
> > > > > that we return dev_id is that the group/bdf info is not enough for the  
> > > > device  
> > > > > fd passing case. But if qemu opens device by itself, the group/bdf info  
> > is  
> > > > still  
> > > > > enough. So a device opened as a cdev doesn't mean it should return  
> > > > dev_id,  
> > > > > it depends on if user has the bdf knowledge.  
> > > >
> > > > But if QEMU opens the cdev, vs getting it from the group, does it make
> > > > any sense to return a set of group-ids + bdf in the host-reset info?
> > > > I'm inclined to think the answer is no.
> > > >
> > > > Per my previous suggestion, I think we should always return the bdf. We
> > > > can't know if the user is accessing through an fd they opened
> > > > themselves or were passed,  
> > >
> > > Oh, yes. I'm convinced by this reason since only cdev mode supports  
> > device fd  
> > > passing. So I'll reuse the existing _INFO and let kernel set a flag to mark  
> > the returned  
> > > info is dev_id+bdf.
> > >
> > > A check. If the device that the _INFIO is invoked is opened via cdev, but  
> > there  
> > > are devices in the dev_set that are got via VFIO_GROUP_GET_DEVICE_FD,  
> > should  
> > > I fail it or allow it?  
> > 
> > It's a niche case, but I think it needs to be allowed.   
> 
> I'm also wondering if it is common in the future. Actually, a user should be
> preferred to either use the group or cdev, but not both. Otherwise, it looks
> to be bad programming.:-)
> 
> Also, as an earlier remark from Jason. If there are affected devices that are
> opened by other users, the new _INFO should fail with -EPERM. I know this
> remark was for the new _INFO ioctl. But now, we are going to reuse the
> existing _INFO, so I'd also want to check if we still need this policy? If yes,
> then it is a problem to check the owner of the devices that are opened by
> the group path.
> 
> https://lore.kernel.org/kvm/ZBsF950laMs2ldFc@nvidia.com/

Personally I don't like the suggestion to fail with -EPERM if the user
doesn't own all the affected devices.  This isn't a "probe if I can do
a reset" ioctl, it's a "provide information about the devices affected
by a reset to know how to call the hot-reset ioctl".  We're returning
the bdf to the cdev version of this ioctl for exactly this debugging
purpose when the devices are not owned, that becomes useless if we give
up an return -EPERM if ownership doesn't align.

> > We'd still
> > report the bdf for those devices, but make use of the invalid/null
> > dev-id.  I think this empowers userspace that they could make the same
> > call on a group opened fd if necessary.  
> 
> For the devices opened via group path, it should have an entry that
> includes invalid_dev_id+bdf. So user can map it to the device. But
> there is no group_id, this may be fine since group is just a shortcut
> to find the device. Is it?

Yes, it could be argued that the group-id itself is superfluous, the
user can determine the group via the bdf, but it also aligns with the
hot-reset ioctl, which currently requires the group fd.  Thanks,

Alex
Jason Gunthorpe March 28, 2023, 4:29 p.m. UTC | #15
On Tue, Mar 28, 2023 at 09:18:01AM -0600, Alex Williamson wrote:

> It's a niche case, but I think it needs to be allowed.  We'd still
> report the bdf for those devices, but make use of the invalid/null
> dev-id.

IDK, it makes the whole implementation much more complicated. Instead
of just copying the current dev_set to the output and calling
vfio_pci_dev_set_resettable() we need to do something more complex..

Keeping the current ioctl as-is means this IOCTL can be used to do any
debugging by getting the actual BDF list.

It means we can make the a new ioctl simple and just return the dev_id
array without these edge complications. I don't think merging two
different ioctls is helping make things simple..

It seems like it does what qemu wants: call the new IOCTL, if it
fails, call the old IOCTL and print out the BDF list to help debug and
then exit.

On success use the data in the new ioctl to generate the machine
configuration to pass the reset grouping into the VM.

When reset actually comes in just trigger it.

Jason
Alex Williamson March 28, 2023, 7:09 p.m. UTC | #16
On Tue, 28 Mar 2023 13:29:23 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 28, 2023 at 09:18:01AM -0600, Alex Williamson wrote:
> 
> > It's a niche case, but I think it needs to be allowed.  We'd still
> > report the bdf for those devices, but make use of the invalid/null
> > dev-id.  
> 
> IDK, it makes the whole implementation much more complicated. Instead
> of just copying the current dev_set to the output and calling
> vfio_pci_dev_set_resettable() we need to do something more complex..
> 
> Keeping the current ioctl as-is means this IOCTL can be used to do any
> debugging by getting the actual BDF list.
> 
> It means we can make the a new ioctl simple and just return the dev_id
> array without these edge complications. I don't think merging two
> different ioctls is helping make things simple..

OTOH, we already have an ioctl that essentially "does the right thing",
we just want to swap out one return field for another.  So which is
more complicated, adding another ioctl that does not quite the same
thing but still needing to maintain the old ioctl for detailed
information, or making the old ioctl bi-modal to return the appropriate
information for the type of device used to access it?

> It seems like it does what qemu wants: call the new IOCTL, if it
> fails, call the old IOCTL and print out the BDF list to help debug and
> then exit.

Userspace is already dealing with a variable length array for the
return value, why would it ever want to repeat that process to get
debugging info.  Besides, wouldn't QEMU prefer the similarity of making
the same call for groups and cdev and simply keying on the data type of
one field?

> On success use the data in the new ioctl to generate the machine
> configuration to pass the reset grouping into the VM.
> 
> When reset actually comes in just trigger it.

"Just trigger it" is the same in either case.  It seems bold to play
the complexity argument when we already have a function that does 90%
the correct thing where we can share much of the implementation and
userspace code without duplicating, but still relying on a legacy
interface for debugging.  Thanks,

Alex
Jason Gunthorpe March 28, 2023, 7:22 p.m. UTC | #17
On Tue, Mar 28, 2023 at 01:09:49PM -0600, Alex Williamson wrote:

> "Just trigger it" is the same in either case.  It seems bold to play
> the complexity argument when we already have a function that does 90%
> the correct thing where we can share much of the implementation and
> userspace code without duplicating, but still relying on a legacy
> interface for debugging. 

It just doesn't seem like we are sharing a lot, this patch is a whole
new function and you are saying the unique implementation needs to be more
complex still.

Maybe the next version will be able to share more??

Like can we just patch the existing code that sets the group_id/dev_id
or somethiing???

Still, I'm not sure that qemu will share really share much here,
because if it runs in group mode then it has to parse the result in a
totally different way than if it runs in dev_id mode. The ioctl call
is only a line or two. I imagine qemu will end up with two different
functions that both return some kind of internal list of qemu devices
in the reset group.

Jason
Yi Liu March 29, 2023, 3:13 a.m. UTC | #18
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, March 29, 2023 12:00 AM
> 
[...]
> > > > A check. If the device that the _INFIO is invoked is opened via cdev,
> but
> > > there
> > > > are devices in the dev_set that are got via
> VFIO_GROUP_GET_DEVICE_FD,
> > > should
> > > > I fail it or allow it?
> > >
> > > It's a niche case, but I think it needs to be allowed.
> >
> > I'm also wondering if it is common in the future. Actually, a user should be
> > preferred to either use the group or cdev, but not both. Otherwise, it
> looks
> > to be bad programming.:-)
> >
> > Also, as an earlier remark from Jason. If there are affected devices that
> are
> > opened by other users, the new _INFO should fail with -EPERM. I know
> this
> > remark was for the new _INFO ioctl. But now, we are going to reuse the
> > existing _INFO, so I'd also want to check if we still need this policy? If yes,
> > then it is a problem to check the owner of the devices that are opened by
> > the group path.
> >
> > https://lore.kernel.org/kvm/ZBsF950laMs2ldFc@nvidia.com/
> 
> Personally I don't like the suggestion to fail with -EPERM if the user
> doesn't own all the affected devices.  This isn't a "probe if I can do
> a reset" ioctl, it's a "provide information about the devices affected
> by a reset to know how to call the hot-reset ioctl".  We're returning
> the bdf to the cdev version of this ioctl for exactly this debugging
> purpose when the devices are not owned, that becomes useless if we give
> up an return -EPERM if ownership doesn't align.

Jason's suggestion makes sense for returning the case of returning dev_id
as dev_id is local to iommufd. If there are devices in the same dev_set are
opened by multiple users, multiple iommufd would be used. Then the
dev_id would have overlap. e.g. a dev_set has three devices. Device A and
B are opened by the current user as cdev, dev_id #1 and #2 are generated.
While device C opened by another user as cdev, dev_id #n is generated for
it. If dev_id #n happens to be #1, then user gets two info entries that have
the same dev_id.

I know the existing _INFO does not have such policy since the group/bdf
info are unique in the system. But for the dev_id case, seems really necessary
to fail if the dev_set is not only opened by one user. From this angle, will it be
good to have two ioctls. Sorry for the back-forth here. ☹

> > > We'd still
> > > report the bdf for those devices, but make use of the invalid/null
> > > dev-id.  I think this empowers userspace that they could make the same
> > > call on a group opened fd if necessary.
> >
> > For the devices opened via group path, it should have an entry that
> > includes invalid_dev_id+bdf. So user can map it to the device. But
> > there is no group_id, this may be fine since group is just a shortcut
> > to find the device. Is it?
> 
> Yes, it could be argued that the group-id itself is superfluous, the
> user can determine the group via the bdf, but it also aligns with the
> hot-reset ioctl, which currently requires the group fd.  Thanks,

I see. For the existing _INFO and HOT_RESET ioctl, I have no doubt.
Both of them use the group as a short-cut.

Regards,
Yi Liu
Tian, Kevin March 29, 2023, 9:41 a.m. UTC | #19
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, March 29, 2023 11:14 AM
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, March 29, 2023 12:00 AM
> >
> >
> > Personally I don't like the suggestion to fail with -EPERM if the user
> > doesn't own all the affected devices.  This isn't a "probe if I can do
> > a reset" ioctl, it's a "provide information about the devices affected
> > by a reset to know how to call the hot-reset ioctl".  We're returning
> > the bdf to the cdev version of this ioctl for exactly this debugging
> > purpose when the devices are not owned, that becomes useless if we give
> > up an return -EPERM if ownership doesn't align.
> 
> Jason's suggestion makes sense for returning the case of returning dev_id
> as dev_id is local to iommufd. If there are devices in the same dev_set are
> opened by multiple users, multiple iommufd would be used. Then the
> dev_id would have overlap. e.g. a dev_set has three devices. Device A and
> B are opened by the current user as cdev, dev_id #1 and #2 are generated.
> While device C opened by another user as cdev, dev_id #n is generated for
> it. If dev_id #n happens to be #1, then user gets two info entries that have
> the same dev_id.
> 

In Alex's proposal you'll set a invalid dev_id for device C so the user can
still get the info for diagnostic purpose instead of seeing an -EPERM error.

btw I found an open about fd pass scheme which may affect the choice here.

In concept even with cdev we still expect the userspace to maintain the
group knowledge so it won't inadvertently attempt to assign devices in
the same group to different IOAS's. It also needs such knowledge when
constructing guest topology.

with fd passed in Qemu has no way to associate the fd to a group.

We could extend bind_iommufd to return the group id or introduce a
new ioctl to query it per dev_id.

Once that is in place looks we don't need a new _INFO ioctl?

Thanks
Kevin
Alex Williamson March 29, 2023, 3:49 p.m. UTC | #20
On Wed, 29 Mar 2023 09:41:26 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Wednesday, March 29, 2023 11:14 AM
> >   
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, March 29, 2023 12:00 AM
> > >
> > >
> > > Personally I don't like the suggestion to fail with -EPERM if the user
> > > doesn't own all the affected devices.  This isn't a "probe if I can do
> > > a reset" ioctl, it's a "provide information about the devices affected
> > > by a reset to know how to call the hot-reset ioctl".  We're returning
> > > the bdf to the cdev version of this ioctl for exactly this debugging
> > > purpose when the devices are not owned, that becomes useless if we give
> > > up an return -EPERM if ownership doesn't align.  
> > 
> > Jason's suggestion makes sense for returning the case of returning dev_id
> > as dev_id is local to iommufd. If there are devices in the same dev_set are
> > opened by multiple users, multiple iommufd would be used. Then the
> > dev_id would have overlap. e.g. a dev_set has three devices. Device A and
> > B are opened by the current user as cdev, dev_id #1 and #2 are generated.
> > While device C opened by another user as cdev, dev_id #n is generated for
> > it. If dev_id #n happens to be #1, then user gets two info entries that have
> > the same dev_id.
> >   
> 
> In Alex's proposal you'll set a invalid dev_id for device C so the user can
> still get the info for diagnostic purpose instead of seeing an -EPERM error.

Yes, we shouldn't be reporting dev_ids outside of the user's iommufd
context.

> btw I found an open about fd pass scheme which may affect the choice here.
> 
> In concept even with cdev we still expect the userspace to maintain the
> group knowledge so it won't inadvertently attempt to assign devices in
> the same group to different IOAS's. It also needs such knowledge when
> constructing guest topology.
> 
> with fd passed in Qemu has no way to associate the fd to a group.

Hmm, QEMU tries to get the group for the device address space in the
guest, so finding an existing group with a different address space
indeed allows QEMU to know of this conflict since the group is the
fundamental unit IOMMU context in the legacy vfio model.

> We could extend bind_iommufd to return the group id or introduce a
> new ioctl to query it per dev_id.

That would be ironic to go to all this trouble to remove groups from
the API only to have them show up here.  But with a cdev interface,
don't we break that model of conflating isolation and address-ability?

For example, devices within a group cannot be bound to separate
iommufds due to lack of isolation, which is handled via DMA ownership,
but barring DMA aliasing issues, due to conventional PCI buses or
quirks, cdev could allow devices within the same group to be managed by
separate IOAS's.  So the group information really isn't enough for
userspace to infer address space restrictions with cdev anyway.

Therefore aren't we expecting this to be denied at attach_ioas() and
QEMU shouldn't be making these sorts of assumptions for cdev anyway?
Thanks,

Alex
Jason Gunthorpe March 29, 2023, 3:50 p.m. UTC | #21
On Wed, Mar 29, 2023 at 09:41:26AM +0000, Tian, Kevin wrote:

> We could extend bind_iommufd to return the group id or introduce a
> new ioctl to query it per dev_id.

> Once that is in place looks we don't need a new _INFO ioctl?

The iommu_group and the reset group are different things

The issue is processing the BDF strings, not the group ID.

Probably we should have some way for iommufd to report the group_id
from the dev_id?

Jason
Jason Gunthorpe March 29, 2023, 3:57 p.m. UTC | #22
On Wed, Mar 29, 2023 at 09:49:44AM -0600, Alex Williamson wrote:
 
> > We could extend bind_iommufd to return the group id or introduce a
> > new ioctl to query it per dev_id.
> 
> That would be ironic to go to all this trouble to remove groups from
> the API only to have them show up here.

Groups always had to be part of the API for advanced cases like qemu -
the point was to make them a small side bit of information not front
and center in control of everything.

> For example, devices within a group cannot be bound to separate
> iommufds due to lack of isolation, which is handled via DMA ownership,
> but barring DMA aliasing issues, due to conventional PCI buses or
> quirks, cdev could allow devices within the same group to be managed by
> separate IOAS's.  

Maybe some future kernel could do this, the API allows it at least..

> So the group information really isn't enough for
> userspace to infer address space restrictions with cdev anyway.
> 
> Therefore aren't we expecting this to be denied at attach_ioas() and
> QEMU shouldn't be making these sorts of assumptions for cdev anyway?

I guess we could make an API specifically to report same-iommu_domina
information?

I was assuming qemu would use the group for now as I don't see a
likely future when we would relax that restriction.. So I was keeping
a "add it when we need it" attitude here.

Jason
Tian, Kevin March 30, 2023, 1:10 a.m. UTC | #23
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, March 29, 2023 11:50 PM
> 
> On Wed, Mar 29, 2023 at 09:41:26AM +0000, Tian, Kevin wrote:
> 
> > We could extend bind_iommufd to return the group id or introduce a
> > new ioctl to query it per dev_id.
> 
> > Once that is in place looks we don't need a new _INFO ioctl?
> 
> The iommu_group and the reset group are different things
> 
> The issue is processing the BDF strings, not the group ID.
> 
> Probably we should have some way for iommufd to report the group_id
> from the dev_id?
> 

Yes, that is my thought. Though iommu_group and reset group are
different things we could still leverage existing _INFO ioctl once there
is a way to associated dev_id to group_id.
Tian, Kevin March 30, 2023, 1:17 a.m. UTC | #24
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, March 29, 2023 11:58 PM
> 
> On Wed, Mar 29, 2023 at 09:49:44AM -0600, Alex Williamson wrote:
> 
> > > We could extend bind_iommufd to return the group id or introduce a
> > > new ioctl to query it per dev_id.
> >
> > That would be ironic to go to all this trouble to remove groups from
> > the API only to have them show up here.
> 
> Groups always had to be part of the API for advanced cases like qemu -
> the point was to make them a small side bit of information not front
> and center in control of everything.

Agree.

> 
> > For example, devices within a group cannot be bound to separate
> > iommufds due to lack of isolation, which is handled via DMA ownership,
> > but barring DMA aliasing issues, due to conventional PCI buses or
> > quirks, cdev could allow devices within the same group to be managed by
> > separate IOAS's.
> 
> Maybe some future kernel could do this, the API allows it at least..
> 
> > So the group information really isn't enough for
> > userspace to infer address space restrictions with cdev anyway.
> >
> > Therefore aren't we expecting this to be denied at attach_ioas() and
> > QEMU shouldn't be making these sorts of assumptions for cdev anyway?
> 
> I guess we could make an API specifically to report same-iommu_domina
> information?
> 
> I was assuming qemu would use the group for now as I don't see a
> likely future when we would relax that restriction.. So I was keeping
> a "add it when we need it" attitude here.
> 

IIRC we discussed this subgroup concept in the thread of reviewing my
high level design proposal 2yrs ago. The consensus at the moment was
that subgroup is architecturally allowed w/o DMA aliasing issues but
we're yet to see a real demand of relaxing current group restriction to
support it. Also with time moving newer platforms should have less
multi-devices group so the need of subgroup is further decreased.

So I'm also inclined to laying the existing group restriction with cdev
for now.

Then can we make a decision how this group_id might be reported?

In nesting series we'll have a GET_INFO ioctl per dev_id. It could be
extended to report group_id too.

Or alternatively just return it in BIND_IOMMUFD together with dev_id.
Tian, Kevin March 30, 2023, 1:33 a.m. UTC | #25
> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, March 30, 2023 9:10 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, March 29, 2023 11:50 PM
> >
> > On Wed, Mar 29, 2023 at 09:41:26AM +0000, Tian, Kevin wrote:
> >
> > > We could extend bind_iommufd to return the group id or introduce a
> > > new ioctl to query it per dev_id.
> >
> > > Once that is in place looks we don't need a new _INFO ioctl?
> >
> > The iommu_group and the reset group are different things
> >
> > The issue is processing the BDF strings, not the group ID.
> >
> > Probably we should have some way for iommufd to report the group_id
> > from the dev_id?
> >
> 
> Yes, that is my thought. Though iommu_group and reset group are
> different things we could still leverage existing _INFO ioctl once there
> is a way to associated dev_id to group_id.

Please ignore this comment. Yes they are different things so even if
a dev_id is in a group_id reported on a reset BDF string it doesn't mean
this dev_id is in the reset group.

Qemu can know that all affected devices are either owned by itself or
not used by other processes if dev_id's opened by itself can be
associated to all group_id's reported in the BDF strings. But it still lacks
of information to tell the reset dependency within those opened devices
within Qemu.

So we do need a new _INFO ioctl for cdev. :/
Yi Liu March 30, 2023, 12:48 p.m. UTC | #26
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, March 29, 2023 11:50 PM
> 
> On Wed, 29 Mar 2023 09:41:26 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Wednesday, March 29, 2023 11:14 AM
> > >
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Wednesday, March 29, 2023 12:00 AM
> > > >
> > > >
> > > > Personally I don't like the suggestion to fail with -EPERM if the user
> > > > doesn't own all the affected devices.  This isn't a "probe if I can do
> > > > a reset" ioctl, it's a "provide information about the devices affected
> > > > by a reset to know how to call the hot-reset ioctl".  We're returning
> > > > the bdf to the cdev version of this ioctl for exactly this debugging
> > > > purpose when the devices are not owned, that becomes useless if we give
> > > > up an return -EPERM if ownership doesn't align.
> > >
> > > Jason's suggestion makes sense for returning the case of returning dev_id
> > > as dev_id is local to iommufd. If there are devices in the same dev_set are
> > > opened by multiple users, multiple iommufd would be used. Then the
> > > dev_id would have overlap. e.g. a dev_set has three devices. Device A and
> > > B are opened by the current user as cdev, dev_id #1 and #2 are generated.
> > > While device C opened by another user as cdev, dev_id #n is generated for
> > > it. If dev_id #n happens to be #1, then user gets two info entries that have
> > > the same dev_id.
> > >
> >
> > In Alex's proposal you'll set a invalid dev_id for device C so the user can
> > still get the info for diagnostic purpose instead of seeing an -EPERM error.
> 
> Yes, we shouldn't be reporting dev_ids outside of the user's iommufd
> context.

Following Alex's suggestion, here are two commits to extend existing _INFO
to report dev_id.

From ad5c81366813c5effd707a0b5f5e797f5fdb3115 Mon Sep 17 00:00:00 2001
From: Yi Liu <yi.l.liu@intel.com>
Date: Thu, 30 Mar 2023 05:29:36 -0700
Subject: [PATCH] vfio: Mark cdev usage in vfio_device

There are users that needs to check if vfio_device is opened as cdev.
e.g. vfio-pci.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/device_cdev.c |  2 ++
 include/linux/vfio.h       | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
index b5de997bff6d..56f3bbe34680 100644
--- a/drivers/vfio/device_cdev.c
+++ b/drivers/vfio/device_cdev.c
@@ -66,6 +66,7 @@ void vfio_device_cdev_close(struct vfio_device_file *df)
 		return;
 
 	mutex_lock(&device->dev_set->lock);
+	device->cdev_opened = false;
 	vfio_device_close(df);
 	vfio_device_put_kvm(device);
 	if (df->iommufd)
@@ -180,6 +181,7 @@ long vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df,
 	 * read/write/mmap
 	 */
 	smp_store_release(&df->access_granted, true);
+	device->cdev_opened = true;
 	mutex_unlock(&device->dev_set->lock);
 
 	return 0;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 1367605d617c..86efc1640940 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -58,6 +58,7 @@ struct vfio_device {
 	struct device device;	/* device.kref covers object life circle */
 #if IS_ENABLED(CONFIG_VFIO_DEVICE_CDEV)
 	struct cdev cdev;
+	bool cdev_opened;
 #endif
 	refcount_t refcount;	/* user count on registered device*/
 	unsigned int open_count;
@@ -167,6 +168,19 @@ static inline int vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id)
 	((void (*)(struct vfio_device *vdev)) NULL)
 #endif
 
+#if IS_ENABLED(CONFIG_VFIO_DEVICE_CDEV)
+static inline bool vfio_device_cdev_opened(struct vfio_device *device)
+{
+	lockdep_assert_held(&device->dev_set->lock);
+	return device->cdev_opened;
+}
+#else
+static inline bool vfio_device_cdev_opened(struct vfio_device *device)
+{
+	return false;
+}
+#endif
+
 /**
  * @migration_set_state: Optional callback to change the migration state for
  *         devices that support migration. It's mandatory for
Yi Liu March 30, 2023, 12:56 p.m. UTC | #27
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, March 30, 2023 8:48 PM

>  	if (fill->cur == fill->max)
>  		return -EAGAIN; /* Something changed, try again */
> @@ -791,7 +812,24 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  	if (!iommu_group)
>  		return -EPERM; /* Cannot reset non-isolated devices */
> 
> -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> +	/*
> +	 * If dev_id is needed, fill in the dev_id field, otherwise
> +	 * fill in group_id.
> +	 */
> +	if (fill->require_devid) {
> +		/*
> +		 * Report the devices that are opened as cdev and have
> +		 * the same iommufd with the fill->iommufd.  Otherwise,
> +		 * just fill in an IOMMUFD_INVALID_ID.
> +		 */
> +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
> +		if (vdev && !vfio_device_cdev_opened(vdev) &&

a typo..it should be
		if (vdev && vfio_device_cdev_opened(vdev) &&

> +		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
> +			vfio_iommufd_physical_devid(vdev, &fill->devices[fill-
> >cur].dev_id);
> +		fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;
> +	} else {
> +		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> +	}
>  	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
>  	fill->devices[fill->cur].bus = pdev->bus->number;
>  	fill->devices[fill->cur].devfn = pdev->devfn;

Regards,
Yi Liu
Jason Gunthorpe March 30, 2023, 10:38 p.m. UTC | #28
On Thu, Mar 30, 2023 at 01:17:03AM +0000, Tian, Kevin wrote:

> In nesting series we'll have a GET_INFO ioctl per dev_id. It could be
> extended to report group_id too.

I like the GET_INFO because it would work with VDPA, is it possible?

Jason
Jason Gunthorpe March 30, 2023, 10:44 p.m. UTC | #29
On Thu, Mar 30, 2023 at 12:48:03PM +0000, Liu, Yi L wrote:
> +	/*
> +	 * If dev_id is needed, fill in the dev_id field, otherwise
> +	 * fill in group_id.
> +	 */
> +	if (fill->require_devid) {
> +		/*
> +		 * Report the devices that are opened as cdev and have
> +		 * the same iommufd with the fill->iommufd.  Otherwise,
> +		 * just fill in an IOMMUFD_INVALID_ID.
> +		 */
> +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
> +		if (vdev && !vfio_device_cdev_opened(vdev) &&
> +		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
> +			vfio_iommufd_physical_devid(vdev, &fill->devices[fill->cur].dev_id);
> +		fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;

This needs an else?

I suggest to check for VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID on input
as well. I know the old kernels don't enforce this but at least we
could start enforcing it going forward so that the group path would
reject it to catch userspace bugs.

May as well fix it up to fully validate the flags

Jason
Alex Williamson March 30, 2023, 11:05 p.m. UTC | #30
On Thu, 30 Mar 2023 19:44:55 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Mar 30, 2023 at 12:48:03PM +0000, Liu, Yi L wrote:
> > +	/*
> > +	 * If dev_id is needed, fill in the dev_id field, otherwise
> > +	 * fill in group_id.
> > +	 */
> > +	if (fill->require_devid) {
> > +		/*
> > +		 * Report the devices that are opened as cdev and have
> > +		 * the same iommufd with the fill->iommufd.  Otherwise,
> > +		 * just fill in an IOMMUFD_INVALID_ID.
> > +		 */
> > +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
> > +		if (vdev && !vfio_device_cdev_opened(vdev) &&
> > +		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
> > +			vfio_iommufd_physical_devid(vdev, &fill->devices[fill->cur].dev_id);
> > +		fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;  
> 
> This needs an else?
> 
> I suggest to check for VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID on input
> as well. I know the old kernels don't enforce this but at least we
> could start enforcing it going forward so that the group path would
> reject it to catch userspace bugs.
> 
> May as well fix it up to fully validate the flags

Is this under the guise of "if nobody complains it's ok, otherwise
revert" plan?  We report dev-id based on the nature of the device, not
the provided flags, so I'm not sure I follow how this protects the group
path, unless we've failed to clear the output flags on that path with
this change.  Thanks,

Alex
Jason Gunthorpe March 30, 2023, 11:18 p.m. UTC | #31
On Thu, Mar 30, 2023 at 05:05:31PM -0600, Alex Williamson wrote:
> > I suggest to check for VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID on input
> > as well. I know the old kernels don't enforce this but at least we
> > could start enforcing it going forward so that the group path would
> > reject it to catch userspace bugs.
> > 
> > May as well fix it up to fully validate the flags
> 
> Is this under the guise of "if nobody complains it's ok, otherwise
> revert" plan? 

Yah, assuming people don't pass in uninited structs.

Jason
diff mbox series

Patch

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 19f5b075d70a..45edf4e9b98b 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1181,6 +1181,102 @@  static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+static struct pci_dev *
+vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set);
+
+static int vfio_pci_ioctl_get_pci_hot_reset_group_info(
+	struct vfio_pci_core_device *vdev,
+	struct vfio_pci_hot_reset_group_info __user *arg)
+{
+	unsigned long minsz =
+		offsetofend(struct vfio_pci_hot_reset_group_info, count);
+	struct vfio_pci_hot_reset_group_info hdr;
+	struct iommufd_ctx *iommufd, *cur_iommufd;
+	u32 count = 0, index = 0, *devices = NULL;
+	struct vfio_pci_core_device *cur;
+	bool slot = false;
+	int ret = 0;
+
+	if (copy_from_user(&hdr, arg, minsz))
+		return -EFAULT;
+
+	if (hdr.argsz < minsz)
+		return -EINVAL;
+
+	hdr.flags = 0;
+
+	/* Can we do a slot or bus reset or neither? */
+	if (!pci_probe_reset_slot(vdev->pdev->slot))
+		slot = true;
+	else if (pci_probe_reset_bus(vdev->pdev->bus))
+		return -ENODEV;
+
+	mutex_lock(&vdev->vdev.dev_set->lock);
+	if (!vfio_pci_dev_set_resettable(vdev->vdev.dev_set)) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
+	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
+	if (!iommufd) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
+	/* How many devices are affected? */
+	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs,
+					    &count, slot);
+	if (ret)
+		goto out_unlock;
+
+	WARN_ON(!count); /* Should always be at least one */
+
+	/*
+	 * If there's enough space, fill it now, otherwise return -ENOSPC and
+	 * the number of devices affected.
+	 */
+	if (hdr.argsz < sizeof(hdr) + (count * sizeof(*devices))) {
+		ret = -ENOSPC;
+		hdr.count = count;
+		goto reset_info_exit;
+	}
+
+	devices = kcalloc(count, sizeof(*devices), GFP_KERNEL);
+	if (!devices) {
+		ret = -ENOMEM;
+		goto reset_info_exit;
+	}
+
+	list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, vdev.dev_set_list) {
+		cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev);
+		if (cur->vdev.open_count) {
+			if (cur_iommufd != iommufd) {
+				ret = -EPERM;
+				break;
+			}
+			ret = vfio_iommufd_physical_devid(&cur->vdev, &devices[index]);
+			if (ret)
+				break;
+			index++;
+		}
+	}
+
+reset_info_exit:
+	if (copy_to_user(arg, &hdr, minsz))
+		ret = -EFAULT;
+
+	if (!ret) {
+		if (copy_to_user(&arg->devices, devices,
+				 hdr.count * sizeof(*devices)))
+			ret = -EFAULT;
+	}
+
+	kfree(devices);
+out_unlock:
+	mutex_unlock(&vdev->vdev.dev_set->lock);
+	return ret;
+}
+
 static int vfio_pci_ioctl_get_pci_hot_reset_info(
 	struct vfio_pci_core_device *vdev,
 	struct vfio_pci_hot_reset_info __user *arg)
@@ -1404,6 +1500,8 @@  long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 		return vfio_pci_ioctl_get_irq_info(vdev, uarg);
 	case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO:
 		return vfio_pci_ioctl_get_pci_hot_reset_info(vdev, uarg);
+	case VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO:
+		return vfio_pci_ioctl_get_pci_hot_reset_group_info(vdev, uarg);
 	case VFIO_DEVICE_GET_REGION_INFO:
 		return vfio_pci_ioctl_get_region_info(vdev, uarg);
 	case VFIO_DEVICE_IOEVENTFD:
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 25432ef213ee..61b801dfd40b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -669,6 +669,53 @@  struct vfio_pci_hot_reset_info {
 
 #define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
 
+/**
+ * VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
+ *						    struct vfio_pci_hot_reset_group_info)
+ *
+ * This is used in the vfio device cdev mode.  It returns the list of
+ * affected devices (represented by iommufd dev_id) when hot reset is
+ * issued on the current device with which this ioctl is invoked.  It
+ * only includes the devices that are opened by the current user in the
+ * time of this command is invoked.  This list may change when user opens
+ * new device or close opened device, hence user should re-invoke it
+ * after open/close devices.  This command has no guarantee on the result
+ * of VFIO_DEVICE_PCI_HOT_RESET since the not-opened affected device can
+ * be by other users in the window between the two ioctls.  If the affected
+ * devices are opened by multiple users, the VFIO_DEVICE_PCI_HOT_RESET
+ * shall fail, detail can check the description of VFIO_DEVICE_PCI_HOT_RESET.
+ *
+ * For the users that open vfio group/container, this ioctl is useless as
+ * they have no idea about the iommufd dev_id returned by this ioctl.  For
+ * the users of the traditional mode vfio group/container, this ioctl will
+ * fail as this mode does not use iommufd hence no dev_id to report back.
+ * For the users of the iommufd compat mode vfio group/container, this ioctl
+ * would succeed as this mode uses iommufd as container fd.  But such users
+ * still have no idea about the iommufd dev_id as the dev_id is only stored
+ * in kernel in this mode.  For the users of the vfio group/container, the
+ * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO should be used to know the hot reset
+ * affected devices.
+ *
+ * Return: 0 on success, -errno on failure:
+ *	-enospc = insufficient buffer;
+ *	-enodev = unsupported for device;
+ *	-eperm = no permission for device, this error comes:
+ *		 - when there are affected devices that are opened but
+ *		   not bound to the same iommufd with the current device
+ *		   with which this ioctl is invoked,
+ *		 - there are affected devices that are not bound to vfio
+ *		   driver yet.
+ *		 - no valid iommufd is bound (e.g. noiommu mode)
+ */
+struct vfio_pci_hot_reset_group_info {
+	__u32	argsz;
+	__u32	flags;
+	__u32	count;
+	__u32	devices[];
+};
+
+#define VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
 /**
  * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
  *				    struct vfio_pci_hot_reset)