[v7,17/22] vfio-pci/zdev: add open/close device hooks

Message ID	20220513191509.272897-18-mjrosato@linux.ibm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Matthew Rosato <mjrosato@linux.ibm.com> To: linux-s390@vger.kernel.org Cc: alex.williamson@redhat.com, cohuck@redhat.com, schnelle@linux.ibm.com, farman@linux.ibm.com, pmorel@linux.ibm.com, borntraeger@linux.ibm.com, hca@linux.ibm.com, gor@linux.ibm.com, gerald.schaefer@linux.ibm.com, agordeev@linux.ibm.com, svens@linux.ibm.com, frankja@linux.ibm.com, david@redhat.com, imbrenda@linux.ibm.com, vneethv@linux.ibm.com, oberpar@linux.ibm.com, freude@linux.ibm.com, thuth@redhat.com, pasic@linux.ibm.com, pbonzini@redhat.com, corbet@lwn.net, jgg@nvidia.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org Subject: [PATCH v7 17/22] vfio-pci/zdev: add open/close device hooks Date: Fri, 13 May 2022 15:15:04 -0400 Message-Id: <20220513191509.272897-18-mjrosato@linux.ibm.com> In-Reply-To: <20220513191509.272897-1-mjrosato@linux.ibm.com> References: <20220513191509.272897-1-mjrosato@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	KVM: s390: enable zPCI for interpretive execution \| expand [v7,00/22] KVM: s390: enable zPCI for interpretive execution [v7,01/22] s390/sclp: detect the zPCI load/store interpretation facility [v7,02/22] s390/sclp: detect the AISII facility [v7,03/22] s390/sclp: detect the AENI facility [v7,04/22] s390/sclp: detect the AISI facility [v7,05/22] s390/airq: pass more TPI info to airq handlers [v7,06/22] s390/airq: allow for airq structure that uses an input vector [v7,07/22] s390/pci: externalize the SIC operation controls and routine [v7,08/22] s390/pci: stash associated GISA designation [v7,09/22] s390/pci: stash dtsm and maxstbl [v7,10/22] vfio/pci: introduce CONFIG_VFIO_PCI_ZDEV_KVM [v7,11/22] KVM: s390: pci: add basic kvm_zdev structure [v7,12/22] KVM: s390: pci: do initial setup for AEN interpretation [v7,13/22] KVM: s390: pci: enable host forwarding of Adapter Event Notifications [v7,14/22] KVM: s390: mechanism to enable guest zPCI Interpretation [v7,15/22] KVM: s390: pci: provide routines for enabling/disabling interrupt forwarding [v7,16/22] KVM: s390: pci: add routines to start/stop interpretive execution [v7,17/22] vfio-pci/zdev: add open/close device hooks [v7,18/22] vfio-pci/zdev: add function handle to clp base capability [v7,19/22] vfio-pci/zdev: different maxstbl for interpreted devices [v7,20/22] KVM: s390: add KVM_S390_ZPCI_OP to manage guest zPCI devices [v7,21/22] KVM: s390: introduce CPU feature for zPCI Interpretation [v7,22/22] MAINTAINERS: additional files related kvm s390 pci passthrough

Matthew Rosato May 13, 2022, 7:15 p.m. UTC

During vfio-pci open_device, register a notifier for the zPCI device to
catch KVM registration.  This is needed in order to pass a special
indicator (GISA) to firmware to allow zPCI interpretation facilities to be
used for only the specific KVM associated with the vfio-pci device.
During vfio-pci close_device, unregister the notifier.

Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 arch/s390/include/asm/pci.h      |  2 ++
 drivers/vfio/pci/vfio_pci_core.c |  2 ++
 drivers/vfio/pci/vfio_pci_zdev.c | 54 ++++++++++++++++++++++++++++++++
 include/linux/vfio_pci_core.h    | 10 ++++++
 4 files changed, 68 insertions(+)

Jason Gunthorpe May 16, 2022, 5:27 p.m. UTC | #1

On Fri, May 13, 2022 at 03:15:04PM -0400, Matthew Rosato wrote:
> @@ -136,3 +137,56 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>  
>  	return ret;
>  }
> +
> +static int vfio_pci_zdev_group_notifier(struct notifier_block *nb,
> +					unsigned long action, void *data)
> +{
> +	struct zpci_dev *zdev = container_of(nb, struct zpci_dev, nb);
> +
> +	if (action == VFIO_GROUP_NOTIFY_SET_KVM) {
> +		if (!zdev)
> +			return NOTIFY_DONE;
> +
> +		if (data) {
> +			if (kvm_s390_pci_register_kvm(zdev, (struct kvm *)data))
> +				return NOTIFY_BAD;

The error codes are all ignored for this notifier chains, so this
seems like a problem.

> +		} else {
> +			if (kvm_s390_pci_unregister_kvm(zdev))
> +				return NOTIFY_BAD;

unregister really shouldn't fail.


> +		}
> +
> +	}
> +
> +	return NOTIFY_OK;
> +}
> +
> +void vfio_pci_zdev_open(struct vfio_pci_core_device *vdev)
> +{
> +	unsigned long events = VFIO_GROUP_NOTIFY_SET_KVM;
> +	struct zpci_dev *zdev = to_zpci(vdev->pdev);
> +
> +	if (!zdev)
> +		return;
> +
> +	zdev->nb.notifier_call = vfio_pci_zdev_group_notifier;
> +
> +	vfio_register_notifier(vdev->vdev.dev, VFIO_GROUP_NOTIFY,
> +			       &events, &zdev->nb);

Normally you'd want to do what is kvm_s390_pci_register_kvm() here,
where a failure can be propogated but then you have a race condition
with the kvm.

Blech, maybe it is time to just fix this race condition permanently,
what do you think? (I didn't even compile it)

diff --git a/drivers/gpu/drm/i915/gvt/gvt.h b/drivers/gpu/drm/i915/gvt/gvt.h
index 2af4c83e733c6c..633acfcf76bf23 100644
--- a/drivers/gpu/drm/i915/gvt/gvt.h
+++ b/drivers/gpu/drm/i915/gvt/gvt.h
@@ -227,8 +227,6 @@ struct intel_vgpu {
 	struct mutex cache_lock;
 
 	struct notifier_block iommu_notifier;
-	struct notifier_block group_notifier;
-	struct kvm *kvm;
 	struct work_struct release_work;
 	atomic_t released;
 
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index 7655ffa97d5116..655d47c65470d5 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -761,23 +761,6 @@ static int intel_vgpu_iommu_notifier(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
-static int intel_vgpu_group_notifier(struct notifier_block *nb,
-				     unsigned long action, void *data)
-{
-	struct intel_vgpu *vgpu =
-		container_of(nb, struct intel_vgpu, group_notifier);
-
-	/* the only action we care about */
-	if (action == VFIO_GROUP_NOTIFY_SET_KVM) {
-		vgpu->kvm = data;
-
-		if (!data)
-			schedule_work(&vgpu->release_work);
-	}
-
-	return NOTIFY_OK;
-}
-
 static bool __kvmgt_vgpu_exist(struct intel_vgpu *vgpu)
 {
 	struct intel_vgpu *itr;
@@ -789,7 +772,7 @@ static bool __kvmgt_vgpu_exist(struct intel_vgpu *vgpu)
 		if (!itr->attached)
 			continue;
 
-		if (vgpu->kvm == itr->kvm) {
+		if (vgpu->vfio_device.kvm == itr->vfio_device.kvm) {
 			ret = true;
 			goto out;
 		}
@@ -806,7 +789,6 @@ static int intel_vgpu_open_device(struct vfio_device *vfio_dev)
 	int ret;
 
 	vgpu->iommu_notifier.notifier_call = intel_vgpu_iommu_notifier;
-	vgpu->group_notifier.notifier_call = intel_vgpu_group_notifier;
 
 	events = VFIO_IOMMU_NOTIFY_DMA_UNMAP;
 	ret = vfio_register_notifier(vfio_dev, VFIO_IOMMU_NOTIFY, &events,
@@ -817,38 +799,30 @@ static int intel_vgpu_open_device(struct vfio_device *vfio_dev)
 		goto out;
 	}
 
-	events = VFIO_GROUP_NOTIFY_SET_KVM;
-	ret = vfio_register_notifier(vfio_dev, VFIO_GROUP_NOTIFY, &events,
-				     &vgpu->group_notifier);
-	if (ret != 0) {
-		gvt_vgpu_err("vfio_register_notifier for group failed: %d\n",
-			ret);
-		goto undo_iommu;
-	}
-
 	ret = -EEXIST;
 	if (vgpu->attached)
-		goto undo_register;
+		goto undo_iommu;
 
 	ret = -ESRCH;
-	if (!vgpu->kvm || vgpu->kvm->mm != current->mm) {
+	if (!vgpu->vfio_device.kvm ||
+	    vgpu->vfio_device.kvm->mm != current->mm) {
 		gvt_vgpu_err("KVM is required to use Intel vGPU\n");
-		goto undo_register;
+		goto undo_iommu;
 	}
 
 	ret = -EEXIST;
 	if (__kvmgt_vgpu_exist(vgpu))
-		goto undo_register;
+		goto undo_iommu;
 
 	vgpu->attached = true;
-	kvm_get_kvm(vgpu->kvm);
 
 	kvmgt_protect_table_init(vgpu);
 	gvt_cache_init(vgpu);
 
 	vgpu->track_node.track_write = kvmgt_page_track_write;
 	vgpu->track_node.track_flush_slot = kvmgt_page_track_flush_slot;
-	kvm_page_track_register_notifier(vgpu->kvm, &vgpu->track_node);
+	kvm_page_track_register_notifier(vgpu->vfio_device.kvm,
+					 &vgpu->track_node);
 
 	debugfs_create_ulong(KVMGT_DEBUGFS_FILENAME, 0444, vgpu->debugfs,
 			     &vgpu->nr_cache_entries);
@@ -858,10 +832,6 @@ static int intel_vgpu_open_device(struct vfio_device *vfio_dev)
 	atomic_set(&vgpu->released, 0);
 	return 0;
 
-undo_register:
-	vfio_unregister_notifier(vfio_dev, VFIO_GROUP_NOTIFY,
-				 &vgpu->group_notifier);
-
 undo_iommu:
 	vfio_unregister_notifier(vfio_dev, VFIO_IOMMU_NOTIFY,
 				 &vgpu->iommu_notifier);
@@ -898,21 +868,15 @@ static void __intel_vgpu_release(struct intel_vgpu *vgpu)
 	drm_WARN(&i915->drm, ret,
 		 "vfio_unregister_notifier for iommu failed: %d\n", ret);
 
-	ret = vfio_unregister_notifier(&vgpu->vfio_device, VFIO_GROUP_NOTIFY,
-				       &vgpu->group_notifier);
-	drm_WARN(&i915->drm, ret,
-		 "vfio_unregister_notifier for group failed: %d\n", ret);
-
 	debugfs_remove(debugfs_lookup(KVMGT_DEBUGFS_FILENAME, vgpu->debugfs));
 
-	kvm_page_track_unregister_notifier(vgpu->kvm, &vgpu->track_node);
-	kvm_put_kvm(vgpu->kvm);
+	kvm_page_track_unregister_notifier(vgpu->vfio_device.kvm,
+					   &vgpu->track_node);
 	kvmgt_protect_table_destroy(vgpu);
 	gvt_cache_destroy(vgpu);
 
 	intel_vgpu_release_msi_eventfd_ctx(vgpu);
 
-	vgpu->kvm = NULL;
 	vgpu->attached = false;
 }
 
@@ -1649,6 +1613,7 @@ static const struct attribute_group *intel_vgpu_groups[] = {
 };
 
 static const struct vfio_device_ops intel_vgpu_dev_ops = {
+	.flags		= VFIO_DEVICE_NEEDS_KVM,
 	.open_device	= intel_vgpu_open_device,
 	.close_device	= intel_vgpu_close_device,
 	.read		= intel_vgpu_read,
@@ -1713,7 +1678,7 @@ static struct mdev_driver intel_vgpu_mdev_driver = {
 
 int intel_gvt_page_track_add(struct intel_vgpu *info, u64 gfn)
 {
-	struct kvm *kvm = info->kvm;
+	struct kvm *kvm = info->vfio_device.kvm;
 	struct kvm_memory_slot *slot;
 	int idx;
 
@@ -1743,7 +1708,7 @@ int intel_gvt_page_track_add(struct intel_vgpu *info, u64 gfn)
 
 int intel_gvt_page_track_remove(struct intel_vgpu *info, u64 gfn)
 {
-	struct kvm *kvm = info->kvm;
+	struct kvm *kvm = info->vfio_device.kvm;
 	struct kvm_memory_slot *slot;
 	int idx;
 
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index e8914024f5b1af..f378f809d8a00d 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -1284,25 +1284,6 @@ static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev)
 	}
 }
 
-static int vfio_ap_mdev_group_notifier(struct notifier_block *nb,
-				       unsigned long action, void *data)
-{
-	int notify_rc = NOTIFY_OK;
-	struct ap_matrix_mdev *matrix_mdev;
-
-	if (action != VFIO_GROUP_NOTIFY_SET_KVM)
-		return NOTIFY_OK;
-
-	matrix_mdev = container_of(nb, struct ap_matrix_mdev, group_notifier);
-
-	if (!data)
-		vfio_ap_mdev_unset_kvm(matrix_mdev);
-	else if (vfio_ap_mdev_set_kvm(matrix_mdev, data))
-		notify_rc = NOTIFY_DONE;
-
-	return notify_rc;
-}
-
 static struct vfio_ap_queue *vfio_ap_find_queue(int apqn)
 {
 	struct device *dev;
@@ -1402,11 +1383,7 @@ static int vfio_ap_mdev_open_device(struct vfio_device *vdev)
 	unsigned long events;
 	int ret;
 
-	matrix_mdev->group_notifier.notifier_call = vfio_ap_mdev_group_notifier;
-	events = VFIO_GROUP_NOTIFY_SET_KVM;
-
-	ret = vfio_register_notifier(vdev, VFIO_GROUP_NOTIFY, &events,
-				     &matrix_mdev->group_notifier);
+	ret = vfio_ap_mdev_set_kvm(matrix_mdev, vdev->vdev.kvm);
 	if (ret)
 		return ret;
 
@@ -1415,12 +1392,11 @@ static int vfio_ap_mdev_open_device(struct vfio_device *vdev)
 	ret = vfio_register_notifier(vdev, VFIO_IOMMU_NOTIFY, &events,
 				     &matrix_mdev->iommu_notifier);
 	if (ret)
-		goto out_unregister_group;
+		goto err_kvm;
 	return 0;
 
-out_unregister_group:
-	vfio_unregister_notifier(vdev, VFIO_GROUP_NOTIFY,
-				 &matrix_mdev->group_notifier);
+err_kvm:
+	vfio_ap_mdev_unset_kvm(matrix_mdev);
 	return ret;
 }
 
@@ -1431,8 +1407,6 @@ static void vfio_ap_mdev_close_device(struct vfio_device *vdev)
 
 	vfio_unregister_notifier(vdev, VFIO_IOMMU_NOTIFY,
 				 &matrix_mdev->iommu_notifier);
-	vfio_unregister_notifier(vdev, VFIO_GROUP_NOTIFY,
-				 &matrix_mdev->group_notifier);
 	vfio_ap_mdev_unset_kvm(matrix_mdev);
 }
 
@@ -1481,6 +1455,7 @@ static ssize_t vfio_ap_mdev_ioctl(struct vfio_device *vdev,
 }
 
 static const struct vfio_device_ops vfio_ap_matrix_dev_ops = {
+	.flags = VFIO_DEVICE_NEEDS_KVM,
 	.open_device = vfio_ap_mdev_open_device,
 	.close_device = vfio_ap_mdev_close_device,
 	.ioctl = vfio_ap_mdev_ioctl,
diff --git a/drivers/s390/crypto/vfio_ap_private.h b/drivers/s390/crypto/vfio_ap_private.h
index 648fcaf8104abb..a26efd804d0df3 100644
--- a/drivers/s390/crypto/vfio_ap_private.h
+++ b/drivers/s390/crypto/vfio_ap_private.h
@@ -81,8 +81,6 @@ struct ap_matrix {
  * @node:	allows the ap_matrix_mdev struct to be added to a list
  * @matrix:	the adapters, usage domains and control domains assigned to the
  *		mediated matrix device.
- * @group_notifier: notifier block used for specifying callback function for
- *		    handling the VFIO_GROUP_NOTIFY_SET_KVM event
  * @iommu_notifier: notifier block used for specifying callback function for
  *		    handling the VFIO_IOMMU_NOTIFY_DMA_UNMAP even
  * @kvm:	the struct holding guest's state
@@ -94,7 +92,6 @@ struct ap_matrix_mdev {
 	struct vfio_device vdev;
 	struct list_head node;
 	struct ap_matrix matrix;
-	struct notifier_block group_notifier;
 	struct notifier_block iommu_notifier;
 	struct kvm *kvm;
 	crypto_hook pqap_hook;
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index cfd797629a21ab..1c20bb5484afde 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -10,6 +10,7 @@
  * Author: Tom Lyon, pugs@cisco.com
  */
 
+#include "linux/kvm_host.h"
 #include <linux/cdev.h>
 #include <linux/compat.h>
 #include <linux/device.h>
@@ -1097,13 +1098,25 @@ static struct file *vfio_device_open(struct vfio_device *device)
 
 	down_write(&device->group->group_rwsem);
 	ret = vfio_device_assign_container(device);
-	up_write(&device->group->group_rwsem);
-	if (ret)
+	if (ret) {
+		up_write(&device->group->group_rwsem);
 		return ERR_PTR(ret);
+	}
+
+	if (device->ops->flags & VFIO_DEVICE_NEEDS_KVM)
+	{
+		if (!device->group->kvm) {
+			up_write(&device->group->group_rwsem);
+			goto err_unassign_container;
+		}
+		device->kvm = device->group->kvm;
+		kvm_get_kvm(device->kvm);
+	}
+	up_write(&device->group->group_rwsem);
 
 	if (!try_module_get(device->dev->driver->owner)) {
 		ret = -ENODEV;
-		goto err_unassign_container;
+		goto err_put_kvm;
 	}
 
 	mutex_lock(&device->dev_set->lock);
@@ -1147,6 +1160,11 @@ static struct file *vfio_device_open(struct vfio_device *device)
 	device->open_count--;
 	mutex_unlock(&device->dev_set->lock);
 	module_put(device->dev->driver->owner);
+err_put_kvm:
+	if (device->kvm) {
+		kvm_put_kvm(device->kvm);
+		device->kvm = NULL;
+	}
 err_unassign_container:
 	vfio_device_unassign_container(device);
 	return ERR_PTR(ret);
@@ -1344,6 +1362,10 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 
 	module_put(device->dev->driver->owner);
 
+	if (device->kvm) {
+		kvm_put_kvm(device->kvm);
+		device->kvm = NULL;
+	}
 	vfio_device_unassign_container(device);
 
 	vfio_device_put(device);
@@ -1748,8 +1770,8 @@ EXPORT_SYMBOL_GPL(vfio_file_enforced_coherent);
  * @file: VFIO group file
  * @kvm: KVM to link
  *
- * The kvm pointer will be forwarded to all the vfio_device's attached to the
- * VFIO file via the VFIO_GROUP_NOTIFY_SET_KVM notifier.
+ * When a VFIO device is first opened the KVM will be available in
+ * device->kvm if VFIO_DEVICE_NEEDS_KVM is set.
  */
 void vfio_file_set_kvm(struct file *file, struct kvm *kvm)
 {
@@ -1760,8 +1782,6 @@ void vfio_file_set_kvm(struct file *file, struct kvm *kvm)
 
 	down_write(&group->group_rwsem);
 	group->kvm = kvm;
-	blocking_notifier_call_chain(&group->notifier,
-				     VFIO_GROUP_NOTIFY_SET_KVM, kvm);
 	up_write(&group->group_rwsem);
 }
 EXPORT_SYMBOL_GPL(vfio_file_set_kvm);
@@ -2061,42 +2081,6 @@ static int vfio_unregister_iommu_notifier(struct vfio_group *group,
 	return ret;
 }
 
-static int vfio_register_group_notifier(struct vfio_group *group,
-					unsigned long *events,
-					struct notifier_block *nb)
-{
-	int ret;
-	bool set_kvm = false;
-
-	if (*events & VFIO_GROUP_NOTIFY_SET_KVM)
-		set_kvm = true;
-
-	/* clear known events */
-	*events &= ~VFIO_GROUP_NOTIFY_SET_KVM;
-
-	/* refuse to continue if still events remaining */
-	if (*events)
-		return -EINVAL;
-
-	ret = blocking_notifier_chain_register(&group->notifier, nb);
-	if (ret)
-		return ret;
-
-	/*
-	 * The attaching of kvm and vfio_group might already happen, so
-	 * here we replay once upon registration.
-	 */
-	if (set_kvm) {
-		down_read(&group->group_rwsem);
-		if (group->kvm)
-			blocking_notifier_call_chain(&group->notifier,
-						     VFIO_GROUP_NOTIFY_SET_KVM,
-						     group->kvm);
-		up_read(&group->group_rwsem);
-	}
-	return 0;
-}
-
 int vfio_register_notifier(struct vfio_device *device,
 			   enum vfio_notify_type type, unsigned long *events,
 			   struct notifier_block *nb)
@@ -2112,9 +2096,6 @@ int vfio_register_notifier(struct vfio_device *device,
 	case VFIO_IOMMU_NOTIFY:
 		ret = vfio_register_iommu_notifier(group, events, nb);
 		break;
-	case VFIO_GROUP_NOTIFY:
-		ret = vfio_register_group_notifier(group, events, nb);
-		break;
 	default:
 		ret = -EINVAL;
 	}
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 45b287826ce686..aaf120b9c080b7 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -36,6 +36,7 @@ struct vfio_device {
 	struct vfio_device_set *dev_set;
 	struct list_head dev_set_list;
 	unsigned int migration_flags;
+	struct kvm *kvm;
 
 	/* Members below here are private, not for driver use */
 	refcount_t refcount;
@@ -44,6 +45,10 @@ struct vfio_device {
 	struct list_head group_next;
 };
 
+enum {
+	VFIO_DEVICE_NEEDS_KVM = 1,
+};
+
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
  *
@@ -72,6 +77,7 @@ struct vfio_device {
  */
 struct vfio_device_ops {
 	char	*name;
+	unsigned int flags;
 	int	(*open_device)(struct vfio_device *vdev);
 	void	(*close_device)(struct vfio_device *vdev);
 	ssize_t	(*read)(struct vfio_device *vdev, char __user *buf,
@@ -155,15 +161,11 @@ extern int vfio_dma_rw(struct vfio_device *device, dma_addr_t user_iova,
 /* each type has independent events */
 enum vfio_notify_type {
 	VFIO_IOMMU_NOTIFY = 0,
-	VFIO_GROUP_NOTIFY = 1,
 };
 
 /* events for VFIO_IOMMU_NOTIFY */
 #define VFIO_IOMMU_NOTIFY_DMA_UNMAP	BIT(0)
 
-/* events for VFIO_GROUP_NOTIFY */
-#define VFIO_GROUP_NOTIFY_SET_KVM	BIT(0)
-
 extern int vfio_register_notifier(struct vfio_device *device,
 				  enum vfio_notify_type type,
 				  unsigned long *required_events,

Matthew Rosato May 16, 2022, 6:30 p.m. UTC | #2

On 5/16/22 1:27 PM, Jason Gunthorpe wrote:
> On Fri, May 13, 2022 at 03:15:04PM -0400, Matthew Rosato wrote:
>> @@ -136,3 +137,56 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>>   
>>   	return ret;
>>   }
>> +
>> +static int vfio_pci_zdev_group_notifier(struct notifier_block *nb,
>> +					unsigned long action, void *data)
>> +{
>> +	struct zpci_dev *zdev = container_of(nb, struct zpci_dev, nb);
>> +
>> +	if (action == VFIO_GROUP_NOTIFY_SET_KVM) {
>> +		if (!zdev)
>> +			return NOTIFY_DONE;
>> +
>> +		if (data) {
>> +			if (kvm_s390_pci_register_kvm(zdev, (struct kvm *)data))
>> +				return NOTIFY_BAD;
> 
> The error codes are all ignored for this notifier chains, so this
> seems like a problem.
> 
>> +		} else {
>> +			if (kvm_s390_pci_unregister_kvm(zdev))
>> +				return NOTIFY_BAD;
> 
> unregister really shouldn't fail.
> 
> 
>> +		}
>> +
>> +	}
>> +
>> +	return NOTIFY_OK;
>> +}
>> +
>> +void vfio_pci_zdev_open(struct vfio_pci_core_device *vdev)
>> +{
>> +	unsigned long events = VFIO_GROUP_NOTIFY_SET_KVM;
>> +	struct zpci_dev *zdev = to_zpci(vdev->pdev);
>> +
>> +	if (!zdev)
>> +		return;
>> +
>> +	zdev->nb.notifier_call = vfio_pci_zdev_group_notifier;
>> +
>> +	vfio_register_notifier(vdev->vdev.dev, VFIO_GROUP_NOTIFY,
>> +			       &events, &zdev->nb);
> 
> Normally you'd want to do what is kvm_s390_pci_register_kvm() here,
> where a failure can be propogated but then you have a race condition
> with the kvm.
> 
> Blech, maybe it is time to just fix this race condition permanently,
> what do you think? (I didn't even compile it)

Conceptually I think this would work for QEMU anyway (it always sets the 
kvm before we open the device).  I tried to test the idea quickly but 
couldn't get the following to apply on vfio-next or your 
vfio_group_locking -- but I understand what you're trying to do so I'll 
re-work and try it out.

@Alex can you think of any usecase/reason why we would need to be able 
to set the KVM sometime after the device was opened?  Doing something 
like below would break that, as this introduces the assumption that the 
group is associated with the KVM before the device is opened (and if 
it's not, the device open fails).


> 
> diff --git a/drivers/gpu/drm/i915/gvt/gvt.h b/drivers/gpu/drm/i915/gvt/gvt.h
> index 2af4c83e733c6c..633acfcf76bf23 100644
> --- a/drivers/gpu/drm/i915/gvt/gvt.h
> +++ b/drivers/gpu/drm/i915/gvt/gvt.h
> @@ -227,8 +227,6 @@ struct intel_vgpu {
>   	struct mutex cache_lock;
>   
>   	struct notifier_block iommu_notifier;
> -	struct notifier_block group_notifier;
> -	struct kvm *kvm;
>   	struct work_struct release_work;
>   	atomic_t released;
>   
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 7655ffa97d5116..655d47c65470d5 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -761,23 +761,6 @@ static int intel_vgpu_iommu_notifier(struct notifier_block *nb,
>   	return NOTIFY_OK;
>   }
>   
> -static int intel_vgpu_group_notifier(struct notifier_block *nb,
> -				     unsigned long action, void *data)
> -{
> -	struct intel_vgpu *vgpu =
> -		container_of(nb, struct intel_vgpu, group_notifier);
> -
> -	/* the only action we care about */
> -	if (action == VFIO_GROUP_NOTIFY_SET_KVM) {
> -		vgpu->kvm = data;
> -
> -		if (!data)
> -			schedule_work(&vgpu->release_work);
> -	}
> -
> -	return NOTIFY_OK;
> -}
> -
>   static bool __kvmgt_vgpu_exist(struct intel_vgpu *vgpu)
>   {
>   	struct intel_vgpu *itr;
> @@ -789,7 +772,7 @@ static bool __kvmgt_vgpu_exist(struct intel_vgpu *vgpu)
>   		if (!itr->attached)
>   			continue;
>   
> -		if (vgpu->kvm == itr->kvm) {
> +		if (vgpu->vfio_device.kvm == itr->vfio_device.kvm) {
>   			ret = true;
>   			goto out;
>   		}
> @@ -806,7 +789,6 @@ static int intel_vgpu_open_device(struct vfio_device *vfio_dev)
>   	int ret;
>   
>   	vgpu->iommu_notifier.notifier_call = intel_vgpu_iommu_notifier;
> -	vgpu->group_notifier.notifier_call = intel_vgpu_group_notifier;
>   
>   	events = VFIO_IOMMU_NOTIFY_DMA_UNMAP;
>   	ret = vfio_register_notifier(vfio_dev, VFIO_IOMMU_NOTIFY, &events,
> @@ -817,38 +799,30 @@ static int intel_vgpu_open_device(struct vfio_device *vfio_dev)
>   		goto out;
>   	}
>   
> -	events = VFIO_GROUP_NOTIFY_SET_KVM;
> -	ret = vfio_register_notifier(vfio_dev, VFIO_GROUP_NOTIFY, &events,
> -				     &vgpu->group_notifier);
> -	if (ret != 0) {
> -		gvt_vgpu_err("vfio_register_notifier for group failed: %d\n",
> -			ret);
> -		goto undo_iommu;
> -	}
> -
>   	ret = -EEXIST;
>   	if (vgpu->attached)
> -		goto undo_register;
> +		goto undo_iommu;
>   
>   	ret = -ESRCH;
> -	if (!vgpu->kvm || vgpu->kvm->mm != current->mm) {
> +	if (!vgpu->vfio_device.kvm ||
> +	    vgpu->vfio_device.kvm->mm != current->mm) {
>   		gvt_vgpu_err("KVM is required to use Intel vGPU\n");
> -		goto undo_register;
> +		goto undo_iommu;
>   	}
>   
>   	ret = -EEXIST;
>   	if (__kvmgt_vgpu_exist(vgpu))
> -		goto undo_register;
> +		goto undo_iommu;
>   
>   	vgpu->attached = true;
> -	kvm_get_kvm(vgpu->kvm);
>   
>   	kvmgt_protect_table_init(vgpu);
>   	gvt_cache_init(vgpu);
>   
>   	vgpu->track_node.track_write = kvmgt_page_track_write;
>   	vgpu->track_node.track_flush_slot = kvmgt_page_track_flush_slot;
> -	kvm_page_track_register_notifier(vgpu->kvm, &vgpu->track_node);
> +	kvm_page_track_register_notifier(vgpu->vfio_device.kvm,
> +					 &vgpu->track_node);
>   
>   	debugfs_create_ulong(KVMGT_DEBUGFS_FILENAME, 0444, vgpu->debugfs,
>   			     &vgpu->nr_cache_entries);
> @@ -858,10 +832,6 @@ static int intel_vgpu_open_device(struct vfio_device *vfio_dev)
>   	atomic_set(&vgpu->released, 0);
>   	return 0;
>   
> -undo_register:
> -	vfio_unregister_notifier(vfio_dev, VFIO_GROUP_NOTIFY,
> -				 &vgpu->group_notifier);
> -
>   undo_iommu:
>   	vfio_unregister_notifier(vfio_dev, VFIO_IOMMU_NOTIFY,
>   				 &vgpu->iommu_notifier);
> @@ -898,21 +868,15 @@ static void __intel_vgpu_release(struct intel_vgpu *vgpu)
>   	drm_WARN(&i915->drm, ret,
>   		 "vfio_unregister_notifier for iommu failed: %d\n", ret);
>   
> -	ret = vfio_unregister_notifier(&vgpu->vfio_device, VFIO_GROUP_NOTIFY,
> -				       &vgpu->group_notifier);
> -	drm_WARN(&i915->drm, ret,
> -		 "vfio_unregister_notifier for group failed: %d\n", ret);
> -
>   	debugfs_remove(debugfs_lookup(KVMGT_DEBUGFS_FILENAME, vgpu->debugfs));
>   
> -	kvm_page_track_unregister_notifier(vgpu->kvm, &vgpu->track_node);
> -	kvm_put_kvm(vgpu->kvm);
> +	kvm_page_track_unregister_notifier(vgpu->vfio_device.kvm,
> +					   &vgpu->track_node);
>   	kvmgt_protect_table_destroy(vgpu);
>   	gvt_cache_destroy(vgpu);
>   
>   	intel_vgpu_release_msi_eventfd_ctx(vgpu);
>   
> -	vgpu->kvm = NULL;
>   	vgpu->attached = false;
>   }
>   
> @@ -1649,6 +1613,7 @@ static const struct attribute_group *intel_vgpu_groups[] = {
>   };
>   
>   static const struct vfio_device_ops intel_vgpu_dev_ops = {
> +	.flags		= VFIO_DEVICE_NEEDS_KVM,
>   	.open_device	= intel_vgpu_open_device,
>   	.close_device	= intel_vgpu_close_device,
>   	.read		= intel_vgpu_read,
> @@ -1713,7 +1678,7 @@ static struct mdev_driver intel_vgpu_mdev_driver = {
>   
>   int intel_gvt_page_track_add(struct intel_vgpu *info, u64 gfn)
>   {
> -	struct kvm *kvm = info->kvm;
> +	struct kvm *kvm = info->vfio_device.kvm;
>   	struct kvm_memory_slot *slot;
>   	int idx;
>   
> @@ -1743,7 +1708,7 @@ int intel_gvt_page_track_add(struct intel_vgpu *info, u64 gfn)
>   
>   int intel_gvt_page_track_remove(struct intel_vgpu *info, u64 gfn)
>   {
> -	struct kvm *kvm = info->kvm;
> +	struct kvm *kvm = info->vfio_device.kvm;
>   	struct kvm_memory_slot *slot;
>   	int idx;
>   
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index e8914024f5b1af..f378f809d8a00d 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -1284,25 +1284,6 @@ static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev)
>   	}
>   }
>   
> -static int vfio_ap_mdev_group_notifier(struct notifier_block *nb,
> -				       unsigned long action, void *data)
> -{
> -	int notify_rc = NOTIFY_OK;
> -	struct ap_matrix_mdev *matrix_mdev;
> -
> -	if (action != VFIO_GROUP_NOTIFY_SET_KVM)
> -		return NOTIFY_OK;
> -
> -	matrix_mdev = container_of(nb, struct ap_matrix_mdev, group_notifier);
> -
> -	if (!data)
> -		vfio_ap_mdev_unset_kvm(matrix_mdev);
> -	else if (vfio_ap_mdev_set_kvm(matrix_mdev, data))
> -		notify_rc = NOTIFY_DONE;
> -
> -	return notify_rc;
> -}
> -
>   static struct vfio_ap_queue *vfio_ap_find_queue(int apqn)
>   {
>   	struct device *dev;
> @@ -1402,11 +1383,7 @@ static int vfio_ap_mdev_open_device(struct vfio_device *vdev)
>   	unsigned long events;
>   	int ret;
>   
> -	matrix_mdev->group_notifier.notifier_call = vfio_ap_mdev_group_notifier;
> -	events = VFIO_GROUP_NOTIFY_SET_KVM;
> -
> -	ret = vfio_register_notifier(vdev, VFIO_GROUP_NOTIFY, &events,
> -				     &matrix_mdev->group_notifier);
> +	ret = vfio_ap_mdev_set_kvm(matrix_mdev, vdev->vdev.kvm);
>   	if (ret)
>   		return ret;
>   
> @@ -1415,12 +1392,11 @@ static int vfio_ap_mdev_open_device(struct vfio_device *vdev)
>   	ret = vfio_register_notifier(vdev, VFIO_IOMMU_NOTIFY, &events,
>   				     &matrix_mdev->iommu_notifier);
>   	if (ret)
> -		goto out_unregister_group;
> +		goto err_kvm;
>   	return 0;
>   
> -out_unregister_group:
> -	vfio_unregister_notifier(vdev, VFIO_GROUP_NOTIFY,
> -				 &matrix_mdev->group_notifier);
> +err_kvm:
> +	vfio_ap_mdev_unset_kvm(matrix_mdev);
>   	return ret;
>   }
>   
> @@ -1431,8 +1407,6 @@ static void vfio_ap_mdev_close_device(struct vfio_device *vdev)
>   
>   	vfio_unregister_notifier(vdev, VFIO_IOMMU_NOTIFY,
>   				 &matrix_mdev->iommu_notifier);
> -	vfio_unregister_notifier(vdev, VFIO_GROUP_NOTIFY,
> -				 &matrix_mdev->group_notifier);
>   	vfio_ap_mdev_unset_kvm(matrix_mdev);
>   }
>   
> @@ -1481,6 +1455,7 @@ static ssize_t vfio_ap_mdev_ioctl(struct vfio_device *vdev,
>   }
>   
>   static const struct vfio_device_ops vfio_ap_matrix_dev_ops = {
> +	.flags = VFIO_DEVICE_NEEDS_KVM,
>   	.open_device = vfio_ap_mdev_open_device,
>   	.close_device = vfio_ap_mdev_close_device,
>   	.ioctl = vfio_ap_mdev_ioctl,
> diff --git a/drivers/s390/crypto/vfio_ap_private.h b/drivers/s390/crypto/vfio_ap_private.h
> index 648fcaf8104abb..a26efd804d0df3 100644
> --- a/drivers/s390/crypto/vfio_ap_private.h
> +++ b/drivers/s390/crypto/vfio_ap_private.h
> @@ -81,8 +81,6 @@ struct ap_matrix {
>    * @node:	allows the ap_matrix_mdev struct to be added to a list
>    * @matrix:	the adapters, usage domains and control domains assigned to the
>    *		mediated matrix device.
> - * @group_notifier: notifier block used for specifying callback function for
> - *		    handling the VFIO_GROUP_NOTIFY_SET_KVM event
>    * @iommu_notifier: notifier block used for specifying callback function for
>    *		    handling the VFIO_IOMMU_NOTIFY_DMA_UNMAP even
>    * @kvm:	the struct holding guest's state
> @@ -94,7 +92,6 @@ struct ap_matrix_mdev {
>   	struct vfio_device vdev;
>   	struct list_head node;
>   	struct ap_matrix matrix;
> -	struct notifier_block group_notifier;
>   	struct notifier_block iommu_notifier;
>   	struct kvm *kvm;
>   	crypto_hook pqap_hook;
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index cfd797629a21ab..1c20bb5484afde 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -10,6 +10,7 @@
>    * Author: Tom Lyon, pugs@cisco.com
>    */
>   
> +#include "linux/kvm_host.h"
>   #include <linux/cdev.h>
>   #include <linux/compat.h>
>   #include <linux/device.h>
> @@ -1097,13 +1098,25 @@ static struct file *vfio_device_open(struct vfio_device *device)
>   
>   	down_write(&device->group->group_rwsem);
>   	ret = vfio_device_assign_container(device);
> -	up_write(&device->group->group_rwsem);
> -	if (ret)
> +	if (ret) {
> +		up_write(&device->group->group_rwsem);
>   		return ERR_PTR(ret);
> +	}
> +
> +	if (device->ops->flags & VFIO_DEVICE_NEEDS_KVM)
> +	{
> +		if (!device->group->kvm) {
> +			up_write(&device->group->group_rwsem);
> +			goto err_unassign_container;
> +		}
> +		device->kvm = device->group->kvm;
> +		kvm_get_kvm(device->kvm);
> +	}
> +	up_write(&device->group->group_rwsem);
>   
>   	if (!try_module_get(device->dev->driver->owner)) {
>   		ret = -ENODEV;
> -		goto err_unassign_container;
> +		goto err_put_kvm;
>   	}
>   
>   	mutex_lock(&device->dev_set->lock);
> @@ -1147,6 +1160,11 @@ static struct file *vfio_device_open(struct vfio_device *device)
>   	device->open_count--;
>   	mutex_unlock(&device->dev_set->lock);
>   	module_put(device->dev->driver->owner);
> +err_put_kvm:
> +	if (device->kvm) {
> +		kvm_put_kvm(device->kvm);
> +		device->kvm = NULL;
> +	}
>   err_unassign_container:
>   	vfio_device_unassign_container(device);
>   	return ERR_PTR(ret);
> @@ -1344,6 +1362,10 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
>   
>   	module_put(device->dev->driver->owner);
>   
> +	if (device->kvm) {
> +		kvm_put_kvm(device->kvm);
> +		device->kvm = NULL;
> +	}
>   	vfio_device_unassign_container(device);
>   
>   	vfio_device_put(device);
> @@ -1748,8 +1770,8 @@ EXPORT_SYMBOL_GPL(vfio_file_enforced_coherent);
>    * @file: VFIO group file
>    * @kvm: KVM to link
>    *
> - * The kvm pointer will be forwarded to all the vfio_device's attached to the
> - * VFIO file via the VFIO_GROUP_NOTIFY_SET_KVM notifier.
> + * When a VFIO device is first opened the KVM will be available in
> + * device->kvm if VFIO_DEVICE_NEEDS_KVM is set.
>    */
>   void vfio_file_set_kvm(struct file *file, struct kvm *kvm)
>   {
> @@ -1760,8 +1782,6 @@ void vfio_file_set_kvm(struct file *file, struct kvm *kvm)
>   
>   	down_write(&group->group_rwsem);
>   	group->kvm = kvm;
> -	blocking_notifier_call_chain(&group->notifier,
> -				     VFIO_GROUP_NOTIFY_SET_KVM, kvm);
>   	up_write(&group->group_rwsem);
>   }
>   EXPORT_SYMBOL_GPL(vfio_file_set_kvm);
> @@ -2061,42 +2081,6 @@ static int vfio_unregister_iommu_notifier(struct vfio_group *group,
>   	return ret;
>   }
>   
> -static int vfio_register_group_notifier(struct vfio_group *group,
> -					unsigned long *events,
> -					struct notifier_block *nb)
> -{
> -	int ret;
> -	bool set_kvm = false;
> -
> -	if (*events & VFIO_GROUP_NOTIFY_SET_KVM)
> -		set_kvm = true;
> -
> -	/* clear known events */
> -	*events &= ~VFIO_GROUP_NOTIFY_SET_KVM;
> -
> -	/* refuse to continue if still events remaining */
> -	if (*events)
> -		return -EINVAL;
> -
> -	ret = blocking_notifier_chain_register(&group->notifier, nb);
> -	if (ret)
> -		return ret;
> -
> -	/*
> -	 * The attaching of kvm and vfio_group might already happen, so
> -	 * here we replay once upon registration.
> -	 */
> -	if (set_kvm) {
> -		down_read(&group->group_rwsem);
> -		if (group->kvm)
> -			blocking_notifier_call_chain(&group->notifier,
> -						     VFIO_GROUP_NOTIFY_SET_KVM,
> -						     group->kvm);
> -		up_read(&group->group_rwsem);
> -	}
> -	return 0;
> -}
> -
>   int vfio_register_notifier(struct vfio_device *device,
>   			   enum vfio_notify_type type, unsigned long *events,
>   			   struct notifier_block *nb)
> @@ -2112,9 +2096,6 @@ int vfio_register_notifier(struct vfio_device *device,
>   	case VFIO_IOMMU_NOTIFY:
>   		ret = vfio_register_iommu_notifier(group, events, nb);
>   		break;
> -	case VFIO_GROUP_NOTIFY:
> -		ret = vfio_register_group_notifier(group, events, nb);
> -		break;
>   	default:
>   		ret = -EINVAL;
>   	}
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 45b287826ce686..aaf120b9c080b7 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -36,6 +36,7 @@ struct vfio_device {
>   	struct vfio_device_set *dev_set;
>   	struct list_head dev_set_list;
>   	unsigned int migration_flags;
> +	struct kvm *kvm;
>   
>   	/* Members below here are private, not for driver use */
>   	refcount_t refcount;
> @@ -44,6 +45,10 @@ struct vfio_device {
>   	struct list_head group_next;
>   };
>   
> +enum {
> +	VFIO_DEVICE_NEEDS_KVM = 1,
> +};
> +
>   /**
>    * struct vfio_device_ops - VFIO bus driver device callbacks
>    *
> @@ -72,6 +77,7 @@ struct vfio_device {
>    */
>   struct vfio_device_ops {
>   	char	*name;
> +	unsigned int flags;
>   	int	(*open_device)(struct vfio_device *vdev);
>   	void	(*close_device)(struct vfio_device *vdev);
>   	ssize_t	(*read)(struct vfio_device *vdev, char __user *buf,
> @@ -155,15 +161,11 @@ extern int vfio_dma_rw(struct vfio_device *device, dma_addr_t user_iova,
>   /* each type has independent events */
>   enum vfio_notify_type {
>   	VFIO_IOMMU_NOTIFY = 0,
> -	VFIO_GROUP_NOTIFY = 1,
>   };
>   
>   /* events for VFIO_IOMMU_NOTIFY */
>   #define VFIO_IOMMU_NOTIFY_DMA_UNMAP	BIT(0)
>   
> -/* events for VFIO_GROUP_NOTIFY */
> -#define VFIO_GROUP_NOTIFY_SET_KVM	BIT(0)
> -
>   extern int vfio_register_notifier(struct vfio_device *device,
>   				  enum vfio_notify_type type,
>   				  unsigned long *required_events,

Jason Gunthorpe May 16, 2022, 6:35 p.m. UTC | #3

On Mon, May 16, 2022 at 02:30:46PM -0400, Matthew Rosato wrote:

> Conceptually I think this would work for QEMU anyway (it always sets the kvm
> before we open the device).  I tried to test the idea quickly but couldn't
> get the following to apply on vfio-next or your vfio_group_locking -- but I
> understand what you're trying to do so I'll re-work and try it out.

I created it on 8c9350e9bf43de1ebab3cc8a80703671e6495ab4 which is the
vfio_group_locking.. I can send you a github if it helps
https://github.com/jgunthorpe/linux/commits/vfio_group_lockin

> @Alex can you think of any usecase/reason why we would need to be able to
> set the KVM sometime after the device was opened?  Doing something like
> below would break that, as this introduces the assumption that the group is
> associated with the KVM before the device is opened (and if it's not, the
> device open fails).

Keep in mind that GVT already hard requires this ordering to even
allow open_device to work - so it already sets a floor for what
userspace can do..

Jason

Alex Williamson May 16, 2022, 7:38 p.m. UTC | #4

On Mon, 16 May 2022 15:35:58 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, May 16, 2022 at 02:30:46PM -0400, Matthew Rosato wrote:
> 
> > Conceptually I think this would work for QEMU anyway (it always sets the kvm
> > before we open the device).  I tried to test the idea quickly but couldn't
> > get the following to apply on vfio-next or your vfio_group_locking -- but I
> > understand what you're trying to do so I'll re-work and try it out.  
> 
> I created it on 8c9350e9bf43de1ebab3cc8a80703671e6495ab4 which is the
> vfio_group_locking.. I can send you a github if it helps
> https://github.com/jgunthorpe/linux/commits/vfio_group_lockin
> 
> > @Alex can you think of any usecase/reason why we would need to be able to
> > set the KVM sometime after the device was opened?  Doing something like
> > below would break that, as this introduces the assumption that the group is
> > associated with the KVM before the device is opened (and if it's not, the
> > device open fails).  
> 
> Keep in mind that GVT already hard requires this ordering to even
> allow open_device to work - so it already sets a floor for what
> userspace can do..

How is this going to work when vfio devices are exposed directly?  We
currently have a strict ordering through the group to get to the
device, and it's therefore a reasonable requirement for userspace to
register the group with kvm before opening the device.  Is the notifier
and async KVM registration necessary to support this dependency with
direct device access?  I don't have as clear a picture of the ordering
with with direct access.  Thanks,

Alex

Matthew Rosato May 16, 2022, 9:59 p.m. UTC | #5

On 5/16/22 2:35 PM, Jason Gunthorpe wrote:
> On Mon, May 16, 2022 at 02:30:46PM -0400, Matthew Rosato wrote:
> 
>> Conceptually I think this would work for QEMU anyway (it always sets the kvm
>> before we open the device).  I tried to test the idea quickly but couldn't
>> get the following to apply on vfio-next or your vfio_group_locking -- but I
>> understand what you're trying to do so I'll re-work and try it out.
> 
> I created it on 8c9350e9bf43de1ebab3cc8a80703671e6495ab4 which is the
> vfio_group_locking.. I can send you a github if it helps
> https://github.com/jgunthorpe/linux/commits/vfio_group_lockin
> 
Thanks -- I was able to successfully test your proposed idea (+ some 
changes to make it compile :)) on top of vfio_group_locking along with a 
modified version of my zdev series.  I also tried it out with vfio-ap 
successfully, but have nothing to test GVT with.

That said, this has caused me to realize that 'iommu: 
iommu_group_claim_dma_owner() must always assign a domain' breaks s390x 
vfio-pci :( I wonder if it is due to the way s390x PCI currently 
switches between dma ops and iommu ops.  It looks like it breaks vfio-ap 
mdevs too, but I know less about that --  I will have to investigate 
both more tomorrow.

Jason Gunthorpe May 16, 2022, 11:05 p.m. UTC | #6

On Mon, May 16, 2022 at 05:59:01PM -0400, Matthew Rosato wrote:
> On 5/16/22 2:35 PM, Jason Gunthorpe wrote:
> > On Mon, May 16, 2022 at 02:30:46PM -0400, Matthew Rosato wrote:
> > 
> > > Conceptually I think this would work for QEMU anyway (it always sets the kvm
> > > before we open the device).  I tried to test the idea quickly but couldn't
> > > get the following to apply on vfio-next or your vfio_group_locking -- but I
> > > understand what you're trying to do so I'll re-work and try it out.
> > 
> > I created it on 8c9350e9bf43de1ebab3cc8a80703671e6495ab4 which is the
> > vfio_group_locking.. I can send you a github if it helps
> > https://github.com/jgunthorpe/linux/commits/vfio_group_lockin
> > 
> Thanks -- I was able to successfully test your proposed idea (+ some changes
> to make it compile :)) on top of vfio_group_locking along with a modified
> version of my zdev series.  I also tried it out with vfio-ap successfully,
> but have nothing to test GVT with.
> 
> That said, this has caused me to realize that 'iommu:
> iommu_group_claim_dma_owner() must always assign a domain' breaks s390x
> vfio-pci :( I wonder if it is due to the way s390x PCI currently switches
> between dma ops and iommu ops.  It looks like it breaks vfio-ap mdevs too,
> but I know less about that --  I will have to investigate both more
> tomorrow.

At the very least it appears that s390 thinks that attach/detach_dev
are strictly paired and that isn't how the iommu ops are defined.

attach_dev can be called multiple times without any detach_dev.

Possibly the unbalance of zpci_register_ioat / zpci_unregister_ioat
causes the trouble?

Jason

Jason Gunthorpe May 16, 2022, 11:11 p.m. UTC | #7

On Mon, May 16, 2022 at 01:38:39PM -0600, Alex Williamson wrote:
> On Mon, 16 May 2022 15:35:58 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, May 16, 2022 at 02:30:46PM -0400, Matthew Rosato wrote:
> > 
> > > Conceptually I think this would work for QEMU anyway (it always sets the kvm
> > > before we open the device).  I tried to test the idea quickly but couldn't
> > > get the following to apply on vfio-next or your vfio_group_locking -- but I
> > > understand what you're trying to do so I'll re-work and try it out.  
> > 
> > I created it on 8c9350e9bf43de1ebab3cc8a80703671e6495ab4 which is the
> > vfio_group_locking.. I can send you a github if it helps
> > https://github.com/jgunthorpe/linux/commits/vfio_group_lockin
> > 
> > > @Alex can you think of any usecase/reason why we would need to be able to
> > > set the KVM sometime after the device was opened?  Doing something like
> > > below would break that, as this introduces the assumption that the group is
> > > associated with the KVM before the device is opened (and if it's not, the
> > > device open fails).  
> > 
> > Keep in mind that GVT already hard requires this ordering to even
> > allow open_device to work - so it already sets a floor for what
> > userspace can do..
> 
> How is this going to work when vfio devices are exposed directly?  We
> currently have a strict ordering through the group to get to the
> device, and it's therefore a reasonable requirement for userspace to
> register the group with kvm before opening the device.  Is the notifier
> and async KVM registration necessary to support this dependency with
> direct device access?  I don't have as clear a picture of the ordering
> with with direct access.  Thanks,

With the device FD there is already a zombie state after open("/dev/...") but
before the internal open_device op is called. This state ends after
the iommufd is assigned, then we can go to open_device.

It is reasonable that the KVM would have to be setup before assigning
the iommfd to the device fd.

Jason

Christoph Hellwig May 17, 2022, 6:21 a.m. UTC | #8

On Mon, May 16, 2022 at 02:27:34PM -0300, Jason Gunthorpe wrote:
> Normally you'd want to do what is kvm_s390_pci_register_kvm() here,
> where a failure can be propogated but then you have a race condition
> with the kvm.
> 
> Blech, maybe it is time to just fix this race condition permanently,
> what do you think? (I didn't even compile it)

This is roughly were I was planning to get to, with one difference:
I don't think we need or even want the VFIO_DEVICE_NEEDS_KVM flag.
Instead just propagation ->kvm to the device whenever it is set and
let drivers that have a hard requirements on it like gvt fail if it
isn't there.  This could still allow using vfio for userspace PCI
drivers on s390 for example or in general allow expressing a soft
requirement, just without the whole notifier mess.

The other question is if we even need an extra reference per device,
can't we hold the group reference until all devices are gone
anyway?  That would remove the need to include kvm_host.h in the
vfio code.

Jason Gunthorpe May 17, 2022, 12:01 p.m. UTC | #9

On Mon, May 16, 2022 at 11:21:19PM -0700, Christoph Hellwig wrote:
> On Mon, May 16, 2022 at 02:27:34PM -0300, Jason Gunthorpe wrote:
> > Normally you'd want to do what is kvm_s390_pci_register_kvm() here,
> > where a failure can be propogated but then you have a race condition
> > with the kvm.
> > 
> > Blech, maybe it is time to just fix this race condition permanently,
> > what do you think? (I didn't even compile it)
> 
> This is roughly were I was planning to get to, with one difference:
> I don't think we need or even want the VFIO_DEVICE_NEEDS_KVM flag.
> Instead just propagation ->kvm to the device whenever it is set and
> let drivers that have a hard requirements on it like gvt fail if it
> isn't there.

I did it so we didn't uselessly hold a ref on the kvm object, but
maybe that is not relevant.

> The other question is if we even need an extra reference per device,
> can't we hold the group reference until all devices are gone
> anyway?  That would remove the need to include kvm_host.h in the
> vfio code.

The device does now hold a reference on the group fd after this patch
series:

https://lore.kernel.org/r/0-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com

However the group does not hold a reference on the KVM, it has a
set/remove interface toward KVM and can have its group->kvm pointer
NULL'd via an ioctl at any time.

So, the semantic here is that the KVM is captured when the device FD
opens and then is immutable for the lifetime of that device FD even if
the group FD's KVM is reassigned or removed.

And I realize that it is all botched, this needs to check and respect
the open_count which requires nesting the locks..

Jason

[v7,17/22] vfio-pci/zdev: add open/close device hooks

Commit Message

Comments

Patch