diff mbox series

[v5,3/3] vfio/pci: Check the device set open count on reset

Message ID 20221105224458.8180-4-ajderossi@gmail.com (mailing list archive)
State Superseded
Headers show
Series vfio/pci: Check the device set open count on reset | expand

Commit Message

Anthony DeRossi Nov. 5, 2022, 10:44 p.m. UTC
vfio_pci_dev_set_needs_reset() inspects the open_count of every device
in the set to determine whether a reset is allowed. The current device
always has open_count == 1 within vfio_pci_core_disable(), effectively
disabling the reset logic. This field is also documented as private in
vfio_device, so it should not be used to determine whether other devices
in the set are open.

Checking for vfio_device_set_open_count() > 1 on the device set fixes
both issues.

After commit 2cd8b14aaa66 ("vfio/pci: Move to the device set
infrastructure"), failure to create a new file for a device would cause
the reset to be skipped due to open_count being decremented after
calling close_device() in the error path.

After commit eadd86f835c6 ("vfio: Remove calls to
vfio_group_add_container_user()"), releasing a device would always skip
the reset due to an ordering change in vfio_device_fops_release().

Failing to reset the device leaves it in an unknown state, potentially
causing errors when it is accessed later or bound to a different driver.

This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3)
assigned to a Windows guest. After shutting down the guest, unbinding
the device from vfio-pci, and binding the device to amdgpu:

[  548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
[  548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[  548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[  548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed
[  548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init

Fixes: 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure")
Fixes: eadd86f835c6 ("vfio: Remove calls to vfio_group_add_container_user()")
Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Jason Gunthorpe Nov. 9, 2022, 12:53 a.m. UTC | #1
On Sat, Nov 05, 2022 at 03:44:58PM -0700, Anthony DeRossi wrote:
> vfio_pci_dev_set_needs_reset() inspects the open_count of every device
> in the set to determine whether a reset is allowed. The current device
> always has open_count == 1 within vfio_pci_core_disable(), effectively
> disabling the reset logic. This field is also documented as private in
> vfio_device, so it should not be used to determine whether other devices
> in the set are open.
> 
> Checking for vfio_device_set_open_count() > 1 on the device set fixes
> both issues.
> 
> After commit 2cd8b14aaa66 ("vfio/pci: Move to the device set
> infrastructure"), failure to create a new file for a device would cause
> the reset to be skipped due to open_count being decremented after
> calling close_device() in the error path.
> 
> After commit eadd86f835c6 ("vfio: Remove calls to
> vfio_group_add_container_user()"), releasing a device would always skip
> the reset due to an ordering change in vfio_device_fops_release().
> 
> Failing to reset the device leaves it in an unknown state, potentially
> causing errors when it is accessed later or bound to a different driver.
> 
> This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3)
> assigned to a Windows guest. After shutting down the guest, unbinding
> the device from vfio-pci, and binding the device to amdgpu:
> 
> [  548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
> [  548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
> [  548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
> [  548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed
> [  548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init
> 
> Fixes: 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure")
> Fixes: eadd86f835c6 ("vfio: Remove calls to vfio_group_add_container_user()")
> Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason
Tian, Kevin Nov. 9, 2022, 3:38 a.m. UTC | #2
> From: Anthony DeRossi <ajderossi@gmail.com>
> Sent: Sunday, November 6, 2022 6:45 AM
> 
> vfio_pci_dev_set_needs_reset() inspects the open_count of every device
> in the set to determine whether a reset is allowed. The current device
> always has open_count == 1 within vfio_pci_core_disable(), effectively
> disabling the reset logic. This field is also documented as private in
> vfio_device, so it should not be used to determine whether other devices
> in the set are open.
> 
> Checking for vfio_device_set_open_count() > 1 on the device set fixes
> both issues.
> 
> After commit 2cd8b14aaa66 ("vfio/pci: Move to the device set
> infrastructure"), failure to create a new file for a device would cause
> the reset to be skipped due to open_count being decremented after
> calling close_device() in the error path.
> 
> After commit eadd86f835c6 ("vfio: Remove calls to
> vfio_group_add_container_user()"), releasing a device would always skip
> the reset due to an ordering change in vfio_device_fops_release().
> 
> Failing to reset the device leaves it in an unknown state, potentially
> causing errors when it is accessed later or bound to a different driver.
> 
> This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3)
> assigned to a Windows guest. After shutting down the guest, unbinding
> the device from vfio-pci, and binding the device to amdgpu:
> 
> [  548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
> [  548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading
> failed
> [  548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init
> of IP block <psp> failed -22
> [  548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed
> [  548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init
> 
> Fixes: 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure")
> Fixes: eadd86f835c6 ("vfio: Remove calls to
> vfio_group_add_container_user()")
> Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
diff mbox series

Patch

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index badc9d828cac..e030c2120183 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2488,12 +2488,12 @@  static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
 	struct vfio_pci_core_device *cur;
 	bool needs_reset = false;
 
-	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
-		/* No VFIO device in the set can have an open device FD */
-		if (cur->vdev.open_count)
-			return false;
+	/* No other VFIO device in the set can be open. */
+	if (vfio_device_set_open_count(dev_set) > 1)
+		return false;
+
+	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
 		needs_reset |= cur->needs_reset;
-	}
 	return needs_reset;
 }