mbox series

[v2,0/4] Make the iommu driver no-snoop block feature consistent

Message ID 0-v2-f090ae795824+6ad-intel_no_snoop_jgg@nvidia.com (mailing list archive)
Headers show
Series Make the iommu driver no-snoop block feature consistent | expand

Message

Jason Gunthorpe April 7, 2022, 3:23 p.m. UTC
PCIe defines a 'no-snoop' bit in each the TLP which is usually implemented
by a platform as bypassing elements in the DMA coherent CPU cache
hierarchy. A driver can command a device to set this bit on some of its
transactions as a micro-optimization.

However, the driver is now responsible to synchronize the CPU cache with
the DMA that bypassed it. On x86 this may be done through the wbinvd
instruction, and the i915 GPU driver is the only Linux DMA driver that
calls it.

The problem comes that KVM on x86 will normally disable the wbinvd
instruction in the guest and render it a NOP. As the driver running in the
guest is not aware the wbinvd doesn't work it may still cause the device
to set the no-snoop bit and the platform will bypass the CPU cache.
Without a working wbinvd there is no way to re-synchronize the CPU cache
and the driver in the VM has data corruption.

Thus, we see a general direction on x86 that the IOMMU HW is able to block
the no-snoop bit in the TLP. This NOP's the optimization and allows KVM to
to NOP the wbinvd without causing any data corruption.

This control for Intel IOMMU was exposed by using IOMMU_CACHE and
IOMMU_CAP_CACHE_COHERENCY, however these two values now have multiple
meanings and usages beyond blocking no-snoop and the whole thing has
become confused. AMD IOMMU has the same feature and same IOPTE bits
however it unconditionally blocks no-snoop.

Change it so that:
 - IOMMU_CACHE is only about the DMA coherence of normal DMAs from a
   device. It is used by the DMA API/VFIO/etc when the user of the
   iommu_domain will not be doing manual cache coherency operations.

 - IOMMU_CAP_CACHE_COHERENCY indicates if IOMMU_CACHE can be used with the
   device.

 - The new optional domain op enforce_cache_coherency() will cause the
   entire domain to block no-snoop requests - ie there is no way for any
   device attached to the domain to opt out of the IOMMU_CACHE behavior.
   This is permanent on the domain and must apply to any future devices
   attached to it.

Ideally an iommu driver should implement enforce_cache_coherency() so that
by DMA API domains allow the no-snoop optimization. This leaves it
available to kernel drivers like i915. VFIO will call
enforce_cache_coherency() before establishing any mappings and the domain
should then permanently block no-snoop.

If enforce_cache_coherency() fails VFIO will communicate back through to
KVM into the arch code via kvm_arch_register_noncoherent_dma()
(only implemented by x86) which triggers a working wbinvd to be made
available to the VM.

While other iommu drivers are certainly welcome to implement
enforce_cache_coherency(), it is not clear there is any benefit in doing
so right now.

This is on github: https://github.com/jgunthorpe/linux/commits/intel_no_snoop

v2:
 - Abandon removing IOMMU_CAP_CACHE_COHERENCY - instead make it the cap
   flag that indicates IOMMU_CACHE is supported
 - Put the VFIO tests for IOMMU_CACHE at VFIO device registration
 - In the Intel driver remove the domain->iommu_snooping value, this is
   global not per-domain
v1: https://lore.kernel.org/r/0-v1-ef02c60ddb76+12ca2-intel_no_snoop_jgg@nvidia.com

Jason Gunthorpe (4):
  iommu: Introduce the domain op enforce_cache_coherency()
  vfio: Move the Intel no-snoop control off of IOMMU_CACHE
  iommu: Redefine IOMMU_CAP_CACHE_COHERENCY as the cap flag for
    IOMMU_CACHE
  vfio: Require that devices support DMA cache coherence

 drivers/iommu/amd/iommu.c       |  7 +++++++
 drivers/iommu/intel/iommu.c     | 17 +++++++++++++----
 drivers/vfio/vfio.c             |  7 +++++++
 drivers/vfio/vfio_iommu_type1.c | 30 +++++++++++++++++++-----------
 include/linux/intel-iommu.h     |  2 +-
 include/linux/iommu.h           |  7 +++++--
 6 files changed, 52 insertions(+), 18 deletions(-)


base-commit: 3123109284176b1532874591f7c81f3837bbdc17

Comments

Robin Murphy April 7, 2022, 5:03 p.m. UTC | #1
On 2022-04-07 16:23, Jason Gunthorpe wrote:
> PCIe defines a 'no-snoop' bit in each the TLP which is usually implemented
> by a platform as bypassing elements in the DMA coherent CPU cache
> hierarchy. A driver can command a device to set this bit on some of its
> transactions as a micro-optimization.
> 
> However, the driver is now responsible to synchronize the CPU cache with
> the DMA that bypassed it. On x86 this may be done through the wbinvd
> instruction, and the i915 GPU driver is the only Linux DMA driver that
> calls it.
> 
> The problem comes that KVM on x86 will normally disable the wbinvd
> instruction in the guest and render it a NOP. As the driver running in the
> guest is not aware the wbinvd doesn't work it may still cause the device
> to set the no-snoop bit and the platform will bypass the CPU cache.
> Without a working wbinvd there is no way to re-synchronize the CPU cache
> and the driver in the VM has data corruption.
> 
> Thus, we see a general direction on x86 that the IOMMU HW is able to block
> the no-snoop bit in the TLP. This NOP's the optimization and allows KVM to
> to NOP the wbinvd without causing any data corruption.
> 
> This control for Intel IOMMU was exposed by using IOMMU_CACHE and
> IOMMU_CAP_CACHE_COHERENCY, however these two values now have multiple
> meanings and usages beyond blocking no-snoop and the whole thing has
> become confused. AMD IOMMU has the same feature and same IOPTE bits
> however it unconditionally blocks no-snoop.
> 
> Change it so that:
>   - IOMMU_CACHE is only about the DMA coherence of normal DMAs from a
>     device. It is used by the DMA API/VFIO/etc when the user of the
>     iommu_domain will not be doing manual cache coherency operations.
> 
>   - IOMMU_CAP_CACHE_COHERENCY indicates if IOMMU_CACHE can be used with the
>     device.
> 
>   - The new optional domain op enforce_cache_coherency() will cause the
>     entire domain to block no-snoop requests - ie there is no way for any
>     device attached to the domain to opt out of the IOMMU_CACHE behavior.
>     This is permanent on the domain and must apply to any future devices
>     attached to it.
> 
> Ideally an iommu driver should implement enforce_cache_coherency() so that
> by DMA API domains allow the no-snoop optimization. This leaves it
> available to kernel drivers like i915. VFIO will call
> enforce_cache_coherency() before establishing any mappings and the domain
> should then permanently block no-snoop.
> 
> If enforce_cache_coherency() fails VFIO will communicate back through to
> KVM into the arch code via kvm_arch_register_noncoherent_dma()
> (only implemented by x86) which triggers a working wbinvd to be made
> available to the VM.
> 
> While other iommu drivers are certainly welcome to implement
> enforce_cache_coherency(), it is not clear there is any benefit in doing
> so right now.
> 
> This is on github: https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> 
> v2:
>   - Abandon removing IOMMU_CAP_CACHE_COHERENCY - instead make it the cap
>     flag that indicates IOMMU_CACHE is supported
>   - Put the VFIO tests for IOMMU_CACHE at VFIO device registration
>   - In the Intel driver remove the domain->iommu_snooping value, this is
>     global not per-domain

At a glance, this all looks about the right shape to me now, thanks!

Ideally I'd hope patch #4 could go straight to device_iommu_capable() 
from my Thunderbolt series, but we can figure that out in a couple of 
weeks once Joerg starts queueing 5.19 material. I've got another VFIO 
patch waiting for the DMA ownership series to land anyway, so it's 
hardly the end of the world if I have to come back to follow up on this 
one too.

For the series,

Acked-by: Robin Murphy <robin.murphy@arm.com>

> v1: https://lore.kernel.org/r/0-v1-ef02c60ddb76+12ca2-intel_no_snoop_jgg@nvidia.com
> 
> Jason Gunthorpe (4):
>    iommu: Introduce the domain op enforce_cache_coherency()
>    vfio: Move the Intel no-snoop control off of IOMMU_CACHE
>    iommu: Redefine IOMMU_CAP_CACHE_COHERENCY as the cap flag for
>      IOMMU_CACHE
>    vfio: Require that devices support DMA cache coherence
> 
>   drivers/iommu/amd/iommu.c       |  7 +++++++
>   drivers/iommu/intel/iommu.c     | 17 +++++++++++++----
>   drivers/vfio/vfio.c             |  7 +++++++
>   drivers/vfio/vfio_iommu_type1.c | 30 +++++++++++++++++++-----------
>   include/linux/intel-iommu.h     |  2 +-
>   include/linux/iommu.h           |  7 +++++--
>   6 files changed, 52 insertions(+), 18 deletions(-)
> 
> 
> base-commit: 3123109284176b1532874591f7c81f3837bbdc17
Jason Gunthorpe April 7, 2022, 5:43 p.m. UTC | #2
On Thu, Apr 07, 2022 at 06:03:37PM +0100, Robin Murphy wrote:
> At a glance, this all looks about the right shape to me now, thanks!

Thanks!

> Ideally I'd hope patch #4 could go straight to device_iommu_capable() from
> my Thunderbolt series, but we can figure that out in a couple of weeks once

Yes, this does helps that because now the only iommu_capable call is
in a context where a device is available :)

> Joerg starts queueing 5.19 material. I've got another VFIO patch waiting for
> the DMA ownership series to land anyway, so it's hardly the end of the world
> if I have to come back to follow up on this one too.

Hopefully Joerg will start soon, I also have patches written waiting
for the DMA ownership series.

Jason
Robin Murphy April 7, 2022, 6:02 p.m. UTC | #3
On 2022-04-07 18:43, Jason Gunthorpe wrote:
> On Thu, Apr 07, 2022 at 06:03:37PM +0100, Robin Murphy wrote:
>> At a glance, this all looks about the right shape to me now, thanks!
> 
> Thanks!
> 
>> Ideally I'd hope patch #4 could go straight to device_iommu_capable() from
>> my Thunderbolt series, but we can figure that out in a couple of weeks once
> 
> Yes, this does helps that because now the only iommu_capable call is
> in a context where a device is available :)

Derp, of course I have *two* VFIO patches waiting, the other one 
touching the iommu_capable() calls (there's still IOMMU_CAP_INTR_REMAP, 
which, much as I hate it and would love to boot all that stuff over to 
drivers/irqchip, it's not in my way so I'm leaving it be for now). I'll 
have to rebase that anyway, so merging this as-is is absolutely fine!

Cheers,
Robin.
Jason Gunthorpe April 7, 2022, 7:08 p.m. UTC | #4
On Thu, Apr 07, 2022 at 07:02:03PM +0100, Robin Murphy wrote:
> On 2022-04-07 18:43, Jason Gunthorpe wrote:
> > On Thu, Apr 07, 2022 at 06:03:37PM +0100, Robin Murphy wrote:
> > > At a glance, this all looks about the right shape to me now, thanks!
> > 
> > Thanks!
> > 
> > > Ideally I'd hope patch #4 could go straight to device_iommu_capable() from
> > > my Thunderbolt series, but we can figure that out in a couple of weeks once
> > 
> > Yes, this does helps that because now the only iommu_capable call is
> > in a context where a device is available :)
> 
> Derp, of course I have *two* VFIO patches waiting, the other one touching
> the iommu_capable() calls (there's still IOMMU_CAP_INTR_REMAP, which, much
> as I hate it and would love to boot all that stuff over to
> drivers/irqchip,

Oh me too...

> it's not in my way so I'm leaving it be for now). I'll have to rebase that
> anyway, so merging this as-is is absolutely fine!

This might help your effort - after this series and this below there
are no 'bus' users of iommu_capable left at all.

From 55d72be40bc0a031711e318c8dd1cb60673d9eca Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgg@nvidia.com>
Date: Thu, 7 Apr 2022 16:00:50 -0300
Subject: [PATCH] vfio: Move the IOMMU_CAP_INTR_REMAP to a context with a
 struct device

This check is done to ensure that the device we want to use is fully
isolated and the platform does not allow the device's MemWr TLPs to
trigger unauthorized MSIs.

Instead of doing it in the container context where we only have a group,
move the check to open_device where we actually know the device.

This is still security safe as userspace cannot trigger an MemWr TLPs
without obtaining a device fd.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/vfio.c             |  9 +++++++++
 drivers/vfio/vfio.h             |  1 +
 drivers/vfio/vfio_iommu_type1.c | 28 +++++++++++++++++-----------
 3 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 9edad767cfdad3..8db5cea1dc1d75 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1355,6 +1355,15 @@ static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
 	if (IS_ERR(device))
 		return PTR_ERR(device);
 
+	/* Confirm this device is compatible with the container */
+	if (group->type == VFIO_IOMMU &&
+	    group->container->iommu_driver->ops->device_ok) {
+		ret = group->container->iommu_driver->ops->device_ok(
+			group->container->iommu_data, device->dev);
+		if (ret)
+			goto err_device_put;
+	}
+
 	if (!try_module_get(device->dev->driver->owner)) {
 		ret = -ENODEV;
 		goto err_device_put;
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index a6713022115155..3db60de71d42eb 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -66,6 +66,7 @@ struct vfio_iommu_driver_ops {
 						   struct iommu_group *group);
 	void		(*notify)(void *iommu_data,
 				  enum vfio_iommu_notify_type event);
+	int		(*device_ok)(void *iommu_data, struct device *device);
 };
 
 int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c13b9290e35759..5e966fb0ab9202 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2153,6 +2153,21 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
 	list_splice_tail(iova_copy, iova);
 }
 
+static int vfio_iommu_device_ok(void *iommu_data, struct device *device)
+{
+	bool msi_remap;
+
+	msi_remap = irq_domain_check_msi_remap() ||
+		    iommu_capable(device->bus, IOMMU_CAP_INTR_REMAP);
+
+	if (!allow_unsafe_interrupts && !msi_remap) {
+		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
+			__func__);
+		return -EPERM;
+	}
+	return 0;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 		struct iommu_group *iommu_group, enum vfio_group_type type)
 {
@@ -2160,7 +2175,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	struct vfio_iommu_group *group;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
-	bool resv_msi, msi_remap;
+	bool resv_msi;
 	phys_addr_t resv_msi_base = 0;
 	struct iommu_domain_geometry *geo;
 	LIST_HEAD(iova_copy);
@@ -2257,16 +2272,6 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	INIT_LIST_HEAD(&domain->group_list);
 	list_add(&group->next, &domain->group_list);
 
-	msi_remap = irq_domain_check_msi_remap() ||
-		    iommu_capable(bus, IOMMU_CAP_INTR_REMAP);
-
-	if (!allow_unsafe_interrupts && !msi_remap) {
-		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
-		       __func__);
-		ret = -EPERM;
-		goto out_detach;
-	}
-
 	/*
 	 * If the IOMMU can block non-coherent operations (ie PCIe TLPs with
 	 * no-snoop set) then VFIO always turns this feature on because on Intel
@@ -3159,6 +3164,7 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.open			= vfio_iommu_type1_open,
 	.release		= vfio_iommu_type1_release,
 	.ioctl			= vfio_iommu_type1_ioctl,
+	.device_ok		= vfio_iommu_device_ok,
 	.attach_group		= vfio_iommu_type1_attach_group,
 	.detach_group		= vfio_iommu_type1_detach_group,
 	.pin_pages		= vfio_iommu_type1_pin_pages,
Robin Murphy April 7, 2022, 7:27 p.m. UTC | #5
On 2022-04-07 20:08, Jason Gunthorpe wrote:
> On Thu, Apr 07, 2022 at 07:02:03PM +0100, Robin Murphy wrote:
>> On 2022-04-07 18:43, Jason Gunthorpe wrote:
>>> On Thu, Apr 07, 2022 at 06:03:37PM +0100, Robin Murphy wrote:
>>>> At a glance, this all looks about the right shape to me now, thanks!
>>>
>>> Thanks!
>>>
>>>> Ideally I'd hope patch #4 could go straight to device_iommu_capable() from
>>>> my Thunderbolt series, but we can figure that out in a couple of weeks once
>>>
>>> Yes, this does helps that because now the only iommu_capable call is
>>> in a context where a device is available :)
>>
>> Derp, of course I have *two* VFIO patches waiting, the other one touching
>> the iommu_capable() calls (there's still IOMMU_CAP_INTR_REMAP, which, much
>> as I hate it and would love to boot all that stuff over to
>> drivers/irqchip,
> 
> Oh me too...
> 
>> it's not in my way so I'm leaving it be for now). I'll have to rebase that
>> anyway, so merging this as-is is absolutely fine!
> 
> This might help your effort - after this series and this below there
> are no 'bus' users of iommu_capable left at all.

Thanks, but I still need a device for the iommu_domain_alloc() as well, 
so at that point the interrupt check is OK to stay where it is. I 
figured out a locking strategy per my original idea that seems pretty 
clean, it just needs vfio_group_viable() to go away first:

https://gitlab.arm.com/linux-arm/linux-rm/-/commit/c6057da9f6b5f4b0fb67c6e647d2f8f76a6177fc

Cheers,
Robin.
Tian, Kevin April 8, 2022, 9:08 a.m. UTC | #6
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, April 8, 2022 3:08 AM
> On Thu, Apr 07, 2022 at 07:02:03PM +0100, Robin Murphy wrote:
> > On 2022-04-07 18:43, Jason Gunthorpe wrote:
> > > On Thu, Apr 07, 2022 at 06:03:37PM +0100, Robin Murphy wrote:
> > > > At a glance, this all looks about the right shape to me now, thanks!
> > >
> > > Thanks!
> > >
> > > > Ideally I'd hope patch #4 could go straight to device_iommu_capable()
> from
> > > > my Thunderbolt series, but we can figure that out in a couple of weeks
> once
> > >
> > > Yes, this does helps that because now the only iommu_capable call is
> > > in a context where a device is available :)
> >
> > Derp, of course I have *two* VFIO patches waiting, the other one touching
> > the iommu_capable() calls (there's still IOMMU_CAP_INTR_REMAP, which,
> much
> > as I hate it and would love to boot all that stuff over to
> > drivers/irqchip,
> 
> Oh me too...
> 
> > it's not in my way so I'm leaving it be for now). I'll have to rebase that
> > anyway, so merging this as-is is absolutely fine!
> 
> This might help your effort - after this series and this below there
> are no 'bus' users of iommu_capable left at all.
> 

Out of curiosity, while iommu_capable is being moved to a per-device
interface what about irq_domain_check_msi_remap() below (which
is also a global check)?

> +static int vfio_iommu_device_ok(void *iommu_data, struct device *device)
> +{
> +	bool msi_remap;
> +
> +	msi_remap = irq_domain_check_msi_remap() ||
> +		    iommu_capable(device->bus, IOMMU_CAP_INTR_REMAP);
> +
Robin Murphy April 8, 2022, 10:11 a.m. UTC | #7
On 2022-04-08 10:08, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Friday, April 8, 2022 3:08 AM
>> On Thu, Apr 07, 2022 at 07:02:03PM +0100, Robin Murphy wrote:
>>> On 2022-04-07 18:43, Jason Gunthorpe wrote:
>>>> On Thu, Apr 07, 2022 at 06:03:37PM +0100, Robin Murphy wrote:
>>>>> At a glance, this all looks about the right shape to me now, thanks!
>>>>
>>>> Thanks!
>>>>
>>>>> Ideally I'd hope patch #4 could go straight to device_iommu_capable()
>> from
>>>>> my Thunderbolt series, but we can figure that out in a couple of weeks
>> once
>>>>
>>>> Yes, this does helps that because now the only iommu_capable call is
>>>> in a context where a device is available :)
>>>
>>> Derp, of course I have *two* VFIO patches waiting, the other one touching
>>> the iommu_capable() calls (there's still IOMMU_CAP_INTR_REMAP, which,
>> much
>>> as I hate it and would love to boot all that stuff over to
>>> drivers/irqchip,
>>
>> Oh me too...
>>
>>> it's not in my way so I'm leaving it be for now). I'll have to rebase that
>>> anyway, so merging this as-is is absolutely fine!
>>
>> This might help your effort - after this series and this below there
>> are no 'bus' users of iommu_capable left at all.
>>
> 
> Out of curiosity, while iommu_capable is being moved to a per-device
> interface what about irq_domain_check_msi_remap() below (which
> is also a global check)?

I suppose it could if anyone cared enough to make the effort - probably 
a case of resolving specific MSI domains for every device in the group, 
and potentially having to deal with hotplug later as well. 
Realistically, though, I wouldn't expect systems to have mixed 
capabilities in that regard (i.e. where the check would return false 
even though *some* domains support remapping), so there doesn't seem to 
be any pressing need to relax it.

Cheers,
Robin.

>> +static int vfio_iommu_device_ok(void *iommu_data, struct device *device)
>> +{
>> +	bool msi_remap;
>> +
>> +	msi_remap = irq_domain_check_msi_remap() ||
>> +		    iommu_capable(device->bus, IOMMU_CAP_INTR_REMAP);
>> +
>
Jason Gunthorpe April 8, 2022, 12:18 p.m. UTC | #8
On Thu, Apr 07, 2022 at 08:27:05PM +0100, Robin Murphy wrote:
> On 2022-04-07 20:08, Jason Gunthorpe wrote:
> > On Thu, Apr 07, 2022 at 07:02:03PM +0100, Robin Murphy wrote:
> > > On 2022-04-07 18:43, Jason Gunthorpe wrote:
> > > > On Thu, Apr 07, 2022 at 06:03:37PM +0100, Robin Murphy wrote:
> > > > > At a glance, this all looks about the right shape to me now, thanks!
> > > > 
> > > > Thanks!
> > > > 
> > > > > Ideally I'd hope patch #4 could go straight to device_iommu_capable() from
> > > > > my Thunderbolt series, but we can figure that out in a couple of weeks once
> > > > 
> > > > Yes, this does helps that because now the only iommu_capable call is
> > > > in a context where a device is available :)
> > > 
> > > Derp, of course I have *two* VFIO patches waiting, the other one touching
> > > the iommu_capable() calls (there's still IOMMU_CAP_INTR_REMAP, which, much
> > > as I hate it and would love to boot all that stuff over to
> > > drivers/irqchip,
> > 
> > Oh me too...
> > 
> > > it's not in my way so I'm leaving it be for now). I'll have to rebase that
> > > anyway, so merging this as-is is absolutely fine!
> > 
> > This might help your effort - after this series and this below there
> > are no 'bus' users of iommu_capable left at all.
> 
> Thanks, but I still need a device for the iommu_domain_alloc() as well, so
> at that point the interrupt check is OK to stay where it is. 

It is a simple enough change that could avoid introducing the
device_iommu_capable() at all perhaps.

> I figured out a locking strategy per my original idea that seems
> pretty clean, it just needs vfio_group_viable() to go away first:

I think this should be more like:

  	        struct vfio_device *vdev;

		mutex_lock(&group->device_lock);
		vdev = list_first_entry(group->device_list, struct vfio_device, group_next);
		ret = driver->ops->attach_group(data, group->iommu_group,
						group->type,
						vdev->dev);
		mutex_unlock(&group->device_lock);

Then don't do the iommu_group_for_each_dev() at all.

The problem with iommu_group_for_each_dev() is that it may select a
struct device that does not have a vfio_device bound to it, so we
would be using a random struct device that is not protected by any
VFIO device_driver.

However, this creates an oddball situation where the vfio_device and
it's struct device could become unplugged from the system while the
domain that the struct device spawned continues to exist and remains
attached to other devices in the same group. ie the iommu driver has
to be careful not to retain the struct device input..

I suppose that is inevitable to have sharing of domains across
devices, so the iommu drivers will have to accommodate this.

Jason
Robin Murphy April 8, 2022, 1:11 p.m. UTC | #9
On 2022-04-08 13:18, Jason Gunthorpe wrote:
> On Thu, Apr 07, 2022 at 08:27:05PM +0100, Robin Murphy wrote:
>> On 2022-04-07 20:08, Jason Gunthorpe wrote:
>>> On Thu, Apr 07, 2022 at 07:02:03PM +0100, Robin Murphy wrote:
>>>> On 2022-04-07 18:43, Jason Gunthorpe wrote:
>>>>> On Thu, Apr 07, 2022 at 06:03:37PM +0100, Robin Murphy wrote:
>>>>>> At a glance, this all looks about the right shape to me now, thanks!
>>>>>
>>>>> Thanks!
>>>>>
>>>>>> Ideally I'd hope patch #4 could go straight to device_iommu_capable() from
>>>>>> my Thunderbolt series, but we can figure that out in a couple of weeks once
>>>>>
>>>>> Yes, this does helps that because now the only iommu_capable call is
>>>>> in a context where a device is available :)
>>>>
>>>> Derp, of course I have *two* VFIO patches waiting, the other one touching
>>>> the iommu_capable() calls (there's still IOMMU_CAP_INTR_REMAP, which, much
>>>> as I hate it and would love to boot all that stuff over to
>>>> drivers/irqchip,
>>>
>>> Oh me too...
>>>
>>>> it's not in my way so I'm leaving it be for now). I'll have to rebase that
>>>> anyway, so merging this as-is is absolutely fine!
>>>
>>> This might help your effort - after this series and this below there
>>> are no 'bus' users of iommu_capable left at all.
>>
>> Thanks, but I still need a device for the iommu_domain_alloc() as well, so
>> at that point the interrupt check is OK to stay where it is.
> 
> It is a simple enough change that could avoid introducing the
> device_iommu_capable() at all perhaps.
> 
>> I figured out a locking strategy per my original idea that seems
>> pretty clean, it just needs vfio_group_viable() to go away first:
> 
> I think this should be more like:
> 
>    	        struct vfio_device *vdev;
> 
> 		mutex_lock(&group->device_lock);
> 		vdev = list_first_entry(group->device_list, struct vfio_device, group_next);
> 		ret = driver->ops->attach_group(data, group->iommu_group,
> 						group->type,
> 						vdev->dev);
> 		mutex_unlock(&group->device_lock);
> 
> Then don't do the iommu_group_for_each_dev() at all.
> 
> The problem with iommu_group_for_each_dev() is that it may select a
> struct device that does not have a vfio_device bound to it, so we
> would be using a random struct device that is not protected by any
> VFIO device_driver.

Yeah, I was just looking at the final puzzle piece of making sure to nab 
the actual VFIO device rather than some unbound device that's just along 
for the ride... If I can't come up with anything more self-contained 
I'll take this suggestion, thanks.

> However, this creates an oddball situation where the vfio_device and
> it's struct device could become unplugged from the system while the
> domain that the struct device spawned continues to exist and remains
> attached to other devices in the same group. ie the iommu driver has
> to be careful not to retain the struct device input..

Oh, I rather assumed that VFIO might automatically tear down the 
container/domain when the last real user disappears. I don't see there 
being an obvious problem if the domain does hang around with residual 
non-VFIO devices attached until userspace explicitly closes all the 
relevant handles, as long as we take care not to release DMA ownership 
until that point also. As you say, it just looks a bit funny.

> I suppose that is inevitable to have sharing of domains across
> devices, so the iommu drivers will have to accommodate this.

I think domain lifecycle management is already entirely up to the users 
and not something that IOMMU drivers need to worry about. Drivers should 
only need to look at per-device data in attach/detach (and, once I've 
finished, alloc) from the device argument which can be assumed to be 
valid at that point. Otherwise, all the relevant internal data for 
domain ops should belong to the domain already.

Cheers,
Robin.
Jason Gunthorpe April 8, 2022, 1:35 p.m. UTC | #10
On Fri, Apr 08, 2022 at 02:11:10PM +0100, Robin Murphy wrote:

> > However, this creates an oddball situation where the vfio_device and
> > it's struct device could become unplugged from the system while the
> > domain that the struct device spawned continues to exist and remains
> > attached to other devices in the same group. ie the iommu driver has
> > to be careful not to retain the struct device input..
> 
> Oh, I rather assumed that VFIO might automatically tear down the
> container/domain when the last real user disappears. 

It does, that isn't quite what I mean..

Lets say a simple case with two groups and two devices.

Open a VFIO container FD

We open group A and SET_CONTAINER it. This results in an
   domain_A = iommu_domain_alloc(device_A)
   iommu_attach_group(domain_A, device_A->group)

We open group B and SET_CONTAINER it. Using the sharing logic we end
up doing
   iommu_attach_group(domain_A, device_B->group)

Now we close group A FD, detatch device_A->group from domain_A and the
driver core hot-unplugs device A completely from the system.

However, domain_A remains in the system used by group B's open FD.

It is a bit funny at least.. I think it is just something to document
and be aware of for iommu driver writers that they probably shouldn't
try to store the allocation device in their domain struct.

IHMO the only purpose of the allocation device is to crystalize the
configuration of the iommu_domain at allocation time.

> as long as we take care not to release DMA ownership until that point also.
> As you say, it just looks a bit funny.

The DMA ownership should be OK as we take ownership on each group FD
open
 
> > I suppose that is inevitable to have sharing of domains across
> > devices, so the iommu drivers will have to accommodate this.
> 
> I think domain lifecycle management is already entirely up to the users and
> not something that IOMMU drivers need to worry about. Drivers should only
> need to look at per-device data in attach/detach (and, once I've finished,
> alloc) from the device argument which can be assumed to be valid at that
> point. Otherwise, all the relevant internal data for domain ops should
> belong to the domain already.

Making attach/detach take a struct device would be nice - but I would
expect the attach/detatch to use a strictly paired struct device and I
don't think this trick of selecting an arbitary vfio_device will
achieve that.

So, I suppose VFIO would want to attach/detatch on every vfio_device
individually and it would iterate over the group instead of doing a
list_first_entry() like above. This would not be hard to do in VFIO.

Not sure what the iommu layer would have to do to accommodate this..

Jason
Robin Murphy April 8, 2022, 5:44 p.m. UTC | #11
On 2022-04-08 14:35, Jason Gunthorpe wrote:
> On Fri, Apr 08, 2022 at 02:11:10PM +0100, Robin Murphy wrote:
> 
>>> However, this creates an oddball situation where the vfio_device and
>>> it's struct device could become unplugged from the system while the
>>> domain that the struct device spawned continues to exist and remains
>>> attached to other devices in the same group. ie the iommu driver has
>>> to be careful not to retain the struct device input..
>>
>> Oh, I rather assumed that VFIO might automatically tear down the
>> container/domain when the last real user disappears.
> 
> It does, that isn't quite what I mean..
> 
> Lets say a simple case with two groups and two devices.
> 
> Open a VFIO container FD
> 
> We open group A and SET_CONTAINER it. This results in an
>     domain_A = iommu_domain_alloc(device_A)
>     iommu_attach_group(domain_A, device_A->group)
> 
> We open group B and SET_CONTAINER it. Using the sharing logic we end
> up doing
>     iommu_attach_group(domain_A, device_B->group)
> 
> Now we close group A FD, detatch device_A->group from domain_A and the
> driver core hot-unplugs device A completely from the system.
> 
> However, domain_A remains in the system used by group B's open FD.
> 
> It is a bit funny at least.. I think it is just something to document
> and be aware of for iommu driver writers that they probably shouldn't
> try to store the allocation device in their domain struct.
> 
> IHMO the only purpose of the allocation device is to crystalize the
> configuration of the iommu_domain at allocation time.

Oh, for sure. When I implement the API switch, I can certainly try to 
document it as clearly as possible that the device argument is only for 
resolving the correct IOMMU ops and target instance, and the resulting 
domain is still not in any way tied to that specific device.

I hadn't thought about how it might look to future developers who aren't 
already familiar with all the history here, so thanks for the nudge!

>> as long as we take care not to release DMA ownership until that point also.
>> As you say, it just looks a bit funny.
> 
> The DMA ownership should be OK as we take ownership on each group FD
> open
>   
>>> I suppose that is inevitable to have sharing of domains across
>>> devices, so the iommu drivers will have to accommodate this.
>>
>> I think domain lifecycle management is already entirely up to the users and
>> not something that IOMMU drivers need to worry about. Drivers should only
>> need to look at per-device data in attach/detach (and, once I've finished,
>> alloc) from the device argument which can be assumed to be valid at that
>> point. Otherwise, all the relevant internal data for domain ops should
>> belong to the domain already.
> 
> Making attach/detach take a struct device would be nice - but I would
> expect the attach/detatch to use a strictly paired struct device and I
> don't think this trick of selecting an arbitary vfio_device will
> achieve that.
> 
> So, I suppose VFIO would want to attach/detatch on every vfio_device
> individually and it would iterate over the group instead of doing a
> list_first_entry() like above. This would not be hard to do in VFIO.

It feels like we've already beaten that discussion to death in other 
threads; regardless of what exact argument the iommu_attach/detach 
operations end up taking, they have to operate on the whole (explicit or 
implicit) iommu_group at once, because doing anything else would defeat 
the point of isolation groups, and be impossible for alias groups.

> Not sure what the iommu layer would have to do to accommodate this..

If it's significantly easier for VFIO to just run through a whole list 
of devices and attach each one without having to keep track of whether 
they might share an iommu_group which has already been attached, then we 
can probably relax the API a little such that attaching to a domain 
which is already the current domain becomes a no-op instead of returning 
-EBUSY, but I'd rather not create an expectation that anyone *has* to do 
that. For any other callers that would be forcing *more* iommu_group 
implementation details onto them, when we all want less.

Cheers,
Robin.
Tian, Kevin April 12, 2022, 2:49 a.m. UTC | #12
> From: Robin Murphy <robin.murphy@arm.com>
> Sent: Friday, April 8, 2022 6:12 PM
> 
> On 2022-04-08 10:08, Tian, Kevin wrote:
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> Sent: Friday, April 8, 2022 3:08 AM
> >> On Thu, Apr 07, 2022 at 07:02:03PM +0100, Robin Murphy wrote:
> >>> On 2022-04-07 18:43, Jason Gunthorpe wrote:
> >>>> On Thu, Apr 07, 2022 at 06:03:37PM +0100, Robin Murphy wrote:
> >>>>> At a glance, this all looks about the right shape to me now, thanks!
> >>>>
> >>>> Thanks!
> >>>>
> >>>>> Ideally I'd hope patch #4 could go straight to device_iommu_capable()
> >> from
> >>>>> my Thunderbolt series, but we can figure that out in a couple of weeks
> >> once
> >>>>
> >>>> Yes, this does helps that because now the only iommu_capable call is
> >>>> in a context where a device is available :)
> >>>
> >>> Derp, of course I have *two* VFIO patches waiting, the other one
> touching
> >>> the iommu_capable() calls (there's still IOMMU_CAP_INTR_REMAP,
> which,
> >> much
> >>> as I hate it and would love to boot all that stuff over to
> >>> drivers/irqchip,
> >>
> >> Oh me too...
> >>
> >>> it's not in my way so I'm leaving it be for now). I'll have to rebase that
> >>> anyway, so merging this as-is is absolutely fine!
> >>
> >> This might help your effort - after this series and this below there
> >> are no 'bus' users of iommu_capable left at all.
> >>
> >
> > Out of curiosity, while iommu_capable is being moved to a per-device
> > interface what about irq_domain_check_msi_remap() below (which
> > is also a global check)?
> 
> I suppose it could if anyone cared enough to make the effort - probably
> a case of resolving specific MSI domains for every device in the group,
> and potentially having to deal with hotplug later as well.
> Realistically, though, I wouldn't expect systems to have mixed
> capabilities in that regard (i.e. where the check would return false
> even though *some* domains support remapping), so there doesn't seem to
> be any pressing need to relax it.
> 

Yes, that makes sense if no such example so far.
Tian, Kevin April 12, 2022, 2:51 a.m. UTC | #13
> From: Robin Murphy <robin.murphy@arm.com>
> Sent: Saturday, April 9, 2022 1:45 AM
> 
> >
> > So, I suppose VFIO would want to attach/detatch on every vfio_device
> > individually and it would iterate over the group instead of doing a
> > list_first_entry() like above. This would not be hard to do in VFIO.
> 
> It feels like we've already beaten that discussion to death in other
> threads; regardless of what exact argument the iommu_attach/detach
> operations end up taking, they have to operate on the whole (explicit or
> implicit) iommu_group at once, because doing anything else would defeat
> the point of isolation groups, and be impossible for alias groups.
> 

Could you add a few words about what is 'alias group'?