diff mbox series

[4/4] iommu: Fix ordering of iommu_release_device()

Message ID 4-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com (mailing list archive)
State New, archived
Headers show
Series Fix splats releated to using the iommu_group after destroying devices | expand

Commit Message

Jason Gunthorpe Sept. 8, 2022, 6:45 p.m. UTC
default domains created a situation where the device is always connected
to a domain of some kind. When the device is idle it is connected to one
of the two pre-existing domains in the group, blocking_domain or
default_domain. In this way we have a continuous assertion of what state
the transation is in.

When this is all destructed then we need to remove all the devices from
their domains via the ops->release_device() call before the domain can be
freed. This is the bug recognized in commit 9ac8545199a1 ("iommu: Fix
use-after-free in iommu_release_device").

However, we must also stop any concurrent access to the iommu driver for
this device before we destroy it. This is done by:

 1) Drivers only using the iommu API while they have a device driver
    attached to the device. This directly prevents release from happening.

 2) Removing the device from the group list so any lingering group
    references no longer refer to the device. This is done by
    iommu_group_remove_device()

Since iommu_group_remove_device() has been moved this breaks #2 and
triggers an WARN when VFIO races group activities with the release of the
device:

   iommu driver failed to attach the default/blocking domain
   WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
   Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6>
   CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
   Hardware name: IBM 3931 A01 782 (LPAR)
   Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
	      R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
   Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
	      00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
	      00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
	      00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
   Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
			  000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
			 #000000095bb10d24: af000000            mc      0,0
			 >000000095bb10d28: b904002a            lgr     %r2,%r10
			  000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
			  000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
			  000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
			  000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
   Call Trace:
    [<000000095bb10d28>] iommu_detach_group+0x70/0x80
   ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
    [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
    [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
    [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
   pci 0004:00:00.0: Removing from iommu group 4
    [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
    [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
    [<000000095be6c072>] system_call+0x82/0xb0
   Last Breaking-Event-Address:
    [<000000095be4bf80>] __warn_printk+0x60/0x68

So, put things in the right order:
 - Remove the device from the group's list
 - Release the device from the iommu driver to drop all domain references
 - Free the domains

This is done by splitting out the kobject_put(), which triggers
iommu_group_release(), from the rest of iommu_group_remove_device() and
placing it after release is called.

Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 36 +++++++++++++++++++++++++++---------
 1 file changed, 27 insertions(+), 9 deletions(-)

Comments

Robin Murphy Sept. 8, 2022, 9:05 p.m. UTC | #1
On 2022-09-08 19:45, Jason Gunthorpe wrote:
> default domains created a situation where the device is always connected
> to a domain of some kind. When the device is idle it is connected to one
> of the two pre-existing domains in the group, blocking_domain or
> default_domain. In this way we have a continuous assertion of what state
> the transation is in.
> 
> When this is all destructed then we need to remove all the devices from
> their domains via the ops->release_device() call before the domain can be
> freed. This is the bug recognized in commit 9ac8545199a1 ("iommu: Fix
> use-after-free in iommu_release_device").
> 
> However, we must also stop any concurrent access to the iommu driver for
> this device before we destroy it. This is done by:
> 
>   1) Drivers only using the iommu API while they have a device driver
>      attached to the device. This directly prevents release from happening.
> 
>   2) Removing the device from the group list so any lingering group
>      references no longer refer to the device. This is done by
>      iommu_group_remove_device()
> 
> Since iommu_group_remove_device() has been moved this breaks #2 and
> triggers an WARN when VFIO races group activities with the release of the
> device:
> 
>     iommu driver failed to attach the default/blocking domain
>     WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
>     Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6>
>     CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
>     Hardware name: IBM 3931 A01 782 (LPAR)
>     Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
> 	      R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>     Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
> 	      00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
> 	      00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
> 	      00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
>     Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
> 			  000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
> 			 #000000095bb10d24: af000000            mc      0,0
> 			 >000000095bb10d28: b904002a            lgr     %r2,%r10
> 			  000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
> 			  000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
> 			  000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
> 			  000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
>     Call Trace:
>      [<000000095bb10d28>] iommu_detach_group+0x70/0x80
>     ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
>      [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
>      [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
>      [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
>     pci 0004:00:00.0: Removing from iommu group 4
>      [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
>      [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
>      [<000000095be6c072>] system_call+0x82/0xb0
>     Last Breaking-Event-Address:
>      [<000000095be4bf80>] __warn_printk+0x60/0x68
> 
> So, put things in the right order:
>   - Remove the device from the group's list
>   - Release the device from the iommu driver to drop all domain references
>   - Free the domains
> 
> This is done by splitting out the kobject_put(), which triggers
> iommu_group_release(), from the rest of iommu_group_remove_device() and
> placing it after release is called.

So simple... now how did I fail to think of that? :)

> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommu.c | 36 +++++++++++++++++++++++++++---------
>   1 file changed, 27 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 780fb70715770d..c451bf715182ac 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -90,6 +90,7 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
>   static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>   static ssize_t iommu_group_store_type(struct iommu_group *group,
>   				      const char *buf, size_t count);
> +static void __iommu_group_remove_device(struct device *dev);
>   
>   #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\
>   struct iommu_group_attribute iommu_group_attr_##_name =		\
> @@ -330,6 +331,7 @@ int iommu_probe_device(struct device *dev)
>   
>   void iommu_release_device(struct device *dev)
>   {
> +	struct iommu_group *group = dev->iommu_group;
>   	const struct iommu_ops *ops;
>   
>   	if (!dev->iommu)
> @@ -337,11 +339,20 @@ void iommu_release_device(struct device *dev)
>   
>   	iommu_device_unlink(dev->iommu->iommu_dev, dev);
>   

In fact, now that you've made it obvious, could we not simply do an 
extra kobject_get() here before calling regular 
iommu_group_remove_device(), and avoid having to split that up at all? 
That should delay any default domain teardown just as definitively as 
holding the original reference for longer, no?

Thanks,
Robin.

> +	__iommu_group_remove_device(dev);
>   	ops = dev_iommu_ops(dev);
>   	if (ops->release_device)
>   		ops->release_device(dev);
>   
> -	iommu_group_remove_device(dev);
> +	/*
> +	 * This will eventually call iommu_group_release() which will free the
> +	 * iommu_domains. Up until the release_device() above the iommu_domains
> +	 * may still have been associated with the device, and we cannot free
> +	 * them until the have been detached. release_device() is expected to
> +	 * detach all domains connected to the dev.
> +	 */
> +	kobject_put(group->devices_kobj);
> +
>   	module_put(ops->owner);
>   	dev_iommu_free(dev);
>   }
> @@ -939,14 +950,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
>   }
>   EXPORT_SYMBOL_GPL(iommu_group_add_device);
>   
> -/**
> - * iommu_group_remove_device - remove a device from it's current group
> - * @dev: device to be removed
> - *
> - * This function is called by an iommu driver to remove the device from
> - * it's current group.  This decrements the iommu group reference count.
> - */
> -void iommu_group_remove_device(struct device *dev)
> +static void __iommu_group_remove_device(struct device *dev)
>   {
>   	struct iommu_group *group = dev->iommu_group;
>   	struct group_device *tmp_device, *device = NULL;
> @@ -977,6 +981,20 @@ void iommu_group_remove_device(struct device *dev)
>   	kfree(device->name);
>   	kfree(device);
>   	dev->iommu_group = NULL;
> +}
> +
> +/**
> + * iommu_group_remove_device - remove a device from it's current group
> + * @dev: device to be removed
> + *
> + * This function is called by an iommu driver to remove the device from
> + * it's current group.  This decrements the iommu group reference count.
> + */
> +void iommu_group_remove_device(struct device *dev)
> +{
> +	struct iommu_group *group = dev->iommu_group;
> +
> +	__iommu_group_remove_device(dev);
>   	kobject_put(group->devices_kobj);
>   }
>   EXPORT_SYMBOL_GPL(iommu_group_remove_device);
Robin Murphy Sept. 8, 2022, 9:27 p.m. UTC | #2
On 2022-09-08 22:05, Robin Murphy wrote:
> On 2022-09-08 19:45, Jason Gunthorpe wrote:
>> default domains created a situation where the device is always connected
>> to a domain of some kind. When the device is idle it is connected to one
>> of the two pre-existing domains in the group, blocking_domain or
>> default_domain. In this way we have a continuous assertion of what state
>> the transation is in.
>>
>> When this is all destructed then we need to remove all the devices from
>> their domains via the ops->release_device() call before the domain can be
>> freed. This is the bug recognized in commit 9ac8545199a1 ("iommu: Fix
>> use-after-free in iommu_release_device").
>>
>> However, we must also stop any concurrent access to the iommu driver for
>> this device before we destroy it. This is done by:
>>
>>   1) Drivers only using the iommu API while they have a device driver
>>      attached to the device. This directly prevents release from 
>> happening.
>>
>>   2) Removing the device from the group list so any lingering group
>>      references no longer refer to the device. This is done by
>>      iommu_group_remove_device()
>>
>> Since iommu_group_remove_device() has been moved this breaks #2 and
>> triggers an WARN when VFIO races group activities with the release of the
>> device:
>>
>>     iommu driver failed to attach the default/blocking domain
>>     WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 
>> iommu_detach_group+0x6c/0x80
>>     Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core 
>> irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6>
>>     CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        
>> W          6.0.0-rc3 #5
>>     Hardware name: IBM 3931 A01 782 (LPAR)
>>     Krnl PSW : 0704c00180000000 000000095bb10d28 
>> (iommu_detach_group+0x70/0x80)
>>           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>     Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 
>> 000000095c97ffe0
>>           00000000fffeffff 00000009fc290000 00000000af1fda50 
>> 00000000af590b58
>>           00000000af1fdaf0 0000000135c7a320 0000000135e52258 
>> 0000000135e52200
>>           00000000a29e8000 00000000af590b40 000000095bb10d24 
>> 0000038004b13c98
>>     Krnl Code: 000000095bb10d18: c020003d56fc        larl    
>> %r2,000000095c2bbb10
>>               000000095bb10d1e: c0e50019d901        brasl   
>> %r14,000000095be4bf20
>>              #000000095bb10d24: af000000            mc      0,0
>>              >000000095bb10d28: b904002a            lgr     %r2,%r10
>>               000000095bb10d2c: ebaff0a00004        lmg     
>> %r10,%r15,160(%r15)
>>               000000095bb10d32: c0f4001aa867        brcl    
>> 15,000000095be65e00
>>               000000095bb10d38: c004002168e0        brcl    
>> 0,000000095bf3def8
>>               000000095bb10d3e: eb6ff0480024        stmg    
>> %r6,%r15,72(%r15)
>>     Call Trace:
>>      [<000000095bb10d28>] iommu_detach_group+0x70/0x80
>>     ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
>>      [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 
>> [vfio_iommu_type1]
>>      [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
>>      [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
>>     pci 0004:00:00.0: Removing from iommu group 4
>>      [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
>>      [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
>>      [<000000095be6c072>] system_call+0x82/0xb0
>>     Last Breaking-Event-Address:
>>      [<000000095be4bf80>] __warn_printk+0x60/0x68
>>
>> So, put things in the right order:
>>   - Remove the device from the group's list
>>   - Release the device from the iommu driver to drop all domain 
>> references
>>   - Free the domains
>>
>> This is done by splitting out the kobject_put(), which triggers
>> iommu_group_release(), from the rest of iommu_group_remove_device() and
>> placing it after release is called.
> 
> So simple... now how did I fail to think of that? :)

Oh, because s390 is using iommu_get_domain_for_dev() in its 
release_device callback, which needs to dereference the group to work, 
and the current domain may also be a non-default one which we can't 
prevent from disappearing racily, that was why :(

I think we may be back to looking at s390 having to keep track of 
domains internally like SMMUv3 does, and both drivers synchronising 
between their domain_free and release_device to to do their internal 
detach from either place... unless possibly the core code starts 
refcounting domains as well?

Robin.

>> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
>> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
>> Cc: Robin Murphy <robin.murphy@arm.com>
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> ---
>>   drivers/iommu/iommu.c | 36 +++++++++++++++++++++++++++---------
>>   1 file changed, 27 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 780fb70715770d..c451bf715182ac 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -90,6 +90,7 @@ static int 
>> iommu_create_device_direct_mappings(struct iommu_group *group,
>>   static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>>   static ssize_t iommu_group_store_type(struct iommu_group *group,
>>                         const char *buf, size_t count);
>> +static void __iommu_group_remove_device(struct device *dev);
>>   #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)        \
>>   struct iommu_group_attribute iommu_group_attr_##_name =        \
>> @@ -330,6 +331,7 @@ int iommu_probe_device(struct device *dev)
>>   void iommu_release_device(struct device *dev)
>>   {
>> +    struct iommu_group *group = dev->iommu_group;
>>       const struct iommu_ops *ops;
>>       if (!dev->iommu)
>> @@ -337,11 +339,20 @@ void iommu_release_device(struct device *dev)
>>       iommu_device_unlink(dev->iommu->iommu_dev, dev);
> 
> In fact, now that you've made it obvious, could we not simply do an 
> extra kobject_get() here before calling regular 
> iommu_group_remove_device(), and avoid having to split that up at all? 
> That should delay any default domain teardown just as definitively as 
> holding the original reference for longer, no?
> 
> Thanks,
> Robin.
> 
>> +    __iommu_group_remove_device(dev);
>>       ops = dev_iommu_ops(dev);
>>       if (ops->release_device)
>>           ops->release_device(dev);
>> -    iommu_group_remove_device(dev);
>> +    /*
>> +     * This will eventually call iommu_group_release() which will 
>> free the
>> +     * iommu_domains. Up until the release_device() above the 
>> iommu_domains
>> +     * may still have been associated with the device, and we cannot 
>> free
>> +     * them until the have been detached. release_device() is 
>> expected to
>> +     * detach all domains connected to the dev.
>> +     */
>> +    kobject_put(group->devices_kobj);
>> +
>>       module_put(ops->owner);
>>       dev_iommu_free(dev);
>>   }
>> @@ -939,14 +950,7 @@ int iommu_group_add_device(struct iommu_group 
>> *group, struct device *dev)
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_group_add_device);
>> -/**
>> - * iommu_group_remove_device - remove a device from it's current group
>> - * @dev: device to be removed
>> - *
>> - * This function is called by an iommu driver to remove the device from
>> - * it's current group.  This decrements the iommu group reference count.
>> - */
>> -void iommu_group_remove_device(struct device *dev)
>> +static void __iommu_group_remove_device(struct device *dev)
>>   {
>>       struct iommu_group *group = dev->iommu_group;
>>       struct group_device *tmp_device, *device = NULL;
>> @@ -977,6 +981,20 @@ void iommu_group_remove_device(struct device *dev)
>>       kfree(device->name);
>>       kfree(device);
>>       dev->iommu_group = NULL;
>> +}
>> +
>> +/**
>> + * iommu_group_remove_device - remove a device from it's current group
>> + * @dev: device to be removed
>> + *
>> + * This function is called by an iommu driver to remove the device from
>> + * it's current group.  This decrements the iommu group reference count.
>> + */
>> +void iommu_group_remove_device(struct device *dev)
>> +{
>> +    struct iommu_group *group = dev->iommu_group;
>> +
>> +    __iommu_group_remove_device(dev);
>>       kobject_put(group->devices_kobj);
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_group_remove_device);
>
Jason Gunthorpe Sept. 8, 2022, 9:43 p.m. UTC | #3
On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote:

> Oh, because s390 is using iommu_get_domain_for_dev() in its release_device
> callback, which needs to dereference the group to work, and the current
> domain may also be a non-default one which we can't prevent from
> disappearing racily, that was why :(

Hum, the issue there is the use of device->iommu_group - but that just
means I didn't split properly. How about this incremental:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c451bf715182ac..99ef799f3fe6b5 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -351,6 +351,7 @@ void iommu_release_device(struct device *dev)
 	 * them until the have been detached. release_device() is expected to
 	 * detach all domains connected to the dev.
 	 */
+	dev->iommu_group = NULL;
 	kobject_put(group->devices_kobj);
 
 	module_put(ops->owner);
@@ -980,7 +981,6 @@ static void __iommu_group_remove_device(struct device *dev)
 
 	kfree(device->name);
 	kfree(device);
-	dev->iommu_group = NULL;
 }
 
 /**
@@ -995,6 +995,7 @@ void iommu_group_remove_device(struct device *dev)
 	struct iommu_group *group = dev->iommu_group;
 
 	__iommu_group_remove_device(dev);
+	dev->iommu_group = NULL;
 	kobject_put(group->devices_kobj);
 }
 EXPORT_SYMBOL_GPL(iommu_group_remove_device);

To me it makes sense that the driver should be able to continue to
query the iommu_group during release anyhow..

And to your other question, the reason I split the function is because
I couldn't really say WTF iommu_group_remove_device() was supposed to
do. The __ version make ssense as part of the remove_device, due to
the sequencing with ops->release()

But the other one doesn't have that. So I want to put in a:

   WARN_ON(group->blocking_domain || group->default_domain);

Because calling it after those domains are allocated looks broken to
me.

Jason
Robin Murphy Sept. 9, 2022, 9:05 a.m. UTC | #4
On 2022-09-08 22:43, Jason Gunthorpe wrote:
> On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote:
> 
>> Oh, because s390 is using iommu_get_domain_for_dev() in its release_device
>> callback, which needs to dereference the group to work, and the current
>> domain may also be a non-default one which we can't prevent from
>> disappearing racily, that was why :(
> 
> Hum, the issue there is the use of device->iommu_group - but that just
> means I didn't split properly. How about this incremental:

That did cross my mind, but it's a bit grim. In the light of the 
morning, I'm not sure s390 actually *needs* the group anyway - AFAICS if 
iommu_group_remove_device() has been processed first, that will have 
synchronised against any concurrent attach/detach, so zdev->s390_domain 
can be assumed to be up to date and used directly without the round trip 
through iommu_get_domain_for_dev(). That then only leaves the issue that 
that domain may still become invalid at any point after the group mutex 
has been dropped.

> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index c451bf715182ac..99ef799f3fe6b5 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -351,6 +351,7 @@ void iommu_release_device(struct device *dev)
>   	 * them until the have been detached. release_device() is expected to
>   	 * detach all domains connected to the dev.
>   	 */
> +	dev->iommu_group = NULL;
>   	kobject_put(group->devices_kobj);
>   
>   	module_put(ops->owner);
> @@ -980,7 +981,6 @@ static void __iommu_group_remove_device(struct device *dev)
>   
>   	kfree(device->name);
>   	kfree(device);
> -	dev->iommu_group = NULL;
>   }
>   
>   /**
> @@ -995,6 +995,7 @@ void iommu_group_remove_device(struct device *dev)
>   	struct iommu_group *group = dev->iommu_group;
>   
>   	__iommu_group_remove_device(dev);
> +	dev->iommu_group = NULL;
>   	kobject_put(group->devices_kobj);
>   }
>   EXPORT_SYMBOL_GPL(iommu_group_remove_device);
> 
> To me it makes sense that the driver should be able to continue to
> query the iommu_group during release anyhow..

I'm not so sure, release shouldn't be depending on a group since there 
may never have been one anyway. Perhaps the answer is an extra 
pre-release step to balance probe_finalize?

> And to your other question, the reason I split the function is because
> I couldn't really say WTF iommu_group_remove_device() was supposed to
> do. The __ version make ssense as part of the remove_device, due to
> the sequencing with ops->release()
> 
> But the other one doesn't have that. So I want to put in a:
> 
>     WARN_ON(group->blocking_domain || group->default_domain);
> 
> Because calling it after those domains are allocated looks broken to
> me.

I might be misunderstanding, but that sounds backwards - if a real 
device is being hotplugged out, we absolutely expect that to happen 
*after* its default domain has been set up. The external callers are 
using fake groups where default domains aren't relevant, and I have no 
idea what PAMU is doing but it's been doing it for long enough that it 
most likely isn't a problem. Thus wherever that check would be it would 
seem either wrong or unnecessary.

Thanks,
Robin.
Jason Gunthorpe Sept. 9, 2022, 1:25 p.m. UTC | #5
On Fri, Sep 09, 2022 at 10:05:58AM +0100, Robin Murphy wrote:
> On 2022-09-08 22:43, Jason Gunthorpe wrote:
> > On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote:
> > 
> > > Oh, because s390 is using iommu_get_domain_for_dev() in its release_device
> > > callback, which needs to dereference the group to work, and the current
> > > domain may also be a non-default one which we can't prevent from
> > > disappearing racily, that was why :(
> > 
> > Hum, the issue there is the use of device->iommu_group - but that just
> > means I didn't split properly. How about this incremental:
> 
> That did cross my mind, but it's a bit grim.

Actually, also in my morning, I think it may not even be necessary.

Keep in mind the start of the series fixes VFIO.

The bug that S390 is trying to fix is that VFIO didn't put back the
group ownership, it just left its own iommu_domain attached and called
release().

But now, at least for single device groups, VFIO will put owenership
back and zdev->s390_domain == NULL when we get to release_device()

> That then only leaves the issue that that domain may still become
> invalid at any point after the group mutex has been dropped.

So that is this race:

        CPU0                         CPU1 
   iommu_release_device(a)
      __iommu_group_remove_device(a)
			         iommu_device_use_default_domain(b)
                                 iommu_domain_free(domain)
                                 iommu_release_device(b)
                                      ops->release_device(b)
      ops->release_device(a) 
        // Boom, a is still attached to domain :(

I can't think of how to solve this other than holding the group mutex
across release_device. See below.

> > And to your other question, the reason I split the function is because
> > I couldn't really say WTF iommu_group_remove_device() was supposed to
> > do. The __ version make ssense as part of the remove_device, due to
> > the sequencing with ops->release()
> > 
> > But the other one doesn't have that. So I want to put in a:
> > 
> >     WARN_ON(group->blocking_domain || group->default_domain);
> > 
> > Because calling it after those domains are allocated looks broken to
> > me.
> 
> I might be misunderstanding, but that sounds backwards - if a real device is
> being hotplugged out, we absolutely expect that to happen *after* its
> default domain has been set up.

See below for what I mean

iommu_group_remove_device() doesn't work as an API because it has no
way to tell the device to stop using the domain we are about to free.

So it should assert that there is no domain to worry about. For the
vfio and power case there is no domain because they don't use iommu
drivers

For FSL it does not use default domains so it will also be OK.

Also, I think FSL is broken and it should not be trying to remove the
"PCI controller" from a group. Every PCI device behind an IOMMU should
be linked to a group.

The only reason I can think someone would wanted to do this is because
they ran into trouble with the VFIO viability checks. But we have a
robust solution to that now that doesn't require abusing the group
like this.

Jason

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 780fb70715770d..cb83576b1877d5 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -90,6 +90,10 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
 static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 static ssize_t iommu_group_store_type(struct iommu_group *group,
 				      const char *buf, size_t count);
+static struct group_device *
+__iommu_group_remove_device(struct iommu_group *group, struct device *dev);
+static void __iommu_group_remove_device_sysfs(struct iommu_group *group,
+					      struct group_device *device);
 
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\
 struct iommu_group_attribute iommu_group_attr_##_name =		\
@@ -330,18 +334,50 @@ int iommu_probe_device(struct device *dev)
 
 void iommu_release_device(struct device *dev)
 {
+	struct iommu_group *group = dev->iommu_group;
 	const struct iommu_ops *ops;
+	struct group_device *device;
 
 	if (!dev->iommu)
 		return;
 
 	iommu_device_unlink(dev->iommu->iommu_dev, dev);
 
+	mutex_lock(&group->mutex);
+	device = __iommu_group_remove_device(group, dev);
 	ops = dev_iommu_ops(dev);
+
+	/*
+	 * If the group has become empty then ownership must have been released,
+	 * and the current domain must be set back to NULL or the default
+	 * domain.
+	 */
+	if (list_empty(&group->devices))
+		WARN_ON(group->owner_cnt ||
+			group->domain != group->default_domain);
+
+	/*
+	 * release_device() must stop using any attached domain on the device.
+	 * If there are still other devices in the group they are not effected
+	 * by this callback.
+	 *
+	 * The IOMMU driver must set the device to either an identity or
+	 * blocking translation and stop using any domain pointer, as it is
+	 * going to be freed.
+	 */
 	if (ops->release_device)
 		ops->release_device(dev);
+	mutex_unlock(&group->mutex);
+
+	__iommu_group_remove_device_sysfs(group, device);
+
+	/*
+	 * This will eventually call iommu_group_release() which will free the
+	 * iommu_domains.
+	 */
+	dev->iommu_group = NULL;
+	kobject_put(group->devices_kobj);
 
-	iommu_group_remove_device(dev);
 	module_put(ops->owner);
 	dev_iommu_free(dev);
 }
@@ -939,44 +975,69 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_group_add_device);
 
-/**
- * iommu_group_remove_device - remove a device from it's current group
- * @dev: device to be removed
- *
- * This function is called by an iommu driver to remove the device from
- * it's current group.  This decrements the iommu group reference count.
- */
-void iommu_group_remove_device(struct device *dev)
+static struct group_device *
+__iommu_group_remove_device(struct iommu_group *group, struct device *dev)
 {
-	struct iommu_group *group = dev->iommu_group;
-	struct group_device *tmp_device, *device = NULL;
+	struct group_device *device;
+
+	lockdep_assert_held(&group->mutex);
 
 	if (!group)
-		return;
+		return NULL;
 
 	dev_info(dev, "Removing from iommu group %d\n", group->id);
 
-	mutex_lock(&group->mutex);
-	list_for_each_entry(tmp_device, &group->devices, list) {
-		if (tmp_device->dev == dev) {
-			device = tmp_device;
+	list_for_each_entry(device, &group->devices, list) {
+		if (device->dev == dev) {
 			list_del(&device->list);
-			break;
+			return device;
 		}
 	}
-	mutex_unlock(&group->mutex);
+	return NULL;
+}
 
+static void __iommu_group_remove_device_sysfs(struct iommu_group *group,
+					      struct group_device *device)
+{
 	if (!device)
 		return;
 
 	sysfs_remove_link(group->devices_kobj, device->name);
-	sysfs_remove_link(&dev->kobj, "iommu_group");
+	sysfs_remove_link(&device->dev->kobj, "iommu_group");
 
-	trace_remove_device_from_group(group->id, dev);
+	trace_remove_device_from_group(group->id, device->dev);
 
 	kfree(device->name);
 	kfree(device);
-	dev->iommu_group = NULL;
+}
+
+/**
+ * iommu_group_remove_device - remove a device from it's current group
+ * @dev: device to be removed
+ *
+ * This function is called by an iommu driver to remove the device from
+ * it's current group.  This decrements the iommu group reference count.
+ */
+void iommu_group_remove_device(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct group_device *device;
+
+	/*
+	 * Since we don't do ops->release_device() there is no way for the
+	 * driver to stop using any attached domain before we free it. If a
+	 * domain is attached while this function is called it will eventually
+	 * UAF.
+	 *
+	 * Thus it is only useful for cases like VFIO/SPAPR that don't use an
+	 * iommu driver, or for cases like FSL that don't use default domains.
+	 */
+	WARN_ON(group->domain);
+
+	mutex_lock(&group->mutex);
+	device = __iommu_group_remove_device(group, dev);
+	mutex_unlock(&group->mutex);
+	__iommu_group_remove_device_sysfs(group, device);
 	kobject_put(group->devices_kobj);
 }
 EXPORT_SYMBOL_GPL(iommu_group_remove_device);
Robin Murphy Sept. 9, 2022, 5:57 p.m. UTC | #6
On 2022-09-09 14:25, Jason Gunthorpe wrote:
> On Fri, Sep 09, 2022 at 10:05:58AM +0100, Robin Murphy wrote:
>> On 2022-09-08 22:43, Jason Gunthorpe wrote:
>>> On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote:
>>>
>>>> Oh, because s390 is using iommu_get_domain_for_dev() in its release_device
>>>> callback, which needs to dereference the group to work, and the current
>>>> domain may also be a non-default one which we can't prevent from
>>>> disappearing racily, that was why :(
>>>
>>> Hum, the issue there is the use of device->iommu_group - but that just
>>> means I didn't split properly. How about this incremental:
>>
>> That did cross my mind, but it's a bit grim.
> 
> Actually, also in my morning, I think it may not even be necessary.
> 
> Keep in mind the start of the series fixes VFIO.
> 
> The bug that S390 is trying to fix is that VFIO didn't put back the
> group ownership, it just left its own iommu_domain attached and called
> release().
> 
> But now, at least for single device groups, VFIO will put owenership
> back and zdev->s390_domain == NULL when we get to release_device()
> 
>> That then only leaves the issue that that domain may still become
>> invalid at any point after the group mutex has been dropped.
> 
> So that is this race:
> 
>          CPU0                         CPU1
>     iommu_release_device(a)
>        __iommu_group_remove_device(a)
> 			         iommu_device_use_default_domain(b)
>                                   iommu_domain_free(domain)
>                                   iommu_release_device(b)
>                                        ops->release_device(b)
>        ops->release_device(a)
>          // Boom, a is still attached to domain :(
> 
> I can't think of how to solve this other than holding the group mutex
> across release_device. See below.

I see a few possibilities:

- Backtrack slightly on its removal, and instead repurpose detach_dev
into a specialised domain cleanup callback, called before or during
iommu_group_remove_device(), with the group mutex held.

- Drivers that hold any kind of internal per-device references to
domains - which is generally the root of this issue in the first place -
can implement proper reference counting, so even if a domain is "freed"
with a device still attached as above, it doesn't actually go away until
release_device(a) cleans up the final dangling reference. I suggested
the core doing this generically, but on reflection I think it's actually
a lot more straightforward as a driver-internal thing.

- Drivers that basically just keep a list of devices in the domain and
need to do a list_del() in release_device, can also list_del_init() any
still-attached devices in domain_free, with a simple per-instance or
global lock to serialise the two.

>>> And to your other question, the reason I split the function is because
>>> I couldn't really say WTF iommu_group_remove_device() was supposed to
>>> do. The __ version make ssense as part of the remove_device, due to
>>> the sequencing with ops->release()
>>>
>>> But the other one doesn't have that. So I want to put in a:
>>>
>>>      WARN_ON(group->blocking_domain || group->default_domain);
>>>
>>> Because calling it after those domains are allocated looks broken to
>>> me.
>>
>> I might be misunderstanding, but that sounds backwards - if a real device is
>> being hotplugged out, we absolutely expect that to happen *after* its
>> default domain has been set up.
> 
> See below for what I mean
> 
> iommu_group_remove_device() doesn't work as an API because it has no
> way to tell the device to stop using the domain we are about to free.
> 
> So it should assert that there is no domain to worry about. For the
> vfio and power case there is no domain because they don't use iommu
> drivers

Ah, I see it now - if we think it's a usage error for any current API
user to allow a device to be removed while still attached to a non-
default domain, then we can just throw our hands up at that, and
mitigate for the default domain case that we *can* control. I'm not 100%
convinced there might not be some niche non-uAPI case for skipping a
detach because you know you're tearing down your device and domain at the
same time, but I'm inclined to agree that we can worry about that if and
when it does ever come up.

If so, I reckon it should be about as as easy as this (untested).

Cheers,
Robin.

----->8-----
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 9fbe5d067473..760d9bd3ad66 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -396,17 +396,25 @@ int iommu_probe_device(struct device *dev)
  void iommu_release_device(struct device *dev)
  {
  	const struct iommu_ops *ops;
+	struct iommu_group *group;
  
  	if (!dev->iommu)
  		return;
  
  	iommu_device_unlink(dev->iommu->iommu_dev, dev);
  
+	/*
+	 * Some drivers track a device's current domain internally and may
+	 * dereference it to clean up in release_device. If a default domain
+	 * exists, hold a reference to ensure it stays around long enough.
+	 */
+	group = iommu_group_get(dev);
+	iommu_group_remove_device(dev);
  	ops = dev_iommu_ops(dev);
  	if (ops->release_device)
  		ops->release_device(dev);
  
-	iommu_group_remove_device(dev);
+	iommu_group_put(group);
  	module_put(ops->owner);
  	dev_iommu_free(dev);
  }
@@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev)
  	dev_info(dev, "Removing from iommu group %d\n", group->id);
  
  	mutex_lock(&group->mutex);
+	if (WARN_ON(group->domain != group->default_domain &&
+		    group->domain != group->blocking_domain)) {
+		if (group->default_domain)
+			__iommu_attach_device(group->default_domain, dev);
+		else
+			__iommu_detach_device(group->domain, dev);
+	}
+
  	list_for_each_entry(tmp_device, &group->devices, list) {
  		if (tmp_device->dev == dev) {
  			device = tmp_device;
Jason Gunthorpe Sept. 9, 2022, 6:30 p.m. UTC | #7
On Fri, Sep 09, 2022 at 06:57:59PM +0100, Robin Murphy wrote:

> > > That then only leaves the issue that that domain may still become
> > > invalid at any point after the group mutex has been dropped.
> > 
> > So that is this race:
> > 
> >          CPU0                         CPU1
> >     iommu_release_device(a)
> >        __iommu_group_remove_device(a)
> > 			         iommu_device_use_default_domain(b)
> >                                   iommu_domain_free(domain)
> >                                   iommu_release_device(b)
> >                                        ops->release_device(b)
> >        ops->release_device(a)
> >          // Boom, a is still attached to domain :(
> > 
> > I can't think of how to solve this other than holding the group mutex
> > across release_device. See below.
> 
> I see a few possibilities:
> 
> - Backtrack slightly on its removal, and instead repurpose detach_dev
> into a specialised domain cleanup callback, called before or during
> iommu_group_remove_device(), with the group mutex held.

See below for why that is somewhat troublesome..
 
> - Drivers that hold any kind of internal per-device references to
> domains - which is generally the root of this issue in the first place -
> can implement proper reference counting, so even if a domain is "freed"
> with a device still attached as above, it doesn't actually go away until
> release_device(a) cleans up the final dangling reference. I suggested
> the core doing this generically, but on reflection I think it's actually
> a lot more straightforward as a driver-internal thing.

Isn't this every driver though? Like every single driver implementing
an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the
iommu_domain - minimally to point the HW to the IOPTEs it stores.

> - Drivers that basically just keep a list of devices in the domain and
> need to do a list_del() in release_device, can also list_del_init() any
> still-attached devices in domain_free, with a simple per-instance or
> global lock to serialise the two.

Compared to just locking ops->release_device() these all seem more
complicated?

IMHO the core code should try to protect the driver from racing
release with anything else.

Do you know a reason not to hold the group mutex across
release_device? I think that is the most straightforward and
future proof.

Arguably all the device ops should be serialized under the group
mutex.

> @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev)
>  	dev_info(dev, "Removing from iommu group %d\n", group->id);
>  	mutex_lock(&group->mutex);
> +	if (WARN_ON(group->domain != group->default_domain &&
> +		    group->domain != group->blocking_domain)) {

This will false trigger, if there are two VFIO devices then the group
will remained owned when we unplug one just of them, but the group's domain
will be a VFIO owned domain. 

It is why I put the list_empty() protection, as the test only works if
it is the last device.

> +		if (group->default_domain)
> +			__iommu_attach_device(group->default_domain, dev);
> +		else
> +			__iommu_detach_device(group->domain, dev);
> +	}

This was very appealing, but I rejected it because it is too difficult
to support multi-device groups that share the RID.

In that case we expect that the first attach/detach of a device on the
shared RID will reconfigure the iommu and the attach/deatch of all the
other devices on the group with the same parameters will be a NOP.

So in a VFIO configuration where two drivers are bound to a single
group with shared RID and we unplug one device, this will rebind the
shared RID and thus the entire group to blocking/default and break the
still running VFIO on the second device.

The device centric interface only works if we always apply the
operation to every device in the group..

This is why I kept it as ops->release_device() with an implicit detach
of the current domain inside the driver. release_device() has that
special meaning of 'detach the domain but do not change a shared RID'

And it misses the logic to WARN_ON if a domain is set and an external
entity wrongly uses iommu_group_remove_device()..

Jason
Robin Murphy Sept. 9, 2022, 7:55 p.m. UTC | #8
On 2022-09-09 19:30, Jason Gunthorpe wrote:
> On Fri, Sep 09, 2022 at 06:57:59PM +0100, Robin Murphy wrote:
> 
>>>> That then only leaves the issue that that domain may still become
>>>> invalid at any point after the group mutex has been dropped.
>>>
>>> So that is this race:
>>>
>>>           CPU0                         CPU1
>>>      iommu_release_device(a)
>>>         __iommu_group_remove_device(a)
>>> 			         iommu_device_use_default_domain(b)
>>>                                    iommu_domain_free(domain)
>>>                                    iommu_release_device(b)
>>>                                         ops->release_device(b)
>>>         ops->release_device(a)
>>>           // Boom, a is still attached to domain :(
>>>
>>> I can't think of how to solve this other than holding the group mutex
>>> across release_device. See below.
>>
>> I see a few possibilities:
>>
>> - Backtrack slightly on its removal, and instead repurpose detach_dev
>> into a specialised domain cleanup callback, called before or during
>> iommu_group_remove_device(), with the group mutex held.
> 
> See below for why that is somewhat troublesome..
>   
>> - Drivers that hold any kind of internal per-device references to
>> domains - which is generally the root of this issue in the first place -
>> can implement proper reference counting, so even if a domain is "freed"
>> with a device still attached as above, it doesn't actually go away until
>> release_device(a) cleans up the final dangling reference. I suggested
>> the core doing this generically, but on reflection I think it's actually
>> a lot more straightforward as a driver-internal thing.
> 
> Isn't this every driver though? Like every single driver implementing
> an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the
> iommu_domain - minimally to point the HW to the IOPTEs it stores.

Um, no? Domain ops get the domain passed in as an argument, which is far 
from hidden, and if any driver implemented them to ignore that argument 
and operate on something else it would be stupid and broken. Note I said 
"per-device reference", meaning things like s390's zpci_dev->s390_domain 
and SMMUv3's dev->iommu->priv->domain. It's only those references that 
are reachable from release_device - outside the normal domain lifecycle 
- which are problematic.

>> - Drivers that basically just keep a list of devices in the domain and
>> need to do a list_del() in release_device, can also list_del_init() any
>> still-attached devices in domain_free, with a simple per-instance or
>> global lock to serialise the two.
> 
> Compared to just locking ops->release_device() these all seem more
> complicated?

Well, yes, but they are still potentially-viable examples of the 
alternative solutions you said you couldn't think of ;)

> IMHO the core code should try to protect the driver from racing
> release with anything else.
> 
> Do you know a reason not to hold the group mutex across
> release_device? I think that is the most straightforward and
> future proof.

Yes, the ones documented in the code and already discussed here. The 
current functional ones aren't particularly *good* reasons, but unless 
and until they can all be cleaned up they are what they are.

> Arguably all the device ops should be serialized under the group
> mutex.

Maybe once groups and default domains are used consistently everywhere. 
And notwithstanding that half the ops have no association with a group, 
are needed before or as part of obtaining a group, or were explicitly 
intended to allow calling back into other APIs that might lock the 
relevant group :P

>> @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev)
>>   	dev_info(dev, "Removing from iommu group %d\n", group->id);
>>   	mutex_lock(&group->mutex);
>> +	if (WARN_ON(group->domain != group->default_domain &&
>> +		    group->domain != group->blocking_domain)) {
> 
> This will false trigger, if there are two VFIO devices then the group
> will remained owned when we unplug one just of them, but the group's domain
> will be a VFIO owned domain.

As opposed to currently, where most drivers' release_device will blindly 
detach/disable the RID in some fashion so the other device would 
suddenly blow up anyway? A warning of the impending disaster might be 
quite informative, I reckon.

> It is why I put the list_empty() protection, as the test only works if
> it is the last device.
> 
>> +		if (group->default_domain)
>> +			__iommu_attach_device(group->default_domain, dev);
>> +		else
>> +			__iommu_detach_device(group->domain, dev);
>> +	}
> 
> This was very appealing, but I rejected it because it is too difficult
> to support multi-device groups that share the RID.
> 
> In that case we expect that the first attach/detach of a device on the
> shared RID will reconfigure the iommu and the attach/deatch of all the
> other devices on the group with the same parameters will be a NOP.
> 
> So in a VFIO configuration where two drivers are bound to a single
> group with shared RID and we unplug one device, this will rebind the
> shared RID and thus the entire group to blocking/default and break the
> still running VFIO on the second device.

As above, I am supremely confident that nobody has ever done that 
because it is already broken on everything that matters.

(It *will* actually work on SMMUv2 because SMMUv2 comprehensively 
handles StreamID-level aliasing beyond what pci_device_group() covers, 
which I remain rather proud of)

> The device centric interface only works if we always apply the
> operation to every device in the group..
> 
> This is why I kept it as ops->release_device() with an implicit detach
> of the current domain inside the driver. release_device() has that
> special meaning of 'detach the domain but do not change a shared RID'

If you want to rely on that notion, you'll need to tell all the major 
drivers about it first, I'm afraid.

> And it misses the logic to WARN_ON if a domain is set and an external
> entity wrongly uses iommu_group_remove_device()..

Huh? An external fake group couldn't have a default domain or blocking 
domain, thus any non-NULL domain will not compare equal to either, so if 
that could happen it will warn, and then most likely crash. I did think 
briefly about trying to make it not crash, but then I remembered that 
fake groups from external callers also aren't backed by IOMMU API 
drivers so have no way to allocate or attach domains either, so in fact 
it cannot happen at all under any circumstances that are worth reasoning 
about.

Yes, some nefarious driver could call iommu_group_remove_device() on 
random devices. It could also call kfree(dev->iommu_group). Or 
kfree(iommu_group_remove_device). Fundamentally-broken kernel code can 
crash the kernel, whoop de do.

Thanks,
Robin.
Jason Gunthorpe Sept. 9, 2022, 11:45 p.m. UTC | #9
On Fri, Sep 09, 2022 at 08:55:07PM +0100, Robin Murphy wrote:

> > Isn't this every driver though? Like every single driver implementing
> > an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the
> > iommu_domain - minimally to point the HW to the IOPTEs it stores.
> 
> Um, no? Domain ops get the domain passed in as an argument, which is far
> from hidden, and if any driver implemented them to ignore that argument and
> operate on something else it would be stupid and broken. Note I said
> "per-device reference", meaning things like s390's zpci_dev->s390_domain and
> SMMUv3's dev->iommu->priv->domain. It's only those references that are
> reachable from release_device - outside the normal domain lifecycle - which
> are problematic.

If the plan is to make the domain refcounted and then allow a 'put' on
it before we reach release_device() then it means every driver needs
to hold a 'get' on the domain while it is programmed into the HW.

Because the hw will still be touching memory that could be freed by an
iommu_domain_free(). By "hidden" reference I mean the HW walkers are
touching memory that would be freed - ie kasn won't see it.

> > Do you know a reason not to hold the group mutex across
> > release_device? I think that is the most straightforward and
> > future proof.
> 
> Yes, the ones documented in the code and already discussed here. The current
> functional ones aren't particularly *good* reasons, but unless and until
> they can all be cleaned up they are what they are.

Uh, I feel like I missed part of the conversation - I don't know what
this list is..

I did look. I'm looking for a call chain that goes from
release_device() into a core function that grabs the group->mutex.

There is a comment in iommu_group_store_type() that suggest there is a
recursion but doesn't say what it is. It was an Intel person who wrote
the comment so I more carefully checked the intel driver and didn't
find a call path, but it is big and complicated..

There is a commment in iommu_change_dev_def_domain() about recursion
on probe_finalize(), not relevant here, AFAICT.

So, I did two approaches, one I checked quickly through every
release_device looking for something.

Then I looked across the entire exported driver facing API and focused
on callchains going back toward the core from APIs that might be
trouble and audited them almost completely.

These APIs do not take the lock, so completely safe:
 iommu_group_alloc
 iommu_group_set_iommudata
 iommu_group_set_name
 iommu_group_get
 iommu_group_ref_get
 iommu_group_put
 iommu_get_domain_for_dev
 iommu_fwspec_free
 iommu_fwspec_init
 iommu_fwspec_add_ids
 iommu_put_resv_regions (called from release_device)

Does take the lock. Checked all of these in all the drivers, didn't
find an obvious path to release_device:
 iommu_group_remove_device
 iommu_group_for_each_dev
 iommu_attach_device
 iommu_detach_device
 iommu_attach_group
 iommu_detach_group
 bus_set_iommu

Can't tell if these take lock due to driver callbacks, but couldn't
find them in release, and doesn't make sense they would be there:
 iommu_device_register
 iommu_device_unregister
 iommu_domain_alloc
 iommu_domain_free

Rest of the exported interface touching the drivers - didn't carefully
check if they are using the lock - however by name seems unlikely
these are in release_device():
 iommu_register_device_fault_handler
 iommu_unregister_device_fault_handler
 iommu_report_device_fault
 iommu_page_response
 report_iommu_fault
 iommu_iova_to_phys
 iommu_map
 iommu_unmap
 iommu_alloc_resv_region
 iommu_present
 iommu_capable
 iommu_default_passthrough

It is big and complicated, so I wouldn't stake my life on it, but it
seems worth investigating further.

Could the recursion have been cleaned up already?

> > > @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev)
> > >   	dev_info(dev, "Removing from iommu group %d\n", group->id);
> > >   	mutex_lock(&group->mutex);
> > > +	if (WARN_ON(group->domain != group->default_domain &&
> > > +		    group->domain != group->blocking_domain)) {
> > 
> > This will false trigger, if there are two VFIO devices then the group
> > will remained owned when we unplug one just of them, but the group's domain
> > will be a VFIO owned domain.
> 
> As opposed to currently, where most drivers' release_device will blindly
> detach/disable the RID in some fashion so the other device would suddenly
> blow up anyway? 

Er, I think it is OK today, in the non-shared case. If the RID isn't
shared then each device in the group is independent, so most drivers,
most of the time, should only effect the RID release_device() is
called on, while this warning will always trigger for any multi-device
group.

> (It *will* actually work on SMMUv2 because SMMUv2 comprehensively handles
> StreamID-level aliasing beyond what pci_device_group() covers, which I
> remain rather proud of)

This is why I prefered not to explicitly change the domain, because at
least if someone did write a non-buggy driver it doesn't get wrecked -
and making a non-buggy driver is at least allowed by the API.

> > And it misses the logic to WARN_ON if a domain is set and an external
> > entity wrongly uses iommu_group_remove_device()..
> 
> Huh? An external fake group couldn't have a default domain or blocking
> domain, thus any non-NULL domain will not compare equal to either, so if
> that could happen it will warn, and then most likely crash. I did think
> briefly about trying to make it not crash, but then I remembered that fake
> groups from external callers also aren't backed by IOMMU API drivers so have
> no way to allocate or attach domains either, so in fact it cannot happen at
> all under any circumstances that are worth reasoning about.

I mean specificaly thing like FSL is doing where it is a real driver
calling this API and the test of 'group->domain == NULL' is the more
robust precondition.

So, IDK, I would perfer to understand where we hit a group mutex
recursion before rejecting that path... If you know specifics please
share, otherwise maybe we should stick in a lockdep check there and
see if anything hits?

But I'm off to LPC so I probably won't write anything more thoughtful
on this for a while.

Thanks,
Jason
Robin Murphy Sept. 12, 2022, 11:13 a.m. UTC | #10
On 2022-09-10 00:45, Jason Gunthorpe wrote:
> On Fri, Sep 09, 2022 at 08:55:07PM +0100, Robin Murphy wrote:
> 
>>> Isn't this every driver though? Like every single driver implementing
>>> an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the
>>> iommu_domain - minimally to point the HW to the IOPTEs it stores.
>>
>> Um, no? Domain ops get the domain passed in as an argument, which is far
>> from hidden, and if any driver implemented them to ignore that argument and
>> operate on something else it would be stupid and broken. Note I said
>> "per-device reference", meaning things like s390's zpci_dev->s390_domain and
>> SMMUv3's dev->iommu->priv->domain. It's only those references that are
>> reachable from release_device - outside the normal domain lifecycle - which
>> are problematic.
> 
> If the plan is to make the domain refcounted and then allow a 'put' on
> it before we reach release_device() then it means every driver needs
> to hold a 'get' on the domain while it is programmed into the HW.
> 
> Because the hw will still be touching memory that could be freed by an
> iommu_domain_free(). By "hidden" reference I mean the HW walkers are
> touching memory that would be freed - ie kasn won't see it.

As far as I'm concerned we're dealing purely with the case where 
release_device races with attaching back to the default domain *and* the 
driver has some reason for release_device to poke at what it thinks the 
currently-attached domain is. Anyone who frees a domain while it's still 
actually live deserves whatever they get; it would be thoroughly 
impractical to attempt to mitigate for that kind of silliness.

>>> Do you know a reason not to hold the group mutex across
>>> release_device? I think that is the most straightforward and
>>> future proof.
>>
>> Yes, the ones documented in the code and already discussed here. The current
>> functional ones aren't particularly *good* reasons, but unless and until
>> they can all be cleaned up they are what they are.
> 
> Uh, I feel like I missed part of the conversation - I don't know what
> this list is..

s390 (remember how we got here?) calls iommu_get_domain_for_dev(). 
ipmmu-vmsa calls arm_iommu_detach_device() (mtk_v1 doesn't, but perhaps 
technically should), to undo the corresponding attach from 
probe_finalize - apologies for misremembering which way round the 
comments were.

>>>> @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev)
>>>>    	dev_info(dev, "Removing from iommu group %d\n", group->id);
>>>>    	mutex_lock(&group->mutex);
>>>> +	if (WARN_ON(group->domain != group->default_domain &&
>>>> +		    group->domain != group->blocking_domain)) {
>>>
>>> This will false trigger, if there are two VFIO devices then the group
>>> will remained owned when we unplug one just of them, but the group's domain
>>> will be a VFIO owned domain.
>>
>> As opposed to currently, where most drivers' release_device will blindly
>> detach/disable the RID in some fashion so the other device would suddenly
>> blow up anyway?
> 
> Er, I think it is OK today, in the non-shared case. If the RID isn't
> shared then each device in the group is independent, so most drivers,
> most of the time, should only effect the RID release_device() is
> called on, while this warning will always trigger for any multi-device
> group.

Oh, apparently I managed to misinterpret this as the two *aliasing* 
devices case, sorry. Indeed it is overly conservative for that. I think 
the robust way to detect bad usage is actually not via the group at all, 
but for iommu_device_{un}use_default_domain() to also maintain a 
per-device ownership flag, then we warn if a device is released with 
that still set.

>> (It *will* actually work on SMMUv2 because SMMUv2 comprehensively handles
>> StreamID-level aliasing beyond what pci_device_group() covers, which I
>> remain rather proud of)
> 
> This is why I prefered not to explicitly change the domain, because at
> least if someone did write a non-buggy driver it doesn't get wrecked -
> and making a non-buggy driver is at least allowed by the API.

Detaching back to the default domain still seems like it's *always* the 
right thing to do at this point, even when it should not be warned about 
as above. As I say it *does* work on non-buggy drivers, and making this 
whole domain use-after-free race a fundamental non-issue is attractive.

>>> And it misses the logic to WARN_ON if a domain is set and an external
>>> entity wrongly uses iommu_group_remove_device()..
>>
>> Huh? An external fake group couldn't have a default domain or blocking
>> domain, thus any non-NULL domain will not compare equal to either, so if
>> that could happen it will warn, and then most likely crash. I did think
>> briefly about trying to make it not crash, but then I remembered that fake
>> groups from external callers also aren't backed by IOMMU API drivers so have
>> no way to allocate or attach domains either, so in fact it cannot happen at
>> all under any circumstances that are worth reasoning about.
> 
> I mean specificaly thing like FSL is doing where it is a real driver
> calling this API and the test of 'group->domain == NULL' is the more
> robust precondition.

Having looked a bit closer, I think I get what PAMU is doing - kind of 
impedance-matching between pci_device_group() and its own non-ACS form 
of isolation, and possibly also a rather roundabout way of propagating 
DT data from the PCI controller node up into the PCI hierarchy. Both 
could quite likely be done in a more straightforward manner these days 
(and TBH I'm not convinced it works at all since it doesn't appear to 
match the actual DT binding), but either way I'm fairly confident we 
needn't worry about it.

Thanks,
Robin.
Jason Gunthorpe Sept. 22, 2022, 4:56 p.m. UTC | #11
On Mon, Sep 12, 2022 at 12:13:25PM +0100, Robin Murphy wrote:

> > > Yes, the ones documented in the code and already discussed here. The current
> > > functional ones aren't particularly *good* reasons, but unless and until
> > > they can all be cleaned up they are what they are.
> > 
> > Uh, I feel like I missed part of the conversation - I don't know what
> > this list is..
> 
> s390 (remember how we got here?) calls iommu_get_domain_for_dev().

iommu_get_domain_for_dev() doesn't take a lock, and the last try I
showed (see below) ordered things so that it would succeed when called
under release(). AFIACT it is already not a problem.

> ipmmu-vmsa calls arm_iommu_detach_device() (mtk_v1 doesn't, but perhaps
> technically should), to undo the corresponding attach from probe_finalize -
> apologies for misremembering which way round the comments were.

This does eventually deadlock on the group->mutex, so yes this is a
problem!

And I agree mtk_v1 does looks like it has a memory leak.

But, this looks easy enough to fix up. See below

> Oh, apparently I managed to misinterpret this as the two *aliasing* devices
> case, sorry. Indeed it is overly conservative for that. I think the robust
> way to detect bad usage is actually not via the group at all, but for
> iommu_device_{un}use_default_domain() to also maintain a per-device
> ownership flag, then we warn if a device is released with that still set.

It seems a bit hard to implement a per-device flag with VFIO?

> > > (It *will* actually work on SMMUv2 because SMMUv2 comprehensively handles
> > > StreamID-level aliasing beyond what pci_device_group() covers, which I
> > > remain rather proud of)
> > 
> > This is why I prefered not to explicitly change the domain, because at
> > least if someone did write a non-buggy driver it doesn't get wrecked -
> > and making a non-buggy driver is at least allowed by the API.
> 
> Detaching back to the default domain still seems like it's *always* the
> right thing to do at this point, 

If the RID is aliased it is the wrong thing to do. Calling attach will
wreck the entire alias set.

release is not supposed to do that, buggy drivers excepted.

> Having looked a bit closer, I think I get what PAMU is doing - kind of
> impedance-matching between pci_device_group() and its own non-ACS form of
> isolation, and possibly also a rather roundabout way of propagating DT data
> from the PCI controller node up into the PCI hierarchy. Both could quite
> likely be done in a more straightforward manner these days (and TBH I'm not
> convinced it works at all since it doesn't appear to match the actual DT
> binding), but either way I'm fairly confident we needn't worry about it.

Yes, this is what I thought too. Arguably it is wrong to try and
rework the groups once created, it should be creating the groups to
cover what it wants from the start, and it shouldn't leave the
controller without a group.

So something like the below is what I'm thinking

Thanks,
Jason

diff --git a/arch/arm/include/asm/dma-iommu.h b/arch/arm/include/asm/dma-iommu.h
index fe9ef6f79e9cfe..ea7198a1786186 100644
--- a/arch/arm/include/asm/dma-iommu.h
+++ b/arch/arm/include/asm/dma-iommu.h
@@ -31,6 +31,7 @@ void arm_iommu_release_mapping(struct dma_iommu_mapping *mapping);
 int arm_iommu_attach_device(struct device *dev,
 					struct dma_iommu_mapping *mapping);
 void arm_iommu_detach_device(struct device *dev);
+void arm_iommu_release_device(struct device *dev);
 
 #endif /* __KERNEL__ */
 #endif
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 089c9c644cce2a..1694bebb3aa4dc 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1697,13 +1697,15 @@ int arm_iommu_attach_device(struct device *dev,
 EXPORT_SYMBOL_GPL(arm_iommu_attach_device);
 
 /**
- * arm_iommu_detach_device
+ * arm_iommu_release_device
  * @dev: valid struct device pointer
  *
- * Detaches the provided device from a previously attached map.
- * This overwrites the dma_ops pointer with appropriate non-IOMMU ops.
+ * This is like arm_iommu_detach_device() except it handles the special
+ * case of being called under an iommu driver's release operation. In this
+ * case the driver must have already detached the device from any attached
+ * domain before calling this function.
  */
-void arm_iommu_detach_device(struct device *dev)
+void arm_iommu_release_device(struct device *dev)
 {
 	struct dma_iommu_mapping *mapping;
 
@@ -1713,13 +1715,34 @@ void arm_iommu_detach_device(struct device *dev)
 		return;
 	}
 
-	iommu_detach_device(mapping->domain, dev);
 	kref_put(&mapping->kref, release_iommu_mapping);
 	to_dma_iommu_mapping(dev) = NULL;
 	set_dma_ops(dev, NULL);
 
 	pr_debug("Detached IOMMU controller from %s device.\n", dev_name(dev));
 }
+EXPORT_SYMBOL_GPL(arm_iommu_release_device);
+
+/**
+ * arm_iommu_detach_device
+ * @dev: valid struct device pointer
+ *
+ * Detaches the provided device from a previously attached map.
+ * This overwrites the dma_ops pointer with appropriate non-IOMMU ops.
+ */
+void arm_iommu_detach_device(struct device *dev)
+{
+	struct dma_iommu_mapping *mapping;
+
+	mapping = to_dma_iommu_mapping(dev);
+	if (!mapping) {
+		dev_warn(dev, "Not attached\n");
+		return;
+	}
+
+	iommu_detach_device(mapping->domain, dev);
+	arm_iommu_release_device(dev);
+}
 EXPORT_SYMBOL_GPL(arm_iommu_detach_device);
 
 static void arm_setup_iommu_dma_ops(struct device *dev, u64 dma_base, u64 size,
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 780fb70715770d..61444b2a11e217 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -90,6 +90,10 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
 static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 static ssize_t iommu_group_store_type(struct iommu_group *group,
 				      const char *buf, size_t count);
+static struct group_device *
+__iommu_group_remove_device(struct iommu_group *group, struct device *dev);
+static void __iommu_group_remove_device_sysfs(struct iommu_group *group,
+					      struct group_device *device);
 
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\
 struct iommu_group_attribute iommu_group_attr_##_name =		\
@@ -330,18 +334,50 @@ int iommu_probe_device(struct device *dev)
 
 void iommu_release_device(struct device *dev)
 {
+	struct iommu_group *group = dev->iommu_group;
 	const struct iommu_ops *ops;
+	struct group_device *device;
 
 	if (!dev->iommu)
 		return;
 
 	iommu_device_unlink(dev->iommu->iommu_dev, dev);
 
+	mutex_lock(&group->mutex);
+	device = __iommu_group_remove_device(group, dev);
 	ops = dev_iommu_ops(dev);
+
+	/*
+	 * If the group has become empty then ownership must have been released,
+	 * and the current domain must be set back to NULL or the default
+	 * domain.
+	 */
+	if (list_empty(&group->devices))
+		WARN_ON(group->owner_cnt ||
+			group->domain != group->default_domain);
+
+	/*
+	 * release_device() must stop using any attached domain on the device.
+	 * If there are still other devices in the group they are not effected
+	 * by this callback.
+	 *
+	 * The IOMMU driver must set the device to either an identity or
+	 * blocking translation and stop using any domain pointer, as it is
+	 * going to be freed.
+	 */
 	if (ops->release_device)
 		ops->release_device(dev);
+	mutex_unlock(&group->mutex);
+
+	__iommu_group_remove_device_sysfs(group, device);
+
+	/*
+	 * This will eventually call iommu_group_release() which will free the
+	 * iommu_domains.
+	 */
+	dev->iommu_group = NULL;
+	kobject_put(group->devices_kobj);
 
-	iommu_group_remove_device(dev);
 	module_put(ops->owner);
 	dev_iommu_free(dev);
 }
@@ -939,44 +975,71 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_group_add_device);
 
-/**
- * iommu_group_remove_device - remove a device from it's current group
- * @dev: device to be removed
- *
- * This function is called by an iommu driver to remove the device from
- * it's current group.  This decrements the iommu group reference count.
- */
-void iommu_group_remove_device(struct device *dev)
+static struct group_device *
+__iommu_group_remove_device(struct iommu_group *group, struct device *dev)
 {
-	struct iommu_group *group = dev->iommu_group;
-	struct group_device *tmp_device, *device = NULL;
+	struct group_device *device;
+
+	lockdep_assert_held(&group->mutex);
 
 	if (!group)
-		return;
+		return NULL;
 
 	dev_info(dev, "Removing from iommu group %d\n", group->id);
 
-	mutex_lock(&group->mutex);
-	list_for_each_entry(tmp_device, &group->devices, list) {
-		if (tmp_device->dev == dev) {
-			device = tmp_device;
+	list_for_each_entry(device, &group->devices, list) {
+		if (device->dev == dev) {
 			list_del(&device->list);
-			break;
+			return device;
 		}
 	}
-	mutex_unlock(&group->mutex);
+	return NULL;
+}
 
+static void __iommu_group_remove_device_sysfs(struct iommu_group *group,
+					      struct group_device *device)
+{
 	if (!device)
 		return;
 
 	sysfs_remove_link(group->devices_kobj, device->name);
-	sysfs_remove_link(&dev->kobj, "iommu_group");
+	sysfs_remove_link(&device->dev->kobj, "iommu_group");
 
-	trace_remove_device_from_group(group->id, dev);
+	trace_remove_device_from_group(group->id, device->dev);
 
 	kfree(device->name);
 	kfree(device);
-	dev->iommu_group = NULL;
+}
+
+/**
+ * iommu_group_remove_device - remove a device from it's current group
+ * @dev: device to be removed
+ *
+ * This function is used by non-iommu drivers to create non-iommu subystem
+ * groups (eg VFIO mdev, SPAPR) Using this from inside an iommu driver is an
+ * abuse. Instead the driver should return the correct group from
+ * ops->device_group()
+ */
+void iommu_group_remove_device(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct group_device *device;
+
+	/*
+	 * Since we don't do ops->release_device() there is no way for the
+	 * driver to stop using any attached domain before we free it. If a
+	 * domain is attached while this function is called it will eventually
+	 * UAF.
+	 *
+	 * Thus it is only useful for cases like VFIO/SPAPR that don't use an
+	 * iommu driver, or for cases like FSL that don't use default domains.
+	 */
+	WARN_ON(group->domain);
+
+	mutex_lock(&group->mutex);
+	device = __iommu_group_remove_device(group, dev);
+	mutex_unlock(&group->mutex);
+	__iommu_group_remove_device_sysfs(group, device);
 	kobject_put(group->devices_kobj);
 }
 EXPORT_SYMBOL_GPL(iommu_group_remove_device);
diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 1d42084d02767e..f5b9787d22420c 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -302,11 +302,8 @@ static void ipmmu_utlb_enable(struct ipmmu_vmsa_domain *domain,
 /*
  * Disable MMU translation for the microTLB.
  */
-static void ipmmu_utlb_disable(struct ipmmu_vmsa_domain *domain,
-			       unsigned int utlb)
+static void ipmmu_utlb_disable(struct ipmmu_vmsa_device *mmu, unsigned int utlb)
 {
-	struct ipmmu_vmsa_device *mmu = domain->mmu;
-
 	ipmmu_imuctr_write(mmu, utlb, 0);
 	mmu->utlb_ctx[utlb] = IPMMU_CTX_INVALID;
 }
@@ -649,11 +646,11 @@ static void ipmmu_detach_device(struct iommu_domain *io_domain,
 				struct device *dev)
 {
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
-	struct ipmmu_vmsa_domain *domain = to_vmsa_domain(io_domain);
+	struct ipmmu_vmsa_device *mmu = to_ipmmu(dev);
 	unsigned int i;
 
 	for (i = 0; i < fwspec->num_ids; ++i)
-		ipmmu_utlb_disable(domain, fwspec->ids[i]);
+		ipmmu_utlb_disable(mmu, fwspec->ids[i]);
 
 	/*
 	 * TODO: Optimize by disabling the context when no device is attached.
@@ -849,7 +846,8 @@ static void ipmmu_probe_finalize(struct device *dev)
 
 static void ipmmu_release_device(struct device *dev)
 {
-	arm_iommu_detach_device(dev);
+	ipmmu_detach_device(NULL, dev);
+	arm_iommu_release_device(dev);
 }
 
 static struct iommu_group *ipmmu_find_group(struct device *dev)
diff mbox series

Patch

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 780fb70715770d..c451bf715182ac 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -90,6 +90,7 @@  static int iommu_create_device_direct_mappings(struct iommu_group *group,
 static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 static ssize_t iommu_group_store_type(struct iommu_group *group,
 				      const char *buf, size_t count);
+static void __iommu_group_remove_device(struct device *dev);
 
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\
 struct iommu_group_attribute iommu_group_attr_##_name =		\
@@ -330,6 +331,7 @@  int iommu_probe_device(struct device *dev)
 
 void iommu_release_device(struct device *dev)
 {
+	struct iommu_group *group = dev->iommu_group;
 	const struct iommu_ops *ops;
 
 	if (!dev->iommu)
@@ -337,11 +339,20 @@  void iommu_release_device(struct device *dev)
 
 	iommu_device_unlink(dev->iommu->iommu_dev, dev);
 
+	__iommu_group_remove_device(dev);
 	ops = dev_iommu_ops(dev);
 	if (ops->release_device)
 		ops->release_device(dev);
 
-	iommu_group_remove_device(dev);
+	/*
+	 * This will eventually call iommu_group_release() which will free the
+	 * iommu_domains. Up until the release_device() above the iommu_domains
+	 * may still have been associated with the device, and we cannot free
+	 * them until the have been detached. release_device() is expected to
+	 * detach all domains connected to the dev.
+	 */
+	kobject_put(group->devices_kobj);
+
 	module_put(ops->owner);
 	dev_iommu_free(dev);
 }
@@ -939,14 +950,7 @@  int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_group_add_device);
 
-/**
- * iommu_group_remove_device - remove a device from it's current group
- * @dev: device to be removed
- *
- * This function is called by an iommu driver to remove the device from
- * it's current group.  This decrements the iommu group reference count.
- */
-void iommu_group_remove_device(struct device *dev)
+static void __iommu_group_remove_device(struct device *dev)
 {
 	struct iommu_group *group = dev->iommu_group;
 	struct group_device *tmp_device, *device = NULL;
@@ -977,6 +981,20 @@  void iommu_group_remove_device(struct device *dev)
 	kfree(device->name);
 	kfree(device);
 	dev->iommu_group = NULL;
+}
+
+/**
+ * iommu_group_remove_device - remove a device from it's current group
+ * @dev: device to be removed
+ *
+ * This function is called by an iommu driver to remove the device from
+ * it's current group.  This decrements the iommu group reference count.
+ */
+void iommu_group_remove_device(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+
+	__iommu_group_remove_device(dev);
 	kobject_put(group->devices_kobj);
 }
 EXPORT_SYMBOL_GPL(iommu_group_remove_device);