Message ID | 4-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix splats releated to using the iommu_group after destroying devices | expand |
On 2022-09-08 19:45, Jason Gunthorpe wrote: > default domains created a situation where the device is always connected > to a domain of some kind. When the device is idle it is connected to one > of the two pre-existing domains in the group, blocking_domain or > default_domain. In this way we have a continuous assertion of what state > the transation is in. > > When this is all destructed then we need to remove all the devices from > their domains via the ops->release_device() call before the domain can be > freed. This is the bug recognized in commit 9ac8545199a1 ("iommu: Fix > use-after-free in iommu_release_device"). > > However, we must also stop any concurrent access to the iommu driver for > this device before we destroy it. This is done by: > > 1) Drivers only using the iommu API while they have a device driver > attached to the device. This directly prevents release from happening. > > 2) Removing the device from the group list so any lingering group > references no longer refer to the device. This is done by > iommu_group_remove_device() > > Since iommu_group_remove_device() has been moved this breaks #2 and > triggers an WARN when VFIO races group activities with the release of the > device: > > iommu driver failed to attach the default/blocking domain > WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80 > Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6> > CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G W 6.0.0-rc3 #5 > Hardware name: IBM 3931 A01 782 (LPAR) > Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80) > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 > Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0 > 00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58 > 00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200 > 00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98 > Krnl Code: 000000095bb10d18: c020003d56fc larl %r2,000000095c2bbb10 > 000000095bb10d1e: c0e50019d901 brasl %r14,000000095be4bf20 > #000000095bb10d24: af000000 mc 0,0 > >000000095bb10d28: b904002a lgr %r2,%r10 > 000000095bb10d2c: ebaff0a00004 lmg %r10,%r15,160(%r15) > 000000095bb10d32: c0f4001aa867 brcl 15,000000095be65e00 > 000000095bb10d38: c004002168e0 brcl 0,000000095bf3def8 > 000000095bb10d3e: eb6ff0480024 stmg %r6,%r15,72(%r15) > Call Trace: > [<000000095bb10d28>] iommu_detach_group+0x70/0x80 > ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80) > [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1] > [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio] > [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio] > pci 0004:00:00.0: Removing from iommu group 4 > [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100 > [<000000095be5d3b4>] __do_syscall+0x1d4/0x200 > [<000000095be6c072>] system_call+0x82/0xb0 > Last Breaking-Event-Address: > [<000000095be4bf80>] __warn_printk+0x60/0x68 > > So, put things in the right order: > - Remove the device from the group's list > - Release the device from the iommu driver to drop all domain references > - Free the domains > > This is done by splitting out the kobject_put(), which triggers > iommu_group_release(), from the rest of iommu_group_remove_device() and > placing it after release is called. So simple... now how did I fail to think of that? :) > Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device") > Reported-by: Matthew Rosato <mjrosato@linux.ibm.com> > Cc: Robin Murphy <robin.murphy@arm.com> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > --- > drivers/iommu/iommu.c | 36 +++++++++++++++++++++++++++--------- > 1 file changed, 27 insertions(+), 9 deletions(-) > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index 780fb70715770d..c451bf715182ac 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -90,6 +90,7 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group, > static struct iommu_group *iommu_group_get_for_dev(struct device *dev); > static ssize_t iommu_group_store_type(struct iommu_group *group, > const char *buf, size_t count); > +static void __iommu_group_remove_device(struct device *dev); > > #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \ > struct iommu_group_attribute iommu_group_attr_##_name = \ > @@ -330,6 +331,7 @@ int iommu_probe_device(struct device *dev) > > void iommu_release_device(struct device *dev) > { > + struct iommu_group *group = dev->iommu_group; > const struct iommu_ops *ops; > > if (!dev->iommu) > @@ -337,11 +339,20 @@ void iommu_release_device(struct device *dev) > > iommu_device_unlink(dev->iommu->iommu_dev, dev); > In fact, now that you've made it obvious, could we not simply do an extra kobject_get() here before calling regular iommu_group_remove_device(), and avoid having to split that up at all? That should delay any default domain teardown just as definitively as holding the original reference for longer, no? Thanks, Robin. > + __iommu_group_remove_device(dev); > ops = dev_iommu_ops(dev); > if (ops->release_device) > ops->release_device(dev); > > - iommu_group_remove_device(dev); > + /* > + * This will eventually call iommu_group_release() which will free the > + * iommu_domains. Up until the release_device() above the iommu_domains > + * may still have been associated with the device, and we cannot free > + * them until the have been detached. release_device() is expected to > + * detach all domains connected to the dev. > + */ > + kobject_put(group->devices_kobj); > + > module_put(ops->owner); > dev_iommu_free(dev); > } > @@ -939,14 +950,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev) > } > EXPORT_SYMBOL_GPL(iommu_group_add_device); > > -/** > - * iommu_group_remove_device - remove a device from it's current group > - * @dev: device to be removed > - * > - * This function is called by an iommu driver to remove the device from > - * it's current group. This decrements the iommu group reference count. > - */ > -void iommu_group_remove_device(struct device *dev) > +static void __iommu_group_remove_device(struct device *dev) > { > struct iommu_group *group = dev->iommu_group; > struct group_device *tmp_device, *device = NULL; > @@ -977,6 +981,20 @@ void iommu_group_remove_device(struct device *dev) > kfree(device->name); > kfree(device); > dev->iommu_group = NULL; > +} > + > +/** > + * iommu_group_remove_device - remove a device from it's current group > + * @dev: device to be removed > + * > + * This function is called by an iommu driver to remove the device from > + * it's current group. This decrements the iommu group reference count. > + */ > +void iommu_group_remove_device(struct device *dev) > +{ > + struct iommu_group *group = dev->iommu_group; > + > + __iommu_group_remove_device(dev); > kobject_put(group->devices_kobj); > } > EXPORT_SYMBOL_GPL(iommu_group_remove_device);
On 2022-09-08 22:05, Robin Murphy wrote: > On 2022-09-08 19:45, Jason Gunthorpe wrote: >> default domains created a situation where the device is always connected >> to a domain of some kind. When the device is idle it is connected to one >> of the two pre-existing domains in the group, blocking_domain or >> default_domain. In this way we have a continuous assertion of what state >> the transation is in. >> >> When this is all destructed then we need to remove all the devices from >> their domains via the ops->release_device() call before the domain can be >> freed. This is the bug recognized in commit 9ac8545199a1 ("iommu: Fix >> use-after-free in iommu_release_device"). >> >> However, we must also stop any concurrent access to the iommu driver for >> this device before we destroy it. This is done by: >> >> 1) Drivers only using the iommu API while they have a device driver >> attached to the device. This directly prevents release from >> happening. >> >> 2) Removing the device from the group list so any lingering group >> references no longer refer to the device. This is done by >> iommu_group_remove_device() >> >> Since iommu_group_remove_device() has been moved this breaks #2 and >> triggers an WARN when VFIO races group activities with the release of the >> device: >> >> iommu driver failed to attach the default/blocking domain >> WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 >> iommu_detach_group+0x6c/0x80 >> Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core >> irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6> >> CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G >> W 6.0.0-rc3 #5 >> Hardware name: IBM 3931 A01 782 (LPAR) >> Krnl PSW : 0704c00180000000 000000095bb10d28 >> (iommu_detach_group+0x70/0x80) >> R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 >> Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 >> 000000095c97ffe0 >> 00000000fffeffff 00000009fc290000 00000000af1fda50 >> 00000000af590b58 >> 00000000af1fdaf0 0000000135c7a320 0000000135e52258 >> 0000000135e52200 >> 00000000a29e8000 00000000af590b40 000000095bb10d24 >> 0000038004b13c98 >> Krnl Code: 000000095bb10d18: c020003d56fc larl >> %r2,000000095c2bbb10 >> 000000095bb10d1e: c0e50019d901 brasl >> %r14,000000095be4bf20 >> #000000095bb10d24: af000000 mc 0,0 >> >000000095bb10d28: b904002a lgr %r2,%r10 >> 000000095bb10d2c: ebaff0a00004 lmg >> %r10,%r15,160(%r15) >> 000000095bb10d32: c0f4001aa867 brcl >> 15,000000095be65e00 >> 000000095bb10d38: c004002168e0 brcl >> 0,000000095bf3def8 >> 000000095bb10d3e: eb6ff0480024 stmg >> %r6,%r15,72(%r15) >> Call Trace: >> [<000000095bb10d28>] iommu_detach_group+0x70/0x80 >> ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80) >> [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 >> [vfio_iommu_type1] >> [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio] >> [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio] >> pci 0004:00:00.0: Removing from iommu group 4 >> [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100 >> [<000000095be5d3b4>] __do_syscall+0x1d4/0x200 >> [<000000095be6c072>] system_call+0x82/0xb0 >> Last Breaking-Event-Address: >> [<000000095be4bf80>] __warn_printk+0x60/0x68 >> >> So, put things in the right order: >> - Remove the device from the group's list >> - Release the device from the iommu driver to drop all domain >> references >> - Free the domains >> >> This is done by splitting out the kobject_put(), which triggers >> iommu_group_release(), from the rest of iommu_group_remove_device() and >> placing it after release is called. > > So simple... now how did I fail to think of that? :) Oh, because s390 is using iommu_get_domain_for_dev() in its release_device callback, which needs to dereference the group to work, and the current domain may also be a non-default one which we can't prevent from disappearing racily, that was why :( I think we may be back to looking at s390 having to keep track of domains internally like SMMUv3 does, and both drivers synchronising between their domain_free and release_device to to do their internal detach from either place... unless possibly the core code starts refcounting domains as well? Robin. >> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device") >> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com> >> Cc: Robin Murphy <robin.murphy@arm.com> >> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> >> --- >> drivers/iommu/iommu.c | 36 +++++++++++++++++++++++++++--------- >> 1 file changed, 27 insertions(+), 9 deletions(-) >> >> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c >> index 780fb70715770d..c451bf715182ac 100644 >> --- a/drivers/iommu/iommu.c >> +++ b/drivers/iommu/iommu.c >> @@ -90,6 +90,7 @@ static int >> iommu_create_device_direct_mappings(struct iommu_group *group, >> static struct iommu_group *iommu_group_get_for_dev(struct device *dev); >> static ssize_t iommu_group_store_type(struct iommu_group *group, >> const char *buf, size_t count); >> +static void __iommu_group_remove_device(struct device *dev); >> #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \ >> struct iommu_group_attribute iommu_group_attr_##_name = \ >> @@ -330,6 +331,7 @@ int iommu_probe_device(struct device *dev) >> void iommu_release_device(struct device *dev) >> { >> + struct iommu_group *group = dev->iommu_group; >> const struct iommu_ops *ops; >> if (!dev->iommu) >> @@ -337,11 +339,20 @@ void iommu_release_device(struct device *dev) >> iommu_device_unlink(dev->iommu->iommu_dev, dev); > > In fact, now that you've made it obvious, could we not simply do an > extra kobject_get() here before calling regular > iommu_group_remove_device(), and avoid having to split that up at all? > That should delay any default domain teardown just as definitively as > holding the original reference for longer, no? > > Thanks, > Robin. > >> + __iommu_group_remove_device(dev); >> ops = dev_iommu_ops(dev); >> if (ops->release_device) >> ops->release_device(dev); >> - iommu_group_remove_device(dev); >> + /* >> + * This will eventually call iommu_group_release() which will >> free the >> + * iommu_domains. Up until the release_device() above the >> iommu_domains >> + * may still have been associated with the device, and we cannot >> free >> + * them until the have been detached. release_device() is >> expected to >> + * detach all domains connected to the dev. >> + */ >> + kobject_put(group->devices_kobj); >> + >> module_put(ops->owner); >> dev_iommu_free(dev); >> } >> @@ -939,14 +950,7 @@ int iommu_group_add_device(struct iommu_group >> *group, struct device *dev) >> } >> EXPORT_SYMBOL_GPL(iommu_group_add_device); >> -/** >> - * iommu_group_remove_device - remove a device from it's current group >> - * @dev: device to be removed >> - * >> - * This function is called by an iommu driver to remove the device from >> - * it's current group. This decrements the iommu group reference count. >> - */ >> -void iommu_group_remove_device(struct device *dev) >> +static void __iommu_group_remove_device(struct device *dev) >> { >> struct iommu_group *group = dev->iommu_group; >> struct group_device *tmp_device, *device = NULL; >> @@ -977,6 +981,20 @@ void iommu_group_remove_device(struct device *dev) >> kfree(device->name); >> kfree(device); >> dev->iommu_group = NULL; >> +} >> + >> +/** >> + * iommu_group_remove_device - remove a device from it's current group >> + * @dev: device to be removed >> + * >> + * This function is called by an iommu driver to remove the device from >> + * it's current group. This decrements the iommu group reference count. >> + */ >> +void iommu_group_remove_device(struct device *dev) >> +{ >> + struct iommu_group *group = dev->iommu_group; >> + >> + __iommu_group_remove_device(dev); >> kobject_put(group->devices_kobj); >> } >> EXPORT_SYMBOL_GPL(iommu_group_remove_device); >
On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote: > Oh, because s390 is using iommu_get_domain_for_dev() in its release_device > callback, which needs to dereference the group to work, and the current > domain may also be a non-default one which we can't prevent from > disappearing racily, that was why :( Hum, the issue there is the use of device->iommu_group - but that just means I didn't split properly. How about this incremental: diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index c451bf715182ac..99ef799f3fe6b5 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -351,6 +351,7 @@ void iommu_release_device(struct device *dev) * them until the have been detached. release_device() is expected to * detach all domains connected to the dev. */ + dev->iommu_group = NULL; kobject_put(group->devices_kobj); module_put(ops->owner); @@ -980,7 +981,6 @@ static void __iommu_group_remove_device(struct device *dev) kfree(device->name); kfree(device); - dev->iommu_group = NULL; } /** @@ -995,6 +995,7 @@ void iommu_group_remove_device(struct device *dev) struct iommu_group *group = dev->iommu_group; __iommu_group_remove_device(dev); + dev->iommu_group = NULL; kobject_put(group->devices_kobj); } EXPORT_SYMBOL_GPL(iommu_group_remove_device); To me it makes sense that the driver should be able to continue to query the iommu_group during release anyhow.. And to your other question, the reason I split the function is because I couldn't really say WTF iommu_group_remove_device() was supposed to do. The __ version make ssense as part of the remove_device, due to the sequencing with ops->release() But the other one doesn't have that. So I want to put in a: WARN_ON(group->blocking_domain || group->default_domain); Because calling it after those domains are allocated looks broken to me. Jason
On 2022-09-08 22:43, Jason Gunthorpe wrote: > On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote: > >> Oh, because s390 is using iommu_get_domain_for_dev() in its release_device >> callback, which needs to dereference the group to work, and the current >> domain may also be a non-default one which we can't prevent from >> disappearing racily, that was why :( > > Hum, the issue there is the use of device->iommu_group - but that just > means I didn't split properly. How about this incremental: That did cross my mind, but it's a bit grim. In the light of the morning, I'm not sure s390 actually *needs* the group anyway - AFAICS if iommu_group_remove_device() has been processed first, that will have synchronised against any concurrent attach/detach, so zdev->s390_domain can be assumed to be up to date and used directly without the round trip through iommu_get_domain_for_dev(). That then only leaves the issue that that domain may still become invalid at any point after the group mutex has been dropped. > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index c451bf715182ac..99ef799f3fe6b5 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -351,6 +351,7 @@ void iommu_release_device(struct device *dev) > * them until the have been detached. release_device() is expected to > * detach all domains connected to the dev. > */ > + dev->iommu_group = NULL; > kobject_put(group->devices_kobj); > > module_put(ops->owner); > @@ -980,7 +981,6 @@ static void __iommu_group_remove_device(struct device *dev) > > kfree(device->name); > kfree(device); > - dev->iommu_group = NULL; > } > > /** > @@ -995,6 +995,7 @@ void iommu_group_remove_device(struct device *dev) > struct iommu_group *group = dev->iommu_group; > > __iommu_group_remove_device(dev); > + dev->iommu_group = NULL; > kobject_put(group->devices_kobj); > } > EXPORT_SYMBOL_GPL(iommu_group_remove_device); > > To me it makes sense that the driver should be able to continue to > query the iommu_group during release anyhow.. I'm not so sure, release shouldn't be depending on a group since there may never have been one anyway. Perhaps the answer is an extra pre-release step to balance probe_finalize? > And to your other question, the reason I split the function is because > I couldn't really say WTF iommu_group_remove_device() was supposed to > do. The __ version make ssense as part of the remove_device, due to > the sequencing with ops->release() > > But the other one doesn't have that. So I want to put in a: > > WARN_ON(group->blocking_domain || group->default_domain); > > Because calling it after those domains are allocated looks broken to > me. I might be misunderstanding, but that sounds backwards - if a real device is being hotplugged out, we absolutely expect that to happen *after* its default domain has been set up. The external callers are using fake groups where default domains aren't relevant, and I have no idea what PAMU is doing but it's been doing it for long enough that it most likely isn't a problem. Thus wherever that check would be it would seem either wrong or unnecessary. Thanks, Robin.
On Fri, Sep 09, 2022 at 10:05:58AM +0100, Robin Murphy wrote: > On 2022-09-08 22:43, Jason Gunthorpe wrote: > > On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote: > > > > > Oh, because s390 is using iommu_get_domain_for_dev() in its release_device > > > callback, which needs to dereference the group to work, and the current > > > domain may also be a non-default one which we can't prevent from > > > disappearing racily, that was why :( > > > > Hum, the issue there is the use of device->iommu_group - but that just > > means I didn't split properly. How about this incremental: > > That did cross my mind, but it's a bit grim. Actually, also in my morning, I think it may not even be necessary. Keep in mind the start of the series fixes VFIO. The bug that S390 is trying to fix is that VFIO didn't put back the group ownership, it just left its own iommu_domain attached and called release(). But now, at least for single device groups, VFIO will put owenership back and zdev->s390_domain == NULL when we get to release_device() > That then only leaves the issue that that domain may still become > invalid at any point after the group mutex has been dropped. So that is this race: CPU0 CPU1 iommu_release_device(a) __iommu_group_remove_device(a) iommu_device_use_default_domain(b) iommu_domain_free(domain) iommu_release_device(b) ops->release_device(b) ops->release_device(a) // Boom, a is still attached to domain :( I can't think of how to solve this other than holding the group mutex across release_device. See below. > > And to your other question, the reason I split the function is because > > I couldn't really say WTF iommu_group_remove_device() was supposed to > > do. The __ version make ssense as part of the remove_device, due to > > the sequencing with ops->release() > > > > But the other one doesn't have that. So I want to put in a: > > > > WARN_ON(group->blocking_domain || group->default_domain); > > > > Because calling it after those domains are allocated looks broken to > > me. > > I might be misunderstanding, but that sounds backwards - if a real device is > being hotplugged out, we absolutely expect that to happen *after* its > default domain has been set up. See below for what I mean iommu_group_remove_device() doesn't work as an API because it has no way to tell the device to stop using the domain we are about to free. So it should assert that there is no domain to worry about. For the vfio and power case there is no domain because they don't use iommu drivers For FSL it does not use default domains so it will also be OK. Also, I think FSL is broken and it should not be trying to remove the "PCI controller" from a group. Every PCI device behind an IOMMU should be linked to a group. The only reason I can think someone would wanted to do this is because they ran into trouble with the VFIO viability checks. But we have a robust solution to that now that doesn't require abusing the group like this. Jason diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 780fb70715770d..cb83576b1877d5 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -90,6 +90,10 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group, static struct iommu_group *iommu_group_get_for_dev(struct device *dev); static ssize_t iommu_group_store_type(struct iommu_group *group, const char *buf, size_t count); +static struct group_device * +__iommu_group_remove_device(struct iommu_group *group, struct device *dev); +static void __iommu_group_remove_device_sysfs(struct iommu_group *group, + struct group_device *device); #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \ struct iommu_group_attribute iommu_group_attr_##_name = \ @@ -330,18 +334,50 @@ int iommu_probe_device(struct device *dev) void iommu_release_device(struct device *dev) { + struct iommu_group *group = dev->iommu_group; const struct iommu_ops *ops; + struct group_device *device; if (!dev->iommu) return; iommu_device_unlink(dev->iommu->iommu_dev, dev); + mutex_lock(&group->mutex); + device = __iommu_group_remove_device(group, dev); ops = dev_iommu_ops(dev); + + /* + * If the group has become empty then ownership must have been released, + * and the current domain must be set back to NULL or the default + * domain. + */ + if (list_empty(&group->devices)) + WARN_ON(group->owner_cnt || + group->domain != group->default_domain); + + /* + * release_device() must stop using any attached domain on the device. + * If there are still other devices in the group they are not effected + * by this callback. + * + * The IOMMU driver must set the device to either an identity or + * blocking translation and stop using any domain pointer, as it is + * going to be freed. + */ if (ops->release_device) ops->release_device(dev); + mutex_unlock(&group->mutex); + + __iommu_group_remove_device_sysfs(group, device); + + /* + * This will eventually call iommu_group_release() which will free the + * iommu_domains. + */ + dev->iommu_group = NULL; + kobject_put(group->devices_kobj); - iommu_group_remove_device(dev); module_put(ops->owner); dev_iommu_free(dev); } @@ -939,44 +975,69 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev) } EXPORT_SYMBOL_GPL(iommu_group_add_device); -/** - * iommu_group_remove_device - remove a device from it's current group - * @dev: device to be removed - * - * This function is called by an iommu driver to remove the device from - * it's current group. This decrements the iommu group reference count. - */ -void iommu_group_remove_device(struct device *dev) +static struct group_device * +__iommu_group_remove_device(struct iommu_group *group, struct device *dev) { - struct iommu_group *group = dev->iommu_group; - struct group_device *tmp_device, *device = NULL; + struct group_device *device; + + lockdep_assert_held(&group->mutex); if (!group) - return; + return NULL; dev_info(dev, "Removing from iommu group %d\n", group->id); - mutex_lock(&group->mutex); - list_for_each_entry(tmp_device, &group->devices, list) { - if (tmp_device->dev == dev) { - device = tmp_device; + list_for_each_entry(device, &group->devices, list) { + if (device->dev == dev) { list_del(&device->list); - break; + return device; } } - mutex_unlock(&group->mutex); + return NULL; +} +static void __iommu_group_remove_device_sysfs(struct iommu_group *group, + struct group_device *device) +{ if (!device) return; sysfs_remove_link(group->devices_kobj, device->name); - sysfs_remove_link(&dev->kobj, "iommu_group"); + sysfs_remove_link(&device->dev->kobj, "iommu_group"); - trace_remove_device_from_group(group->id, dev); + trace_remove_device_from_group(group->id, device->dev); kfree(device->name); kfree(device); - dev->iommu_group = NULL; +} + +/** + * iommu_group_remove_device - remove a device from it's current group + * @dev: device to be removed + * + * This function is called by an iommu driver to remove the device from + * it's current group. This decrements the iommu group reference count. + */ +void iommu_group_remove_device(struct device *dev) +{ + struct iommu_group *group = dev->iommu_group; + struct group_device *device; + + /* + * Since we don't do ops->release_device() there is no way for the + * driver to stop using any attached domain before we free it. If a + * domain is attached while this function is called it will eventually + * UAF. + * + * Thus it is only useful for cases like VFIO/SPAPR that don't use an + * iommu driver, or for cases like FSL that don't use default domains. + */ + WARN_ON(group->domain); + + mutex_lock(&group->mutex); + device = __iommu_group_remove_device(group, dev); + mutex_unlock(&group->mutex); + __iommu_group_remove_device_sysfs(group, device); kobject_put(group->devices_kobj); } EXPORT_SYMBOL_GPL(iommu_group_remove_device);
On 2022-09-09 14:25, Jason Gunthorpe wrote: > On Fri, Sep 09, 2022 at 10:05:58AM +0100, Robin Murphy wrote: >> On 2022-09-08 22:43, Jason Gunthorpe wrote: >>> On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote: >>> >>>> Oh, because s390 is using iommu_get_domain_for_dev() in its release_device >>>> callback, which needs to dereference the group to work, and the current >>>> domain may also be a non-default one which we can't prevent from >>>> disappearing racily, that was why :( >>> >>> Hum, the issue there is the use of device->iommu_group - but that just >>> means I didn't split properly. How about this incremental: >> >> That did cross my mind, but it's a bit grim. > > Actually, also in my morning, I think it may not even be necessary. > > Keep in mind the start of the series fixes VFIO. > > The bug that S390 is trying to fix is that VFIO didn't put back the > group ownership, it just left its own iommu_domain attached and called > release(). > > But now, at least for single device groups, VFIO will put owenership > back and zdev->s390_domain == NULL when we get to release_device() > >> That then only leaves the issue that that domain may still become >> invalid at any point after the group mutex has been dropped. > > So that is this race: > > CPU0 CPU1 > iommu_release_device(a) > __iommu_group_remove_device(a) > iommu_device_use_default_domain(b) > iommu_domain_free(domain) > iommu_release_device(b) > ops->release_device(b) > ops->release_device(a) > // Boom, a is still attached to domain :( > > I can't think of how to solve this other than holding the group mutex > across release_device. See below. I see a few possibilities: - Backtrack slightly on its removal, and instead repurpose detach_dev into a specialised domain cleanup callback, called before or during iommu_group_remove_device(), with the group mutex held. - Drivers that hold any kind of internal per-device references to domains - which is generally the root of this issue in the first place - can implement proper reference counting, so even if a domain is "freed" with a device still attached as above, it doesn't actually go away until release_device(a) cleans up the final dangling reference. I suggested the core doing this generically, but on reflection I think it's actually a lot more straightforward as a driver-internal thing. - Drivers that basically just keep a list of devices in the domain and need to do a list_del() in release_device, can also list_del_init() any still-attached devices in domain_free, with a simple per-instance or global lock to serialise the two. >>> And to your other question, the reason I split the function is because >>> I couldn't really say WTF iommu_group_remove_device() was supposed to >>> do. The __ version make ssense as part of the remove_device, due to >>> the sequencing with ops->release() >>> >>> But the other one doesn't have that. So I want to put in a: >>> >>> WARN_ON(group->blocking_domain || group->default_domain); >>> >>> Because calling it after those domains are allocated looks broken to >>> me. >> >> I might be misunderstanding, but that sounds backwards - if a real device is >> being hotplugged out, we absolutely expect that to happen *after* its >> default domain has been set up. > > See below for what I mean > > iommu_group_remove_device() doesn't work as an API because it has no > way to tell the device to stop using the domain we are about to free. > > So it should assert that there is no domain to worry about. For the > vfio and power case there is no domain because they don't use iommu > drivers Ah, I see it now - if we think it's a usage error for any current API user to allow a device to be removed while still attached to a non- default domain, then we can just throw our hands up at that, and mitigate for the default domain case that we *can* control. I'm not 100% convinced there might not be some niche non-uAPI case for skipping a detach because you know you're tearing down your device and domain at the same time, but I'm inclined to agree that we can worry about that if and when it does ever come up. If so, I reckon it should be about as as easy as this (untested). Cheers, Robin. ----->8----- diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 9fbe5d067473..760d9bd3ad66 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -396,17 +396,25 @@ int iommu_probe_device(struct device *dev) void iommu_release_device(struct device *dev) { const struct iommu_ops *ops; + struct iommu_group *group; if (!dev->iommu) return; iommu_device_unlink(dev->iommu->iommu_dev, dev); + /* + * Some drivers track a device's current domain internally and may + * dereference it to clean up in release_device. If a default domain + * exists, hold a reference to ensure it stays around long enough. + */ + group = iommu_group_get(dev); + iommu_group_remove_device(dev); ops = dev_iommu_ops(dev); if (ops->release_device) ops->release_device(dev); - iommu_group_remove_device(dev); + iommu_group_put(group); module_put(ops->owner); dev_iommu_free(dev); } @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev) dev_info(dev, "Removing from iommu group %d\n", group->id); mutex_lock(&group->mutex); + if (WARN_ON(group->domain != group->default_domain && + group->domain != group->blocking_domain)) { + if (group->default_domain) + __iommu_attach_device(group->default_domain, dev); + else + __iommu_detach_device(group->domain, dev); + } + list_for_each_entry(tmp_device, &group->devices, list) { if (tmp_device->dev == dev) { device = tmp_device;
On Fri, Sep 09, 2022 at 06:57:59PM +0100, Robin Murphy wrote: > > > That then only leaves the issue that that domain may still become > > > invalid at any point after the group mutex has been dropped. > > > > So that is this race: > > > > CPU0 CPU1 > > iommu_release_device(a) > > __iommu_group_remove_device(a) > > iommu_device_use_default_domain(b) > > iommu_domain_free(domain) > > iommu_release_device(b) > > ops->release_device(b) > > ops->release_device(a) > > // Boom, a is still attached to domain :( > > > > I can't think of how to solve this other than holding the group mutex > > across release_device. See below. > > I see a few possibilities: > > - Backtrack slightly on its removal, and instead repurpose detach_dev > into a specialised domain cleanup callback, called before or during > iommu_group_remove_device(), with the group mutex held. See below for why that is somewhat troublesome.. > - Drivers that hold any kind of internal per-device references to > domains - which is generally the root of this issue in the first place - > can implement proper reference counting, so even if a domain is "freed" > with a device still attached as above, it doesn't actually go away until > release_device(a) cleans up the final dangling reference. I suggested > the core doing this generically, but on reflection I think it's actually > a lot more straightforward as a driver-internal thing. Isn't this every driver though? Like every single driver implementing an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the iommu_domain - minimally to point the HW to the IOPTEs it stores. > - Drivers that basically just keep a list of devices in the domain and > need to do a list_del() in release_device, can also list_del_init() any > still-attached devices in domain_free, with a simple per-instance or > global lock to serialise the two. Compared to just locking ops->release_device() these all seem more complicated? IMHO the core code should try to protect the driver from racing release with anything else. Do you know a reason not to hold the group mutex across release_device? I think that is the most straightforward and future proof. Arguably all the device ops should be serialized under the group mutex. > @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev) > dev_info(dev, "Removing from iommu group %d\n", group->id); > mutex_lock(&group->mutex); > + if (WARN_ON(group->domain != group->default_domain && > + group->domain != group->blocking_domain)) { This will false trigger, if there are two VFIO devices then the group will remained owned when we unplug one just of them, but the group's domain will be a VFIO owned domain. It is why I put the list_empty() protection, as the test only works if it is the last device. > + if (group->default_domain) > + __iommu_attach_device(group->default_domain, dev); > + else > + __iommu_detach_device(group->domain, dev); > + } This was very appealing, but I rejected it because it is too difficult to support multi-device groups that share the RID. In that case we expect that the first attach/detach of a device on the shared RID will reconfigure the iommu and the attach/deatch of all the other devices on the group with the same parameters will be a NOP. So in a VFIO configuration where two drivers are bound to a single group with shared RID and we unplug one device, this will rebind the shared RID and thus the entire group to blocking/default and break the still running VFIO on the second device. The device centric interface only works if we always apply the operation to every device in the group.. This is why I kept it as ops->release_device() with an implicit detach of the current domain inside the driver. release_device() has that special meaning of 'detach the domain but do not change a shared RID' And it misses the logic to WARN_ON if a domain is set and an external entity wrongly uses iommu_group_remove_device().. Jason
On 2022-09-09 19:30, Jason Gunthorpe wrote: > On Fri, Sep 09, 2022 at 06:57:59PM +0100, Robin Murphy wrote: > >>>> That then only leaves the issue that that domain may still become >>>> invalid at any point after the group mutex has been dropped. >>> >>> So that is this race: >>> >>> CPU0 CPU1 >>> iommu_release_device(a) >>> __iommu_group_remove_device(a) >>> iommu_device_use_default_domain(b) >>> iommu_domain_free(domain) >>> iommu_release_device(b) >>> ops->release_device(b) >>> ops->release_device(a) >>> // Boom, a is still attached to domain :( >>> >>> I can't think of how to solve this other than holding the group mutex >>> across release_device. See below. >> >> I see a few possibilities: >> >> - Backtrack slightly on its removal, and instead repurpose detach_dev >> into a specialised domain cleanup callback, called before or during >> iommu_group_remove_device(), with the group mutex held. > > See below for why that is somewhat troublesome.. > >> - Drivers that hold any kind of internal per-device references to >> domains - which is generally the root of this issue in the first place - >> can implement proper reference counting, so even if a domain is "freed" >> with a device still attached as above, it doesn't actually go away until >> release_device(a) cleans up the final dangling reference. I suggested >> the core doing this generically, but on reflection I think it's actually >> a lot more straightforward as a driver-internal thing. > > Isn't this every driver though? Like every single driver implementing > an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the > iommu_domain - minimally to point the HW to the IOPTEs it stores. Um, no? Domain ops get the domain passed in as an argument, which is far from hidden, and if any driver implemented them to ignore that argument and operate on something else it would be stupid and broken. Note I said "per-device reference", meaning things like s390's zpci_dev->s390_domain and SMMUv3's dev->iommu->priv->domain. It's only those references that are reachable from release_device - outside the normal domain lifecycle - which are problematic. >> - Drivers that basically just keep a list of devices in the domain and >> need to do a list_del() in release_device, can also list_del_init() any >> still-attached devices in domain_free, with a simple per-instance or >> global lock to serialise the two. > > Compared to just locking ops->release_device() these all seem more > complicated? Well, yes, but they are still potentially-viable examples of the alternative solutions you said you couldn't think of ;) > IMHO the core code should try to protect the driver from racing > release with anything else. > > Do you know a reason not to hold the group mutex across > release_device? I think that is the most straightforward and > future proof. Yes, the ones documented in the code and already discussed here. The current functional ones aren't particularly *good* reasons, but unless and until they can all be cleaned up they are what they are. > Arguably all the device ops should be serialized under the group > mutex. Maybe once groups and default domains are used consistently everywhere. And notwithstanding that half the ops have no association with a group, are needed before or as part of obtaining a group, or were explicitly intended to allow calling back into other APIs that might lock the relevant group :P >> @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev) >> dev_info(dev, "Removing from iommu group %d\n", group->id); >> mutex_lock(&group->mutex); >> + if (WARN_ON(group->domain != group->default_domain && >> + group->domain != group->blocking_domain)) { > > This will false trigger, if there are two VFIO devices then the group > will remained owned when we unplug one just of them, but the group's domain > will be a VFIO owned domain. As opposed to currently, where most drivers' release_device will blindly detach/disable the RID in some fashion so the other device would suddenly blow up anyway? A warning of the impending disaster might be quite informative, I reckon. > It is why I put the list_empty() protection, as the test only works if > it is the last device. > >> + if (group->default_domain) >> + __iommu_attach_device(group->default_domain, dev); >> + else >> + __iommu_detach_device(group->domain, dev); >> + } > > This was very appealing, but I rejected it because it is too difficult > to support multi-device groups that share the RID. > > In that case we expect that the first attach/detach of a device on the > shared RID will reconfigure the iommu and the attach/deatch of all the > other devices on the group with the same parameters will be a NOP. > > So in a VFIO configuration where two drivers are bound to a single > group with shared RID and we unplug one device, this will rebind the > shared RID and thus the entire group to blocking/default and break the > still running VFIO on the second device. As above, I am supremely confident that nobody has ever done that because it is already broken on everything that matters. (It *will* actually work on SMMUv2 because SMMUv2 comprehensively handles StreamID-level aliasing beyond what pci_device_group() covers, which I remain rather proud of) > The device centric interface only works if we always apply the > operation to every device in the group.. > > This is why I kept it as ops->release_device() with an implicit detach > of the current domain inside the driver. release_device() has that > special meaning of 'detach the domain but do not change a shared RID' If you want to rely on that notion, you'll need to tell all the major drivers about it first, I'm afraid. > And it misses the logic to WARN_ON if a domain is set and an external > entity wrongly uses iommu_group_remove_device().. Huh? An external fake group couldn't have a default domain or blocking domain, thus any non-NULL domain will not compare equal to either, so if that could happen it will warn, and then most likely crash. I did think briefly about trying to make it not crash, but then I remembered that fake groups from external callers also aren't backed by IOMMU API drivers so have no way to allocate or attach domains either, so in fact it cannot happen at all under any circumstances that are worth reasoning about. Yes, some nefarious driver could call iommu_group_remove_device() on random devices. It could also call kfree(dev->iommu_group). Or kfree(iommu_group_remove_device). Fundamentally-broken kernel code can crash the kernel, whoop de do. Thanks, Robin.
On Fri, Sep 09, 2022 at 08:55:07PM +0100, Robin Murphy wrote: > > Isn't this every driver though? Like every single driver implementing > > an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the > > iommu_domain - minimally to point the HW to the IOPTEs it stores. > > Um, no? Domain ops get the domain passed in as an argument, which is far > from hidden, and if any driver implemented them to ignore that argument and > operate on something else it would be stupid and broken. Note I said > "per-device reference", meaning things like s390's zpci_dev->s390_domain and > SMMUv3's dev->iommu->priv->domain. It's only those references that are > reachable from release_device - outside the normal domain lifecycle - which > are problematic. If the plan is to make the domain refcounted and then allow a 'put' on it before we reach release_device() then it means every driver needs to hold a 'get' on the domain while it is programmed into the HW. Because the hw will still be touching memory that could be freed by an iommu_domain_free(). By "hidden" reference I mean the HW walkers are touching memory that would be freed - ie kasn won't see it. > > Do you know a reason not to hold the group mutex across > > release_device? I think that is the most straightforward and > > future proof. > > Yes, the ones documented in the code and already discussed here. The current > functional ones aren't particularly *good* reasons, but unless and until > they can all be cleaned up they are what they are. Uh, I feel like I missed part of the conversation - I don't know what this list is.. I did look. I'm looking for a call chain that goes from release_device() into a core function that grabs the group->mutex. There is a comment in iommu_group_store_type() that suggest there is a recursion but doesn't say what it is. It was an Intel person who wrote the comment so I more carefully checked the intel driver and didn't find a call path, but it is big and complicated.. There is a commment in iommu_change_dev_def_domain() about recursion on probe_finalize(), not relevant here, AFAICT. So, I did two approaches, one I checked quickly through every release_device looking for something. Then I looked across the entire exported driver facing API and focused on callchains going back toward the core from APIs that might be trouble and audited them almost completely. These APIs do not take the lock, so completely safe: iommu_group_alloc iommu_group_set_iommudata iommu_group_set_name iommu_group_get iommu_group_ref_get iommu_group_put iommu_get_domain_for_dev iommu_fwspec_free iommu_fwspec_init iommu_fwspec_add_ids iommu_put_resv_regions (called from release_device) Does take the lock. Checked all of these in all the drivers, didn't find an obvious path to release_device: iommu_group_remove_device iommu_group_for_each_dev iommu_attach_device iommu_detach_device iommu_attach_group iommu_detach_group bus_set_iommu Can't tell if these take lock due to driver callbacks, but couldn't find them in release, and doesn't make sense they would be there: iommu_device_register iommu_device_unregister iommu_domain_alloc iommu_domain_free Rest of the exported interface touching the drivers - didn't carefully check if they are using the lock - however by name seems unlikely these are in release_device(): iommu_register_device_fault_handler iommu_unregister_device_fault_handler iommu_report_device_fault iommu_page_response report_iommu_fault iommu_iova_to_phys iommu_map iommu_unmap iommu_alloc_resv_region iommu_present iommu_capable iommu_default_passthrough It is big and complicated, so I wouldn't stake my life on it, but it seems worth investigating further. Could the recursion have been cleaned up already? > > > @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev) > > > dev_info(dev, "Removing from iommu group %d\n", group->id); > > > mutex_lock(&group->mutex); > > > + if (WARN_ON(group->domain != group->default_domain && > > > + group->domain != group->blocking_domain)) { > > > > This will false trigger, if there are two VFIO devices then the group > > will remained owned when we unplug one just of them, but the group's domain > > will be a VFIO owned domain. > > As opposed to currently, where most drivers' release_device will blindly > detach/disable the RID in some fashion so the other device would suddenly > blow up anyway? Er, I think it is OK today, in the non-shared case. If the RID isn't shared then each device in the group is independent, so most drivers, most of the time, should only effect the RID release_device() is called on, while this warning will always trigger for any multi-device group. > (It *will* actually work on SMMUv2 because SMMUv2 comprehensively handles > StreamID-level aliasing beyond what pci_device_group() covers, which I > remain rather proud of) This is why I prefered not to explicitly change the domain, because at least if someone did write a non-buggy driver it doesn't get wrecked - and making a non-buggy driver is at least allowed by the API. > > And it misses the logic to WARN_ON if a domain is set and an external > > entity wrongly uses iommu_group_remove_device().. > > Huh? An external fake group couldn't have a default domain or blocking > domain, thus any non-NULL domain will not compare equal to either, so if > that could happen it will warn, and then most likely crash. I did think > briefly about trying to make it not crash, but then I remembered that fake > groups from external callers also aren't backed by IOMMU API drivers so have > no way to allocate or attach domains either, so in fact it cannot happen at > all under any circumstances that are worth reasoning about. I mean specificaly thing like FSL is doing where it is a real driver calling this API and the test of 'group->domain == NULL' is the more robust precondition. So, IDK, I would perfer to understand where we hit a group mutex recursion before rejecting that path... If you know specifics please share, otherwise maybe we should stick in a lockdep check there and see if anything hits? But I'm off to LPC so I probably won't write anything more thoughtful on this for a while. Thanks, Jason
On 2022-09-10 00:45, Jason Gunthorpe wrote: > On Fri, Sep 09, 2022 at 08:55:07PM +0100, Robin Murphy wrote: > >>> Isn't this every driver though? Like every single driver implementing >>> an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the >>> iommu_domain - minimally to point the HW to the IOPTEs it stores. >> >> Um, no? Domain ops get the domain passed in as an argument, which is far >> from hidden, and if any driver implemented them to ignore that argument and >> operate on something else it would be stupid and broken. Note I said >> "per-device reference", meaning things like s390's zpci_dev->s390_domain and >> SMMUv3's dev->iommu->priv->domain. It's only those references that are >> reachable from release_device - outside the normal domain lifecycle - which >> are problematic. > > If the plan is to make the domain refcounted and then allow a 'put' on > it before we reach release_device() then it means every driver needs > to hold a 'get' on the domain while it is programmed into the HW. > > Because the hw will still be touching memory that could be freed by an > iommu_domain_free(). By "hidden" reference I mean the HW walkers are > touching memory that would be freed - ie kasn won't see it. As far as I'm concerned we're dealing purely with the case where release_device races with attaching back to the default domain *and* the driver has some reason for release_device to poke at what it thinks the currently-attached domain is. Anyone who frees a domain while it's still actually live deserves whatever they get; it would be thoroughly impractical to attempt to mitigate for that kind of silliness. >>> Do you know a reason not to hold the group mutex across >>> release_device? I think that is the most straightforward and >>> future proof. >> >> Yes, the ones documented in the code and already discussed here. The current >> functional ones aren't particularly *good* reasons, but unless and until >> they can all be cleaned up they are what they are. > > Uh, I feel like I missed part of the conversation - I don't know what > this list is.. s390 (remember how we got here?) calls iommu_get_domain_for_dev(). ipmmu-vmsa calls arm_iommu_detach_device() (mtk_v1 doesn't, but perhaps technically should), to undo the corresponding attach from probe_finalize - apologies for misremembering which way round the comments were. >>>> @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev) >>>> dev_info(dev, "Removing from iommu group %d\n", group->id); >>>> mutex_lock(&group->mutex); >>>> + if (WARN_ON(group->domain != group->default_domain && >>>> + group->domain != group->blocking_domain)) { >>> >>> This will false trigger, if there are two VFIO devices then the group >>> will remained owned when we unplug one just of them, but the group's domain >>> will be a VFIO owned domain. >> >> As opposed to currently, where most drivers' release_device will blindly >> detach/disable the RID in some fashion so the other device would suddenly >> blow up anyway? > > Er, I think it is OK today, in the non-shared case. If the RID isn't > shared then each device in the group is independent, so most drivers, > most of the time, should only effect the RID release_device() is > called on, while this warning will always trigger for any multi-device > group. Oh, apparently I managed to misinterpret this as the two *aliasing* devices case, sorry. Indeed it is overly conservative for that. I think the robust way to detect bad usage is actually not via the group at all, but for iommu_device_{un}use_default_domain() to also maintain a per-device ownership flag, then we warn if a device is released with that still set. >> (It *will* actually work on SMMUv2 because SMMUv2 comprehensively handles >> StreamID-level aliasing beyond what pci_device_group() covers, which I >> remain rather proud of) > > This is why I prefered not to explicitly change the domain, because at > least if someone did write a non-buggy driver it doesn't get wrecked - > and making a non-buggy driver is at least allowed by the API. Detaching back to the default domain still seems like it's *always* the right thing to do at this point, even when it should not be warned about as above. As I say it *does* work on non-buggy drivers, and making this whole domain use-after-free race a fundamental non-issue is attractive. >>> And it misses the logic to WARN_ON if a domain is set and an external >>> entity wrongly uses iommu_group_remove_device().. >> >> Huh? An external fake group couldn't have a default domain or blocking >> domain, thus any non-NULL domain will not compare equal to either, so if >> that could happen it will warn, and then most likely crash. I did think >> briefly about trying to make it not crash, but then I remembered that fake >> groups from external callers also aren't backed by IOMMU API drivers so have >> no way to allocate or attach domains either, so in fact it cannot happen at >> all under any circumstances that are worth reasoning about. > > I mean specificaly thing like FSL is doing where it is a real driver > calling this API and the test of 'group->domain == NULL' is the more > robust precondition. Having looked a bit closer, I think I get what PAMU is doing - kind of impedance-matching between pci_device_group() and its own non-ACS form of isolation, and possibly also a rather roundabout way of propagating DT data from the PCI controller node up into the PCI hierarchy. Both could quite likely be done in a more straightforward manner these days (and TBH I'm not convinced it works at all since it doesn't appear to match the actual DT binding), but either way I'm fairly confident we needn't worry about it. Thanks, Robin.
On Mon, Sep 12, 2022 at 12:13:25PM +0100, Robin Murphy wrote: > > > Yes, the ones documented in the code and already discussed here. The current > > > functional ones aren't particularly *good* reasons, but unless and until > > > they can all be cleaned up they are what they are. > > > > Uh, I feel like I missed part of the conversation - I don't know what > > this list is.. > > s390 (remember how we got here?) calls iommu_get_domain_for_dev(). iommu_get_domain_for_dev() doesn't take a lock, and the last try I showed (see below) ordered things so that it would succeed when called under release(). AFIACT it is already not a problem. > ipmmu-vmsa calls arm_iommu_detach_device() (mtk_v1 doesn't, but perhaps > technically should), to undo the corresponding attach from probe_finalize - > apologies for misremembering which way round the comments were. This does eventually deadlock on the group->mutex, so yes this is a problem! And I agree mtk_v1 does looks like it has a memory leak. But, this looks easy enough to fix up. See below > Oh, apparently I managed to misinterpret this as the two *aliasing* devices > case, sorry. Indeed it is overly conservative for that. I think the robust > way to detect bad usage is actually not via the group at all, but for > iommu_device_{un}use_default_domain() to also maintain a per-device > ownership flag, then we warn if a device is released with that still set. It seems a bit hard to implement a per-device flag with VFIO? > > > (It *will* actually work on SMMUv2 because SMMUv2 comprehensively handles > > > StreamID-level aliasing beyond what pci_device_group() covers, which I > > > remain rather proud of) > > > > This is why I prefered not to explicitly change the domain, because at > > least if someone did write a non-buggy driver it doesn't get wrecked - > > and making a non-buggy driver is at least allowed by the API. > > Detaching back to the default domain still seems like it's *always* the > right thing to do at this point, If the RID is aliased it is the wrong thing to do. Calling attach will wreck the entire alias set. release is not supposed to do that, buggy drivers excepted. > Having looked a bit closer, I think I get what PAMU is doing - kind of > impedance-matching between pci_device_group() and its own non-ACS form of > isolation, and possibly also a rather roundabout way of propagating DT data > from the PCI controller node up into the PCI hierarchy. Both could quite > likely be done in a more straightforward manner these days (and TBH I'm not > convinced it works at all since it doesn't appear to match the actual DT > binding), but either way I'm fairly confident we needn't worry about it. Yes, this is what I thought too. Arguably it is wrong to try and rework the groups once created, it should be creating the groups to cover what it wants from the start, and it shouldn't leave the controller without a group. So something like the below is what I'm thinking Thanks, Jason diff --git a/arch/arm/include/asm/dma-iommu.h b/arch/arm/include/asm/dma-iommu.h index fe9ef6f79e9cfe..ea7198a1786186 100644 --- a/arch/arm/include/asm/dma-iommu.h +++ b/arch/arm/include/asm/dma-iommu.h @@ -31,6 +31,7 @@ void arm_iommu_release_mapping(struct dma_iommu_mapping *mapping); int arm_iommu_attach_device(struct device *dev, struct dma_iommu_mapping *mapping); void arm_iommu_detach_device(struct device *dev); +void arm_iommu_release_device(struct device *dev); #endif /* __KERNEL__ */ #endif diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index 089c9c644cce2a..1694bebb3aa4dc 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -1697,13 +1697,15 @@ int arm_iommu_attach_device(struct device *dev, EXPORT_SYMBOL_GPL(arm_iommu_attach_device); /** - * arm_iommu_detach_device + * arm_iommu_release_device * @dev: valid struct device pointer * - * Detaches the provided device from a previously attached map. - * This overwrites the dma_ops pointer with appropriate non-IOMMU ops. + * This is like arm_iommu_detach_device() except it handles the special + * case of being called under an iommu driver's release operation. In this + * case the driver must have already detached the device from any attached + * domain before calling this function. */ -void arm_iommu_detach_device(struct device *dev) +void arm_iommu_release_device(struct device *dev) { struct dma_iommu_mapping *mapping; @@ -1713,13 +1715,34 @@ void arm_iommu_detach_device(struct device *dev) return; } - iommu_detach_device(mapping->domain, dev); kref_put(&mapping->kref, release_iommu_mapping); to_dma_iommu_mapping(dev) = NULL; set_dma_ops(dev, NULL); pr_debug("Detached IOMMU controller from %s device.\n", dev_name(dev)); } +EXPORT_SYMBOL_GPL(arm_iommu_release_device); + +/** + * arm_iommu_detach_device + * @dev: valid struct device pointer + * + * Detaches the provided device from a previously attached map. + * This overwrites the dma_ops pointer with appropriate non-IOMMU ops. + */ +void arm_iommu_detach_device(struct device *dev) +{ + struct dma_iommu_mapping *mapping; + + mapping = to_dma_iommu_mapping(dev); + if (!mapping) { + dev_warn(dev, "Not attached\n"); + return; + } + + iommu_detach_device(mapping->domain, dev); + arm_iommu_release_device(dev); +} EXPORT_SYMBOL_GPL(arm_iommu_detach_device); static void arm_setup_iommu_dma_ops(struct device *dev, u64 dma_base, u64 size, diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 780fb70715770d..61444b2a11e217 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -90,6 +90,10 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group, static struct iommu_group *iommu_group_get_for_dev(struct device *dev); static ssize_t iommu_group_store_type(struct iommu_group *group, const char *buf, size_t count); +static struct group_device * +__iommu_group_remove_device(struct iommu_group *group, struct device *dev); +static void __iommu_group_remove_device_sysfs(struct iommu_group *group, + struct group_device *device); #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \ struct iommu_group_attribute iommu_group_attr_##_name = \ @@ -330,18 +334,50 @@ int iommu_probe_device(struct device *dev) void iommu_release_device(struct device *dev) { + struct iommu_group *group = dev->iommu_group; const struct iommu_ops *ops; + struct group_device *device; if (!dev->iommu) return; iommu_device_unlink(dev->iommu->iommu_dev, dev); + mutex_lock(&group->mutex); + device = __iommu_group_remove_device(group, dev); ops = dev_iommu_ops(dev); + + /* + * If the group has become empty then ownership must have been released, + * and the current domain must be set back to NULL or the default + * domain. + */ + if (list_empty(&group->devices)) + WARN_ON(group->owner_cnt || + group->domain != group->default_domain); + + /* + * release_device() must stop using any attached domain on the device. + * If there are still other devices in the group they are not effected + * by this callback. + * + * The IOMMU driver must set the device to either an identity or + * blocking translation and stop using any domain pointer, as it is + * going to be freed. + */ if (ops->release_device) ops->release_device(dev); + mutex_unlock(&group->mutex); + + __iommu_group_remove_device_sysfs(group, device); + + /* + * This will eventually call iommu_group_release() which will free the + * iommu_domains. + */ + dev->iommu_group = NULL; + kobject_put(group->devices_kobj); - iommu_group_remove_device(dev); module_put(ops->owner); dev_iommu_free(dev); } @@ -939,44 +975,71 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev) } EXPORT_SYMBOL_GPL(iommu_group_add_device); -/** - * iommu_group_remove_device - remove a device from it's current group - * @dev: device to be removed - * - * This function is called by an iommu driver to remove the device from - * it's current group. This decrements the iommu group reference count. - */ -void iommu_group_remove_device(struct device *dev) +static struct group_device * +__iommu_group_remove_device(struct iommu_group *group, struct device *dev) { - struct iommu_group *group = dev->iommu_group; - struct group_device *tmp_device, *device = NULL; + struct group_device *device; + + lockdep_assert_held(&group->mutex); if (!group) - return; + return NULL; dev_info(dev, "Removing from iommu group %d\n", group->id); - mutex_lock(&group->mutex); - list_for_each_entry(tmp_device, &group->devices, list) { - if (tmp_device->dev == dev) { - device = tmp_device; + list_for_each_entry(device, &group->devices, list) { + if (device->dev == dev) { list_del(&device->list); - break; + return device; } } - mutex_unlock(&group->mutex); + return NULL; +} +static void __iommu_group_remove_device_sysfs(struct iommu_group *group, + struct group_device *device) +{ if (!device) return; sysfs_remove_link(group->devices_kobj, device->name); - sysfs_remove_link(&dev->kobj, "iommu_group"); + sysfs_remove_link(&device->dev->kobj, "iommu_group"); - trace_remove_device_from_group(group->id, dev); + trace_remove_device_from_group(group->id, device->dev); kfree(device->name); kfree(device); - dev->iommu_group = NULL; +} + +/** + * iommu_group_remove_device - remove a device from it's current group + * @dev: device to be removed + * + * This function is used by non-iommu drivers to create non-iommu subystem + * groups (eg VFIO mdev, SPAPR) Using this from inside an iommu driver is an + * abuse. Instead the driver should return the correct group from + * ops->device_group() + */ +void iommu_group_remove_device(struct device *dev) +{ + struct iommu_group *group = dev->iommu_group; + struct group_device *device; + + /* + * Since we don't do ops->release_device() there is no way for the + * driver to stop using any attached domain before we free it. If a + * domain is attached while this function is called it will eventually + * UAF. + * + * Thus it is only useful for cases like VFIO/SPAPR that don't use an + * iommu driver, or for cases like FSL that don't use default domains. + */ + WARN_ON(group->domain); + + mutex_lock(&group->mutex); + device = __iommu_group_remove_device(group, dev); + mutex_unlock(&group->mutex); + __iommu_group_remove_device_sysfs(group, device); kobject_put(group->devices_kobj); } EXPORT_SYMBOL_GPL(iommu_group_remove_device); diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c index 1d42084d02767e..f5b9787d22420c 100644 --- a/drivers/iommu/ipmmu-vmsa.c +++ b/drivers/iommu/ipmmu-vmsa.c @@ -302,11 +302,8 @@ static void ipmmu_utlb_enable(struct ipmmu_vmsa_domain *domain, /* * Disable MMU translation for the microTLB. */ -static void ipmmu_utlb_disable(struct ipmmu_vmsa_domain *domain, - unsigned int utlb) +static void ipmmu_utlb_disable(struct ipmmu_vmsa_device *mmu, unsigned int utlb) { - struct ipmmu_vmsa_device *mmu = domain->mmu; - ipmmu_imuctr_write(mmu, utlb, 0); mmu->utlb_ctx[utlb] = IPMMU_CTX_INVALID; } @@ -649,11 +646,11 @@ static void ipmmu_detach_device(struct iommu_domain *io_domain, struct device *dev) { struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); - struct ipmmu_vmsa_domain *domain = to_vmsa_domain(io_domain); + struct ipmmu_vmsa_device *mmu = to_ipmmu(dev); unsigned int i; for (i = 0; i < fwspec->num_ids; ++i) - ipmmu_utlb_disable(domain, fwspec->ids[i]); + ipmmu_utlb_disable(mmu, fwspec->ids[i]); /* * TODO: Optimize by disabling the context when no device is attached. @@ -849,7 +846,8 @@ static void ipmmu_probe_finalize(struct device *dev) static void ipmmu_release_device(struct device *dev) { - arm_iommu_detach_device(dev); + ipmmu_detach_device(NULL, dev); + arm_iommu_release_device(dev); } static struct iommu_group *ipmmu_find_group(struct device *dev)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 780fb70715770d..c451bf715182ac 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -90,6 +90,7 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group, static struct iommu_group *iommu_group_get_for_dev(struct device *dev); static ssize_t iommu_group_store_type(struct iommu_group *group, const char *buf, size_t count); +static void __iommu_group_remove_device(struct device *dev); #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \ struct iommu_group_attribute iommu_group_attr_##_name = \ @@ -330,6 +331,7 @@ int iommu_probe_device(struct device *dev) void iommu_release_device(struct device *dev) { + struct iommu_group *group = dev->iommu_group; const struct iommu_ops *ops; if (!dev->iommu) @@ -337,11 +339,20 @@ void iommu_release_device(struct device *dev) iommu_device_unlink(dev->iommu->iommu_dev, dev); + __iommu_group_remove_device(dev); ops = dev_iommu_ops(dev); if (ops->release_device) ops->release_device(dev); - iommu_group_remove_device(dev); + /* + * This will eventually call iommu_group_release() which will free the + * iommu_domains. Up until the release_device() above the iommu_domains + * may still have been associated with the device, and we cannot free + * them until the have been detached. release_device() is expected to + * detach all domains connected to the dev. + */ + kobject_put(group->devices_kobj); + module_put(ops->owner); dev_iommu_free(dev); } @@ -939,14 +950,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev) } EXPORT_SYMBOL_GPL(iommu_group_add_device); -/** - * iommu_group_remove_device - remove a device from it's current group - * @dev: device to be removed - * - * This function is called by an iommu driver to remove the device from - * it's current group. This decrements the iommu group reference count. - */ -void iommu_group_remove_device(struct device *dev) +static void __iommu_group_remove_device(struct device *dev) { struct iommu_group *group = dev->iommu_group; struct group_device *tmp_device, *device = NULL; @@ -977,6 +981,20 @@ void iommu_group_remove_device(struct device *dev) kfree(device->name); kfree(device); dev->iommu_group = NULL; +} + +/** + * iommu_group_remove_device - remove a device from it's current group + * @dev: device to be removed + * + * This function is called by an iommu driver to remove the device from + * it's current group. This decrements the iommu group reference count. + */ +void iommu_group_remove_device(struct device *dev) +{ + struct iommu_group *group = dev->iommu_group; + + __iommu_group_remove_device(dev); kobject_put(group->devices_kobj); } EXPORT_SYMBOL_GPL(iommu_group_remove_device);