Message ID | e3b191e6fd6ca9a1e84c5e5e40044faf97abb874.1740753261.git.robin.murphy@arm.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | iommu: Fix the longstanding probe issues | expand |
On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: > + if (!dev->driver && dev->bus->dma_configure) { > + mutex_unlock(&iommu_probe_device_lock); > + dev->bus->dma_configure(dev); > + mutex_lock(&iommu_probe_device_lock); > + } I think it would be very nice to get rid of the lock/unlock.. It makes me nervous that we continue on assuming dev->iommu was freshly allocated.. setup the dev->iommu partially, then drop the lock. There is only one other caller in: static int really_probe(struct device *dev, const struct device_driver *drv) { if (dev->bus->dma_configure) { ret = dev->bus->dma_configure(dev); if (ret) goto pinctrl_bind_failed; } Is it feasible to make it so the caller has to hold the iommu_probe_device_lock prior to calling the op? That would require moving the locking inside of_dma_configure to less inside, and using a new iommu_probe_device() wrapper. However, if you plan to turn this inside out soonish then it would not be worth the bother. Anyhow: Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Jason
On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: > In hindsight, there were some crucial subtleties overlooked when moving > {of,acpi}_dma_configure() to driver probe time to allow waiting for > IOMMU drivers with -EPROBE_DEFER, and these have become an > ever-increasing source of problems. The IOMMU API has some fundamental > assumptions that iommu_probe_device() is called for every device added > to the system, in the order in which they are added. Calling it in a > random order or not at all dependent on driver binding leads to > malformed groups, a potential lack of isolation for devices with no > driver, and all manner of unexpected concurrency and race conditions. > We've attempted to mitigate the latter with point-fix bodges like > iommu_probe_device_lock, but it's a losing battle and the time has come > to bite the bullet and address the true source of the problem instead. > > The crux of the matter is that the firmware parsing actually serves two > distinct purposes; one is identifying the IOMMU instance associated with > a device so we can check its availability, the second is actually > telling that instance about the relevant firmware-provided data for the > device. However the latter also depends on the former, and at the time > there was no good place to defer and retry that separately from the > availability check we also wanted for client driver probe. > > Nowadays, though, we have a proper notion of multiple IOMMU instances in > the core API itself, and each one gets a chance to probe its own devices > upon registration, so we can finally make that work as intended for > DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to > be able to run the iommu_fwspec machinery currently buried deep in the > wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be > surprisingly straightforward to bootstrap this transformation by pretty > much just calling the same path twice. At client driver probe time, > dev->driver is obviously set; conversely at device_add(), or a > subsequent bus_iommu_probe(), any device waiting for an IOMMU really > should *not* have a driver already, so we can use that as a condition to > disambiguate the two cases, and avoid recursing back into the IOMMU core > at the wrong times. > > Obviously this isn't the nicest thing, but for now it gives us a > functional baseline to then unpick the layers in between without many > more awkward cross-subsystem patches. There are some minor side-effects > like dma_range_map potentially being created earlier, and some debug > prints being repeated, but these aren't significantly detrimental. Let's > make things work first, then deal with making them nice. > > With the basic flow finally in the right order again, the next step is > probably turning the bus->dma_configure paths inside-out, since all we > really need from bus code is its notion of which device and input ID(s) > to parse the common firmware properties with... > > Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c > Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c > Signed-off-by: Robin Murphy <robin.murphy@arm.com> > --- > > v2: > - Comment bus driver changes for clarity > - Use dev->iommu as the now-robust replay condition > - Drop the device_iommu_mapped() checks in the firmware paths as they > weren't doing much - we can't replace probe_device_lock just yet... > > drivers/acpi/arm64/dma.c | 5 +++++ > drivers/acpi/scan.c | 7 ------- > drivers/amba/bus.c | 3 ++- > drivers/base/platform.c | 3 ++- > drivers/bus/fsl-mc/fsl-mc-bus.c | 3 ++- > drivers/cdx/cdx.c | 3 ++- > drivers/iommu/iommu.c | 24 +++++++++++++++++++++--- > drivers/iommu/of_iommu.c | 7 ++++++- > drivers/of/device.c | 7 ++++++- > drivers/pci/pci-driver.c | 3 ++- > 10 files changed, 48 insertions(+), 17 deletions(-) > > diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c > index 52b2abf88689..f30f138352b7 100644 > --- a/drivers/acpi/arm64/dma.c > +++ b/drivers/acpi/arm64/dma.c > @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev) > else > end = (1ULL << 32) - 1; > > + if (dev->dma_range_map) { > + dev_dbg(dev, "dma_range_map already set\n"); > + return; > + } > + Why are we checking that dev->dma_range_map exists here rather than at function entry ? Is that because we want to run the previous code for some reasons even if dev->dma_range_map is already set ? Just noticed, the OF counterpart does not seem to take the same approach, so I am just flagging this up if it matters at all. Other than that, for acpi/arm64: Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org> > ret = acpi_dma_get_range(dev, &map); > if (!ret && map) { > end = dma_range_map_max(map); > diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c > index 9f4efa8f75a6..fb1fe9f3b1a3 100644 > --- a/drivers/acpi/scan.c > +++ b/drivers/acpi/scan.c > @@ -1632,13 +1632,6 @@ static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in) > err = viot_iommu_configure(dev); > mutex_unlock(&iommu_probe_device_lock); > > - /* > - * If we have reason to believe the IOMMU driver missed the initial > - * iommu_probe_device() call for dev, replay it to get things in order. > - */ > - if (!err && dev->bus) > - err = iommu_probe_device(dev); > - > return err; > } > > diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c > index 8ef259b4d037..71482d639a6d 100644 > --- a/drivers/amba/bus.c > +++ b/drivers/amba/bus.c > @@ -364,7 +364,8 @@ static int amba_dma_configure(struct device *dev) > ret = acpi_dma_configure(dev, attr); > } > > - if (!ret && !drv->driver_managed_dma) { > + /* @drv may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !drv->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > diff --git a/drivers/base/platform.c b/drivers/base/platform.c > index 6f2a33722c52..1813cfd0c4bd 100644 > --- a/drivers/base/platform.c > +++ b/drivers/base/platform.c > @@ -1451,7 +1451,8 @@ static int platform_dma_configure(struct device *dev) > attr = acpi_get_dma_attr(to_acpi_device_node(fwnode)); > ret = acpi_dma_configure(dev, attr); > } > - if (ret || drv->driver_managed_dma) > + /* @drv may not be valid when we're called from the IOMMU layer */ > + if (ret || !dev->driver || drv->driver_managed_dma) > return ret; > > ret = iommu_device_use_default_domain(dev); > diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c > index d1f3d327ddd1..a8be8cf246fb 100644 > --- a/drivers/bus/fsl-mc/fsl-mc-bus.c > +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c > @@ -153,7 +153,8 @@ static int fsl_mc_dma_configure(struct device *dev) > else > ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT, &input_id); > > - if (!ret && !mc_drv->driver_managed_dma) { > + /* @mc_drv may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !mc_drv->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c > index c573ed2ee71a..780fb0c4adba 100644 > --- a/drivers/cdx/cdx.c > +++ b/drivers/cdx/cdx.c > @@ -360,7 +360,8 @@ static int cdx_dma_configure(struct device *dev) > return ret; > } > > - if (!ret && !cdx_drv->driver_managed_dma) { > + /* @cdx_drv may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !cdx_drv->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index a3b45b84f42b..1cec7074367a 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -414,9 +414,21 @@ static int iommu_init_device(struct device *dev) > if (!dev_iommu_get(dev)) > return -ENOMEM; > /* > - * For FDT-based systems and ACPI IORT/VIOT, drivers register IOMMU > - * instances with non-NULL fwnodes, and client devices should have been > - * identified with a fwspec by this point. Otherwise, we can currently > + * For FDT-based systems and ACPI IORT/VIOT, the common firmware parsing > + * is buried in the bus dma_configure path. Properly unpicking that is > + * still a big job, so for now just invoke the whole thing. The device > + * already having a driver bound means dma_configure has already run and > + * either found no IOMMU to wait for, or we're in its replay call right > + * now, so either way there's no point calling it again. > + */ > + if (!dev->driver && dev->bus->dma_configure) { > + mutex_unlock(&iommu_probe_device_lock); > + dev->bus->dma_configure(dev); > + mutex_lock(&iommu_probe_device_lock); > + } > + /* > + * At this point, relevant devices either now have a fwspec which will > + * match ops registered with a non-NULL fwnode, or we can reasonably > * assume that only one of Intel, AMD, s390, PAMU or legacy SMMUv2 can > * be present, and that any of their registered instances has suitable > * ops for probing, and thus cheekily co-opt the same mechanism. > @@ -426,6 +438,12 @@ static int iommu_init_device(struct device *dev) > ret = -ENODEV; > goto err_free; > } > + /* > + * And if we do now see any replay calls, they would indicate someone > + * misusing the dma_configure path outside bus code. > + */ > + if (dev->driver) > + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); > > if (!try_module_get(ops->owner)) { > ret = -EINVAL; > diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c > index e10a68b5ffde..6b989a62def2 100644 > --- a/drivers/iommu/of_iommu.c > +++ b/drivers/iommu/of_iommu.c > @@ -155,7 +155,12 @@ int of_iommu_configure(struct device *dev, struct device_node *master_np, > dev_iommu_free(dev); > mutex_unlock(&iommu_probe_device_lock); > > - if (!err && dev->bus) > + /* > + * If we're not on the iommu_probe_device() path (as indicated by the > + * initial dev->iommu) then try to simulate it. This should no longer > + * happen unless of_dma_configure() is being misused outside bus code. > + */ > + if (!err && dev->bus && !dev_iommu_present) > err = iommu_probe_device(dev); > > if (err && err != -EPROBE_DEFER) > diff --git a/drivers/of/device.c b/drivers/of/device.c > index edf3be197265..5053e5d532cc 100644 > --- a/drivers/of/device.c > +++ b/drivers/of/device.c > @@ -99,6 +99,11 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, > bool coherent, set_map = false; > int ret; > > + if (dev->dma_range_map) { > + dev_dbg(dev, "dma_range_map already set\n"); > + goto skip_map; > + } > + > if (np == dev->of_node) > bus_np = __of_get_dma_parent(np); > else > @@ -119,7 +124,7 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, > end = dma_range_map_max(map); > set_map = true; > } > - > +skip_map: > /* > * If @dev is expected to be DMA-capable then the bus code that created > * it should have initialised its dma_mask pointer by this point. For > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > index f57ea36d125d..4468b7327cab 100644 > --- a/drivers/pci/pci-driver.c > +++ b/drivers/pci/pci-driver.c > @@ -1653,7 +1653,8 @@ static int pci_dma_configure(struct device *dev) > > pci_put_host_bridge_device(bridge); > > - if (!ret && !driver->driver_managed_dma) { > + /* @driver may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !driver->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > -- > 2.39.2.101.g768bb238c484.dirty >
On 2025-03-07 2:24 pm, Lorenzo Pieralisi wrote: [...] >> diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c >> index 52b2abf88689..f30f138352b7 100644 >> --- a/drivers/acpi/arm64/dma.c >> +++ b/drivers/acpi/arm64/dma.c >> @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev) >> else >> end = (1ULL << 32) - 1; >> >> + if (dev->dma_range_map) { >> + dev_dbg(dev, "dma_range_map already set\n"); >> + return; >> + } >> + > > Why are we checking that dev->dma_range_map exists here rather than > at function entry ? Is that because we want to run the previous code > for some reasons even if dev->dma_range_map is already set ? > > Just noticed, the OF counterpart does not seem to take the same > approach, so I am just flagging this up if it matters at all. The intent is just to ensure it's safe to come through this path more than once for the time being - it doesn't really matter whether or not we repeat anything apart from allocating and assigning dma_range_map, since that leaks the previous allocation. Thus this check is logically guarding the acpi_dma_get_range() call in this path, and the of_dma_get_range() call in the other. > Other than that, for acpi/arm64: > > Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Thanks! Robin.
Hi Robin, On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: > + /* > + * And if we do now see any replay calls, they would indicate someone > + * misusing the dma_configure path outside bus code. > + */ > + if (dev->driver) > + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); This warning triggers on my workstation (with an AMD IOMMU), any ideas? ------------[ cut here ]------------ reg-dummy reg-dummy: late IOMMU probe at driver bind, something fishy here! WARNING: CPU: 0 PID: 1 at drivers/iommu/iommu.c:449 __iommu_probe_device+0x10b/0x510 Modules linked in: CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc6-iommu-next+ #1 1d691d7aebf343bde741cf4c8610d78a2ea2d2d9 Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5406 11/13/2019 RIP: 0010:__iommu_probe_device+0x10b/0x510 Code: 68 00 74 28 48 8b 6b 50 48 85 ed 75 03 48 8b 2b 48 89 df e8 87 71 06 00 48 89 ea 48 c7 c7 90 dd 2c 8b 48 89 c6 e8 35 83 77 ff <0f> 0b 49 8b bd a8 00 00 00 e8 77 ab 85 ff 84 c0 0f 84 ad 01 00 00 RSP: 0018:ffffafba00047c58 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffffa00481301c10 RCX: 0000000000000003 RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 RBP: ffffa00480ffaee0 R08: 0000000000000000 R09: ffffafba00047ae8 R10: ffffa0135e93ffa8 R11: 0000000000000003 R12: ffffafba00047d18 R13: ffffffff8adac1a0 R14: 0000000000000000 R15: ffffa004802c5800 FS: 0000000000000000(0000) GS:ffffa0135ea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffa00baca01000 CR3: 000000082b838000 CR4: 00000000003506f0 Call Trace: <TASK> ? __iommu_probe_device+0x10b/0x510 ? __warn.cold+0x93/0xf7 ? __iommu_probe_device+0x10b/0x510 ? report_bug+0xff/0x140 ? prb_read_valid+0x1b/0x30 ? handle_bug+0x58/0x90 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? __iommu_probe_device+0x10b/0x510 ? __iommu_probe_device+0x10b/0x510 ? __pfx_probe_iommu_group+0x10/0x10 probe_iommu_group+0x28/0x50 bus_for_each_dev+0x7e/0xd0 iommu_device_register+0xee/0x260 iommu_go_to_state+0xa2a/0x1970 ? srso_return_thunk+0x5/0x5f ? blake2s_update+0x68/0x160 ? __pfx_pci_iommu_init+0x10/0x10 amd_iommu_init+0x14/0x50 ? __pfx_pci_iommu_init+0x10/0x10 pci_iommu_init+0x12/0x40 do_one_initcall+0x46/0x300 kernel_init_freeable+0x23d/0x2a0 ? __pfx_kernel_init+0x10/0x10 kernel_init+0x1a/0x140 ret_from_fork+0x34/0x50 ? __pfx_kernel_init+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]---
On 2025/3/12 2:42, Joerg Roedel wrote: > On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: >> + /* >> + * And if we do now see any replay calls, they would indicate someone >> + * misusing the dma_configure path outside bus code. >> + */ >> + if (dev->driver) >> + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); > This warning triggers on my workstation (with an AMD IOMMU), any ideas? > > ------------[ cut here ]------------ > reg-dummy reg-dummy: late IOMMU probe at driver bind, something fishy here! > WARNING: CPU: 0 PID: 1 at drivers/iommu/iommu.c:449 __iommu_probe_device+0x10b/0x510 > Modules linked in: > > CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc6-iommu-next+ #1 1d691d7aebf343bde741cf4c8610d78a2ea2d2d9 > Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5406 11/13/2019 > RIP: 0010:__iommu_probe_device+0x10b/0x510 > Code: 68 00 74 28 48 8b 6b 50 48 85 ed 75 03 48 8b 2b 48 89 df e8 87 71 06 00 48 89 ea 48 c7 c7 90 dd 2c 8b 48 89 c6 e8 35 83 77 ff <0f> 0b 49 8b bd a8 00 00 00 e8 77 ab 85 ff 84 c0 0f 84 ad 01 00 00 > RSP: 0018:ffffafba00047c58 EFLAGS: 00010282 > RAX: 0000000000000000 RBX: ffffa00481301c10 RCX: 0000000000000003 > RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 > RBP: ffffa00480ffaee0 R08: 0000000000000000 R09: ffffafba00047ae8 > R10: ffffa0135e93ffa8 R11: 0000000000000003 R12: ffffafba00047d18 > R13: ffffffff8adac1a0 R14: 0000000000000000 R15: ffffa004802c5800 > FS: 0000000000000000(0000) GS:ffffa0135ea00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffa00baca01000 CR3: 000000082b838000 CR4: 00000000003506f0 > Call Trace: > <TASK> > ? __iommu_probe_device+0x10b/0x510 > ? __warn.cold+0x93/0xf7 > ? __iommu_probe_device+0x10b/0x510 > ? report_bug+0xff/0x140 > ? prb_read_valid+0x1b/0x30 > ? handle_bug+0x58/0x90 > ? exc_invalid_op+0x17/0x70 > ? asm_exc_invalid_op+0x1a/0x20 > ? __iommu_probe_device+0x10b/0x510 > ? __iommu_probe_device+0x10b/0x510 > ? __pfx_probe_iommu_group+0x10/0x10 > probe_iommu_group+0x28/0x50 > bus_for_each_dev+0x7e/0xd0 > iommu_device_register+0xee/0x260 > iommu_go_to_state+0xa2a/0x1970 > ? srso_return_thunk+0x5/0x5f > ? blake2s_update+0x68/0x160 > ? __pfx_pci_iommu_init+0x10/0x10 > amd_iommu_init+0x14/0x50 > ? __pfx_pci_iommu_init+0x10/0x10 > pci_iommu_init+0x12/0x40 > do_one_initcall+0x46/0x300 > kernel_init_freeable+0x23d/0x2a0 > ? __pfx_kernel_init+0x10/0x10 > kernel_init+0x1a/0x140 > ret_from_fork+0x34/0x50 > ? __pfx_kernel_init+0x10/0x10 > ret_from_fork_asm+0x1a/0x30 > </TASK> > ---[ end trace 0000000000000000 ]--- I saw the same issue on Intel platforms. Thanks, baolu
On 2025-03-11 6:42 pm, Joerg Roedel wrote: > Hi Robin, > > On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: >> + /* >> + * And if we do now see any replay calls, they would indicate someone >> + * misusing the dma_configure path outside bus code. >> + */ >> + if (dev->driver) >> + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); > > This warning triggers on my workstation (with an AMD IOMMU), any ideas? Argh! When I moved the dma_configure call into iommu_init_device() for v2 I moved the warning with it, but of course that needs to stay where it was, *after* the point that ops->probe_device has had a chance to filter out irrelevant devices. Does this make it behave? Thanks, Robin. ----->8----- diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 09798ddbce9d..1da6c55a0d02 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -437,12 +437,6 @@ static int iommu_init_device(struct device *dev) ret = -ENODEV; goto err_free; } - /* - * And if we do now see any replay calls, they would indicate someone - * misusing the dma_configure path outside bus code. - */ - if (dev->driver) - dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); if (!try_module_get(ops->owner)) { ret = -EINVAL; @@ -565,6 +559,12 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list ret = iommu_init_device(dev); if (ret) return ret; + /* + * And if we do now see any replay calls, they would indicate someone + * misusing the dma_configure path outside bus code. + */ + if (dev->driver) + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); group = dev->iommu_group; gdev = iommu_group_alloc_device(group, dev);
On 2025/3/12 18:10, Robin Murphy wrote: > On 2025-03-11 6:42 pm, Joerg Roedel wrote: >> Hi Robin, >> >> On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: >>> + /* >>> + * And if we do now see any replay calls, they would indicate >>> someone >>> + * misusing the dma_configure path outside bus code. >>> + */ >>> + if (dev->driver) >>> + dev_WARN(dev, "late IOMMU probe at driver bind, something >>> fishy here!\n"); >> >> This warning triggers on my workstation (with an AMD IOMMU), any ideas? > > Argh! When I moved the dma_configure call into iommu_init_device() for > v2 I moved the warning with it, but of course that needs to stay where > it was, *after* the point that ops->probe_device has had a chance to > filter out irrelevant devices. Does this make it behave? Yes. It works on my end. Thanks, baolu
Hi Robin, On Wed, Mar 12, 2025 at 10:10:04AM +0000, Robin Murphy wrote: > Argh! When I moved the dma_configure call into iommu_init_device() for > v2 I moved the warning with it, but of course that needs to stay where > it was, *after* the point that ops->probe_device has had a chance to > filter out irrelevant devices. Does this make it behave? Okay, thanks for the patch. I am currently building a kernel to test it and will report back. Regards, Joerg
diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c index 52b2abf88689..f30f138352b7 100644 --- a/drivers/acpi/arm64/dma.c +++ b/drivers/acpi/arm64/dma.c @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev) else end = (1ULL << 32) - 1; + if (dev->dma_range_map) { + dev_dbg(dev, "dma_range_map already set\n"); + return; + } + ret = acpi_dma_get_range(dev, &map); if (!ret && map) { end = dma_range_map_max(map); diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index 9f4efa8f75a6..fb1fe9f3b1a3 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -1632,13 +1632,6 @@ static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in) err = viot_iommu_configure(dev); mutex_unlock(&iommu_probe_device_lock); - /* - * If we have reason to believe the IOMMU driver missed the initial - * iommu_probe_device() call for dev, replay it to get things in order. - */ - if (!err && dev->bus) - err = iommu_probe_device(dev); - return err; } diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c index 8ef259b4d037..71482d639a6d 100644 --- a/drivers/amba/bus.c +++ b/drivers/amba/bus.c @@ -364,7 +364,8 @@ static int amba_dma_configure(struct device *dev) ret = acpi_dma_configure(dev, attr); } - if (!ret && !drv->driver_managed_dma) { + /* @drv may not be valid when we're called from the IOMMU layer */ + if (!ret && dev->driver && !drv->driver_managed_dma) { ret = iommu_device_use_default_domain(dev); if (ret) arch_teardown_dma_ops(dev); diff --git a/drivers/base/platform.c b/drivers/base/platform.c index 6f2a33722c52..1813cfd0c4bd 100644 --- a/drivers/base/platform.c +++ b/drivers/base/platform.c @@ -1451,7 +1451,8 @@ static int platform_dma_configure(struct device *dev) attr = acpi_get_dma_attr(to_acpi_device_node(fwnode)); ret = acpi_dma_configure(dev, attr); } - if (ret || drv->driver_managed_dma) + /* @drv may not be valid when we're called from the IOMMU layer */ + if (ret || !dev->driver || drv->driver_managed_dma) return ret; ret = iommu_device_use_default_domain(dev); diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c index d1f3d327ddd1..a8be8cf246fb 100644 --- a/drivers/bus/fsl-mc/fsl-mc-bus.c +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c @@ -153,7 +153,8 @@ static int fsl_mc_dma_configure(struct device *dev) else ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT, &input_id); - if (!ret && !mc_drv->driver_managed_dma) { + /* @mc_drv may not be valid when we're called from the IOMMU layer */ + if (!ret && dev->driver && !mc_drv->driver_managed_dma) { ret = iommu_device_use_default_domain(dev); if (ret) arch_teardown_dma_ops(dev); diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c index c573ed2ee71a..780fb0c4adba 100644 --- a/drivers/cdx/cdx.c +++ b/drivers/cdx/cdx.c @@ -360,7 +360,8 @@ static int cdx_dma_configure(struct device *dev) return ret; } - if (!ret && !cdx_drv->driver_managed_dma) { + /* @cdx_drv may not be valid when we're called from the IOMMU layer */ + if (!ret && dev->driver && !cdx_drv->driver_managed_dma) { ret = iommu_device_use_default_domain(dev); if (ret) arch_teardown_dma_ops(dev); diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index a3b45b84f42b..1cec7074367a 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -414,9 +414,21 @@ static int iommu_init_device(struct device *dev) if (!dev_iommu_get(dev)) return -ENOMEM; /* - * For FDT-based systems and ACPI IORT/VIOT, drivers register IOMMU - * instances with non-NULL fwnodes, and client devices should have been - * identified with a fwspec by this point. Otherwise, we can currently + * For FDT-based systems and ACPI IORT/VIOT, the common firmware parsing + * is buried in the bus dma_configure path. Properly unpicking that is + * still a big job, so for now just invoke the whole thing. The device + * already having a driver bound means dma_configure has already run and + * either found no IOMMU to wait for, or we're in its replay call right + * now, so either way there's no point calling it again. + */ + if (!dev->driver && dev->bus->dma_configure) { + mutex_unlock(&iommu_probe_device_lock); + dev->bus->dma_configure(dev); + mutex_lock(&iommu_probe_device_lock); + } + /* + * At this point, relevant devices either now have a fwspec which will + * match ops registered with a non-NULL fwnode, or we can reasonably * assume that only one of Intel, AMD, s390, PAMU or legacy SMMUv2 can * be present, and that any of their registered instances has suitable * ops for probing, and thus cheekily co-opt the same mechanism. @@ -426,6 +438,12 @@ static int iommu_init_device(struct device *dev) ret = -ENODEV; goto err_free; } + /* + * And if we do now see any replay calls, they would indicate someone + * misusing the dma_configure path outside bus code. + */ + if (dev->driver) + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); if (!try_module_get(ops->owner)) { ret = -EINVAL; diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c index e10a68b5ffde..6b989a62def2 100644 --- a/drivers/iommu/of_iommu.c +++ b/drivers/iommu/of_iommu.c @@ -155,7 +155,12 @@ int of_iommu_configure(struct device *dev, struct device_node *master_np, dev_iommu_free(dev); mutex_unlock(&iommu_probe_device_lock); - if (!err && dev->bus) + /* + * If we're not on the iommu_probe_device() path (as indicated by the + * initial dev->iommu) then try to simulate it. This should no longer + * happen unless of_dma_configure() is being misused outside bus code. + */ + if (!err && dev->bus && !dev_iommu_present) err = iommu_probe_device(dev); if (err && err != -EPROBE_DEFER) diff --git a/drivers/of/device.c b/drivers/of/device.c index edf3be197265..5053e5d532cc 100644 --- a/drivers/of/device.c +++ b/drivers/of/device.c @@ -99,6 +99,11 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, bool coherent, set_map = false; int ret; + if (dev->dma_range_map) { + dev_dbg(dev, "dma_range_map already set\n"); + goto skip_map; + } + if (np == dev->of_node) bus_np = __of_get_dma_parent(np); else @@ -119,7 +124,7 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, end = dma_range_map_max(map); set_map = true; } - +skip_map: /* * If @dev is expected to be DMA-capable then the bus code that created * it should have initialised its dma_mask pointer by this point. For diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index f57ea36d125d..4468b7327cab 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -1653,7 +1653,8 @@ static int pci_dma_configure(struct device *dev) pci_put_host_bridge_device(bridge); - if (!ret && !driver->driver_managed_dma) { + /* @driver may not be valid when we're called from the IOMMU layer */ + if (!ret && dev->driver && !driver->driver_managed_dma) { ret = iommu_device_use_default_domain(dev); if (ret) arch_teardown_dma_ops(dev);