Message ID | e3b191e6fd6ca9a1e84c5e5e40044faf97abb874.1740753261.git.robin.murphy@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iommu: Fix the longstanding probe issues | expand |
On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: > + if (!dev->driver && dev->bus->dma_configure) { > + mutex_unlock(&iommu_probe_device_lock); > + dev->bus->dma_configure(dev); > + mutex_lock(&iommu_probe_device_lock); > + } I think it would be very nice to get rid of the lock/unlock.. It makes me nervous that we continue on assuming dev->iommu was freshly allocated.. setup the dev->iommu partially, then drop the lock. There is only one other caller in: static int really_probe(struct device *dev, const struct device_driver *drv) { if (dev->bus->dma_configure) { ret = dev->bus->dma_configure(dev); if (ret) goto pinctrl_bind_failed; } Is it feasible to make it so the caller has to hold the iommu_probe_device_lock prior to calling the op? That would require moving the locking inside of_dma_configure to less inside, and using a new iommu_probe_device() wrapper. However, if you plan to turn this inside out soonish then it would not be worth the bother. Anyhow: Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Jason
On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: > In hindsight, there were some crucial subtleties overlooked when moving > {of,acpi}_dma_configure() to driver probe time to allow waiting for > IOMMU drivers with -EPROBE_DEFER, and these have become an > ever-increasing source of problems. The IOMMU API has some fundamental > assumptions that iommu_probe_device() is called for every device added > to the system, in the order in which they are added. Calling it in a > random order or not at all dependent on driver binding leads to > malformed groups, a potential lack of isolation for devices with no > driver, and all manner of unexpected concurrency and race conditions. > We've attempted to mitigate the latter with point-fix bodges like > iommu_probe_device_lock, but it's a losing battle and the time has come > to bite the bullet and address the true source of the problem instead. > > The crux of the matter is that the firmware parsing actually serves two > distinct purposes; one is identifying the IOMMU instance associated with > a device so we can check its availability, the second is actually > telling that instance about the relevant firmware-provided data for the > device. However the latter also depends on the former, and at the time > there was no good place to defer and retry that separately from the > availability check we also wanted for client driver probe. > > Nowadays, though, we have a proper notion of multiple IOMMU instances in > the core API itself, and each one gets a chance to probe its own devices > upon registration, so we can finally make that work as intended for > DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to > be able to run the iommu_fwspec machinery currently buried deep in the > wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be > surprisingly straightforward to bootstrap this transformation by pretty > much just calling the same path twice. At client driver probe time, > dev->driver is obviously set; conversely at device_add(), or a > subsequent bus_iommu_probe(), any device waiting for an IOMMU really > should *not* have a driver already, so we can use that as a condition to > disambiguate the two cases, and avoid recursing back into the IOMMU core > at the wrong times. > > Obviously this isn't the nicest thing, but for now it gives us a > functional baseline to then unpick the layers in between without many > more awkward cross-subsystem patches. There are some minor side-effects > like dma_range_map potentially being created earlier, and some debug > prints being repeated, but these aren't significantly detrimental. Let's > make things work first, then deal with making them nice. > > With the basic flow finally in the right order again, the next step is > probably turning the bus->dma_configure paths inside-out, since all we > really need from bus code is its notion of which device and input ID(s) > to parse the common firmware properties with... > > Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c > Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c > Signed-off-by: Robin Murphy <robin.murphy@arm.com> > --- > > v2: > - Comment bus driver changes for clarity > - Use dev->iommu as the now-robust replay condition > - Drop the device_iommu_mapped() checks in the firmware paths as they > weren't doing much - we can't replace probe_device_lock just yet... > > drivers/acpi/arm64/dma.c | 5 +++++ > drivers/acpi/scan.c | 7 ------- > drivers/amba/bus.c | 3 ++- > drivers/base/platform.c | 3 ++- > drivers/bus/fsl-mc/fsl-mc-bus.c | 3 ++- > drivers/cdx/cdx.c | 3 ++- > drivers/iommu/iommu.c | 24 +++++++++++++++++++++--- > drivers/iommu/of_iommu.c | 7 ++++++- > drivers/of/device.c | 7 ++++++- > drivers/pci/pci-driver.c | 3 ++- > 10 files changed, 48 insertions(+), 17 deletions(-) > > diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c > index 52b2abf88689..f30f138352b7 100644 > --- a/drivers/acpi/arm64/dma.c > +++ b/drivers/acpi/arm64/dma.c > @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev) > else > end = (1ULL << 32) - 1; > > + if (dev->dma_range_map) { > + dev_dbg(dev, "dma_range_map already set\n"); > + return; > + } > + Why are we checking that dev->dma_range_map exists here rather than at function entry ? Is that because we want to run the previous code for some reasons even if dev->dma_range_map is already set ? Just noticed, the OF counterpart does not seem to take the same approach, so I am just flagging this up if it matters at all. Other than that, for acpi/arm64: Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org> > ret = acpi_dma_get_range(dev, &map); > if (!ret && map) { > end = dma_range_map_max(map); > diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c > index 9f4efa8f75a6..fb1fe9f3b1a3 100644 > --- a/drivers/acpi/scan.c > +++ b/drivers/acpi/scan.c > @@ -1632,13 +1632,6 @@ static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in) > err = viot_iommu_configure(dev); > mutex_unlock(&iommu_probe_device_lock); > > - /* > - * If we have reason to believe the IOMMU driver missed the initial > - * iommu_probe_device() call for dev, replay it to get things in order. > - */ > - if (!err && dev->bus) > - err = iommu_probe_device(dev); > - > return err; > } > > diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c > index 8ef259b4d037..71482d639a6d 100644 > --- a/drivers/amba/bus.c > +++ b/drivers/amba/bus.c > @@ -364,7 +364,8 @@ static int amba_dma_configure(struct device *dev) > ret = acpi_dma_configure(dev, attr); > } > > - if (!ret && !drv->driver_managed_dma) { > + /* @drv may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !drv->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > diff --git a/drivers/base/platform.c b/drivers/base/platform.c > index 6f2a33722c52..1813cfd0c4bd 100644 > --- a/drivers/base/platform.c > +++ b/drivers/base/platform.c > @@ -1451,7 +1451,8 @@ static int platform_dma_configure(struct device *dev) > attr = acpi_get_dma_attr(to_acpi_device_node(fwnode)); > ret = acpi_dma_configure(dev, attr); > } > - if (ret || drv->driver_managed_dma) > + /* @drv may not be valid when we're called from the IOMMU layer */ > + if (ret || !dev->driver || drv->driver_managed_dma) > return ret; > > ret = iommu_device_use_default_domain(dev); > diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c > index d1f3d327ddd1..a8be8cf246fb 100644 > --- a/drivers/bus/fsl-mc/fsl-mc-bus.c > +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c > @@ -153,7 +153,8 @@ static int fsl_mc_dma_configure(struct device *dev) > else > ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT, &input_id); > > - if (!ret && !mc_drv->driver_managed_dma) { > + /* @mc_drv may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !mc_drv->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c > index c573ed2ee71a..780fb0c4adba 100644 > --- a/drivers/cdx/cdx.c > +++ b/drivers/cdx/cdx.c > @@ -360,7 +360,8 @@ static int cdx_dma_configure(struct device *dev) > return ret; > } > > - if (!ret && !cdx_drv->driver_managed_dma) { > + /* @cdx_drv may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !cdx_drv->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index a3b45b84f42b..1cec7074367a 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -414,9 +414,21 @@ static int iommu_init_device(struct device *dev) > if (!dev_iommu_get(dev)) > return -ENOMEM; > /* > - * For FDT-based systems and ACPI IORT/VIOT, drivers register IOMMU > - * instances with non-NULL fwnodes, and client devices should have been > - * identified with a fwspec by this point. Otherwise, we can currently > + * For FDT-based systems and ACPI IORT/VIOT, the common firmware parsing > + * is buried in the bus dma_configure path. Properly unpicking that is > + * still a big job, so for now just invoke the whole thing. The device > + * already having a driver bound means dma_configure has already run and > + * either found no IOMMU to wait for, or we're in its replay call right > + * now, so either way there's no point calling it again. > + */ > + if (!dev->driver && dev->bus->dma_configure) { > + mutex_unlock(&iommu_probe_device_lock); > + dev->bus->dma_configure(dev); > + mutex_lock(&iommu_probe_device_lock); > + } > + /* > + * At this point, relevant devices either now have a fwspec which will > + * match ops registered with a non-NULL fwnode, or we can reasonably > * assume that only one of Intel, AMD, s390, PAMU or legacy SMMUv2 can > * be present, and that any of their registered instances has suitable > * ops for probing, and thus cheekily co-opt the same mechanism. > @@ -426,6 +438,12 @@ static int iommu_init_device(struct device *dev) > ret = -ENODEV; > goto err_free; > } > + /* > + * And if we do now see any replay calls, they would indicate someone > + * misusing the dma_configure path outside bus code. > + */ > + if (dev->driver) > + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); > > if (!try_module_get(ops->owner)) { > ret = -EINVAL; > diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c > index e10a68b5ffde..6b989a62def2 100644 > --- a/drivers/iommu/of_iommu.c > +++ b/drivers/iommu/of_iommu.c > @@ -155,7 +155,12 @@ int of_iommu_configure(struct device *dev, struct device_node *master_np, > dev_iommu_free(dev); > mutex_unlock(&iommu_probe_device_lock); > > - if (!err && dev->bus) > + /* > + * If we're not on the iommu_probe_device() path (as indicated by the > + * initial dev->iommu) then try to simulate it. This should no longer > + * happen unless of_dma_configure() is being misused outside bus code. > + */ > + if (!err && dev->bus && !dev_iommu_present) > err = iommu_probe_device(dev); > > if (err && err != -EPROBE_DEFER) > diff --git a/drivers/of/device.c b/drivers/of/device.c > index edf3be197265..5053e5d532cc 100644 > --- a/drivers/of/device.c > +++ b/drivers/of/device.c > @@ -99,6 +99,11 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, > bool coherent, set_map = false; > int ret; > > + if (dev->dma_range_map) { > + dev_dbg(dev, "dma_range_map already set\n"); > + goto skip_map; > + } > + > if (np == dev->of_node) > bus_np = __of_get_dma_parent(np); > else > @@ -119,7 +124,7 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, > end = dma_range_map_max(map); > set_map = true; > } > - > +skip_map: > /* > * If @dev is expected to be DMA-capable then the bus code that created > * it should have initialised its dma_mask pointer by this point. For > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > index f57ea36d125d..4468b7327cab 100644 > --- a/drivers/pci/pci-driver.c > +++ b/drivers/pci/pci-driver.c > @@ -1653,7 +1653,8 @@ static int pci_dma_configure(struct device *dev) > > pci_put_host_bridge_device(bridge); > > - if (!ret && !driver->driver_managed_dma) { > + /* @driver may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !driver->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > -- > 2.39.2.101.g768bb238c484.dirty >
On 2025-03-07 2:24 pm, Lorenzo Pieralisi wrote: [...] >> diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c >> index 52b2abf88689..f30f138352b7 100644 >> --- a/drivers/acpi/arm64/dma.c >> +++ b/drivers/acpi/arm64/dma.c >> @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev) >> else >> end = (1ULL << 32) - 1; >> >> + if (dev->dma_range_map) { >> + dev_dbg(dev, "dma_range_map already set\n"); >> + return; >> + } >> + > > Why are we checking that dev->dma_range_map exists here rather than > at function entry ? Is that because we want to run the previous code > for some reasons even if dev->dma_range_map is already set ? > > Just noticed, the OF counterpart does not seem to take the same > approach, so I am just flagging this up if it matters at all. The intent is just to ensure it's safe to come through this path more than once for the time being - it doesn't really matter whether or not we repeat anything apart from allocating and assigning dma_range_map, since that leaks the previous allocation. Thus this check is logically guarding the acpi_dma_get_range() call in this path, and the of_dma_get_range() call in the other. > Other than that, for acpi/arm64: > > Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Thanks! Robin.
Hi Robin, On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: > + /* > + * And if we do now see any replay calls, they would indicate someone > + * misusing the dma_configure path outside bus code. > + */ > + if (dev->driver) > + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); This warning triggers on my workstation (with an AMD IOMMU), any ideas? ------------[ cut here ]------------ reg-dummy reg-dummy: late IOMMU probe at driver bind, something fishy here! WARNING: CPU: 0 PID: 1 at drivers/iommu/iommu.c:449 __iommu_probe_device+0x10b/0x510 Modules linked in: CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc6-iommu-next+ #1 1d691d7aebf343bde741cf4c8610d78a2ea2d2d9 Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5406 11/13/2019 RIP: 0010:__iommu_probe_device+0x10b/0x510 Code: 68 00 74 28 48 8b 6b 50 48 85 ed 75 03 48 8b 2b 48 89 df e8 87 71 06 00 48 89 ea 48 c7 c7 90 dd 2c 8b 48 89 c6 e8 35 83 77 ff <0f> 0b 49 8b bd a8 00 00 00 e8 77 ab 85 ff 84 c0 0f 84 ad 01 00 00 RSP: 0018:ffffafba00047c58 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffffa00481301c10 RCX: 0000000000000003 RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 RBP: ffffa00480ffaee0 R08: 0000000000000000 R09: ffffafba00047ae8 R10: ffffa0135e93ffa8 R11: 0000000000000003 R12: ffffafba00047d18 R13: ffffffff8adac1a0 R14: 0000000000000000 R15: ffffa004802c5800 FS: 0000000000000000(0000) GS:ffffa0135ea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffa00baca01000 CR3: 000000082b838000 CR4: 00000000003506f0 Call Trace: <TASK> ? __iommu_probe_device+0x10b/0x510 ? __warn.cold+0x93/0xf7 ? __iommu_probe_device+0x10b/0x510 ? report_bug+0xff/0x140 ? prb_read_valid+0x1b/0x30 ? handle_bug+0x58/0x90 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? __iommu_probe_device+0x10b/0x510 ? __iommu_probe_device+0x10b/0x510 ? __pfx_probe_iommu_group+0x10/0x10 probe_iommu_group+0x28/0x50 bus_for_each_dev+0x7e/0xd0 iommu_device_register+0xee/0x260 iommu_go_to_state+0xa2a/0x1970 ? srso_return_thunk+0x5/0x5f ? blake2s_update+0x68/0x160 ? __pfx_pci_iommu_init+0x10/0x10 amd_iommu_init+0x14/0x50 ? __pfx_pci_iommu_init+0x10/0x10 pci_iommu_init+0x12/0x40 do_one_initcall+0x46/0x300 kernel_init_freeable+0x23d/0x2a0 ? __pfx_kernel_init+0x10/0x10 kernel_init+0x1a/0x140 ret_from_fork+0x34/0x50 ? __pfx_kernel_init+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]---
On 2025/3/12 2:42, Joerg Roedel wrote: > On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: >> + /* >> + * And if we do now see any replay calls, they would indicate someone >> + * misusing the dma_configure path outside bus code. >> + */ >> + if (dev->driver) >> + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); > This warning triggers on my workstation (with an AMD IOMMU), any ideas? > > ------------[ cut here ]------------ > reg-dummy reg-dummy: late IOMMU probe at driver bind, something fishy here! > WARNING: CPU: 0 PID: 1 at drivers/iommu/iommu.c:449 __iommu_probe_device+0x10b/0x510 > Modules linked in: > > CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc6-iommu-next+ #1 1d691d7aebf343bde741cf4c8610d78a2ea2d2d9 > Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5406 11/13/2019 > RIP: 0010:__iommu_probe_device+0x10b/0x510 > Code: 68 00 74 28 48 8b 6b 50 48 85 ed 75 03 48 8b 2b 48 89 df e8 87 71 06 00 48 89 ea 48 c7 c7 90 dd 2c 8b 48 89 c6 e8 35 83 77 ff <0f> 0b 49 8b bd a8 00 00 00 e8 77 ab 85 ff 84 c0 0f 84 ad 01 00 00 > RSP: 0018:ffffafba00047c58 EFLAGS: 00010282 > RAX: 0000000000000000 RBX: ffffa00481301c10 RCX: 0000000000000003 > RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 > RBP: ffffa00480ffaee0 R08: 0000000000000000 R09: ffffafba00047ae8 > R10: ffffa0135e93ffa8 R11: 0000000000000003 R12: ffffafba00047d18 > R13: ffffffff8adac1a0 R14: 0000000000000000 R15: ffffa004802c5800 > FS: 0000000000000000(0000) GS:ffffa0135ea00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffa00baca01000 CR3: 000000082b838000 CR4: 00000000003506f0 > Call Trace: > <TASK> > ? __iommu_probe_device+0x10b/0x510 > ? __warn.cold+0x93/0xf7 > ? __iommu_probe_device+0x10b/0x510 > ? report_bug+0xff/0x140 > ? prb_read_valid+0x1b/0x30 > ? handle_bug+0x58/0x90 > ? exc_invalid_op+0x17/0x70 > ? asm_exc_invalid_op+0x1a/0x20 > ? __iommu_probe_device+0x10b/0x510 > ? __iommu_probe_device+0x10b/0x510 > ? __pfx_probe_iommu_group+0x10/0x10 > probe_iommu_group+0x28/0x50 > bus_for_each_dev+0x7e/0xd0 > iommu_device_register+0xee/0x260 > iommu_go_to_state+0xa2a/0x1970 > ? srso_return_thunk+0x5/0x5f > ? blake2s_update+0x68/0x160 > ? __pfx_pci_iommu_init+0x10/0x10 > amd_iommu_init+0x14/0x50 > ? __pfx_pci_iommu_init+0x10/0x10 > pci_iommu_init+0x12/0x40 > do_one_initcall+0x46/0x300 > kernel_init_freeable+0x23d/0x2a0 > ? __pfx_kernel_init+0x10/0x10 > kernel_init+0x1a/0x140 > ret_from_fork+0x34/0x50 > ? __pfx_kernel_init+0x10/0x10 > ret_from_fork_asm+0x1a/0x30 > </TASK> > ---[ end trace 0000000000000000 ]--- I saw the same issue on Intel platforms. Thanks, baolu
On 2025-03-11 6:42 pm, Joerg Roedel wrote: > Hi Robin, > > On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: >> + /* >> + * And if we do now see any replay calls, they would indicate someone >> + * misusing the dma_configure path outside bus code. >> + */ >> + if (dev->driver) >> + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); > > This warning triggers on my workstation (with an AMD IOMMU), any ideas? Argh! When I moved the dma_configure call into iommu_init_device() for v2 I moved the warning with it, but of course that needs to stay where it was, *after* the point that ops->probe_device has had a chance to filter out irrelevant devices. Does this make it behave? Thanks, Robin. ----->8----- diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 09798ddbce9d..1da6c55a0d02 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -437,12 +437,6 @@ static int iommu_init_device(struct device *dev) ret = -ENODEV; goto err_free; } - /* - * And if we do now see any replay calls, they would indicate someone - * misusing the dma_configure path outside bus code. - */ - if (dev->driver) - dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); if (!try_module_get(ops->owner)) { ret = -EINVAL; @@ -565,6 +559,12 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list ret = iommu_init_device(dev); if (ret) return ret; + /* + * And if we do now see any replay calls, they would indicate someone + * misusing the dma_configure path outside bus code. + */ + if (dev->driver) + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); group = dev->iommu_group; gdev = iommu_group_alloc_device(group, dev);
On 2025/3/12 18:10, Robin Murphy wrote: > On 2025-03-11 6:42 pm, Joerg Roedel wrote: >> Hi Robin, >> >> On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: >>> + /* >>> + * And if we do now see any replay calls, they would indicate >>> someone >>> + * misusing the dma_configure path outside bus code. >>> + */ >>> + if (dev->driver) >>> + dev_WARN(dev, "late IOMMU probe at driver bind, something >>> fishy here!\n"); >> >> This warning triggers on my workstation (with an AMD IOMMU), any ideas? > > Argh! When I moved the dma_configure call into iommu_init_device() for > v2 I moved the warning with it, but of course that needs to stay where > it was, *after* the point that ops->probe_device has had a chance to > filter out irrelevant devices. Does this make it behave? Yes. It works on my end. Thanks, baolu
Hi Robin, On Wed, Mar 12, 2025 at 10:10:04AM +0000, Robin Murphy wrote: > Argh! When I moved the dma_configure call into iommu_init_device() for > v2 I moved the warning with it, but of course that needs to stay where > it was, *after* the point that ops->probe_device has had a chance to > filter out irrelevant devices. Does this make it behave? Okay, thanks for the patch. I am currently building a kernel to test it and will report back. Regards, Joerg
Hi Robin, On 28.02.2025 16:46, Robin Murphy wrote: > In hindsight, there were some crucial subtleties overlooked when moving > {of,acpi}_dma_configure() to driver probe time to allow waiting for > IOMMU drivers with -EPROBE_DEFER, and these have become an > ever-increasing source of problems. The IOMMU API has some fundamental > assumptions that iommu_probe_device() is called for every device added > to the system, in the order in which they are added. Calling it in a > random order or not at all dependent on driver binding leads to > malformed groups, a potential lack of isolation for devices with no > driver, and all manner of unexpected concurrency and race conditions. > We've attempted to mitigate the latter with point-fix bodges like > iommu_probe_device_lock, but it's a losing battle and the time has come > to bite the bullet and address the true source of the problem instead. > > The crux of the matter is that the firmware parsing actually serves two > distinct purposes; one is identifying the IOMMU instance associated with > a device so we can check its availability, the second is actually > telling that instance about the relevant firmware-provided data for the > device. However the latter also depends on the former, and at the time > there was no good place to defer and retry that separately from the > availability check we also wanted for client driver probe. > > Nowadays, though, we have a proper notion of multiple IOMMU instances in > the core API itself, and each one gets a chance to probe its own devices > upon registration, so we can finally make that work as intended for > DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to > be able to run the iommu_fwspec machinery currently buried deep in the > wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be > surprisingly straightforward to bootstrap this transformation by pretty > much just calling the same path twice. At client driver probe time, > dev->driver is obviously set; conversely at device_add(), or a > subsequent bus_iommu_probe(), any device waiting for an IOMMU really > should *not* have a driver already, so we can use that as a condition to > disambiguate the two cases, and avoid recursing back into the IOMMU core > at the wrong times. > > Obviously this isn't the nicest thing, but for now it gives us a > functional baseline to then unpick the layers in between without many > more awkward cross-subsystem patches. There are some minor side-effects > like dma_range_map potentially being created earlier, and some debug > prints being repeated, but these aren't significantly detrimental. Let's > make things work first, then deal with making them nice. > > With the basic flow finally in the right order again, the next step is > probably turning the bus->dma_configure paths inside-out, since all we > really need from bus code is its notion of which device and input ID(s) > to parse the common firmware properties with... > > Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c > Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c > Signed-off-by: Robin Murphy <robin.murphy@arm.com> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path"). In my tests I found it breaks booting of ARM64 RK3568-based Odroid-M1 board (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the relevant kernel log: Unable to handle kernel NULL pointer dereference at virtual address 00000000000003e8 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [00000000000003e8] user address but active_mm is swapper Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP Modules linked in: CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533 Hardware name: Hardkernel ODROID-M1 (DT) pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : devm_kmalloc+0x2c/0x114 lr : rk_iommu_of_xlate+0x30/0x90 ... Call trace: devm_kmalloc+0x2c/0x114 (P) rk_iommu_of_xlate+0x30/0x90 of_iommu_xlate+0x9c/0xc8 of_iommu_configure+0x1c4/0x23c of_dma_configure_id+0x244/0x3b8 platform_dma_configure+0x74/0xac __iommu_probe_device+0x258/0x49c probe_iommu_group+0x3c/0x64 bus_for_each_dev+0x74/0xd4 iommu_device_register+0xd4/0x230 rk_iommu_probe+0x1f8/0x350 platform_probe+0x68/0xdc really_probe+0xbc/0x298 __driver_probe_device+0x78/0x12c driver_probe_device+0x40/0x164 __driver_attach+0x9c/0x1ac bus_for_each_dev+0x74/0xd4 driver_attach+0x24/0x30 bus_add_driver+0xe4/0x208 driver_register+0x60/0x128 __platform_driver_register+0x24/0x30 rk_iommu_driver_init+0x1c/0x28 do_one_initcall+0x64/0x308 kernel_init_freeable+0x288/0x4f0 kernel_init+0x20/0x1d8 ret_from_fork+0x10/0x20 Code: d2801017 a90153f3 ab0102e0 aa0103f4 (b943eab8) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x100,00000040,00901250,8200720b Memory Limit: none ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- Let me know if You want me to test any experimental patch or provide more information. > --- > > v2: > - Comment bus driver changes for clarity > - Use dev->iommu as the now-robust replay condition > - Drop the device_iommu_mapped() checks in the firmware paths as they > weren't doing much - we can't replace probe_device_lock just yet... > > drivers/acpi/arm64/dma.c | 5 +++++ > drivers/acpi/scan.c | 7 ------- > drivers/amba/bus.c | 3 ++- > drivers/base/platform.c | 3 ++- > drivers/bus/fsl-mc/fsl-mc-bus.c | 3 ++- > drivers/cdx/cdx.c | 3 ++- > drivers/iommu/iommu.c | 24 +++++++++++++++++++++--- > drivers/iommu/of_iommu.c | 7 ++++++- > drivers/of/device.c | 7 ++++++- > drivers/pci/pci-driver.c | 3 ++- > 10 files changed, 48 insertions(+), 17 deletions(-) > > diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c > index 52b2abf88689..f30f138352b7 100644 > --- a/drivers/acpi/arm64/dma.c > +++ b/drivers/acpi/arm64/dma.c > @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev) > else > end = (1ULL << 32) - 1; > > + if (dev->dma_range_map) { > + dev_dbg(dev, "dma_range_map already set\n"); > + return; > + } > + > ret = acpi_dma_get_range(dev, &map); > if (!ret && map) { > end = dma_range_map_max(map); > diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c > index 9f4efa8f75a6..fb1fe9f3b1a3 100644 > --- a/drivers/acpi/scan.c > +++ b/drivers/acpi/scan.c > @@ -1632,13 +1632,6 @@ static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in) > err = viot_iommu_configure(dev); > mutex_unlock(&iommu_probe_device_lock); > > - /* > - * If we have reason to believe the IOMMU driver missed the initial > - * iommu_probe_device() call for dev, replay it to get things in order. > - */ > - if (!err && dev->bus) > - err = iommu_probe_device(dev); > - > return err; > } > > diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c > index 8ef259b4d037..71482d639a6d 100644 > --- a/drivers/amba/bus.c > +++ b/drivers/amba/bus.c > @@ -364,7 +364,8 @@ static int amba_dma_configure(struct device *dev) > ret = acpi_dma_configure(dev, attr); > } > > - if (!ret && !drv->driver_managed_dma) { > + /* @drv may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !drv->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > diff --git a/drivers/base/platform.c b/drivers/base/platform.c > index 6f2a33722c52..1813cfd0c4bd 100644 > --- a/drivers/base/platform.c > +++ b/drivers/base/platform.c > @@ -1451,7 +1451,8 @@ static int platform_dma_configure(struct device *dev) > attr = acpi_get_dma_attr(to_acpi_device_node(fwnode)); > ret = acpi_dma_configure(dev, attr); > } > - if (ret || drv->driver_managed_dma) > + /* @drv may not be valid when we're called from the IOMMU layer */ > + if (ret || !dev->driver || drv->driver_managed_dma) > return ret; > > ret = iommu_device_use_default_domain(dev); > diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c > index d1f3d327ddd1..a8be8cf246fb 100644 > --- a/drivers/bus/fsl-mc/fsl-mc-bus.c > +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c > @@ -153,7 +153,8 @@ static int fsl_mc_dma_configure(struct device *dev) > else > ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT, &input_id); > > - if (!ret && !mc_drv->driver_managed_dma) { > + /* @mc_drv may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !mc_drv->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c > index c573ed2ee71a..780fb0c4adba 100644 > --- a/drivers/cdx/cdx.c > +++ b/drivers/cdx/cdx.c > @@ -360,7 +360,8 @@ static int cdx_dma_configure(struct device *dev) > return ret; > } > > - if (!ret && !cdx_drv->driver_managed_dma) { > + /* @cdx_drv may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !cdx_drv->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index a3b45b84f42b..1cec7074367a 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -414,9 +414,21 @@ static int iommu_init_device(struct device *dev) > if (!dev_iommu_get(dev)) > return -ENOMEM; > /* > - * For FDT-based systems and ACPI IORT/VIOT, drivers register IOMMU > - * instances with non-NULL fwnodes, and client devices should have been > - * identified with a fwspec by this point. Otherwise, we can currently > + * For FDT-based systems and ACPI IORT/VIOT, the common firmware parsing > + * is buried in the bus dma_configure path. Properly unpicking that is > + * still a big job, so for now just invoke the whole thing. The device > + * already having a driver bound means dma_configure has already run and > + * either found no IOMMU to wait for, or we're in its replay call right > + * now, so either way there's no point calling it again. > + */ > + if (!dev->driver && dev->bus->dma_configure) { > + mutex_unlock(&iommu_probe_device_lock); > + dev->bus->dma_configure(dev); > + mutex_lock(&iommu_probe_device_lock); > + } > + /* > + * At this point, relevant devices either now have a fwspec which will > + * match ops registered with a non-NULL fwnode, or we can reasonably > * assume that only one of Intel, AMD, s390, PAMU or legacy SMMUv2 can > * be present, and that any of their registered instances has suitable > * ops for probing, and thus cheekily co-opt the same mechanism. > @@ -426,6 +438,12 @@ static int iommu_init_device(struct device *dev) > ret = -ENODEV; > goto err_free; > } > + /* > + * And if we do now see any replay calls, they would indicate someone > + * misusing the dma_configure path outside bus code. > + */ > + if (dev->driver) > + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); > > if (!try_module_get(ops->owner)) { > ret = -EINVAL; > diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c > index e10a68b5ffde..6b989a62def2 100644 > --- a/drivers/iommu/of_iommu.c > +++ b/drivers/iommu/of_iommu.c > @@ -155,7 +155,12 @@ int of_iommu_configure(struct device *dev, struct device_node *master_np, > dev_iommu_free(dev); > mutex_unlock(&iommu_probe_device_lock); > > - if (!err && dev->bus) > + /* > + * If we're not on the iommu_probe_device() path (as indicated by the > + * initial dev->iommu) then try to simulate it. This should no longer > + * happen unless of_dma_configure() is being misused outside bus code. > + */ > + if (!err && dev->bus && !dev_iommu_present) > err = iommu_probe_device(dev); > > if (err && err != -EPROBE_DEFER) > diff --git a/drivers/of/device.c b/drivers/of/device.c > index edf3be197265..5053e5d532cc 100644 > --- a/drivers/of/device.c > +++ b/drivers/of/device.c > @@ -99,6 +99,11 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, > bool coherent, set_map = false; > int ret; > > + if (dev->dma_range_map) { > + dev_dbg(dev, "dma_range_map already set\n"); > + goto skip_map; > + } > + > if (np == dev->of_node) > bus_np = __of_get_dma_parent(np); > else > @@ -119,7 +124,7 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, > end = dma_range_map_max(map); > set_map = true; > } > - > +skip_map: > /* > * If @dev is expected to be DMA-capable then the bus code that created > * it should have initialised its dma_mask pointer by this point. For > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > index f57ea36d125d..4468b7327cab 100644 > --- a/drivers/pci/pci-driver.c > +++ b/drivers/pci/pci-driver.c > @@ -1653,7 +1653,8 @@ static int pci_dma_configure(struct device *dev) > > pci_put_host_bridge_device(bridge); > > - if (!ret && !driver->driver_managed_dma) { > + /* @driver may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !driver->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); Best regards
On 2025-03-13 9:56 am, Marek Szyprowski wrote: [...] > This patch landed in yesterday's linux-next as commit bcb81ac6ae3c > ("iommu: Get DT/ACPI parsing into the proper probe path"). In my tests I > found it breaks booting of ARM64 RK3568-based Odroid-M1 board > (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the > relevant kernel log: ...and the bug-flushing-out begins! > Unable to handle kernel NULL pointer dereference at virtual address > 00000000000003e8 > Mem abort info: > ESR = 0x0000000096000004 > EC = 0x25: DABT (current EL), IL = 32 bits > SET = 0, FnV = 0 > EA = 0, S1PTW = 0 > FSC = 0x04: level 0 translation fault > Data abort info: > ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 > CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > [00000000000003e8] user address but active_mm is swapper > Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP > Modules linked in: > CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533 > Hardware name: Hardkernel ODROID-M1 (DT) > pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > pc : devm_kmalloc+0x2c/0x114 > lr : rk_iommu_of_xlate+0x30/0x90 > ... > Call trace: > devm_kmalloc+0x2c/0x114 (P) > rk_iommu_of_xlate+0x30/0x90 Yeah, looks like this is doing something a bit questionable which can't work properly. TBH the whole dma_dev thing could probably be cleaned up now that we have proper instances, but for now does this work? (annoyingly none of my Rockchip boards are set up for testing right now, but I might have time to dig one out later) Thanks, Robin. ----->8----- Subject: [PATCH] iommu/rockchip: Allocate per-device data sensibly Now that DT-based probing is finally happening in the right order again, it reveals an issue in Rockchip's of_xlate, which can now be called during registration, but is using the global dma_dev which is only assigned later. However, this makes little sense when we're already looking up the correct IOMMU device, who should logically be the owner of the devm allocation anyway. Fixes: bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path") Signed-off-by: Robin Murphy <robin.murphy@arm.com> --- drivers/iommu/rockchip-iommu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c index 323cc665c357..48826d1ccfd8 100644 --- a/drivers/iommu/rockchip-iommu.c +++ b/drivers/iommu/rockchip-iommu.c @@ -1148,12 +1148,12 @@ static int rk_iommu_of_xlate(struct device *dev, struct platform_device *iommu_dev; struct rk_iommudata *data; - data = devm_kzalloc(dma_dev, sizeof(*data), GFP_KERNEL); + iommu_dev = of_find_device_by_node(args->np); + + data = devm_kzalloc(&iommu_dev->dev, sizeof(*data), GFP_KERNEL); if (!data) return -ENOMEM; - iommu_dev = of_find_device_by_node(args->np); - data->iommu = platform_get_drvdata(iommu_dev); data->iommu->domain = &rk_identity_domain; dev_iommu_priv_set(dev, data);
On 13.03.2025 12:01, Robin Murphy wrote: > On 2025-03-13 9:56 am, Marek Szyprowski wrote: > [...] >> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c >> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my tests I >> found it breaks booting of ARM64 RK3568-based Odroid-M1 board >> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the >> relevant kernel log: > > ...and the bug-flushing-out begins! > >> Unable to handle kernel NULL pointer dereference at virtual address >> 00000000000003e8 >> Mem abort info: >> ESR = 0x0000000096000004 >> EC = 0x25: DABT (current EL), IL = 32 bits >> SET = 0, FnV = 0 >> EA = 0, S1PTW = 0 >> FSC = 0x04: level 0 translation fault >> Data abort info: >> ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 >> CM = 0, WnR = 0, TnD = 0, TagAccess = 0 >> GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 >> [00000000000003e8] user address but active_mm is swapper >> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP >> Modules linked in: >> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533 >> Hardware name: Hardkernel ODROID-M1 (DT) >> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> pc : devm_kmalloc+0x2c/0x114 >> lr : rk_iommu_of_xlate+0x30/0x90 >> ... >> Call trace: >> devm_kmalloc+0x2c/0x114 (P) >> rk_iommu_of_xlate+0x30/0x90 > > Yeah, looks like this is doing something a bit questionable which can't > work properly. TBH the whole dma_dev thing could probably be cleaned up > now that we have proper instances, but for now does this work? Yes, this patch fixes the problem I've observed. Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> BTW, this dma_dev idea has been borrowed from my exynos_iommu driver and I doubt it can be cleaned up. > > (annoyingly none of my Rockchip boards are set up for testing right > now, but I might have time to dig one out later) > > Thanks, > Robin. > > ----->8----- > > Subject: [PATCH] iommu/rockchip: Allocate per-device data sensibly > > Now that DT-based probing is finally happening in the right order again, > it reveals an issue in Rockchip's of_xlate, which can now be called > during registration, but is using the global dma_dev which is only > assigned later. However, this makes little sense when we're already > looking up the correct IOMMU device, who should logically be the owner > of the devm allocation anyway. > > Fixes: bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe > path") > Signed-off-by: Robin Murphy <robin.murphy@arm.com> > --- > drivers/iommu/rockchip-iommu.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/iommu/rockchip-iommu.c > b/drivers/iommu/rockchip-iommu.c > index 323cc665c357..48826d1ccfd8 100644 > --- a/drivers/iommu/rockchip-iommu.c > +++ b/drivers/iommu/rockchip-iommu.c > @@ -1148,12 +1148,12 @@ static int rk_iommu_of_xlate(struct device *dev, > struct platform_device *iommu_dev; > struct rk_iommudata *data; > > - data = devm_kzalloc(dma_dev, sizeof(*data), GFP_KERNEL); > + iommu_dev = of_find_device_by_node(args->np); > + > + data = devm_kzalloc(&iommu_dev->dev, sizeof(*data), GFP_KERNEL); > if (!data) > return -ENOMEM; > > - iommu_dev = of_find_device_by_node(args->np); > - > data->iommu = platform_get_drvdata(iommu_dev); > data->iommu->domain = &rk_identity_domain; > dev_iommu_priv_set(dev, data); Best regards
On 2025-03-13 12:23 pm, Marek Szyprowski wrote: > On 13.03.2025 12:01, Robin Murphy wrote: >> On 2025-03-13 9:56 am, Marek Szyprowski wrote: >> [...] >>> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c >>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my tests I >>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board >>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the >>> relevant kernel log: >> >> ...and the bug-flushing-out begins! >> >>> Unable to handle kernel NULL pointer dereference at virtual address >>> 00000000000003e8 >>> Mem abort info: >>> ESR = 0x0000000096000004 >>> EC = 0x25: DABT (current EL), IL = 32 bits >>> SET = 0, FnV = 0 >>> EA = 0, S1PTW = 0 >>> FSC = 0x04: level 0 translation fault >>> Data abort info: >>> ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 >>> CM = 0, WnR = 0, TnD = 0, TagAccess = 0 >>> GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 >>> [00000000000003e8] user address but active_mm is swapper >>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP >>> Modules linked in: >>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533 >>> Hardware name: Hardkernel ODROID-M1 (DT) >>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>> pc : devm_kmalloc+0x2c/0x114 >>> lr : rk_iommu_of_xlate+0x30/0x90 >>> ... >>> Call trace: >>> devm_kmalloc+0x2c/0x114 (P) >>> rk_iommu_of_xlate+0x30/0x90 >> >> Yeah, looks like this is doing something a bit questionable which can't >> work properly. TBH the whole dma_dev thing could probably be cleaned up >> now that we have proper instances, but for now does this work? > > Yes, this patch fixes the problem I've observed. > > Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> > Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> > > BTW, this dma_dev idea has been borrowed from my exynos_iommu driver and > I doubt it can be cleaned up. On the contrary I suspect they both can - it all dates back to when we had the single global platform bus iommu_ops and the SoC drivers were forced to bodge their own notion of multiple instances, but with the modern core code, ops are always called via a valid IOMMU instance or domain, so in principle it should always be possible to get at an appropriate IOMMU device now. IIRC it was mostly about allocating and DMA-mapping the pagetables in domain_alloc, where the private notion of instances didn't have enough information, but domain_alloc_paging solves that. Cheers, Robin.
On 2025-03-13 1:06 pm, Robin Murphy wrote: > On 2025-03-13 12:23 pm, Marek Szyprowski wrote: >> On 13.03.2025 12:01, Robin Murphy wrote: >>> On 2025-03-13 9:56 am, Marek Szyprowski wrote: >>> [...] >>>> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c >>>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my >>>> tests I >>>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board >>>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the >>>> relevant kernel log: >>> >>> ...and the bug-flushing-out begins! >>> >>>> Unable to handle kernel NULL pointer dereference at virtual address >>>> 00000000000003e8 >>>> Mem abort info: >>>> ESR = 0x0000000096000004 >>>> EC = 0x25: DABT (current EL), IL = 32 bits >>>> SET = 0, FnV = 0 >>>> EA = 0, S1PTW = 0 >>>> FSC = 0x04: level 0 translation fault >>>> Data abort info: >>>> ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 >>>> CM = 0, WnR = 0, TnD = 0, TagAccess = 0 >>>> GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 >>>> [00000000000003e8] user address but active_mm is swapper >>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP >>>> Modules linked in: >>>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533 >>>> Hardware name: Hardkernel ODROID-M1 (DT) >>>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>> pc : devm_kmalloc+0x2c/0x114 >>>> lr : rk_iommu_of_xlate+0x30/0x90 >>>> ... >>>> Call trace: >>>> devm_kmalloc+0x2c/0x114 (P) >>>> rk_iommu_of_xlate+0x30/0x90 >>> >>> Yeah, looks like this is doing something a bit questionable which can't >>> work properly. TBH the whole dma_dev thing could probably be cleaned up >>> now that we have proper instances, but for now does this work? >> >> Yes, this patch fixes the problem I've observed. >> >> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> >> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> >> >> BTW, this dma_dev idea has been borrowed from my exynos_iommu driver and >> I doubt it can be cleaned up. > > On the contrary I suspect they both can - it all dates back to when we > had the single global platform bus iommu_ops and the SoC drivers were > forced to bodge their own notion of multiple instances, but with the > modern core code, ops are always called via a valid IOMMU instance or > domain, so in principle it should always be possible to get at an > appropriate IOMMU device now. IIRC it was mostly about allocating and > DMA-mapping the pagetables in domain_alloc, where the private notion of > instances didn't have enough information, but domain_alloc_paging solves > that. Bah, in fact I think I am going to have to do that now, since although it doesn't crash, rk_domain_alloc_paging() will also be failing for the same reason. Time to find a PSU for the RK3399 board, I guess... (Or maybe just move the dma_dev assignment earlier to match Exynos?) Thanks, Robin.
On 13.03.2025 15:12, Robin Murphy wrote: > On 2025-03-13 1:06 pm, Robin Murphy wrote: >> On 2025-03-13 12:23 pm, Marek Szyprowski wrote: >>> On 13.03.2025 12:01, Robin Murphy wrote: >>>> On 2025-03-13 9:56 am, Marek Szyprowski wrote: >>>> [...] >>>>> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c >>>>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my >>>>> tests I >>>>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board >>>>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the >>>>> relevant kernel log: >>>> >>>> ...and the bug-flushing-out begins! >>>> >>>>> Unable to handle kernel NULL pointer dereference at virtual address >>>>> 00000000000003e8 >>>>> Mem abort info: >>>>> ESR = 0x0000000096000004 >>>>> EC = 0x25: DABT (current EL), IL = 32 bits >>>>> SET = 0, FnV = 0 >>>>> EA = 0, S1PTW = 0 >>>>> FSC = 0x04: level 0 translation fault >>>>> Data abort info: >>>>> ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 >>>>> CM = 0, WnR = 0, TnD = 0, TagAccess = 0 >>>>> GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 >>>>> [00000000000003e8] user address but active_mm is swapper >>>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP >>>>> Modules linked in: >>>>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533 >>>>> Hardware name: Hardkernel ODROID-M1 (DT) >>>>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>>> pc : devm_kmalloc+0x2c/0x114 >>>>> lr : rk_iommu_of_xlate+0x30/0x90 >>>>> ... >>>>> Call trace: >>>>> devm_kmalloc+0x2c/0x114 (P) >>>>> rk_iommu_of_xlate+0x30/0x90 >>>> >>>> Yeah, looks like this is doing something a bit questionable which >>>> can't >>>> work properly. TBH the whole dma_dev thing could probably be >>>> cleaned up >>>> now that we have proper instances, but for now does this work? >>> >>> Yes, this patch fixes the problem I've observed. >>> >>> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> >>> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> >>> >>> BTW, this dma_dev idea has been borrowed from my exynos_iommu driver >>> and >>> I doubt it can be cleaned up. >> >> On the contrary I suspect they both can - it all dates back to when >> we had the single global platform bus iommu_ops and the SoC drivers >> were forced to bodge their own notion of multiple instances, but with >> the modern core code, ops are always called via a valid IOMMU >> instance or domain, so in principle it should always be possible to >> get at an appropriate IOMMU device now. IIRC it was mostly about >> allocating and DMA-mapping the pagetables in domain_alloc, where the >> private notion of instances didn't have enough information, but >> domain_alloc_paging solves that. > > Bah, in fact I think I am going to have to do that now, since although > it doesn't crash, rk_domain_alloc_paging() will also be failing for > the same reason. Time to find a PSU for the RK3399 board, I guess... > > (Or maybe just move the dma_dev assignment earlier to match Exynos?) Well I just found that Exynos IOMMU is also broken on some on my test boards. It looks that the runtime pm links are somehow not correctly established. I will try to analyze this later in the afternoon. Best regards
On 17/03/2025 7:37 am, Marek Szyprowski wrote: > On 13.03.2025 15:12, Robin Murphy wrote: >> On 2025-03-13 1:06 pm, Robin Murphy wrote: >>> On 2025-03-13 12:23 pm, Marek Szyprowski wrote: >>>> On 13.03.2025 12:01, Robin Murphy wrote: >>>>> On 2025-03-13 9:56 am, Marek Szyprowski wrote: >>>>> [...] >>>>>> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c >>>>>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my >>>>>> tests I >>>>>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board >>>>>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the >>>>>> relevant kernel log: >>>>> >>>>> ...and the bug-flushing-out begins! >>>>> >>>>>> Unable to handle kernel NULL pointer dereference at virtual address >>>>>> 00000000000003e8 >>>>>> Mem abort info: >>>>>> ESR = 0x0000000096000004 >>>>>> EC = 0x25: DABT (current EL), IL = 32 bits >>>>>> SET = 0, FnV = 0 >>>>>> EA = 0, S1PTW = 0 >>>>>> FSC = 0x04: level 0 translation fault >>>>>> Data abort info: >>>>>> ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 >>>>>> CM = 0, WnR = 0, TnD = 0, TagAccess = 0 >>>>>> GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 >>>>>> [00000000000003e8] user address but active_mm is swapper >>>>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP >>>>>> Modules linked in: >>>>>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533 >>>>>> Hardware name: Hardkernel ODROID-M1 (DT) >>>>>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>>>> pc : devm_kmalloc+0x2c/0x114 >>>>>> lr : rk_iommu_of_xlate+0x30/0x90 >>>>>> ... >>>>>> Call trace: >>>>>> devm_kmalloc+0x2c/0x114 (P) >>>>>> rk_iommu_of_xlate+0x30/0x90 >>>>> >>>>> Yeah, looks like this is doing something a bit questionable which >>>>> can't >>>>> work properly. TBH the whole dma_dev thing could probably be >>>>> cleaned up >>>>> now that we have proper instances, but for now does this work? >>>> >>>> Yes, this patch fixes the problem I've observed. >>>> >>>> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> >>>> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> >>>> >>>> BTW, this dma_dev idea has been borrowed from my exynos_iommu driver >>>> and >>>> I doubt it can be cleaned up. >>> >>> On the contrary I suspect they both can - it all dates back to when >>> we had the single global platform bus iommu_ops and the SoC drivers >>> were forced to bodge their own notion of multiple instances, but with >>> the modern core code, ops are always called via a valid IOMMU >>> instance or domain, so in principle it should always be possible to >>> get at an appropriate IOMMU device now. IIRC it was mostly about >>> allocating and DMA-mapping the pagetables in domain_alloc, where the >>> private notion of instances didn't have enough information, but >>> domain_alloc_paging solves that. >> >> Bah, in fact I think I am going to have to do that now, since although >> it doesn't crash, rk_domain_alloc_paging() will also be failing for >> the same reason. Time to find a PSU for the RK3399 board, I guess... >> >> (Or maybe just move the dma_dev assignment earlier to match Exynos?) > > Well I just found that Exynos IOMMU is also broken on some on my test > boards. It looks that the runtime pm links are somehow not correctly > established. I will try to analyze this later in the afternoon. Hmm, I tried to get an Odroid-XU3 up and running, but it seems unable to boot my original 6.14-rc3-based branch - even with the IOMMU driver disabled, it's consistently dying somewhere near (or just after) init with what looks like some catastrophic memory corruption issue - very occasionally it's managed to print the first line of various different panics. Before that point though, with the IOMMU driver enabled it does appear to show signs of working OK: [ 0.649703] exynos-sysmmu 14650000.sysmmu: hardware version: 3.3 [ 0.654220] platform 14450000.mixer: Adding to iommu group 1 ... [ 2.680920] exynos-mixer 14450000.mixer: exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42924000 ... [ 5.196674] exynos-mixer 14450000.mixer: exynos_iommu_identity_attach: Restored IOMMU to IDENTITY from pgtable 0x42924000 [ 5.207091] exynos-mixer 14450000.mixer: exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42884000 The multi-instance stuff in probe/release does look a bit suspect, however - seems like the second instance probe would overwrite the first instance's links, and then there would be a double-del() if the device were ever actually released again? I may have made that much more likely to happen, but I suspect it was already possible with async driver probe... Thanks, Robin.
Hi Robin, On Fri, 28 Feb 2025 at 16:51, Robin Murphy <robin.murphy@arm.com> wrote: > In hindsight, there were some crucial subtleties overlooked when moving > {of,acpi}_dma_configure() to driver probe time to allow waiting for > IOMMU drivers with -EPROBE_DEFER, and these have become an > ever-increasing source of problems. The IOMMU API has some fundamental > assumptions that iommu_probe_device() is called for every device added > to the system, in the order in which they are added. Calling it in a > random order or not at all dependent on driver binding leads to > malformed groups, a potential lack of isolation for devices with no > driver, and all manner of unexpected concurrency and race conditions. > We've attempted to mitigate the latter with point-fix bodges like > iommu_probe_device_lock, but it's a losing battle and the time has come > to bite the bullet and address the true source of the problem instead. > > The crux of the matter is that the firmware parsing actually serves two > distinct purposes; one is identifying the IOMMU instance associated with > a device so we can check its availability, the second is actually > telling that instance about the relevant firmware-provided data for the > device. However the latter also depends on the former, and at the time > there was no good place to defer and retry that separately from the > availability check we also wanted for client driver probe. > > Nowadays, though, we have a proper notion of multiple IOMMU instances in > the core API itself, and each one gets a chance to probe its own devices > upon registration, so we can finally make that work as intended for > DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to > be able to run the iommu_fwspec machinery currently buried deep in the > wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be > surprisingly straightforward to bootstrap this transformation by pretty > much just calling the same path twice. At client driver probe time, > dev->driver is obviously set; conversely at device_add(), or a > subsequent bus_iommu_probe(), any device waiting for an IOMMU really > should *not* have a driver already, so we can use that as a condition to > disambiguate the two cases, and avoid recursing back into the IOMMU core > at the wrong times. > > Obviously this isn't the nicest thing, but for now it gives us a > functional baseline to then unpick the layers in between without many > more awkward cross-subsystem patches. There are some minor side-effects > like dma_range_map potentially being created earlier, and some debug > prints being repeated, but these aren't significantly detrimental. Let's > make things work first, then deal with making them nice. > > With the basic flow finally in the right order again, the next step is > probably turning the bus->dma_configure paths inside-out, since all we > really need from bus code is its notion of which device and input ID(s) > to parse the common firmware properties with... > > Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c > Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c > Signed-off-by: Robin Murphy <robin.murphy@arm.com> Thanks for your patch, which is now commit bcb81ac6ae3c2ef9 ("iommu: Get DT/ACPI parsing into the proper probe path") in iommu/next. This patch triggers two issues on R-Car Gen3 platforms: 1. I am seeing a warning on Renesas Salvator-XS with R-Car M3N (but not on the similar board with R-Car H3), and only for SATA[1]. Unfortunately commit 73d2f10957f517e5 ("iommu: Don't warn prematurely about dodgy probes") does not help: ------------[ cut here ]------------ sata_rcar ee300000.sata: late IOMMU probe at driver bind, something fishy here! WARNING: CPU: 1 PID: 13 at drivers/iommu/iommu.c:571 __iommu_probe_device+0x208/0x38c Modules linked in: CPU: 1 UID: 0 PID: 13 Comm: kworker/u8:1 Not tainted 6.14.0-rc3-rcar3-00020-g73d2f10957f5-dirty #315 Hardware name: Renesas Salvator-X 2nd version board based on r8a77965 (DT) Workqueue: events_unbound deferred_probe_work_func pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __iommu_probe_device+0x208/0x38c lr : __iommu_probe_device+0x208/0x38c sp : ffffffc086da39a0 x29: ffffffc086da39b0 x28: 0000000000000000 x27: 0000000000000000 x26: 0000000000000001 x25: ffffffc080e0e0ae x24: ffffffc080e0e0bb x23: 0000000000000000 x22: ffffff800bd3d090 x21: ffffffc080acf680 x20: ffffff8008e8f780 x19: ffffff800aca8810 x18: 00000000e9f55e4c x17: 6874656d6f73202c x16: 646e696220726576 x15: 0720072007200720 x14: 0720072007200720 x13: 0720072007200720 x12: 0720072007200720 x11: 00000000000001ac x10: ffffffc086da3700 x9 : ffffffc083edb140 x8 : ffffffc086da3698 x7 : ffffffc086da36a0 x6 : 00000000fff7ffff x5 : c0000000fff7ffff x4 : 0000000000000000 x3 : 0000000000000001 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80083d5380 Call trace: __iommu_probe_device+0x208/0x38c (P) iommu_probe_device+0x34/0x74 of_iommu_configure+0x128/0x200 of_dma_configure_id+0xdc/0x1d4 platform_dma_configure+0x48/0x6c really_probe+0xf0/0x260 __driver_probe_device+0xec/0x104 driver_probe_device+0x3c/0xc0 __device_attach_driver+0x58/0xcc bus_for_each_drv+0xb8/0xe0 __device_attach+0xdc/0x138 device_initial_probe+0x10/0x18 bus_probe_device+0x38/0xa0 deferred_probe_work_func+0xb4/0xcc process_scheduled_works+0x2e4/0x4a8 worker_thread+0x144/0x1cc kthread+0x188/0x198 ret_from_fork+0x10/0x20 irq event stamp: 49052 hardirqs last enabled at (49051): [<ffffffc0800fb6a8>] __up_console_sem+0x50/0x74 hardirqs last disabled at (49052): [<ffffffc0809eb65c>] el1_dbg+0x20/0x6c softirqs last enabled at (49046): [<ffffffc080096c50>] handle_softirqs+0x1b0/0x3b4 softirqs last disabled at (48839): [<ffffffc080010168>] __do_softirq+0x10/0x18 ---[ end trace 0000000000000000 ]--- I added debug prints to sata_rcar_probe(), and verified SATA is probed at about the same time on R-Car H3 and M3N, and the probe is never deferred. Do you have a clue? 2. The IOMMU driver's iommu_ops.of_xlate() callback is called about three times as much as before: +platform ec700000.dma-controller: ipmmu_of_xlate +platform ec720000.dma-controller: ipmmu_of_xlate +platform ec700000.dma-controller: ipmmu_of_xlate +platform ec720000.dma-controller: ipmmu_of_xlate +platform ec700000.dma-controller: ipmmu_of_xlate +platform ec720000.dma-controller: ipmmu_of_xlate +platform ec700000.dma-controller: ipmmu_of_xlate +platform ec720000.dma-controller: ipmmu_of_xlate +platform ec700000.dma-controller: ipmmu_of_xlate +platform ec720000.dma-controller: ipmmu_of_xlate +platform fea27000.fcp: ipmmu_of_xlate +platform fea2f000.fcp: ipmmu_of_xlate +platform ec700000.dma-controller: ipmmu_of_xlate +platform ec720000.dma-controller: ipmmu_of_xlate +platform fe950000.fcp: ipmmu_of_xlate +platform fe96f000.fcp: ipmmu_of_xlate +platform fea27000.fcp: ipmmu_of_xlate +platform fea2f000.fcp: ipmmu_of_xlate +platform fe9af000.fcp: ipmmu_of_xlate rcar-fcp fe950000.fcp: ipmmu_of_xlate rcar-fcp fe96f000.fcp: ipmmu_of_xlate rcar-fcp fea27000.fcp: ipmmu_of_xlate rcar-fcp fea2f000.fcp: ipmmu_of_xlate rcar-fcp fe9af000.fcp: ipmmu_of_xlate rcar-dmac ec700000.dma-controller: ipmmu_of_xlate rcar-dmac ec720000.dma-controller: ipmmu_of_xlate +platform e6700000.dma-controller: ipmmu_of_xlate +platform e6800000.ethernet: ipmmu_of_xlate +platform e6700000.dma-controller: ipmmu_of_xlate +platform e7300000.dma-controller: ipmmu_of_xlate +platform e7310000.dma-controller: ipmmu_of_xlate +platform e6800000.ethernet: ipmmu_of_xlate +platform ee100000.mmc: ipmmu_of_xlate +platform ee140000.mmc: ipmmu_of_xlate +platform ee160000.mmc: ipmmu_of_xlate +platform e6700000.dma-controller: ipmmu_of_xlate +platform e7300000.dma-controller: ipmmu_of_xlate +platform e7310000.dma-controller: ipmmu_of_xlate +platform e6800000.ethernet: ipmmu_of_xlate +platform ee100000.mmc: ipmmu_of_xlate +platform ee140000.mmc: ipmmu_of_xlate +platform ee160000.mmc: ipmmu_of_xlate +platform ee300000.sata: ipmmu_of_xlate sata_rcar ee300000.sata: ipmmu_of_xlate ravb e6800000.ethernet: ipmmu_of_xlate -renesas_sdhi_internal_dmac ee100000.mmc: ipmmu_of_xlate -renesas_sdhi_internal_dmac ee140000.mmc: ipmmu_of_xlate -renesas_sdhi_internal_dmac ee160000.mmc: ipmmu_of_xlate rcar-dmac e6700000.dma-controller: ipmmu_of_xlate rcar-dmac e7300000.dma-controller: ipmmu_of_xlate rcar-dmac e7310000.dma-controller: ipmmu_of_xlate For some devices, it can be called up to 6 times. All of the duplicates happen before the device is bound (cfr. "platform" instead of the actual driver name): 6 platform ec720000.dma-controller: ipmmu_of_xlate 6 platform ec700000.dma-controller: ipmmu_of_xlate 3 platform e6800000.ethernet: ipmmu_of_xlate 3 platform e6700000.dma-controller: ipmmu_of_xlate 2 platform fea2f000.fcp: ipmmu_of_xlate 2 platform fea27000.fcp: ipmmu_of_xlate 2 platform ee160000.mmc: ipmmu_of_xlate 2 platform ee140000.mmc: ipmmu_of_xlate 2 platform ee100000.mmc: ipmmu_of_xlate 2 platform e7310000.dma-controller: ipmmu_of_xlate 2 platform e7300000.dma-controller: ipmmu_of_xlate 1 platform fe9af000.fcp: ipmmu_of_xlate 1 platform fe96f000.fcp: ipmmu_of_xlate 1 platform fe950000.fcp: ipmmu_of_xlate 1 platform ee300000.sata: ipmmu_of_xlate 1 sata_rcar ee300000.sata: ipmmu_of_xlate 1 rcar-fcp fea2f000.fcp: ipmmu_of_xlate 1 rcar-fcp fea27000.fcp: ipmmu_of_xlate 1 rcar-fcp fe9af000.fcp: ipmmu_of_xlate 1 rcar-fcp fe96f000.fcp: ipmmu_of_xlate 1 rcar-fcp fe950000.fcp: ipmmu_of_xlate 1 rcar-dmac ec720000.dma-controller: ipmmu_of_xlate 1 rcar-dmac ec700000.dma-controller: ipmmu_of_xlate 1 rcar-dmac e7310000.dma-controller: ipmmu_of_xlate 1 rcar-dmac e7300000.dma-controller: ipmmu_of_xlate 1 rcar-dmac e6700000.dma-controller: ipmmu_of_xlate 1 ravb e6800000.ethernet: ipmmu_of_xlate Before, the callback was called just once for each DMA slave device. Is this intentional? Thanks! [1] SATA IOMMU on R-Car Gen3 needs an out-of-tree patch to add it to drivers/iommu/ipmmu-vmsa.c:devices_allowlist[]. Gr{oetje,eeting}s, Geert
Hi Geert, On 18/03/2025 4:37 pm, Geert Uytterhoeven wrote: [...] > Thanks for your patch, which is now commit bcb81ac6ae3c2ef9 ("iommu: > Get DT/ACPI parsing into the proper probe path") in iommu/next. > > This patch triggers two issues on R-Car Gen3 platforms: > > 1. I am seeing a warning on Renesas Salvator-XS with R-Car M3N > (but not on the similar board with R-Car H3), and only for SATA[1]. > Unfortunately commit 73d2f10957f517e5 ("iommu: Don't warn prematurely > about dodgy probes") does not help: [...] > Call trace: > __iommu_probe_device+0x208/0x38c (P) > iommu_probe_device+0x34/0x74 > of_iommu_configure+0x128/0x200 > of_dma_configure_id+0xdc/0x1d4 > platform_dma_configure+0x48/0x6c > really_probe+0xf0/0x260 > __driver_probe_device+0xec/0x104 > driver_probe_device+0x3c/0xc0 Hurrah, this is the warning doing the correct job - something *is* off if we're now getting here without the IOMMU configuration being done already (for a normal device with no other funny business going on). > 2. The IOMMU driver's iommu_ops.of_xlate() callback is called about > three times as much as before: That would suggest that the fwspec gets set up OK, then something later in the __iommu_probe_device() path fails and tears it down again, so the next attempt starts from scratch. Do you see the "Cannot attach to IPMMU" message firing? And similarly to the Rockchip case, does the below help? Thanks, Robin. ----->8----- diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c index 074daf1aac4e..5d416262ae5f 100644 --- a/drivers/iommu/ipmmu-vmsa.c +++ b/drivers/iommu/ipmmu-vmsa.c @@ -1081,6 +1081,7 @@ static int ipmmu_probe(struct platform_device *pdev) } } + platform_set_drvdata(pdev, mmu); /* * Register the IPMMU to the IOMMU subsystem in the following cases: * - R-Car Gen2 IPMMU (all devices registered) @@ -1103,7 +1104,6 @@ static int ipmmu_probe(struct platform_device *pdev) * ipmmu_init() after the probe function returns. */ - platform_set_drvdata(pdev, mmu); return 0; }
On 17.03.2025 19:22, Robin Murphy wrote: > On 17/03/2025 7:37 am, Marek Szyprowski wrote: >> On 13.03.2025 15:12, Robin Murphy wrote: >>> On 2025-03-13 1:06 pm, Robin Murphy wrote: >>>> On 2025-03-13 12:23 pm, Marek Szyprowski wrote: >>>>> On 13.03.2025 12:01, Robin Murphy wrote: >>>>>> On 2025-03-13 9:56 am, Marek Szyprowski wrote: >>>>>> [...] >>>>>>> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c >>>>>>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my >>>>>>> tests I >>>>>>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board >>>>>>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the >>>>>>> relevant kernel log: >>>>>> >>>>>> ...and the bug-flushing-out begins! >>>>>> >>>>>>> Unable to handle kernel NULL pointer dereference at virtual address >>>>>>> 00000000000003e8 >>>>>>> Mem abort info: >>>>>>> ESR = 0x0000000096000004 >>>>>>> EC = 0x25: DABT (current EL), IL = 32 bits >>>>>>> SET = 0, FnV = 0 >>>>>>> EA = 0, S1PTW = 0 >>>>>>> FSC = 0x04: level 0 translation fault >>>>>>> Data abort info: >>>>>>> ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 >>>>>>> CM = 0, WnR = 0, TnD = 0, TagAccess = 0 >>>>>>> GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 >>>>>>> [00000000000003e8] user address but active_mm is swapper >>>>>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP >>>>>>> Modules linked in: >>>>>>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533 >>>>>>> Hardware name: Hardkernel ODROID-M1 (DT) >>>>>>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>>>>> pc : devm_kmalloc+0x2c/0x114 >>>>>>> lr : rk_iommu_of_xlate+0x30/0x90 >>>>>>> ... >>>>>>> Call trace: >>>>>>> devm_kmalloc+0x2c/0x114 (P) >>>>>>> rk_iommu_of_xlate+0x30/0x90 >>>>>> >>>>>> Yeah, looks like this is doing something a bit questionable which >>>>>> can't >>>>>> work properly. TBH the whole dma_dev thing could probably be >>>>>> cleaned up >>>>>> now that we have proper instances, but for now does this work? >>>>> >>>>> Yes, this patch fixes the problem I've observed. >>>>> >>>>> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> >>>>> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> >>>>> >>>>> BTW, this dma_dev idea has been borrowed from my exynos_iommu driver >>>>> and >>>>> I doubt it can be cleaned up. >>>> >>>> On the contrary I suspect they both can - it all dates back to when >>>> we had the single global platform bus iommu_ops and the SoC drivers >>>> were forced to bodge their own notion of multiple instances, but with >>>> the modern core code, ops are always called via a valid IOMMU >>>> instance or domain, so in principle it should always be possible to >>>> get at an appropriate IOMMU device now. IIRC it was mostly about >>>> allocating and DMA-mapping the pagetables in domain_alloc, where the >>>> private notion of instances didn't have enough information, but >>>> domain_alloc_paging solves that. >>> >>> Bah, in fact I think I am going to have to do that now, since although >>> it doesn't crash, rk_domain_alloc_paging() will also be failing for >>> the same reason. Time to find a PSU for the RK3399 board, I guess... >>> >>> (Or maybe just move the dma_dev assignment earlier to match Exynos?) >> >> Well I just found that Exynos IOMMU is also broken on some on my test >> boards. It looks that the runtime pm links are somehow not correctly >> established. I will try to analyze this later in the afternoon. > > Hmm, I tried to get an Odroid-XU3 up and running, but it seems unable > to boot my original 6.14-rc3-based branch - even with the IOMMU driver > disabled, it's consistently dying somewhere near (or just after) init > with what looks like some catastrophic memory corruption issue - very > occasionally it's managed to print the first line of various different > panics. > > Before that point though, with the IOMMU driver enabled it does appear > to show signs of working OK: > > [ 0.649703] exynos-sysmmu 14650000.sysmmu: hardware version: 3.3 > [ 0.654220] platform 14450000.mixer: Adding to iommu group 1 > ... > [ 2.680920] exynos-mixer 14450000.mixer: > exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42924000 > ... > [ 5.196674] exynos-mixer 14450000.mixer: > exynos_iommu_identity_attach: Restored IOMMU to IDENTITY from pgtable > 0x42924000 > [ 5.207091] exynos-mixer 14450000.mixer: > exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42884000 > > > The multi-instance stuff in probe/release does look a bit suspect, > however - seems like the second instance probe would overwrite the > first instance's links, and then there would be a double-del() if the > device were ever actually released again? I may have made that much > more likely to happen, but I suspect it was already possible with > async driver probe... That is really strange. My Odroid XU3 boots fine from commit bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path"), although the IOMMU seems not to be working correctly. I've tested this with 14450000.mixer device (one need to attach HDMI cable to get it activated) and it looks that the video data are not being read from memory at all (the lack of VSYNC is reported, no IOMMU fault). However, from time to time, everything initializes and works properly. It looks that this is somehow related to the different IOMMU/DMA-mapping glue code, as the other boards (ARM64 based) with exactly the same Exynos IOMMU driver always work fine. I've tried to figure out what actually happens, but so far I didn't get anything for sure. Disabling the call to dev->bus->dma_configure(dev) from iommu_init_device() seems to be fixing this, but this is almost equal to the revert of the $subject patch. I don't get why calling it in iommu_init_device() causes problems. It also doesn't look that this is anyhow related to the multi-instance stuff, as the same happens if I only leave a single exynos-sysmmu instance and its client (only 14450000.mixer device in the system). Best regards
On 21/03/2025 12:15 pm, Marek Szyprowski wrote: > On 17.03.2025 19:22, Robin Murphy wrote: >> On 17/03/2025 7:37 am, Marek Szyprowski wrote: >>> On 13.03.2025 15:12, Robin Murphy wrote: >>>> On 2025-03-13 1:06 pm, Robin Murphy wrote: >>>>> On 2025-03-13 12:23 pm, Marek Szyprowski wrote: >>>>>> On 13.03.2025 12:01, Robin Murphy wrote: >>>>>>> On 2025-03-13 9:56 am, Marek Szyprowski wrote: >>>>>>> [...] >>>>>>>> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c >>>>>>>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my >>>>>>>> tests I >>>>>>>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board >>>>>>>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the >>>>>>>> relevant kernel log: >>>>>>> >>>>>>> ...and the bug-flushing-out begins! >>>>>>> >>>>>>>> Unable to handle kernel NULL pointer dereference at virtual address >>>>>>>> 00000000000003e8 >>>>>>>> Mem abort info: >>>>>>>> ESR = 0x0000000096000004 >>>>>>>> EC = 0x25: DABT (current EL), IL = 32 bits >>>>>>>> SET = 0, FnV = 0 >>>>>>>> EA = 0, S1PTW = 0 >>>>>>>> FSC = 0x04: level 0 translation fault >>>>>>>> Data abort info: >>>>>>>> ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 >>>>>>>> CM = 0, WnR = 0, TnD = 0, TagAccess = 0 >>>>>>>> GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 >>>>>>>> [00000000000003e8] user address but active_mm is swapper >>>>>>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP >>>>>>>> Modules linked in: >>>>>>>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533 >>>>>>>> Hardware name: Hardkernel ODROID-M1 (DT) >>>>>>>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>>>>>> pc : devm_kmalloc+0x2c/0x114 >>>>>>>> lr : rk_iommu_of_xlate+0x30/0x90 >>>>>>>> ... >>>>>>>> Call trace: >>>>>>>> devm_kmalloc+0x2c/0x114 (P) >>>>>>>> rk_iommu_of_xlate+0x30/0x90 >>>>>>> >>>>>>> Yeah, looks like this is doing something a bit questionable which >>>>>>> can't >>>>>>> work properly. TBH the whole dma_dev thing could probably be >>>>>>> cleaned up >>>>>>> now that we have proper instances, but for now does this work? >>>>>> >>>>>> Yes, this patch fixes the problem I've observed. >>>>>> >>>>>> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> >>>>>> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> >>>>>> >>>>>> BTW, this dma_dev idea has been borrowed from my exynos_iommu driver >>>>>> and >>>>>> I doubt it can be cleaned up. >>>>> >>>>> On the contrary I suspect they both can - it all dates back to when >>>>> we had the single global platform bus iommu_ops and the SoC drivers >>>>> were forced to bodge their own notion of multiple instances, but with >>>>> the modern core code, ops are always called via a valid IOMMU >>>>> instance or domain, so in principle it should always be possible to >>>>> get at an appropriate IOMMU device now. IIRC it was mostly about >>>>> allocating and DMA-mapping the pagetables in domain_alloc, where the >>>>> private notion of instances didn't have enough information, but >>>>> domain_alloc_paging solves that. >>>> >>>> Bah, in fact I think I am going to have to do that now, since although >>>> it doesn't crash, rk_domain_alloc_paging() will also be failing for >>>> the same reason. Time to find a PSU for the RK3399 board, I guess... >>>> >>>> (Or maybe just move the dma_dev assignment earlier to match Exynos?) >>> >>> Well I just found that Exynos IOMMU is also broken on some on my test >>> boards. It looks that the runtime pm links are somehow not correctly >>> established. I will try to analyze this later in the afternoon. >> >> Hmm, I tried to get an Odroid-XU3 up and running, but it seems unable >> to boot my original 6.14-rc3-based branch - even with the IOMMU driver >> disabled, it's consistently dying somewhere near (or just after) init >> with what looks like some catastrophic memory corruption issue - very >> occasionally it's managed to print the first line of various different >> panics. >> >> Before that point though, with the IOMMU driver enabled it does appear >> to show signs of working OK: >> >> [ 0.649703] exynos-sysmmu 14650000.sysmmu: hardware version: 3.3 >> [ 0.654220] platform 14450000.mixer: Adding to iommu group 1 >> ... >> [ 2.680920] exynos-mixer 14450000.mixer: >> exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42924000 >> ... >> [ 5.196674] exynos-mixer 14450000.mixer: >> exynos_iommu_identity_attach: Restored IOMMU to IDENTITY from pgtable >> 0x42924000 >> [ 5.207091] exynos-mixer 14450000.mixer: >> exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42884000 >> >> >> The multi-instance stuff in probe/release does look a bit suspect, >> however - seems like the second instance probe would overwrite the >> first instance's links, and then there would be a double-del() if the >> device were ever actually released again? I may have made that much >> more likely to happen, but I suspect it was already possible with >> async driver probe... > > That is really strange. My Odroid XU3 boots fine from commit > bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path"), > although the IOMMU seems not to be working correctly. I've tested this > with 14450000.mixer device (one need to attach HDMI cable to get it > activated) and it looks that the video data are not being read from > memory at all (the lack of VSYNC is reported, no IOMMU fault). However, > from time to time, everything initializes and works properly. Urgh, seems my mistake was assuming exynos_defconfig was the right thing to begin from - bcb81ac6ae3c with that still dies in the same way (this time I saw a hint of spin_bug() being hit...), however a multi_v7_defconfig build does get to userspace OK again with no obvious signs of distress: [root@alarm ~]# grep -Hr . /sys/kernel/iommu_groups/*/type /sys/kernel/iommu_groups/0/type:identity /sys/kernel/iommu_groups/1/type:identity /sys/kernel/iommu_groups/10/type:identity /sys/kernel/iommu_groups/2/type:identity /sys/kernel/iommu_groups/3/type:identity /sys/kernel/iommu_groups/4/type:identity /sys/kernel/iommu_groups/5/type:identity /sys/kernel/iommu_groups/6/type:identity /sys/kernel/iommu_groups/7/type:identity /sys/kernel/iommu_groups/8/type:identity /sys/kernel/iommu_groups/9/type:identity Annoyingly I do have an adapter for the fiddly micro-HDMI, but it's at home :( > It looks that this is somehow related to the different IOMMU/DMA-mapping > glue code, as the other boards (ARM64 based) with exactly the same > Exynos IOMMU driver always work fine. I've tried to figure out what > actually happens, but so far I didn't get anything for sure. Disabling > the call to dev->bus->dma_configure(dev) from iommu_init_device() seems > to be fixing this, but this is almost equal to the revert of the > $subject patch. I don't get why calling it in iommu_init_device() causes > problems. It also doesn't look that this is anyhow related to the > multi-instance stuff, as the same happens if I only leave a single > exynos-sysmmu instance and its client (only 14450000.mixer device in the > system). On a hunch I stuck a print in exynos_iommu_probe_device(), and it looks like in fact device_link_add() isn't getting called at all, and indeed your symptoms do sound like they could be explained by the IOMMU not being reliably resumed... lemme stare at exynos_iommu_of_xlate() a bit longer... Thanks, Robin.
Hi Robin, On Tue, 18 Mar 2025 at 18:24, Robin Murphy <robin.murphy@arm.com> wrote: > On 18/03/2025 4:37 pm, Geert Uytterhoeven wrote: > [...] > > Thanks for your patch, which is now commit bcb81ac6ae3c2ef9 ("iommu: > > Get DT/ACPI parsing into the proper probe path") in iommu/next. > > > > This patch triggers two issues on R-Car Gen3 platforms: > > > > 1. I am seeing a warning on Renesas Salvator-XS with R-Car M3N > > (but not on the similar board with R-Car H3), and only for SATA[1]. > > Unfortunately commit 73d2f10957f517e5 ("iommu: Don't warn prematurely > > about dodgy probes") does not help: > [...] > > Call trace: > > __iommu_probe_device+0x208/0x38c (P) > > iommu_probe_device+0x34/0x74 > > of_iommu_configure+0x128/0x200 > > of_dma_configure_id+0xdc/0x1d4 > > platform_dma_configure+0x48/0x6c > > really_probe+0xf0/0x260 > > __driver_probe_device+0xec/0x104 > > driver_probe_device+0x3c/0xc0 > > Hurrah, this is the warning doing the correct job - something *is* off > if we're now getting here without the IOMMU configuration being done > already (for a normal device with no other funny business going on). > > > 2. The IOMMU driver's iommu_ops.of_xlate() callback is called about > > three times as much as before: > > That would suggest that the fwspec gets set up OK, then something later > in the __iommu_probe_device() path fails and tears it down again, so the > next attempt starts from scratch. Do you see the "Cannot attach to > IPMMU" message firing? I do not see such messages. > And similarly to the Rockchip case, does the > below help? The below is basically the same as your "[PATCH] iommu/ipmmu-vmsa: Register in a sensible order"[1]. While that fixes my first issue, it does not fix the second (harmless?) issue. Note that I only noticed the second issue because I have local debug code in soc_device_match(). Perhaps it happens, unnoticed, on other systems too? Thanks! [1] https://lore.kernel.org/53be6667544de65a15415b699e38a9a965692e45.1742481687.git.robin.murphy@arm.com/ Gr{oetje,eeting}s, Geert
Hi, On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: > In hindsight, there were some crucial subtleties overlooked when moving > {of,acpi}_dma_configure() to driver probe time to allow waiting for > IOMMU drivers with -EPROBE_DEFER, and these have become an > ever-increasing source of problems. The IOMMU API has some fundamental > assumptions that iommu_probe_device() is called for every device added > to the system, in the order in which they are added. Calling it in a > random order or not at all dependent on driver binding leads to > malformed groups, a potential lack of isolation for devices with no > driver, and all manner of unexpected concurrency and race conditions. > We've attempted to mitigate the latter with point-fix bodges like > iommu_probe_device_lock, but it's a losing battle and the time has come > to bite the bullet and address the true source of the problem instead. > > The crux of the matter is that the firmware parsing actually serves two > distinct purposes; one is identifying the IOMMU instance associated with > a device so we can check its availability, the second is actually > telling that instance about the relevant firmware-provided data for the > device. However the latter also depends on the former, and at the time > there was no good place to defer and retry that separately from the > availability check we also wanted for client driver probe. > > Nowadays, though, we have a proper notion of multiple IOMMU instances in > the core API itself, and each one gets a chance to probe its own devices > upon registration, so we can finally make that work as intended for > DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to > be able to run the iommu_fwspec machinery currently buried deep in the > wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be > surprisingly straightforward to bootstrap this transformation by pretty > much just calling the same path twice. At client driver probe time, > dev->driver is obviously set; conversely at device_add(), or a > subsequent bus_iommu_probe(), any device waiting for an IOMMU really > should *not* have a driver already, so we can use that as a condition to > disambiguate the two cases, and avoid recursing back into the IOMMU core > at the wrong times. > > Obviously this isn't the nicest thing, but for now it gives us a > functional baseline to then unpick the layers in between without many > more awkward cross-subsystem patches. There are some minor side-effects > like dma_range_map potentially being created earlier, and some debug > prints being repeated, but these aren't significantly detrimental. Let's > make things work first, then deal with making them nice. > > With the basic flow finally in the right order again, the next step is > probably turning the bus->dma_configure paths inside-out, since all we > really need from bus code is its notion of which device and input ID(s) > to parse the common firmware properties with... > > Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c > Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c > Signed-off-by: Robin Murphy <robin.murphy@arm.com> This change causes the MediaTek IOMMU driver to panic on probe, resulting in multiple MediaTek platforms not being able to boot. This was observed on Linus's tree at commit 1a9239bb4253 ("Merge tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") which just received the IOMMU updates a couple merge commits prior. The regression was bisected down to this patch on MT8183 Juniper Chromebook. The error is a NULL pointer dereference. Here's the decoded backtrace: Unable to handle kernel read from unreadable memory at virtual address 0000000000000000 Mem abort info: ESR = 0x0000000096000005 usb 1-1.1.2: new high-speed USB device number 6 using xhci-mtk EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [0000000000000000] user address but active_mm is swapper Internal error: Oops: 0000000096000005 [#1] SMP Modules linked in: CPU: 4 UID: 0 PID: 68 Comm: kworker/u34:1 Tainted: G B 6.14.0-05877-g1a9239bb4253 #621 PREEMPT a6631d3f04612a5c23866dc67cf38316c6b023e0 Tainted: [B]=BAD_PAGE Hardware name: Google juniper sku16 board (DT) Workqueue: events_unbound deferred_probe_work_func pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368 drivers/iommu/mtk_iommu.c:940) lr : mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368 drivers/iommu/mtk_iommu.c:940) sp : ffffffc0809674a0 x29: ffffffc0809674a0 x28: ffffff80c1f0f078 x27: ffffff80c1f0f460 x26: ffffff80c1f0f458 x25: ffffffdb1b738788 x24: ffffffc0809676d0 x23: ffffff80c8fe4130 x22: ffffffffffffffed x21: 0000000000000000 x20: ffffff80c1f0f010 x19: ffffff80c8fd7d00 x18: 0000000000000000 x17: 000000040044ffff x16: 00500072b5593510 x15: 0000000000000000 x14: ffffff80c09fbb80 x13: ffffffa5bf03f000 x12: ffffffbb6399cdf1 x11: 1ffffffb6399cdf0 x10: ffffffbb6399cdf0 x9 : dfffffc000000000 x8 : 000000449c663210 x7 : ffffffdb1cce6f87 x6 : 0000000000000001 x5 : ffffffdb1cce6f80 x4 : ffffffbb6399cdf1 x3 : 0000000000000000 x2 : 0000000000000020 x1 : ffffff80c22ed940 x0 : 0000000000000001 Call trace: mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368 drivers/iommu/mtk_iommu.c:940) (P) __iommu_probe_device (drivers/iommu/iommu.c:461 drivers/iommu/iommu.c:563) probe_iommu_group (drivers/iommu/iommu.c:1722) bus_for_each_dev (drivers/base/bus.c:370) iommu_device_register (drivers/iommu/iommu.c:1875 drivers/iommu/iommu.c:276) mtk_iommu_probe (drivers/iommu/mtk_iommu.c:1380) platform_probe (drivers/base/platform.c:1404) really_probe (drivers/base/dd.c:579 drivers/base/dd.c:658) __driver_probe_device (drivers/base/dd.c:800) driver_probe_device (drivers/base/dd.c:830) __device_attach_driver (drivers/base/dd.c:959) bus_for_each_drv (drivers/base/bus.c:462) __device_attach (drivers/base/dd.c:1032) device_initial_probe (drivers/base/dd.c:1080) bus_probe_device (drivers/base/bus.c:537) deferred_probe_work_func (drivers/base/dd.c:124) process_one_work (./arch/arm64/include/asm/jump_label.h:36 ./include/trace/events/workqueue.h:110 kernel/workqueue.c:3243) worker_thread (kernel/workqueue.c:3313 (discriminator 2) kernel/workqueue.c:3400 (discriminator 2)) kthread (kernel/kthread.c:464) ret_from_fork (arch/arm64/kernel/entry.S:863) Code: 92800256 f940d2b5 aa1503e0 97e8dbd7 (f94002b5) All code ======== 0: 92800256 mov x22, #0xffffffffffffffed // #-19 4: f940d2b5 ldr x21, [x21, #416] 8: aa1503e0 mov x0, x21 c: 97e8dbd7 bl 0xffffffffffa36f68 10:* f94002b5 ldr x21, [x21] <-- trapping instruction Code starting with the faulting instruction =========================================== 0: f94002b5 ldr x21, [x21] ---[ end trace 0000000000000000 ]--- ChenYu > --- > > v2: > - Comment bus driver changes for clarity > - Use dev->iommu as the now-robust replay condition > - Drop the device_iommu_mapped() checks in the firmware paths as they > weren't doing much - we can't replace probe_device_lock just yet... > > drivers/acpi/arm64/dma.c | 5 +++++ > drivers/acpi/scan.c | 7 ------- > drivers/amba/bus.c | 3 ++- > drivers/base/platform.c | 3 ++- > drivers/bus/fsl-mc/fsl-mc-bus.c | 3 ++- > drivers/cdx/cdx.c | 3 ++- > drivers/iommu/iommu.c | 24 +++++++++++++++++++++--- > drivers/iommu/of_iommu.c | 7 ++++++- > drivers/of/device.c | 7 ++++++- > drivers/pci/pci-driver.c | 3 ++- > 10 files changed, 48 insertions(+), 17 deletions(-) > > diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c > index 52b2abf88689..f30f138352b7 100644 > --- a/drivers/acpi/arm64/dma.c > +++ b/drivers/acpi/arm64/dma.c > @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev) > else > end = (1ULL << 32) - 1; > > + if (dev->dma_range_map) { > + dev_dbg(dev, "dma_range_map already set\n"); > + return; > + } > + > ret = acpi_dma_get_range(dev, &map); > if (!ret && map) { > end = dma_range_map_max(map); > diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c > index 9f4efa8f75a6..fb1fe9f3b1a3 100644 > --- a/drivers/acpi/scan.c > +++ b/drivers/acpi/scan.c > @@ -1632,13 +1632,6 @@ static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in) > err = viot_iommu_configure(dev); > mutex_unlock(&iommu_probe_device_lock); > > - /* > - * If we have reason to believe the IOMMU driver missed the initial > - * iommu_probe_device() call for dev, replay it to get things in order. > - */ > - if (!err && dev->bus) > - err = iommu_probe_device(dev); > - > return err; > } > > diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c > index 8ef259b4d037..71482d639a6d 100644 > --- a/drivers/amba/bus.c > +++ b/drivers/amba/bus.c > @@ -364,7 +364,8 @@ static int amba_dma_configure(struct device *dev) > ret = acpi_dma_configure(dev, attr); > } > > - if (!ret && !drv->driver_managed_dma) { > + /* @drv may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !drv->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > diff --git a/drivers/base/platform.c b/drivers/base/platform.c > index 6f2a33722c52..1813cfd0c4bd 100644 > --- a/drivers/base/platform.c > +++ b/drivers/base/platform.c > @@ -1451,7 +1451,8 @@ static int platform_dma_configure(struct device *dev) > attr = acpi_get_dma_attr(to_acpi_device_node(fwnode)); > ret = acpi_dma_configure(dev, attr); > } > - if (ret || drv->driver_managed_dma) > + /* @drv may not be valid when we're called from the IOMMU layer */ > + if (ret || !dev->driver || drv->driver_managed_dma) > return ret; > > ret = iommu_device_use_default_domain(dev); > diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c > index d1f3d327ddd1..a8be8cf246fb 100644 > --- a/drivers/bus/fsl-mc/fsl-mc-bus.c > +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c > @@ -153,7 +153,8 @@ static int fsl_mc_dma_configure(struct device *dev) > else > ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT, &input_id); > > - if (!ret && !mc_drv->driver_managed_dma) { > + /* @mc_drv may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !mc_drv->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c > index c573ed2ee71a..780fb0c4adba 100644 > --- a/drivers/cdx/cdx.c > +++ b/drivers/cdx/cdx.c > @@ -360,7 +360,8 @@ static int cdx_dma_configure(struct device *dev) > return ret; > } > > - if (!ret && !cdx_drv->driver_managed_dma) { > + /* @cdx_drv may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !cdx_drv->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index a3b45b84f42b..1cec7074367a 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -414,9 +414,21 @@ static int iommu_init_device(struct device *dev) > if (!dev_iommu_get(dev)) > return -ENOMEM; > /* > - * For FDT-based systems and ACPI IORT/VIOT, drivers register IOMMU > - * instances with non-NULL fwnodes, and client devices should have been > - * identified with a fwspec by this point. Otherwise, we can currently > + * For FDT-based systems and ACPI IORT/VIOT, the common firmware parsing > + * is buried in the bus dma_configure path. Properly unpicking that is > + * still a big job, so for now just invoke the whole thing. The device > + * already having a driver bound means dma_configure has already run and > + * either found no IOMMU to wait for, or we're in its replay call right > + * now, so either way there's no point calling it again. > + */ > + if (!dev->driver && dev->bus->dma_configure) { > + mutex_unlock(&iommu_probe_device_lock); > + dev->bus->dma_configure(dev); > + mutex_lock(&iommu_probe_device_lock); > + } > + /* > + * At this point, relevant devices either now have a fwspec which will > + * match ops registered with a non-NULL fwnode, or we can reasonably > * assume that only one of Intel, AMD, s390, PAMU or legacy SMMUv2 can > * be present, and that any of their registered instances has suitable > * ops for probing, and thus cheekily co-opt the same mechanism. > @@ -426,6 +438,12 @@ static int iommu_init_device(struct device *dev) > ret = -ENODEV; > goto err_free; > } > + /* > + * And if we do now see any replay calls, they would indicate someone > + * misusing the dma_configure path outside bus code. > + */ > + if (dev->driver) > + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); > > if (!try_module_get(ops->owner)) { > ret = -EINVAL; > diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c > index e10a68b5ffde..6b989a62def2 100644 > --- a/drivers/iommu/of_iommu.c > +++ b/drivers/iommu/of_iommu.c > @@ -155,7 +155,12 @@ int of_iommu_configure(struct device *dev, struct device_node *master_np, > dev_iommu_free(dev); > mutex_unlock(&iommu_probe_device_lock); > > - if (!err && dev->bus) > + /* > + * If we're not on the iommu_probe_device() path (as indicated by the > + * initial dev->iommu) then try to simulate it. This should no longer > + * happen unless of_dma_configure() is being misused outside bus code. > + */ > + if (!err && dev->bus && !dev_iommu_present) > err = iommu_probe_device(dev); > > if (err && err != -EPROBE_DEFER) > diff --git a/drivers/of/device.c b/drivers/of/device.c > index edf3be197265..5053e5d532cc 100644 > --- a/drivers/of/device.c > +++ b/drivers/of/device.c > @@ -99,6 +99,11 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, > bool coherent, set_map = false; > int ret; > > + if (dev->dma_range_map) { > + dev_dbg(dev, "dma_range_map already set\n"); > + goto skip_map; > + } > + > if (np == dev->of_node) > bus_np = __of_get_dma_parent(np); > else > @@ -119,7 +124,7 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, > end = dma_range_map_max(map); > set_map = true; > } > - > +skip_map: > /* > * If @dev is expected to be DMA-capable then the bus code that created > * it should have initialised its dma_mask pointer by this point. For > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > index f57ea36d125d..4468b7327cab 100644 > --- a/drivers/pci/pci-driver.c > +++ b/drivers/pci/pci-driver.c > @@ -1653,7 +1653,8 @@ static int pci_dma_configure(struct device *dev) > > pci_put_host_bridge_device(bridge); > > - if (!ret && !driver->driver_managed_dma) { > + /* @driver may not be valid when we're called from the IOMMU layer */ > + if (!ret && dev->driver && !driver->driver_managed_dma) { > ret = iommu_device_use_default_domain(dev); > if (ret) > arch_teardown_dma_ops(dev); > -- > 2.39.2.101.g768bb238c484.dirty >
Hi, On Thu, 2025-03-27 at 17:47 +0800, Chen-Yu Tsai wrote: > Hi, > > On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote: > > In hindsight, there were some crucial subtleties overlooked when > > moving > > {of,acpi}_dma_configure() to driver probe time to allow waiting for > > IOMMU drivers with -EPROBE_DEFER, and these have become an > > ever-increasing source of problems. The IOMMU API has some > > fundamental > > assumptions that iommu_probe_device() is called for every device > > added > > to the system, in the order in which they are added. Calling it in > > a > > random order or not at all dependent on driver binding leads to > > malformed groups, a potential lack of isolation for devices with no > > driver, and all manner of unexpected concurrency and race > > conditions. > > We've attempted to mitigate the latter with point-fix bodges like > > iommu_probe_device_lock, but it's a losing battle and the time has > > come > > to bite the bullet and address the true source of the problem > > instead. > > > > The crux of the matter is that the firmware parsing actually serves > > two > > distinct purposes; one is identifying the IOMMU instance associated > > with > > a device so we can check its availability, the second is actually > > telling that instance about the relevant firmware-provided data for > > the > > device. However the latter also depends on the former, and at the > > time > > there was no good place to defer and retry that separately from the > > availability check we also wanted for client driver probe. > > > > Nowadays, though, we have a proper notion of multiple IOMMU > > instances in > > the core API itself, and each one gets a chance to probe its own > > devices > > upon registration, so we can finally make that work as intended for > > DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() > > to > > be able to run the iommu_fwspec machinery currently buried deep in > > the > > wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be > > surprisingly straightforward to bootstrap this transformation by > > pretty > > much just calling the same path twice. At client driver probe time, > > dev->driver is obviously set; conversely at device_add(), or a > > subsequent bus_iommu_probe(), any device waiting for an IOMMU > > really > > should *not* have a driver already, so we can use that as a > > condition to > > disambiguate the two cases, and avoid recursing back into the IOMMU > > core > > at the wrong times. > > > > Obviously this isn't the nicest thing, but for now it gives us a > > functional baseline to then unpick the layers in between without > > many > > more awkward cross-subsystem patches. There are some minor side- > > effects > > like dma_range_map potentially being created earlier, and some > > debug > > prints being repeated, but these aren't significantly detrimental. > > Let's > > make things work first, then deal with making them nice. > > > > With the basic flow finally in the right order again, the next step > > is > > probably turning the bus->dma_configure paths inside-out, since all > > we > > really need from bus code is its notion of which device and input > > ID(s) > > to parse the common firmware properties with... > > > > Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c > > Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c > > Signed-off-by: Robin Murphy <robin.murphy@arm.com> > > This change causes the MediaTek IOMMU driver to panic on probe, > resulting in multiple MediaTek platforms not being able to boot. > This was observed on Linus's tree at commit 1a9239bb4253 > ("Merge tag 'net-next-6.15' of > git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") > which just received the IOMMU updates a couple merge commits > prior. > > The regression was bisected down to this patch on MT8183 Juniper > Chromebook. The error is a NULL pointer dereference. Here's the > decoded backtrace: > > Unable to handle kernel read from unreadable memory at virtual > address 0000000000000000 > Mem abort info: > ESR = 0x0000000096000005 > usb 1-1.1.2: new high-speed USB device number 6 using xhci-mtk > EC = 0x25: DABT (current EL), IL = 32 bits > SET = 0, FnV = 0 > EA = 0, S1PTW = 0 > FSC = 0x05: level 1 translation fault > Data abort info: > ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 > CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > [0000000000000000] user address but active_mm is swapper > Internal error: Oops: 0000000096000005 [#1] SMP > Modules linked in: > CPU: 4 UID: 0 PID: 68 Comm: kworker/u34:1 Tainted: G > B 6.14.0-05877-g1a9239bb4253 #621 PREEMPT > a6631d3f04612a5c23866dc67cf38316c6b023e0 > Tainted: [B]=BAD_PAGE > Hardware name: Google juniper sku16 board (DT) > Workqueue: events_unbound deferred_probe_work_func > pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > pc : mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368 > drivers/iommu/mtk_iommu.c:940) > lr : mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368 > drivers/iommu/mtk_iommu.c:940) > sp : ffffffc0809674a0 > x29: ffffffc0809674a0 x28: ffffff80c1f0f078 x27: ffffff80c1f0f460 > x26: ffffff80c1f0f458 x25: ffffffdb1b738788 x24: ffffffc0809676d0 > x23: ffffff80c8fe4130 x22: ffffffffffffffed x21: 0000000000000000 > x20: ffffff80c1f0f010 x19: ffffff80c8fd7d00 x18: 0000000000000000 > x17: 000000040044ffff x16: 00500072b5593510 x15: 0000000000000000 > x14: ffffff80c09fbb80 x13: ffffffa5bf03f000 x12: ffffffbb6399cdf1 > x11: 1ffffffb6399cdf0 x10: ffffffbb6399cdf0 x9 : dfffffc000000000 > x8 : 000000449c663210 x7 : ffffffdb1cce6f87 x6 : 0000000000000001 > x5 : ffffffdb1cce6f80 x4 : ffffffbb6399cdf1 x3 : 0000000000000000 > x2 : 0000000000000020 x1 : ffffff80c22ed940 x0 : 0000000000000001 > Call trace: > mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368 > drivers/iommu/mtk_iommu.c:940) (P) > __iommu_probe_device (drivers/iommu/iommu.c:461 > drivers/iommu/iommu.c:563) > probe_iommu_group (drivers/iommu/iommu.c:1722) > bus_for_each_dev (drivers/base/bus.c:370) > iommu_device_register (drivers/iommu/iommu.c:1875 > drivers/iommu/iommu.c:276) > mtk_iommu_probe (drivers/iommu/mtk_iommu.c:1380) > platform_probe (drivers/base/platform.c:1404) > really_probe (drivers/base/dd.c:579 drivers/base/dd.c:658) > __driver_probe_device (drivers/base/dd.c:800) > driver_probe_device (drivers/base/dd.c:830) > __device_attach_driver (drivers/base/dd.c:959) > bus_for_each_drv (drivers/base/bus.c:462) > __device_attach (drivers/base/dd.c:1032) > device_initial_probe (drivers/base/dd.c:1080) > bus_probe_device (drivers/base/bus.c:537) > deferred_probe_work_func (drivers/base/dd.c:124) > process_one_work (./arch/arm64/include/asm/jump_label.h:36 > ./include/trace/events/workqueue.h:110 kernel/workqueue.c:3243) > worker_thread (kernel/workqueue.c:3313 (discriminator 2) > kernel/workqueue.c:3400 (discriminator 2)) > kthread (kernel/kthread.c:464) > ret_from_fork (arch/arm64/kernel/entry.S:863) > Code: 92800256 f940d2b5 aa1503e0 97e8dbd7 (f94002b5) > All code > ======== > 0: 92800256 mov x22, > #0xffffffffffffffed // #-19 > 4: f940d2b5 ldr x21, [x21, #416] > 8: aa1503e0 mov x0, x21 > c: 97e8dbd7 bl 0xffffffffffa36f68 > 10:* f94002b5 ldr x21, [x21] <-- > trapping instruction > > Code starting with the faulting instruction > =========================================== > 0: f94002b5 ldr x21, [x21] > ---[ end trace 0000000000000000 ]--- > > > ChenYu > I've sent a patch [1] to fix this issue that I've also observed on my Mediatek Genio boards (510-EVK and 1200-EVK). [1]:https://lore.kernel.org/linux-iommu/20250327-fix-mtk-iommu-error-v1-1-df969158e752@collabora.com/ Regards, Louis-Alexis > > --- > > > > v2: > > - Comment bus driver changes for clarity > > - Use dev->iommu as the now-robust replay condition > > - Drop the device_iommu_mapped() checks in the firmware paths as > > they > > weren't doing much - we can't replace probe_device_lock just > > yet... > > > > drivers/acpi/arm64/dma.c | 5 +++++ > > drivers/acpi/scan.c | 7 ------- > > drivers/amba/bus.c | 3 ++- > > drivers/base/platform.c | 3 ++- > > drivers/bus/fsl-mc/fsl-mc-bus.c | 3 ++- > > drivers/cdx/cdx.c | 3 ++- > > drivers/iommu/iommu.c | 24 +++++++++++++++++++++--- > > drivers/iommu/of_iommu.c | 7 ++++++- > > drivers/of/device.c | 7 ++++++- > > drivers/pci/pci-driver.c | 3 ++- > > 10 files changed, 48 insertions(+), 17 deletions(-) > > > > diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c > > index 52b2abf88689..f30f138352b7 100644 > > --- a/drivers/acpi/arm64/dma.c > > +++ b/drivers/acpi/arm64/dma.c > > @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev) > > else > > end = (1ULL << 32) - 1; > > > > + if (dev->dma_range_map) { > > + dev_dbg(dev, "dma_range_map already set\n"); > > + return; > > + } > > + > > ret = acpi_dma_get_range(dev, &map); > > if (!ret && map) { > > end = dma_range_map_max(map); > > diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c > > index 9f4efa8f75a6..fb1fe9f3b1a3 100644 > > --- a/drivers/acpi/scan.c > > +++ b/drivers/acpi/scan.c > > @@ -1632,13 +1632,6 @@ static int acpi_iommu_configure_id(struct > > device *dev, const u32 *id_in) > > err = viot_iommu_configure(dev); > > mutex_unlock(&iommu_probe_device_lock); > > > > - /* > > - * If we have reason to believe the IOMMU driver missed > > the initial > > - * iommu_probe_device() call for dev, replay it to get > > things in order. > > - */ > > - if (!err && dev->bus) > > - err = iommu_probe_device(dev); > > - > > return err; > > } > > > > diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c > > index 8ef259b4d037..71482d639a6d 100644 > > --- a/drivers/amba/bus.c > > +++ b/drivers/amba/bus.c > > @@ -364,7 +364,8 @@ static int amba_dma_configure(struct device > > *dev) > > ret = acpi_dma_configure(dev, attr); > > } > > > > - if (!ret && !drv->driver_managed_dma) { > > + /* @drv may not be valid when we're called from the IOMMU > > layer */ > > + if (!ret && dev->driver && !drv->driver_managed_dma) { > > ret = iommu_device_use_default_domain(dev); > > if (ret) > > arch_teardown_dma_ops(dev); > > diff --git a/drivers/base/platform.c b/drivers/base/platform.c > > index 6f2a33722c52..1813cfd0c4bd 100644 > > --- a/drivers/base/platform.c > > +++ b/drivers/base/platform.c > > @@ -1451,7 +1451,8 @@ static int platform_dma_configure(struct > > device *dev) > > attr = > > acpi_get_dma_attr(to_acpi_device_node(fwnode)); > > ret = acpi_dma_configure(dev, attr); > > } > > - if (ret || drv->driver_managed_dma) > > + /* @drv may not be valid when we're called from the IOMMU > > layer */ > > + if (ret || !dev->driver || drv->driver_managed_dma) > > return ret; > > > > ret = iommu_device_use_default_domain(dev); > > diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl- > > mc/fsl-mc-bus.c > > index d1f3d327ddd1..a8be8cf246fb 100644 > > --- a/drivers/bus/fsl-mc/fsl-mc-bus.c > > +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c > > @@ -153,7 +153,8 @@ static int fsl_mc_dma_configure(struct device > > *dev) > > else > > ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT, > > &input_id); > > > > - if (!ret && !mc_drv->driver_managed_dma) { > > + /* @mc_drv may not be valid when we're called from the > > IOMMU layer */ > > + if (!ret && dev->driver && !mc_drv->driver_managed_dma) { > > ret = iommu_device_use_default_domain(dev); > > if (ret) > > arch_teardown_dma_ops(dev); > > diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c > > index c573ed2ee71a..780fb0c4adba 100644 > > --- a/drivers/cdx/cdx.c > > +++ b/drivers/cdx/cdx.c > > @@ -360,7 +360,8 @@ static int cdx_dma_configure(struct device > > *dev) > > return ret; > > } > > > > - if (!ret && !cdx_drv->driver_managed_dma) { > > + /* @cdx_drv may not be valid when we're called from the > > IOMMU layer */ > > + if (!ret && dev->driver && !cdx_drv->driver_managed_dma) { > > ret = iommu_device_use_default_domain(dev); > > if (ret) > > arch_teardown_dma_ops(dev); > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > > index a3b45b84f42b..1cec7074367a 100644 > > --- a/drivers/iommu/iommu.c > > +++ b/drivers/iommu/iommu.c > > @@ -414,9 +414,21 @@ static int iommu_init_device(struct device > > *dev) > > if (!dev_iommu_get(dev)) > > return -ENOMEM; > > /* > > - * For FDT-based systems and ACPI IORT/VIOT, drivers > > register IOMMU > > - * instances with non-NULL fwnodes, and client devices > > should have been > > - * identified with a fwspec by this point. Otherwise, we > > can currently > > + * For FDT-based systems and ACPI IORT/VIOT, the common > > firmware parsing > > + * is buried in the bus dma_configure path. Properly > > unpicking that is > > + * still a big job, so for now just invoke the whole > > thing. The device > > + * already having a driver bound means dma_configure has > > already run and > > + * either found no IOMMU to wait for, or we're in its > > replay call right > > + * now, so either way there's no point calling it again. > > + */ > > + if (!dev->driver && dev->bus->dma_configure) { > > + mutex_unlock(&iommu_probe_device_lock); > > + dev->bus->dma_configure(dev); > > + mutex_lock(&iommu_probe_device_lock); > > + } > > + /* > > + * At this point, relevant devices either now have a > > fwspec which will > > + * match ops registered with a non-NULL fwnode, or we can > > reasonably > > * assume that only one of Intel, AMD, s390, PAMU or > > legacy SMMUv2 can > > * be present, and that any of their registered instances > > has suitable > > * ops for probing, and thus cheekily co-opt the same > > mechanism. > > @@ -426,6 +438,12 @@ static int iommu_init_device(struct device > > *dev) > > ret = -ENODEV; > > goto err_free; > > } > > + /* > > + * And if we do now see any replay calls, they would > > indicate someone > > + * misusing the dma_configure path outside bus code. > > + */ > > + if (dev->driver) > > + dev_WARN(dev, "late IOMMU probe at driver bind, > > something fishy here!\n"); > > > > if (!try_module_get(ops->owner)) { > > ret = -EINVAL; > > diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c > > index e10a68b5ffde..6b989a62def2 100644 > > --- a/drivers/iommu/of_iommu.c > > +++ b/drivers/iommu/of_iommu.c > > @@ -155,7 +155,12 @@ int of_iommu_configure(struct device *dev, > > struct device_node *master_np, > > dev_iommu_free(dev); > > mutex_unlock(&iommu_probe_device_lock); > > > > - if (!err && dev->bus) > > + /* > > + * If we're not on the iommu_probe_device() path (as > > indicated by the > > + * initial dev->iommu) then try to simulate it. This > > should no longer > > + * happen unless of_dma_configure() is being misused > > outside bus code. > > + */ > > + if (!err && dev->bus && !dev_iommu_present) > > err = iommu_probe_device(dev); > > > > if (err && err != -EPROBE_DEFER) > > diff --git a/drivers/of/device.c b/drivers/of/device.c > > index edf3be197265..5053e5d532cc 100644 > > --- a/drivers/of/device.c > > +++ b/drivers/of/device.c > > @@ -99,6 +99,11 @@ int of_dma_configure_id(struct device *dev, > > struct device_node *np, > > bool coherent, set_map = false; > > int ret; > > > > + if (dev->dma_range_map) { > > + dev_dbg(dev, "dma_range_map already set\n"); > > + goto skip_map; > > + } > > + > > if (np == dev->of_node) > > bus_np = __of_get_dma_parent(np); > > else > > @@ -119,7 +124,7 @@ int of_dma_configure_id(struct device *dev, > > struct device_node *np, > > end = dma_range_map_max(map); > > set_map = true; > > } > > - > > +skip_map: > > /* > > * If @dev is expected to be DMA-capable then the bus code > > that created > > * it should have initialised its dma_mask pointer by this > > point. For > > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > > index f57ea36d125d..4468b7327cab 100644 > > --- a/drivers/pci/pci-driver.c > > +++ b/drivers/pci/pci-driver.c > > @@ -1653,7 +1653,8 @@ static int pci_dma_configure(struct device > > *dev) > > > > pci_put_host_bridge_device(bridge); > > > > - if (!ret && !driver->driver_managed_dma) { > > + /* @driver may not be valid when we're called from the > > IOMMU layer */ > > + if (!ret && dev->driver && !driver->driver_managed_dma) { > > ret = iommu_device_use_default_domain(dev); > > if (ret) > > arch_teardown_dma_ops(dev); > > -- > > 2.39.2.101.g768bb238c484.dirty > > >
diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c index 52b2abf88689..f30f138352b7 100644 --- a/drivers/acpi/arm64/dma.c +++ b/drivers/acpi/arm64/dma.c @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev) else end = (1ULL << 32) - 1; + if (dev->dma_range_map) { + dev_dbg(dev, "dma_range_map already set\n"); + return; + } + ret = acpi_dma_get_range(dev, &map); if (!ret && map) { end = dma_range_map_max(map); diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index 9f4efa8f75a6..fb1fe9f3b1a3 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -1632,13 +1632,6 @@ static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in) err = viot_iommu_configure(dev); mutex_unlock(&iommu_probe_device_lock); - /* - * If we have reason to believe the IOMMU driver missed the initial - * iommu_probe_device() call for dev, replay it to get things in order. - */ - if (!err && dev->bus) - err = iommu_probe_device(dev); - return err; } diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c index 8ef259b4d037..71482d639a6d 100644 --- a/drivers/amba/bus.c +++ b/drivers/amba/bus.c @@ -364,7 +364,8 @@ static int amba_dma_configure(struct device *dev) ret = acpi_dma_configure(dev, attr); } - if (!ret && !drv->driver_managed_dma) { + /* @drv may not be valid when we're called from the IOMMU layer */ + if (!ret && dev->driver && !drv->driver_managed_dma) { ret = iommu_device_use_default_domain(dev); if (ret) arch_teardown_dma_ops(dev); diff --git a/drivers/base/platform.c b/drivers/base/platform.c index 6f2a33722c52..1813cfd0c4bd 100644 --- a/drivers/base/platform.c +++ b/drivers/base/platform.c @@ -1451,7 +1451,8 @@ static int platform_dma_configure(struct device *dev) attr = acpi_get_dma_attr(to_acpi_device_node(fwnode)); ret = acpi_dma_configure(dev, attr); } - if (ret || drv->driver_managed_dma) + /* @drv may not be valid when we're called from the IOMMU layer */ + if (ret || !dev->driver || drv->driver_managed_dma) return ret; ret = iommu_device_use_default_domain(dev); diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c index d1f3d327ddd1..a8be8cf246fb 100644 --- a/drivers/bus/fsl-mc/fsl-mc-bus.c +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c @@ -153,7 +153,8 @@ static int fsl_mc_dma_configure(struct device *dev) else ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT, &input_id); - if (!ret && !mc_drv->driver_managed_dma) { + /* @mc_drv may not be valid when we're called from the IOMMU layer */ + if (!ret && dev->driver && !mc_drv->driver_managed_dma) { ret = iommu_device_use_default_domain(dev); if (ret) arch_teardown_dma_ops(dev); diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c index c573ed2ee71a..780fb0c4adba 100644 --- a/drivers/cdx/cdx.c +++ b/drivers/cdx/cdx.c @@ -360,7 +360,8 @@ static int cdx_dma_configure(struct device *dev) return ret; } - if (!ret && !cdx_drv->driver_managed_dma) { + /* @cdx_drv may not be valid when we're called from the IOMMU layer */ + if (!ret && dev->driver && !cdx_drv->driver_managed_dma) { ret = iommu_device_use_default_domain(dev); if (ret) arch_teardown_dma_ops(dev); diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index a3b45b84f42b..1cec7074367a 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -414,9 +414,21 @@ static int iommu_init_device(struct device *dev) if (!dev_iommu_get(dev)) return -ENOMEM; /* - * For FDT-based systems and ACPI IORT/VIOT, drivers register IOMMU - * instances with non-NULL fwnodes, and client devices should have been - * identified with a fwspec by this point. Otherwise, we can currently + * For FDT-based systems and ACPI IORT/VIOT, the common firmware parsing + * is buried in the bus dma_configure path. Properly unpicking that is + * still a big job, so for now just invoke the whole thing. The device + * already having a driver bound means dma_configure has already run and + * either found no IOMMU to wait for, or we're in its replay call right + * now, so either way there's no point calling it again. + */ + if (!dev->driver && dev->bus->dma_configure) { + mutex_unlock(&iommu_probe_device_lock); + dev->bus->dma_configure(dev); + mutex_lock(&iommu_probe_device_lock); + } + /* + * At this point, relevant devices either now have a fwspec which will + * match ops registered with a non-NULL fwnode, or we can reasonably * assume that only one of Intel, AMD, s390, PAMU or legacy SMMUv2 can * be present, and that any of their registered instances has suitable * ops for probing, and thus cheekily co-opt the same mechanism. @@ -426,6 +438,12 @@ static int iommu_init_device(struct device *dev) ret = -ENODEV; goto err_free; } + /* + * And if we do now see any replay calls, they would indicate someone + * misusing the dma_configure path outside bus code. + */ + if (dev->driver) + dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n"); if (!try_module_get(ops->owner)) { ret = -EINVAL; diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c index e10a68b5ffde..6b989a62def2 100644 --- a/drivers/iommu/of_iommu.c +++ b/drivers/iommu/of_iommu.c @@ -155,7 +155,12 @@ int of_iommu_configure(struct device *dev, struct device_node *master_np, dev_iommu_free(dev); mutex_unlock(&iommu_probe_device_lock); - if (!err && dev->bus) + /* + * If we're not on the iommu_probe_device() path (as indicated by the + * initial dev->iommu) then try to simulate it. This should no longer + * happen unless of_dma_configure() is being misused outside bus code. + */ + if (!err && dev->bus && !dev_iommu_present) err = iommu_probe_device(dev); if (err && err != -EPROBE_DEFER) diff --git a/drivers/of/device.c b/drivers/of/device.c index edf3be197265..5053e5d532cc 100644 --- a/drivers/of/device.c +++ b/drivers/of/device.c @@ -99,6 +99,11 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, bool coherent, set_map = false; int ret; + if (dev->dma_range_map) { + dev_dbg(dev, "dma_range_map already set\n"); + goto skip_map; + } + if (np == dev->of_node) bus_np = __of_get_dma_parent(np); else @@ -119,7 +124,7 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, end = dma_range_map_max(map); set_map = true; } - +skip_map: /* * If @dev is expected to be DMA-capable then the bus code that created * it should have initialised its dma_mask pointer by this point. For diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index f57ea36d125d..4468b7327cab 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -1653,7 +1653,8 @@ static int pci_dma_configure(struct device *dev) pci_put_host_bridge_device(bridge); - if (!ret && !driver->driver_managed_dma) { + /* @driver may not be valid when we're called from the IOMMU layer */ + if (!ret && dev->driver && !driver->driver_managed_dma) { ret = iommu_device_use_default_domain(dev); if (ret) arch_teardown_dma_ops(dev);