diff mbox series

[v2,4/4] iommu: Get DT/ACPI parsing into the proper probe path

Message ID e3b191e6fd6ca9a1e84c5e5e40044faf97abb874.1740753261.git.robin.murphy@arm.com (mailing list archive)
State New, archived
Headers show
Series iommu: Fix the longstanding probe issues | expand

Commit Message

Robin Murphy Feb. 28, 2025, 3:46 p.m. UTC
In hindsight, there were some crucial subtleties overlooked when moving
{of,acpi}_dma_configure() to driver probe time to allow waiting for
IOMMU drivers with -EPROBE_DEFER, and these have become an
ever-increasing source of problems. The IOMMU API has some fundamental
assumptions that iommu_probe_device() is called for every device added
to the system, in the order in which they are added. Calling it in a
random order or not at all dependent on driver binding leads to
malformed groups, a potential lack of isolation for devices with no
driver, and all manner of unexpected concurrency and race conditions.
We've attempted to mitigate the latter with point-fix bodges like
iommu_probe_device_lock, but it's a losing battle and the time has come
to bite the bullet and address the true source of the problem instead.

The crux of the matter is that the firmware parsing actually serves two
distinct purposes; one is identifying the IOMMU instance associated with
a device so we can check its availability, the second is actually
telling that instance about the relevant firmware-provided data for the
device. However the latter also depends on the former, and at the time
there was no good place to defer and retry that separately from the
availability check we also wanted for client driver probe.

Nowadays, though, we have a proper notion of multiple IOMMU instances in
the core API itself, and each one gets a chance to probe its own devices
upon registration, so we can finally make that work as intended for
DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to
be able to run the iommu_fwspec machinery currently buried deep in the
wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be
surprisingly straightforward to bootstrap this transformation by pretty
much just calling the same path twice. At client driver probe time,
dev->driver is obviously set; conversely at device_add(), or a
subsequent bus_iommu_probe(), any device waiting for an IOMMU really
should *not* have a driver already, so we can use that as a condition to
disambiguate the two cases, and avoid recursing back into the IOMMU core
at the wrong times.

Obviously this isn't the nicest thing, but for now it gives us a
functional baseline to then unpick the layers in between without many
more awkward cross-subsystem patches. There are some minor side-effects
like dma_range_map potentially being created earlier, and some debug
prints being repeated, but these aren't significantly detrimental. Let's
make things work first, then deal with making them nice.

With the basic flow finally in the right order again, the next step is
probably turning the bus->dma_configure paths inside-out, since all we
really need from bus code is its notion of which device and input ID(s)
to parse the common firmware properties with...

Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c
Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2:
 - Comment bus driver changes for clarity
 - Use dev->iommu as the now-robust replay condition
 - Drop the device_iommu_mapped() checks in the firmware paths as they
   weren't doing much - we can't replace probe_device_lock just yet...
 
 drivers/acpi/arm64/dma.c        |  5 +++++
 drivers/acpi/scan.c             |  7 -------
 drivers/amba/bus.c              |  3 ++-
 drivers/base/platform.c         |  3 ++-
 drivers/bus/fsl-mc/fsl-mc-bus.c |  3 ++-
 drivers/cdx/cdx.c               |  3 ++-
 drivers/iommu/iommu.c           | 24 +++++++++++++++++++++---
 drivers/iommu/of_iommu.c        |  7 ++++++-
 drivers/of/device.c             |  7 ++++++-
 drivers/pci/pci-driver.c        |  3 ++-
 10 files changed, 48 insertions(+), 17 deletions(-)

Comments

Jason Gunthorpe March 5, 2025, 6:28 p.m. UTC | #1
On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote:
> +	if (!dev->driver && dev->bus->dma_configure) {
> +		mutex_unlock(&iommu_probe_device_lock);
> +		dev->bus->dma_configure(dev);
> +		mutex_lock(&iommu_probe_device_lock);
> +	}

I think it would be very nice to get rid of the lock/unlock.. It makes
me nervous that we continue on assuming dev->iommu was freshly
allocated..

 setup the dev->iommu partially, then drop the lock.

There is only one other caller in:

static int really_probe(struct device *dev, const struct device_driver *drv)
{
	if (dev->bus->dma_configure) {
		ret = dev->bus->dma_configure(dev);
		if (ret)
			goto pinctrl_bind_failed;
	}

Is it feasible to make it so the caller has to hold the
iommu_probe_device_lock prior to calling the op?

That would require moving the locking inside of_dma_configure to less
inside, and using a new iommu_probe_device() wrapper.

However, if you plan to turn this inside out soonish then it would not
be worth the bother.

Anyhow:

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason
Lorenzo Pieralisi March 7, 2025, 2:24 p.m. UTC | #2
On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote:
> In hindsight, there were some crucial subtleties overlooked when moving
> {of,acpi}_dma_configure() to driver probe time to allow waiting for
> IOMMU drivers with -EPROBE_DEFER, and these have become an
> ever-increasing source of problems. The IOMMU API has some fundamental
> assumptions that iommu_probe_device() is called for every device added
> to the system, in the order in which they are added. Calling it in a
> random order or not at all dependent on driver binding leads to
> malformed groups, a potential lack of isolation for devices with no
> driver, and all manner of unexpected concurrency and race conditions.
> We've attempted to mitigate the latter with point-fix bodges like
> iommu_probe_device_lock, but it's a losing battle and the time has come
> to bite the bullet and address the true source of the problem instead.
> 
> The crux of the matter is that the firmware parsing actually serves two
> distinct purposes; one is identifying the IOMMU instance associated with
> a device so we can check its availability, the second is actually
> telling that instance about the relevant firmware-provided data for the
> device. However the latter also depends on the former, and at the time
> there was no good place to defer and retry that separately from the
> availability check we also wanted for client driver probe.
> 
> Nowadays, though, we have a proper notion of multiple IOMMU instances in
> the core API itself, and each one gets a chance to probe its own devices
> upon registration, so we can finally make that work as intended for
> DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to
> be able to run the iommu_fwspec machinery currently buried deep in the
> wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be
> surprisingly straightforward to bootstrap this transformation by pretty
> much just calling the same path twice. At client driver probe time,
> dev->driver is obviously set; conversely at device_add(), or a
> subsequent bus_iommu_probe(), any device waiting for an IOMMU really
> should *not* have a driver already, so we can use that as a condition to
> disambiguate the two cases, and avoid recursing back into the IOMMU core
> at the wrong times.
> 
> Obviously this isn't the nicest thing, but for now it gives us a
> functional baseline to then unpick the layers in between without many
> more awkward cross-subsystem patches. There are some minor side-effects
> like dma_range_map potentially being created earlier, and some debug
> prints being repeated, but these aren't significantly detrimental. Let's
> make things work first, then deal with making them nice.
> 
> With the basic flow finally in the right order again, the next step is
> probably turning the bus->dma_configure paths inside-out, since all we
> really need from bus code is its notion of which device and input ID(s)
> to parse the common firmware properties with...
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c
> Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
> 
> v2:
>  - Comment bus driver changes for clarity
>  - Use dev->iommu as the now-robust replay condition
>  - Drop the device_iommu_mapped() checks in the firmware paths as they
>    weren't doing much - we can't replace probe_device_lock just yet...
>  
>  drivers/acpi/arm64/dma.c        |  5 +++++
>  drivers/acpi/scan.c             |  7 -------
>  drivers/amba/bus.c              |  3 ++-
>  drivers/base/platform.c         |  3 ++-
>  drivers/bus/fsl-mc/fsl-mc-bus.c |  3 ++-
>  drivers/cdx/cdx.c               |  3 ++-
>  drivers/iommu/iommu.c           | 24 +++++++++++++++++++++---
>  drivers/iommu/of_iommu.c        |  7 ++++++-
>  drivers/of/device.c             |  7 ++++++-
>  drivers/pci/pci-driver.c        |  3 ++-
>  10 files changed, 48 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c
> index 52b2abf88689..f30f138352b7 100644
> --- a/drivers/acpi/arm64/dma.c
> +++ b/drivers/acpi/arm64/dma.c
> @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev)
>  	else
>  		end = (1ULL << 32) - 1;
>  
> +	if (dev->dma_range_map) {
> +		dev_dbg(dev, "dma_range_map already set\n");
> +		return;
> +	}
> +

Why are we checking that dev->dma_range_map exists here rather than
at function entry ? Is that because we want to run the previous code
for some reasons even if dev->dma_range_map is already set ?

Just noticed, the OF counterpart does not seem to take the same
approach, so I am just flagging this up if it matters at all.

Other than that, for acpi/arm64:

Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org>

>  	ret = acpi_dma_get_range(dev, &map);
>  	if (!ret && map) {
>  		end = dma_range_map_max(map);
> diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
> index 9f4efa8f75a6..fb1fe9f3b1a3 100644
> --- a/drivers/acpi/scan.c
> +++ b/drivers/acpi/scan.c
> @@ -1632,13 +1632,6 @@ static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in)
>  		err = viot_iommu_configure(dev);
>  	mutex_unlock(&iommu_probe_device_lock);
>  
> -	/*
> -	 * If we have reason to believe the IOMMU driver missed the initial
> -	 * iommu_probe_device() call for dev, replay it to get things in order.
> -	 */
> -	if (!err && dev->bus)
> -		err = iommu_probe_device(dev);
> -
>  	return err;
>  }
>  
> diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> index 8ef259b4d037..71482d639a6d 100644
> --- a/drivers/amba/bus.c
> +++ b/drivers/amba/bus.c
> @@ -364,7 +364,8 @@ static int amba_dma_configure(struct device *dev)
>  		ret = acpi_dma_configure(dev, attr);
>  	}
>  
> -	if (!ret && !drv->driver_managed_dma) {
> +	/* @drv may not be valid when we're called from the IOMMU layer */
> +	if (!ret && dev->driver && !drv->driver_managed_dma) {
>  		ret = iommu_device_use_default_domain(dev);
>  		if (ret)
>  			arch_teardown_dma_ops(dev);
> diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> index 6f2a33722c52..1813cfd0c4bd 100644
> --- a/drivers/base/platform.c
> +++ b/drivers/base/platform.c
> @@ -1451,7 +1451,8 @@ static int platform_dma_configure(struct device *dev)
>  		attr = acpi_get_dma_attr(to_acpi_device_node(fwnode));
>  		ret = acpi_dma_configure(dev, attr);
>  	}
> -	if (ret || drv->driver_managed_dma)
> +	/* @drv may not be valid when we're called from the IOMMU layer */
> +	if (ret || !dev->driver || drv->driver_managed_dma)
>  		return ret;
>  
>  	ret = iommu_device_use_default_domain(dev);
> diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c
> index d1f3d327ddd1..a8be8cf246fb 100644
> --- a/drivers/bus/fsl-mc/fsl-mc-bus.c
> +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c
> @@ -153,7 +153,8 @@ static int fsl_mc_dma_configure(struct device *dev)
>  	else
>  		ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT, &input_id);
>  
> -	if (!ret && !mc_drv->driver_managed_dma) {
> +	/* @mc_drv may not be valid when we're called from the IOMMU layer */
> +	if (!ret && dev->driver && !mc_drv->driver_managed_dma) {
>  		ret = iommu_device_use_default_domain(dev);
>  		if (ret)
>  			arch_teardown_dma_ops(dev);
> diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c
> index c573ed2ee71a..780fb0c4adba 100644
> --- a/drivers/cdx/cdx.c
> +++ b/drivers/cdx/cdx.c
> @@ -360,7 +360,8 @@ static int cdx_dma_configure(struct device *dev)
>  		return ret;
>  	}
>  
> -	if (!ret && !cdx_drv->driver_managed_dma) {
> +	/* @cdx_drv may not be valid when we're called from the IOMMU layer */
> +	if (!ret && dev->driver && !cdx_drv->driver_managed_dma) {
>  		ret = iommu_device_use_default_domain(dev);
>  		if (ret)
>  			arch_teardown_dma_ops(dev);
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index a3b45b84f42b..1cec7074367a 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -414,9 +414,21 @@ static int iommu_init_device(struct device *dev)
>  	if (!dev_iommu_get(dev))
>  		return -ENOMEM;
>  	/*
> -	 * For FDT-based systems and ACPI IORT/VIOT, drivers register IOMMU
> -	 * instances with non-NULL fwnodes, and client devices should have been
> -	 * identified with a fwspec by this point. Otherwise, we can currently
> +	 * For FDT-based systems and ACPI IORT/VIOT, the common firmware parsing
> +	 * is buried in the bus dma_configure path. Properly unpicking that is
> +	 * still a big job, so for now just invoke the whole thing. The device
> +	 * already having a driver bound means dma_configure has already run and
> +	 * either found no IOMMU to wait for, or we're in its replay call right
> +	 * now, so either way there's no point calling it again.
> +	 */
> +	if (!dev->driver && dev->bus->dma_configure) {
> +		mutex_unlock(&iommu_probe_device_lock);
> +		dev->bus->dma_configure(dev);
> +		mutex_lock(&iommu_probe_device_lock);
> +	}
> +	/*
> +	 * At this point, relevant devices either now have a fwspec which will
> +	 * match ops registered with a non-NULL fwnode, or we can reasonably
>  	 * assume that only one of Intel, AMD, s390, PAMU or legacy SMMUv2 can
>  	 * be present, and that any of their registered instances has suitable
>  	 * ops for probing, and thus cheekily co-opt the same mechanism.
> @@ -426,6 +438,12 @@ static int iommu_init_device(struct device *dev)
>  		ret = -ENODEV;
>  		goto err_free;
>  	}
> +	/*
> +	 * And if we do now see any replay calls, they would indicate someone
> +	 * misusing the dma_configure path outside bus code.
> +	 */
> +	if (dev->driver)
> +		dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n");
>  
>  	if (!try_module_get(ops->owner)) {
>  		ret = -EINVAL;
> diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
> index e10a68b5ffde..6b989a62def2 100644
> --- a/drivers/iommu/of_iommu.c
> +++ b/drivers/iommu/of_iommu.c
> @@ -155,7 +155,12 @@ int of_iommu_configure(struct device *dev, struct device_node *master_np,
>  		dev_iommu_free(dev);
>  	mutex_unlock(&iommu_probe_device_lock);
>  
> -	if (!err && dev->bus)
> +	/*
> +	 * If we're not on the iommu_probe_device() path (as indicated by the
> +	 * initial dev->iommu) then try to simulate it. This should no longer
> +	 * happen unless of_dma_configure() is being misused outside bus code.
> +	 */
> +	if (!err && dev->bus && !dev_iommu_present)
>  		err = iommu_probe_device(dev);
>  
>  	if (err && err != -EPROBE_DEFER)
> diff --git a/drivers/of/device.c b/drivers/of/device.c
> index edf3be197265..5053e5d532cc 100644
> --- a/drivers/of/device.c
> +++ b/drivers/of/device.c
> @@ -99,6 +99,11 @@ int of_dma_configure_id(struct device *dev, struct device_node *np,
>  	bool coherent, set_map = false;
>  	int ret;
>  
> +	if (dev->dma_range_map) {
> +		dev_dbg(dev, "dma_range_map already set\n");
> +		goto skip_map;
> +	}
> +
>  	if (np == dev->of_node)
>  		bus_np = __of_get_dma_parent(np);
>  	else
> @@ -119,7 +124,7 @@ int of_dma_configure_id(struct device *dev, struct device_node *np,
>  		end = dma_range_map_max(map);
>  		set_map = true;
>  	}
> -
> +skip_map:
>  	/*
>  	 * If @dev is expected to be DMA-capable then the bus code that created
>  	 * it should have initialised its dma_mask pointer by this point. For
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index f57ea36d125d..4468b7327cab 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1653,7 +1653,8 @@ static int pci_dma_configure(struct device *dev)
>  
>  	pci_put_host_bridge_device(bridge);
>  
> -	if (!ret && !driver->driver_managed_dma) {
> +	/* @driver may not be valid when we're called from the IOMMU layer */
> +	if (!ret && dev->driver && !driver->driver_managed_dma) {
>  		ret = iommu_device_use_default_domain(dev);
>  		if (ret)
>  			arch_teardown_dma_ops(dev);
> -- 
> 2.39.2.101.g768bb238c484.dirty
>
Robin Murphy March 7, 2025, 8:20 p.m. UTC | #3
On 2025-03-07 2:24 pm, Lorenzo Pieralisi wrote:
[...]
>> diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c
>> index 52b2abf88689..f30f138352b7 100644
>> --- a/drivers/acpi/arm64/dma.c
>> +++ b/drivers/acpi/arm64/dma.c
>> @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev)
>>   	else
>>   		end = (1ULL << 32) - 1;
>>   
>> +	if (dev->dma_range_map) {
>> +		dev_dbg(dev, "dma_range_map already set\n");
>> +		return;
>> +	}
>> +
> 
> Why are we checking that dev->dma_range_map exists here rather than
> at function entry ? Is that because we want to run the previous code
> for some reasons even if dev->dma_range_map is already set ?
> 
> Just noticed, the OF counterpart does not seem to take the same
> approach, so I am just flagging this up if it matters at all.

The intent is just to ensure it's safe to come through this path more 
than once for the time being - it doesn't really matter whether or not 
we repeat anything apart from allocating and assigning dma_range_map, 
since that leaks the previous allocation. Thus this check is logically 
guarding the acpi_dma_get_range() call in this path, and the 
of_dma_get_range() call in the other.

> Other than that, for acpi/arm64:
> 
> Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org>

Thanks!

Robin.
Joerg Roedel March 11, 2025, 6:42 p.m. UTC | #4
Hi Robin,

On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote:
> +	/*
> +	 * And if we do now see any replay calls, they would indicate someone
> +	 * misusing the dma_configure path outside bus code.
> +	 */
> +	if (dev->driver)
> +		dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n");

This warning triggers on my workstation (with an AMD IOMMU), any ideas?

 ------------[ cut here ]------------
 reg-dummy reg-dummy: late IOMMU probe at driver bind, something fishy here!
 WARNING: CPU: 0 PID: 1 at drivers/iommu/iommu.c:449 __iommu_probe_device+0x10b/0x510
 Modules linked in:

 CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc6-iommu-next+ #1 1d691d7aebf343bde741cf4c8610d78a2ea2d2d9
 Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5406 11/13/2019
 RIP: 0010:__iommu_probe_device+0x10b/0x510
 Code: 68 00 74 28 48 8b 6b 50 48 85 ed 75 03 48 8b 2b 48 89 df e8 87 71 06 00 48 89 ea 48 c7 c7 90 dd 2c 8b 48 89 c6 e8 35 83 77 ff <0f> 0b 49 8b bd a8 00 00 00 e8 77 ab 85 ff 84 c0 0f 84 ad 01 00 00
 RSP: 0018:ffffafba00047c58 EFLAGS: 00010282
 RAX: 0000000000000000 RBX: ffffa00481301c10 RCX: 0000000000000003
 RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001
 RBP: ffffa00480ffaee0 R08: 0000000000000000 R09: ffffafba00047ae8
 R10: ffffa0135e93ffa8 R11: 0000000000000003 R12: ffffafba00047d18
 R13: ffffffff8adac1a0 R14: 0000000000000000 R15: ffffa004802c5800
 FS:  0000000000000000(0000) GS:ffffa0135ea00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffa00baca01000 CR3: 000000082b838000 CR4: 00000000003506f0
 Call Trace:
  <TASK>
  ? __iommu_probe_device+0x10b/0x510
  ? __warn.cold+0x93/0xf7
  ? __iommu_probe_device+0x10b/0x510
  ? report_bug+0xff/0x140
  ? prb_read_valid+0x1b/0x30
  ? handle_bug+0x58/0x90
  ? exc_invalid_op+0x17/0x70
  ? asm_exc_invalid_op+0x1a/0x20
  ? __iommu_probe_device+0x10b/0x510
  ? __iommu_probe_device+0x10b/0x510
  ? __pfx_probe_iommu_group+0x10/0x10
  probe_iommu_group+0x28/0x50
  bus_for_each_dev+0x7e/0xd0
  iommu_device_register+0xee/0x260
  iommu_go_to_state+0xa2a/0x1970
  ? srso_return_thunk+0x5/0x5f
  ? blake2s_update+0x68/0x160
  ? __pfx_pci_iommu_init+0x10/0x10
  amd_iommu_init+0x14/0x50
  ? __pfx_pci_iommu_init+0x10/0x10
  pci_iommu_init+0x12/0x40
  do_one_initcall+0x46/0x300
  kernel_init_freeable+0x23d/0x2a0
  ? __pfx_kernel_init+0x10/0x10
  kernel_init+0x1a/0x140
  ret_from_fork+0x34/0x50
  ? __pfx_kernel_init+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 ---[ end trace 0000000000000000 ]---
Baolu Lu March 12, 2025, 7:07 a.m. UTC | #5
On 2025/3/12 2:42, Joerg Roedel wrote:
> On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote:
>> +	/*
>> +	 * And if we do now see any replay calls, they would indicate someone
>> +	 * misusing the dma_configure path outside bus code.
>> +	 */
>> +	if (dev->driver)
>> +		dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n");
> This warning triggers on my workstation (with an AMD IOMMU), any ideas?
> 
>   ------------[ cut here ]------------
>   reg-dummy reg-dummy: late IOMMU probe at driver bind, something fishy here!
>   WARNING: CPU: 0 PID: 1 at drivers/iommu/iommu.c:449 __iommu_probe_device+0x10b/0x510
>   Modules linked in:
> 
>   CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc6-iommu-next+ #1 1d691d7aebf343bde741cf4c8610d78a2ea2d2d9
>   Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5406 11/13/2019
>   RIP: 0010:__iommu_probe_device+0x10b/0x510
>   Code: 68 00 74 28 48 8b 6b 50 48 85 ed 75 03 48 8b 2b 48 89 df e8 87 71 06 00 48 89 ea 48 c7 c7 90 dd 2c 8b 48 89 c6 e8 35 83 77 ff <0f> 0b 49 8b bd a8 00 00 00 e8 77 ab 85 ff 84 c0 0f 84 ad 01 00 00
>   RSP: 0018:ffffafba00047c58 EFLAGS: 00010282
>   RAX: 0000000000000000 RBX: ffffa00481301c10 RCX: 0000000000000003
>   RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001
>   RBP: ffffa00480ffaee0 R08: 0000000000000000 R09: ffffafba00047ae8
>   R10: ffffa0135e93ffa8 R11: 0000000000000003 R12: ffffafba00047d18
>   R13: ffffffff8adac1a0 R14: 0000000000000000 R15: ffffa004802c5800
>   FS:  0000000000000000(0000) GS:ffffa0135ea00000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: ffffa00baca01000 CR3: 000000082b838000 CR4: 00000000003506f0
>   Call Trace:
>    <TASK>
>    ? __iommu_probe_device+0x10b/0x510
>    ? __warn.cold+0x93/0xf7
>    ? __iommu_probe_device+0x10b/0x510
>    ? report_bug+0xff/0x140
>    ? prb_read_valid+0x1b/0x30
>    ? handle_bug+0x58/0x90
>    ? exc_invalid_op+0x17/0x70
>    ? asm_exc_invalid_op+0x1a/0x20
>    ? __iommu_probe_device+0x10b/0x510
>    ? __iommu_probe_device+0x10b/0x510
>    ? __pfx_probe_iommu_group+0x10/0x10
>    probe_iommu_group+0x28/0x50
>    bus_for_each_dev+0x7e/0xd0
>    iommu_device_register+0xee/0x260
>    iommu_go_to_state+0xa2a/0x1970
>    ? srso_return_thunk+0x5/0x5f
>    ? blake2s_update+0x68/0x160
>    ? __pfx_pci_iommu_init+0x10/0x10
>    amd_iommu_init+0x14/0x50
>    ? __pfx_pci_iommu_init+0x10/0x10
>    pci_iommu_init+0x12/0x40
>    do_one_initcall+0x46/0x300
>    kernel_init_freeable+0x23d/0x2a0
>    ? __pfx_kernel_init+0x10/0x10
>    kernel_init+0x1a/0x140
>    ret_from_fork+0x34/0x50
>    ? __pfx_kernel_init+0x10/0x10
>    ret_from_fork_asm+0x1a/0x30
>    </TASK>
>   ---[ end trace 0000000000000000 ]---

I saw the same issue on Intel platforms.

Thanks,
baolu
Robin Murphy March 12, 2025, 10:10 a.m. UTC | #6
On 2025-03-11 6:42 pm, Joerg Roedel wrote:
> Hi Robin,
> 
> On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote:
>> +	/*
>> +	 * And if we do now see any replay calls, they would indicate someone
>> +	 * misusing the dma_configure path outside bus code.
>> +	 */
>> +	if (dev->driver)
>> +		dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n");
> 
> This warning triggers on my workstation (with an AMD IOMMU), any ideas?

Argh! When I moved the dma_configure call into iommu_init_device() for
v2 I moved the warning with it, but of course that needs to stay where
it was, *after* the point that ops->probe_device has had a chance to
filter out irrelevant devices. Does this make it behave?

Thanks,
Robin.

----->8-----
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 09798ddbce9d..1da6c55a0d02 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -437,12 +437,6 @@ static int iommu_init_device(struct device *dev)
  		ret = -ENODEV;
  		goto err_free;
  	}
-	/*
-	 * And if we do now see any replay calls, they would indicate someone
-	 * misusing the dma_configure path outside bus code.
-	 */
-	if (dev->driver)
-		dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n");
  
  	if (!try_module_get(ops->owner)) {
  		ret = -EINVAL;
@@ -565,6 +559,12 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
  	ret = iommu_init_device(dev);
  	if (ret)
  		return ret;
+	/*
+	 * And if we do now see any replay calls, they would indicate someone
+	 * misusing the dma_configure path outside bus code.
+	 */
+	if (dev->driver)
+		dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n");
  
  	group = dev->iommu_group;
  	gdev = iommu_group_alloc_device(group, dev);
Baolu Lu March 12, 2025, 2:34 p.m. UTC | #7
On 2025/3/12 18:10, Robin Murphy wrote:
> On 2025-03-11 6:42 pm, Joerg Roedel wrote:
>> Hi Robin,
>>
>> On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote:
>>> +    /*
>>> +     * And if we do now see any replay calls, they would indicate 
>>> someone
>>> +     * misusing the dma_configure path outside bus code.
>>> +     */
>>> +    if (dev->driver)
>>> +        dev_WARN(dev, "late IOMMU probe at driver bind, something 
>>> fishy here!\n");
>>
>> This warning triggers on my workstation (with an AMD IOMMU), any ideas?
> 
> Argh! When I moved the dma_configure call into iommu_init_device() for
> v2 I moved the warning with it, but of course that needs to stay where
> it was, *after* the point that ops->probe_device has had a chance to
> filter out irrelevant devices. Does this make it behave?

Yes. It works on my end.

Thanks,
baolu
Joerg Roedel March 12, 2025, 3:21 p.m. UTC | #8
Hi Robin,

On Wed, Mar 12, 2025 at 10:10:04AM +0000, Robin Murphy wrote:
> Argh! When I moved the dma_configure call into iommu_init_device() for
> v2 I moved the warning with it, but of course that needs to stay where
> it was, *after* the point that ops->probe_device has had a chance to
> filter out irrelevant devices. Does this make it behave?

Okay, thanks for the patch. I am currently building a kernel to test it
and will report back.

Regards,

	Joerg
Marek Szyprowski March 13, 2025, 9:56 a.m. UTC | #9
Hi Robin,

On 28.02.2025 16:46, Robin Murphy wrote:
> In hindsight, there were some crucial subtleties overlooked when moving
> {of,acpi}_dma_configure() to driver probe time to allow waiting for
> IOMMU drivers with -EPROBE_DEFER, and these have become an
> ever-increasing source of problems. The IOMMU API has some fundamental
> assumptions that iommu_probe_device() is called for every device added
> to the system, in the order in which they are added. Calling it in a
> random order or not at all dependent on driver binding leads to
> malformed groups, a potential lack of isolation for devices with no
> driver, and all manner of unexpected concurrency and race conditions.
> We've attempted to mitigate the latter with point-fix bodges like
> iommu_probe_device_lock, but it's a losing battle and the time has come
> to bite the bullet and address the true source of the problem instead.
>
> The crux of the matter is that the firmware parsing actually serves two
> distinct purposes; one is identifying the IOMMU instance associated with
> a device so we can check its availability, the second is actually
> telling that instance about the relevant firmware-provided data for the
> device. However the latter also depends on the former, and at the time
> there was no good place to defer and retry that separately from the
> availability check we also wanted for client driver probe.
>
> Nowadays, though, we have a proper notion of multiple IOMMU instances in
> the core API itself, and each one gets a chance to probe its own devices
> upon registration, so we can finally make that work as intended for
> DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to
> be able to run the iommu_fwspec machinery currently buried deep in the
> wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be
> surprisingly straightforward to bootstrap this transformation by pretty
> much just calling the same path twice. At client driver probe time,
> dev->driver is obviously set; conversely at device_add(), or a
> subsequent bus_iommu_probe(), any device waiting for an IOMMU really
> should *not* have a driver already, so we can use that as a condition to
> disambiguate the two cases, and avoid recursing back into the IOMMU core
> at the wrong times.
>
> Obviously this isn't the nicest thing, but for now it gives us a
> functional baseline to then unpick the layers in between without many
> more awkward cross-subsystem patches. There are some minor side-effects
> like dma_range_map potentially being created earlier, and some debug
> prints being repeated, but these aren't significantly detrimental. Let's
> make things work first, then deal with making them nice.
>
> With the basic flow finally in the right order again, the next step is
> probably turning the bus->dma_configure paths inside-out, since all we
> really need from bus code is its notion of which device and input ID(s)
> to parse the common firmware properties with...
>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c
> Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>

This patch landed in yesterday's linux-next as commit bcb81ac6ae3c 
("iommu: Get DT/ACPI parsing into the proper probe path"). In my tests I 
found it breaks booting of ARM64 RK3568-based Odroid-M1 board 
(arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the 
relevant kernel log:

Unable to handle kernel NULL pointer dereference at virtual address 
00000000000003e8
Mem abort info:
   ESR = 0x0000000096000004
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x04: level 0 translation fault
Data abort info:
   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[00000000000003e8] user address but active_mm is swapper
Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
Modules linked in:
CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533
Hardware name: Hardkernel ODROID-M1 (DT)
pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : devm_kmalloc+0x2c/0x114
lr : rk_iommu_of_xlate+0x30/0x90
...
Call trace:
  devm_kmalloc+0x2c/0x114 (P)
  rk_iommu_of_xlate+0x30/0x90
  of_iommu_xlate+0x9c/0xc8
  of_iommu_configure+0x1c4/0x23c
  of_dma_configure_id+0x244/0x3b8
  platform_dma_configure+0x74/0xac
  __iommu_probe_device+0x258/0x49c
  probe_iommu_group+0x3c/0x64
  bus_for_each_dev+0x74/0xd4
  iommu_device_register+0xd4/0x230
  rk_iommu_probe+0x1f8/0x350
  platform_probe+0x68/0xdc
  really_probe+0xbc/0x298
  __driver_probe_device+0x78/0x12c
  driver_probe_device+0x40/0x164
  __driver_attach+0x9c/0x1ac
  bus_for_each_dev+0x74/0xd4
  driver_attach+0x24/0x30
  bus_add_driver+0xe4/0x208
  driver_register+0x60/0x128
  __platform_driver_register+0x24/0x30
  rk_iommu_driver_init+0x1c/0x28
  do_one_initcall+0x64/0x308
  kernel_init_freeable+0x288/0x4f0
  kernel_init+0x20/0x1d8
  ret_from_fork+0x10/0x20
Code: d2801017 a90153f3 ab0102e0 aa0103f4 (b943eab8)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x100,00000040,00901250,8200720b
Memory Limit: none
---[ end Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0000000b ]---

Let me know if You want me to test any experimental patch or provide 
more information.

> ---
>
> v2:
>   - Comment bus driver changes for clarity
>   - Use dev->iommu as the now-robust replay condition
>   - Drop the device_iommu_mapped() checks in the firmware paths as they
>     weren't doing much - we can't replace probe_device_lock just yet...
>   
>   drivers/acpi/arm64/dma.c        |  5 +++++
>   drivers/acpi/scan.c             |  7 -------
>   drivers/amba/bus.c              |  3 ++-
>   drivers/base/platform.c         |  3 ++-
>   drivers/bus/fsl-mc/fsl-mc-bus.c |  3 ++-
>   drivers/cdx/cdx.c               |  3 ++-
>   drivers/iommu/iommu.c           | 24 +++++++++++++++++++++---
>   drivers/iommu/of_iommu.c        |  7 ++++++-
>   drivers/of/device.c             |  7 ++++++-
>   drivers/pci/pci-driver.c        |  3 ++-
>   10 files changed, 48 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c
> index 52b2abf88689..f30f138352b7 100644
> --- a/drivers/acpi/arm64/dma.c
> +++ b/drivers/acpi/arm64/dma.c
> @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev)
>   	else
>   		end = (1ULL << 32) - 1;
>   
> +	if (dev->dma_range_map) {
> +		dev_dbg(dev, "dma_range_map already set\n");
> +		return;
> +	}
> +
>   	ret = acpi_dma_get_range(dev, &map);
>   	if (!ret && map) {
>   		end = dma_range_map_max(map);
> diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
> index 9f4efa8f75a6..fb1fe9f3b1a3 100644
> --- a/drivers/acpi/scan.c
> +++ b/drivers/acpi/scan.c
> @@ -1632,13 +1632,6 @@ static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in)
>   		err = viot_iommu_configure(dev);
>   	mutex_unlock(&iommu_probe_device_lock);
>   
> -	/*
> -	 * If we have reason to believe the IOMMU driver missed the initial
> -	 * iommu_probe_device() call for dev, replay it to get things in order.
> -	 */
> -	if (!err && dev->bus)
> -		err = iommu_probe_device(dev);
> -
>   	return err;
>   }
>   
> diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> index 8ef259b4d037..71482d639a6d 100644
> --- a/drivers/amba/bus.c
> +++ b/drivers/amba/bus.c
> @@ -364,7 +364,8 @@ static int amba_dma_configure(struct device *dev)
>   		ret = acpi_dma_configure(dev, attr);
>   	}
>   
> -	if (!ret && !drv->driver_managed_dma) {
> +	/* @drv may not be valid when we're called from the IOMMU layer */
> +	if (!ret && dev->driver && !drv->driver_managed_dma) {
>   		ret = iommu_device_use_default_domain(dev);
>   		if (ret)
>   			arch_teardown_dma_ops(dev);
> diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> index 6f2a33722c52..1813cfd0c4bd 100644
> --- a/drivers/base/platform.c
> +++ b/drivers/base/platform.c
> @@ -1451,7 +1451,8 @@ static int platform_dma_configure(struct device *dev)
>   		attr = acpi_get_dma_attr(to_acpi_device_node(fwnode));
>   		ret = acpi_dma_configure(dev, attr);
>   	}
> -	if (ret || drv->driver_managed_dma)
> +	/* @drv may not be valid when we're called from the IOMMU layer */
> +	if (ret || !dev->driver || drv->driver_managed_dma)
>   		return ret;
>   
>   	ret = iommu_device_use_default_domain(dev);
> diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c
> index d1f3d327ddd1..a8be8cf246fb 100644
> --- a/drivers/bus/fsl-mc/fsl-mc-bus.c
> +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c
> @@ -153,7 +153,8 @@ static int fsl_mc_dma_configure(struct device *dev)
>   	else
>   		ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT, &input_id);
>   
> -	if (!ret && !mc_drv->driver_managed_dma) {
> +	/* @mc_drv may not be valid when we're called from the IOMMU layer */
> +	if (!ret && dev->driver && !mc_drv->driver_managed_dma) {
>   		ret = iommu_device_use_default_domain(dev);
>   		if (ret)
>   			arch_teardown_dma_ops(dev);
> diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c
> index c573ed2ee71a..780fb0c4adba 100644
> --- a/drivers/cdx/cdx.c
> +++ b/drivers/cdx/cdx.c
> @@ -360,7 +360,8 @@ static int cdx_dma_configure(struct device *dev)
>   		return ret;
>   	}
>   
> -	if (!ret && !cdx_drv->driver_managed_dma) {
> +	/* @cdx_drv may not be valid when we're called from the IOMMU layer */
> +	if (!ret && dev->driver && !cdx_drv->driver_managed_dma) {
>   		ret = iommu_device_use_default_domain(dev);
>   		if (ret)
>   			arch_teardown_dma_ops(dev);
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index a3b45b84f42b..1cec7074367a 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -414,9 +414,21 @@ static int iommu_init_device(struct device *dev)
>   	if (!dev_iommu_get(dev))
>   		return -ENOMEM;
>   	/*
> -	 * For FDT-based systems and ACPI IORT/VIOT, drivers register IOMMU
> -	 * instances with non-NULL fwnodes, and client devices should have been
> -	 * identified with a fwspec by this point. Otherwise, we can currently
> +	 * For FDT-based systems and ACPI IORT/VIOT, the common firmware parsing
> +	 * is buried in the bus dma_configure path. Properly unpicking that is
> +	 * still a big job, so for now just invoke the whole thing. The device
> +	 * already having a driver bound means dma_configure has already run and
> +	 * either found no IOMMU to wait for, or we're in its replay call right
> +	 * now, so either way there's no point calling it again.
> +	 */
> +	if (!dev->driver && dev->bus->dma_configure) {
> +		mutex_unlock(&iommu_probe_device_lock);
> +		dev->bus->dma_configure(dev);
> +		mutex_lock(&iommu_probe_device_lock);
> +	}
> +	/*
> +	 * At this point, relevant devices either now have a fwspec which will
> +	 * match ops registered with a non-NULL fwnode, or we can reasonably
>   	 * assume that only one of Intel, AMD, s390, PAMU or legacy SMMUv2 can
>   	 * be present, and that any of their registered instances has suitable
>   	 * ops for probing, and thus cheekily co-opt the same mechanism.
> @@ -426,6 +438,12 @@ static int iommu_init_device(struct device *dev)
>   		ret = -ENODEV;
>   		goto err_free;
>   	}
> +	/*
> +	 * And if we do now see any replay calls, they would indicate someone
> +	 * misusing the dma_configure path outside bus code.
> +	 */
> +	if (dev->driver)
> +		dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n");
>   
>   	if (!try_module_get(ops->owner)) {
>   		ret = -EINVAL;
> diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
> index e10a68b5ffde..6b989a62def2 100644
> --- a/drivers/iommu/of_iommu.c
> +++ b/drivers/iommu/of_iommu.c
> @@ -155,7 +155,12 @@ int of_iommu_configure(struct device *dev, struct device_node *master_np,
>   		dev_iommu_free(dev);
>   	mutex_unlock(&iommu_probe_device_lock);
>   
> -	if (!err && dev->bus)
> +	/*
> +	 * If we're not on the iommu_probe_device() path (as indicated by the
> +	 * initial dev->iommu) then try to simulate it. This should no longer
> +	 * happen unless of_dma_configure() is being misused outside bus code.
> +	 */
> +	if (!err && dev->bus && !dev_iommu_present)
>   		err = iommu_probe_device(dev);
>   
>   	if (err && err != -EPROBE_DEFER)
> diff --git a/drivers/of/device.c b/drivers/of/device.c
> index edf3be197265..5053e5d532cc 100644
> --- a/drivers/of/device.c
> +++ b/drivers/of/device.c
> @@ -99,6 +99,11 @@ int of_dma_configure_id(struct device *dev, struct device_node *np,
>   	bool coherent, set_map = false;
>   	int ret;
>   
> +	if (dev->dma_range_map) {
> +		dev_dbg(dev, "dma_range_map already set\n");
> +		goto skip_map;
> +	}
> +
>   	if (np == dev->of_node)
>   		bus_np = __of_get_dma_parent(np);
>   	else
> @@ -119,7 +124,7 @@ int of_dma_configure_id(struct device *dev, struct device_node *np,
>   		end = dma_range_map_max(map);
>   		set_map = true;
>   	}
> -
> +skip_map:
>   	/*
>   	 * If @dev is expected to be DMA-capable then the bus code that created
>   	 * it should have initialised its dma_mask pointer by this point. For
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index f57ea36d125d..4468b7327cab 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1653,7 +1653,8 @@ static int pci_dma_configure(struct device *dev)
>   
>   	pci_put_host_bridge_device(bridge);
>   
> -	if (!ret && !driver->driver_managed_dma) {
> +	/* @driver may not be valid when we're called from the IOMMU layer */
> +	if (!ret && dev->driver && !driver->driver_managed_dma) {
>   		ret = iommu_device_use_default_domain(dev);
>   		if (ret)
>   			arch_teardown_dma_ops(dev);

Best regards
Robin Murphy March 13, 2025, 11:01 a.m. UTC | #10
On 2025-03-13 9:56 am, Marek Szyprowski wrote:
[...]
> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c
> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my tests I
> found it breaks booting of ARM64 RK3568-based Odroid-M1 board
> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the
> relevant kernel log:

...and the bug-flushing-out begins!

> Unable to handle kernel NULL pointer dereference at virtual address
> 00000000000003e8
> Mem abort info:
>     ESR = 0x0000000096000004
>     EC = 0x25: DABT (current EL), IL = 32 bits
>     SET = 0, FnV = 0
>     EA = 0, S1PTW = 0
>     FSC = 0x04: level 0 translation fault
> Data abort info:
>     ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>     CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>     GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [00000000000003e8] user address but active_mm is swapper
> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> Modules linked in:
> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533
> Hardware name: Hardkernel ODROID-M1 (DT)
> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : devm_kmalloc+0x2c/0x114
> lr : rk_iommu_of_xlate+0x30/0x90
> ...
> Call trace:
>    devm_kmalloc+0x2c/0x114 (P)
>    rk_iommu_of_xlate+0x30/0x90

Yeah, looks like this is doing something a bit questionable which can't
work properly. TBH the whole dma_dev thing could probably be cleaned up
now that we have proper instances, but for now does this work?

(annoyingly none of my Rockchip boards are set up for testing right now, 
but I might have time to dig one out later)

Thanks,
Robin.

----->8-----

Subject: [PATCH] iommu/rockchip: Allocate per-device data sensibly

Now that DT-based probing is finally happening in the right order again,
it reveals an issue in Rockchip's of_xlate, which can now be called
during registration, but is using the global dma_dev which is only
assigned later. However, this makes little sense when we're already
looking up the correct IOMMU device, who should logically be the owner
of the devm allocation anyway.

Fixes: bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe 
path")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
  drivers/iommu/rockchip-iommu.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c
index 323cc665c357..48826d1ccfd8 100644
--- a/drivers/iommu/rockchip-iommu.c
+++ b/drivers/iommu/rockchip-iommu.c
@@ -1148,12 +1148,12 @@ static int rk_iommu_of_xlate(struct device *dev,
  	struct platform_device *iommu_dev;
  	struct rk_iommudata *data;

-	data = devm_kzalloc(dma_dev, sizeof(*data), GFP_KERNEL);
+	iommu_dev = of_find_device_by_node(args->np);
+
+	data = devm_kzalloc(&iommu_dev->dev, sizeof(*data), GFP_KERNEL);
  	if (!data)
  		return -ENOMEM;

-	iommu_dev = of_find_device_by_node(args->np);
-
  	data->iommu = platform_get_drvdata(iommu_dev);
  	data->iommu->domain = &rk_identity_domain;
  	dev_iommu_priv_set(dev, data);
Marek Szyprowski March 13, 2025, 12:23 p.m. UTC | #11
On 13.03.2025 12:01, Robin Murphy wrote:
> On 2025-03-13 9:56 am, Marek Szyprowski wrote:
> [...]
>> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c
>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my tests I
>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board
>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the
>> relevant kernel log:
>
> ...and the bug-flushing-out begins!
>
>> Unable to handle kernel NULL pointer dereference at virtual address
>> 00000000000003e8
>> Mem abort info:
>>     ESR = 0x0000000096000004
>>     EC = 0x25: DABT (current EL), IL = 32 bits
>>     SET = 0, FnV = 0
>>     EA = 0, S1PTW = 0
>>     FSC = 0x04: level 0 translation fault
>> Data abort info:
>>     ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>     CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>     GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>> [00000000000003e8] user address but active_mm is swapper
>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>> Modules linked in:
>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533
>> Hardware name: Hardkernel ODROID-M1 (DT)
>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> pc : devm_kmalloc+0x2c/0x114
>> lr : rk_iommu_of_xlate+0x30/0x90
>> ...
>> Call trace:
>>    devm_kmalloc+0x2c/0x114 (P)
>>    rk_iommu_of_xlate+0x30/0x90
>
> Yeah, looks like this is doing something a bit questionable which can't
> work properly. TBH the whole dma_dev thing could probably be cleaned up
> now that we have proper instances, but for now does this work?

Yes, this patch fixes the problem I've observed.

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>

BTW, this dma_dev idea has been borrowed from my exynos_iommu driver and 
I doubt it can be cleaned up.


>
> (annoyingly none of my Rockchip boards are set up for testing right 
> now, but I might have time to dig one out later)
>
> Thanks,
> Robin.
>
> ----->8-----
>
> Subject: [PATCH] iommu/rockchip: Allocate per-device data sensibly
>
> Now that DT-based probing is finally happening in the right order again,
> it reveals an issue in Rockchip's of_xlate, which can now be called
> during registration, but is using the global dma_dev which is only
> assigned later. However, this makes little sense when we're already
> looking up the correct IOMMU device, who should logically be the owner
> of the devm allocation anyway.
>
> Fixes: bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe 
> path")
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>  drivers/iommu/rockchip-iommu.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/iommu/rockchip-iommu.c 
> b/drivers/iommu/rockchip-iommu.c
> index 323cc665c357..48826d1ccfd8 100644
> --- a/drivers/iommu/rockchip-iommu.c
> +++ b/drivers/iommu/rockchip-iommu.c
> @@ -1148,12 +1148,12 @@ static int rk_iommu_of_xlate(struct device *dev,
>      struct platform_device *iommu_dev;
>      struct rk_iommudata *data;
>
> -    data = devm_kzalloc(dma_dev, sizeof(*data), GFP_KERNEL);
> +    iommu_dev = of_find_device_by_node(args->np);
> +
> +    data = devm_kzalloc(&iommu_dev->dev, sizeof(*data), GFP_KERNEL);
>      if (!data)
>          return -ENOMEM;
>
> -    iommu_dev = of_find_device_by_node(args->np);
> -
>      data->iommu = platform_get_drvdata(iommu_dev);
>      data->iommu->domain = &rk_identity_domain;
>      dev_iommu_priv_set(dev, data);

Best regards
Robin Murphy March 13, 2025, 1:06 p.m. UTC | #12
On 2025-03-13 12:23 pm, Marek Szyprowski wrote:
> On 13.03.2025 12:01, Robin Murphy wrote:
>> On 2025-03-13 9:56 am, Marek Szyprowski wrote:
>> [...]
>>> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c
>>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my tests I
>>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board
>>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the
>>> relevant kernel log:
>>
>> ...and the bug-flushing-out begins!
>>
>>> Unable to handle kernel NULL pointer dereference at virtual address
>>> 00000000000003e8
>>> Mem abort info:
>>>      ESR = 0x0000000096000004
>>>      EC = 0x25: DABT (current EL), IL = 32 bits
>>>      SET = 0, FnV = 0
>>>      EA = 0, S1PTW = 0
>>>      FSC = 0x04: level 0 translation fault
>>> Data abort info:
>>>      ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>>      CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>>      GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>> [00000000000003e8] user address but active_mm is swapper
>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>> Modules linked in:
>>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533
>>> Hardware name: Hardkernel ODROID-M1 (DT)
>>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> pc : devm_kmalloc+0x2c/0x114
>>> lr : rk_iommu_of_xlate+0x30/0x90
>>> ...
>>> Call trace:
>>>     devm_kmalloc+0x2c/0x114 (P)
>>>     rk_iommu_of_xlate+0x30/0x90
>>
>> Yeah, looks like this is doing something a bit questionable which can't
>> work properly. TBH the whole dma_dev thing could probably be cleaned up
>> now that we have proper instances, but for now does this work?
> 
> Yes, this patch fixes the problem I've observed.
> 
> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
> 
> BTW, this dma_dev idea has been borrowed from my exynos_iommu driver and
> I doubt it can be cleaned up.

On the contrary I suspect they both can - it all dates back to when we 
had the single global platform bus iommu_ops and the SoC drivers were 
forced to bodge their own notion of multiple instances, but with the 
modern core code, ops are always called via a valid IOMMU instance or 
domain, so in principle it should always be possible to get at an 
appropriate IOMMU device now. IIRC it was mostly about allocating and 
DMA-mapping the pagetables in domain_alloc, where the private notion of 
instances didn't have enough information, but domain_alloc_paging solves 
that.

Cheers,
Robin.
Robin Murphy March 13, 2025, 2:12 p.m. UTC | #13
On 2025-03-13 1:06 pm, Robin Murphy wrote:
> On 2025-03-13 12:23 pm, Marek Szyprowski wrote:
>> On 13.03.2025 12:01, Robin Murphy wrote:
>>> On 2025-03-13 9:56 am, Marek Szyprowski wrote:
>>> [...]
>>>> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c
>>>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my 
>>>> tests I
>>>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board
>>>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the
>>>> relevant kernel log:
>>>
>>> ...and the bug-flushing-out begins!
>>>
>>>> Unable to handle kernel NULL pointer dereference at virtual address
>>>> 00000000000003e8
>>>> Mem abort info:
>>>>      ESR = 0x0000000096000004
>>>>      EC = 0x25: DABT (current EL), IL = 32 bits
>>>>      SET = 0, FnV = 0
>>>>      EA = 0, S1PTW = 0
>>>>      FSC = 0x04: level 0 translation fault
>>>> Data abort info:
>>>>      ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>>>      CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>>>      GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>>> [00000000000003e8] user address but active_mm is swapper
>>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>>> Modules linked in:
>>>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533
>>>> Hardware name: Hardkernel ODROID-M1 (DT)
>>>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>> pc : devm_kmalloc+0x2c/0x114
>>>> lr : rk_iommu_of_xlate+0x30/0x90
>>>> ...
>>>> Call trace:
>>>>     devm_kmalloc+0x2c/0x114 (P)
>>>>     rk_iommu_of_xlate+0x30/0x90
>>>
>>> Yeah, looks like this is doing something a bit questionable which can't
>>> work properly. TBH the whole dma_dev thing could probably be cleaned up
>>> now that we have proper instances, but for now does this work?
>>
>> Yes, this patch fixes the problem I've observed.
>>
>> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
>> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>
>> BTW, this dma_dev idea has been borrowed from my exynos_iommu driver and
>> I doubt it can be cleaned up.
> 
> On the contrary I suspect they both can - it all dates back to when we 
> had the single global platform bus iommu_ops and the SoC drivers were 
> forced to bodge their own notion of multiple instances, but with the 
> modern core code, ops are always called via a valid IOMMU instance or 
> domain, so in principle it should always be possible to get at an 
> appropriate IOMMU device now. IIRC it was mostly about allocating and 
> DMA-mapping the pagetables in domain_alloc, where the private notion of 
> instances didn't have enough information, but domain_alloc_paging solves 
> that.

Bah, in fact I think I am going to have to do that now, since although 
it doesn't crash, rk_domain_alloc_paging() will also be failing for the 
same reason. Time to find a PSU for the RK3399 board, I guess...

(Or maybe just move the dma_dev assignment earlier to match Exynos?)

Thanks,
Robin.
Marek Szyprowski March 17, 2025, 7:37 a.m. UTC | #14
On 13.03.2025 15:12, Robin Murphy wrote:
> On 2025-03-13 1:06 pm, Robin Murphy wrote:
>> On 2025-03-13 12:23 pm, Marek Szyprowski wrote:
>>> On 13.03.2025 12:01, Robin Murphy wrote:
>>>> On 2025-03-13 9:56 am, Marek Szyprowski wrote:
>>>> [...]
>>>>> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c
>>>>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my 
>>>>> tests I
>>>>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board
>>>>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the
>>>>> relevant kernel log:
>>>>
>>>> ...and the bug-flushing-out begins!
>>>>
>>>>> Unable to handle kernel NULL pointer dereference at virtual address
>>>>> 00000000000003e8
>>>>> Mem abort info:
>>>>>      ESR = 0x0000000096000004
>>>>>      EC = 0x25: DABT (current EL), IL = 32 bits
>>>>>      SET = 0, FnV = 0
>>>>>      EA = 0, S1PTW = 0
>>>>>      FSC = 0x04: level 0 translation fault
>>>>> Data abort info:
>>>>>      ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>>>>      CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>>>>      GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>>>> [00000000000003e8] user address but active_mm is swapper
>>>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>>>> Modules linked in:
>>>>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533
>>>>> Hardware name: Hardkernel ODROID-M1 (DT)
>>>>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> pc : devm_kmalloc+0x2c/0x114
>>>>> lr : rk_iommu_of_xlate+0x30/0x90
>>>>> ...
>>>>> Call trace:
>>>>>     devm_kmalloc+0x2c/0x114 (P)
>>>>>     rk_iommu_of_xlate+0x30/0x90
>>>>
>>>> Yeah, looks like this is doing something a bit questionable which 
>>>> can't
>>>> work properly. TBH the whole dma_dev thing could probably be 
>>>> cleaned up
>>>> now that we have proper instances, but for now does this work?
>>>
>>> Yes, this patch fixes the problem I've observed.
>>>
>>> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>>
>>> BTW, this dma_dev idea has been borrowed from my exynos_iommu driver 
>>> and
>>> I doubt it can be cleaned up.
>>
>> On the contrary I suspect they both can - it all dates back to when 
>> we had the single global platform bus iommu_ops and the SoC drivers 
>> were forced to bodge their own notion of multiple instances, but with 
>> the modern core code, ops are always called via a valid IOMMU 
>> instance or domain, so in principle it should always be possible to 
>> get at an appropriate IOMMU device now. IIRC it was mostly about 
>> allocating and DMA-mapping the pagetables in domain_alloc, where the 
>> private notion of instances didn't have enough information, but 
>> domain_alloc_paging solves that.
>
> Bah, in fact I think I am going to have to do that now, since although 
> it doesn't crash, rk_domain_alloc_paging() will also be failing for 
> the same reason. Time to find a PSU for the RK3399 board, I guess...
>
> (Or maybe just move the dma_dev assignment earlier to match Exynos?)

Well I just found that Exynos IOMMU is also broken on some on my test 
boards. It looks that the runtime pm links are somehow not correctly 
established. I will try to analyze this later in the afternoon.


Best regards
Robin Murphy March 17, 2025, 6:22 p.m. UTC | #15
On 17/03/2025 7:37 am, Marek Szyprowski wrote:
> On 13.03.2025 15:12, Robin Murphy wrote:
>> On 2025-03-13 1:06 pm, Robin Murphy wrote:
>>> On 2025-03-13 12:23 pm, Marek Szyprowski wrote:
>>>> On 13.03.2025 12:01, Robin Murphy wrote:
>>>>> On 2025-03-13 9:56 am, Marek Szyprowski wrote:
>>>>> [...]
>>>>>> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c
>>>>>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my
>>>>>> tests I
>>>>>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board
>>>>>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the
>>>>>> relevant kernel log:
>>>>>
>>>>> ...and the bug-flushing-out begins!
>>>>>
>>>>>> Unable to handle kernel NULL pointer dereference at virtual address
>>>>>> 00000000000003e8
>>>>>> Mem abort info:
>>>>>>       ESR = 0x0000000096000004
>>>>>>       EC = 0x25: DABT (current EL), IL = 32 bits
>>>>>>       SET = 0, FnV = 0
>>>>>>       EA = 0, S1PTW = 0
>>>>>>       FSC = 0x04: level 0 translation fault
>>>>>> Data abort info:
>>>>>>       ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>>>>>       CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>>>>>       GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>>>>> [00000000000003e8] user address but active_mm is swapper
>>>>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>>>>> Modules linked in:
>>>>>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533
>>>>>> Hardware name: Hardkernel ODROID-M1 (DT)
>>>>>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>> pc : devm_kmalloc+0x2c/0x114
>>>>>> lr : rk_iommu_of_xlate+0x30/0x90
>>>>>> ...
>>>>>> Call trace:
>>>>>>      devm_kmalloc+0x2c/0x114 (P)
>>>>>>      rk_iommu_of_xlate+0x30/0x90
>>>>>
>>>>> Yeah, looks like this is doing something a bit questionable which
>>>>> can't
>>>>> work properly. TBH the whole dma_dev thing could probably be
>>>>> cleaned up
>>>>> now that we have proper instances, but for now does this work?
>>>>
>>>> Yes, this patch fixes the problem I've observed.
>>>>
>>>> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>>> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>
>>>> BTW, this dma_dev idea has been borrowed from my exynos_iommu driver
>>>> and
>>>> I doubt it can be cleaned up.
>>>
>>> On the contrary I suspect they both can - it all dates back to when
>>> we had the single global platform bus iommu_ops and the SoC drivers
>>> were forced to bodge their own notion of multiple instances, but with
>>> the modern core code, ops are always called via a valid IOMMU
>>> instance or domain, so in principle it should always be possible to
>>> get at an appropriate IOMMU device now. IIRC it was mostly about
>>> allocating and DMA-mapping the pagetables in domain_alloc, where the
>>> private notion of instances didn't have enough information, but
>>> domain_alloc_paging solves that.
>>
>> Bah, in fact I think I am going to have to do that now, since although
>> it doesn't crash, rk_domain_alloc_paging() will also be failing for
>> the same reason. Time to find a PSU for the RK3399 board, I guess...
>>
>> (Or maybe just move the dma_dev assignment earlier to match Exynos?)
> 
> Well I just found that Exynos IOMMU is also broken on some on my test
> boards. It looks that the runtime pm links are somehow not correctly
> established. I will try to analyze this later in the afternoon.

Hmm, I tried to get an Odroid-XU3 up and running, but it seems unable to 
boot my original 6.14-rc3-based branch - even with the IOMMU driver 
disabled, it's consistently dying somewhere near (or just after) init 
with what looks like some catastrophic memory corruption issue - very 
occasionally it's managed to print the first line of various different 
panics.

Before that point though, with the IOMMU driver enabled it does appear 
to show signs of working OK:

[    0.649703] exynos-sysmmu 14650000.sysmmu: hardware version: 3.3
[    0.654220] platform 14450000.mixer: Adding to iommu group 1
...
[    2.680920] exynos-mixer 14450000.mixer: exynos_iommu_attach_device: 
Attached IOMMU with pgtable 0x42924000
...
[    5.196674] exynos-mixer 14450000.mixer: 
exynos_iommu_identity_attach: Restored IOMMU to IDENTITY from pgtable 
0x42924000
[    5.207091] exynos-mixer 14450000.mixer: exynos_iommu_attach_device: 
Attached IOMMU with pgtable 0x42884000


The multi-instance stuff in probe/release does look a bit suspect, 
however - seems like the second instance probe would overwrite the first 
instance's links, and then there would be a double-del() if the device 
were ever actually released again? I may have made that much more likely 
to happen, but I suspect it was already possible with async driver probe...

Thanks,
Robin.
Geert Uytterhoeven March 18, 2025, 4:37 p.m. UTC | #16
Hi Robin,

On Fri, 28 Feb 2025 at 16:51, Robin Murphy <robin.murphy@arm.com> wrote:
> In hindsight, there were some crucial subtleties overlooked when moving
> {of,acpi}_dma_configure() to driver probe time to allow waiting for
> IOMMU drivers with -EPROBE_DEFER, and these have become an
> ever-increasing source of problems. The IOMMU API has some fundamental
> assumptions that iommu_probe_device() is called for every device added
> to the system, in the order in which they are added. Calling it in a
> random order or not at all dependent on driver binding leads to
> malformed groups, a potential lack of isolation for devices with no
> driver, and all manner of unexpected concurrency and race conditions.
> We've attempted to mitigate the latter with point-fix bodges like
> iommu_probe_device_lock, but it's a losing battle and the time has come
> to bite the bullet and address the true source of the problem instead.
>
> The crux of the matter is that the firmware parsing actually serves two
> distinct purposes; one is identifying the IOMMU instance associated with
> a device so we can check its availability, the second is actually
> telling that instance about the relevant firmware-provided data for the
> device. However the latter also depends on the former, and at the time
> there was no good place to defer and retry that separately from the
> availability check we also wanted for client driver probe.
>
> Nowadays, though, we have a proper notion of multiple IOMMU instances in
> the core API itself, and each one gets a chance to probe its own devices
> upon registration, so we can finally make that work as intended for
> DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to
> be able to run the iommu_fwspec machinery currently buried deep in the
> wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be
> surprisingly straightforward to bootstrap this transformation by pretty
> much just calling the same path twice. At client driver probe time,
> dev->driver is obviously set; conversely at device_add(), or a
> subsequent bus_iommu_probe(), any device waiting for an IOMMU really
> should *not* have a driver already, so we can use that as a condition to
> disambiguate the two cases, and avoid recursing back into the IOMMU core
> at the wrong times.
>
> Obviously this isn't the nicest thing, but for now it gives us a
> functional baseline to then unpick the layers in between without many
> more awkward cross-subsystem patches. There are some minor side-effects
> like dma_range_map potentially being created earlier, and some debug
> prints being repeated, but these aren't significantly detrimental. Let's
> make things work first, then deal with making them nice.
>
> With the basic flow finally in the right order again, the next step is
> probably turning the bus->dma_configure paths inside-out, since all we
> really need from bus code is its notion of which device and input ID(s)
> to parse the common firmware properties with...
>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c
> Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>

Thanks for your patch, which is now commit bcb81ac6ae3c2ef9 ("iommu:
Get DT/ACPI parsing into the proper probe path") in iommu/next.

This patch triggers two issues on R-Car Gen3 platforms:

1. I am seeing a warning on Renesas Salvator-XS with R-Car M3N
(but not on the similar board with R-Car H3), and only for SATA[1].
Unfortunately commit 73d2f10957f517e5 ("iommu: Don't warn prematurely
about dodgy probes") does not help:

    ------------[ cut here ]------------
    sata_rcar ee300000.sata: late IOMMU probe at driver bind,
something fishy here!
    WARNING: CPU: 1 PID: 13 at drivers/iommu/iommu.c:571
__iommu_probe_device+0x208/0x38c
    Modules linked in:
    CPU: 1 UID: 0 PID: 13 Comm: kworker/u8:1 Not tainted
6.14.0-rc3-rcar3-00020-g73d2f10957f5-dirty #315
    Hardware name: Renesas Salvator-X 2nd version board based on r8a77965 (DT)
    Workqueue: events_unbound deferred_probe_work_func
    pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    pc : __iommu_probe_device+0x208/0x38c
    lr : __iommu_probe_device+0x208/0x38c
    sp : ffffffc086da39a0
    x29: ffffffc086da39b0 x28: 0000000000000000 x27: 0000000000000000
    x26: 0000000000000001 x25: ffffffc080e0e0ae x24: ffffffc080e0e0bb
    x23: 0000000000000000 x22: ffffff800bd3d090 x21: ffffffc080acf680
    x20: ffffff8008e8f780 x19: ffffff800aca8810 x18: 00000000e9f55e4c
    x17: 6874656d6f73202c x16: 646e696220726576 x15: 0720072007200720
    x14: 0720072007200720 x13: 0720072007200720 x12: 0720072007200720
    x11: 00000000000001ac x10: ffffffc086da3700 x9 : ffffffc083edb140
    x8 : ffffffc086da3698 x7 : ffffffc086da36a0 x6 : 00000000fff7ffff
    x5 : c0000000fff7ffff x4 : 0000000000000000 x3 : 0000000000000001
    x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80083d5380
    Call trace:
     __iommu_probe_device+0x208/0x38c (P)
     iommu_probe_device+0x34/0x74
     of_iommu_configure+0x128/0x200
     of_dma_configure_id+0xdc/0x1d4
     platform_dma_configure+0x48/0x6c
     really_probe+0xf0/0x260
     __driver_probe_device+0xec/0x104
     driver_probe_device+0x3c/0xc0
     __device_attach_driver+0x58/0xcc
     bus_for_each_drv+0xb8/0xe0
     __device_attach+0xdc/0x138
     device_initial_probe+0x10/0x18
     bus_probe_device+0x38/0xa0
     deferred_probe_work_func+0xb4/0xcc
     process_scheduled_works+0x2e4/0x4a8
     worker_thread+0x144/0x1cc
     kthread+0x188/0x198
     ret_from_fork+0x10/0x20
    irq event stamp: 49052
    hardirqs last  enabled at (49051): [<ffffffc0800fb6a8>]
__up_console_sem+0x50/0x74
    hardirqs last disabled at (49052): [<ffffffc0809eb65c>] el1_dbg+0x20/0x6c
    softirqs last  enabled at (49046): [<ffffffc080096c50>]
handle_softirqs+0x1b0/0x3b4
    softirqs last disabled at (48839): [<ffffffc080010168>]
__do_softirq+0x10/0x18
    ---[ end trace 0000000000000000 ]---

I added debug prints to sata_rcar_probe(), and verified SATA is
probed at about the same time on R-Car H3 and M3N, and the probe is
never deferred.

Do you have a clue?


2. The IOMMU driver's iommu_ops.of_xlate() callback is called about
three times as much as before:

    +platform ec700000.dma-controller: ipmmu_of_xlate
    +platform ec720000.dma-controller: ipmmu_of_xlate
    +platform ec700000.dma-controller: ipmmu_of_xlate
    +platform ec720000.dma-controller: ipmmu_of_xlate
    +platform ec700000.dma-controller: ipmmu_of_xlate
    +platform ec720000.dma-controller: ipmmu_of_xlate
    +platform ec700000.dma-controller: ipmmu_of_xlate
    +platform ec720000.dma-controller: ipmmu_of_xlate
    +platform ec700000.dma-controller: ipmmu_of_xlate
    +platform ec720000.dma-controller: ipmmu_of_xlate
    +platform fea27000.fcp: ipmmu_of_xlate
    +platform fea2f000.fcp: ipmmu_of_xlate
    +platform ec700000.dma-controller: ipmmu_of_xlate
    +platform ec720000.dma-controller: ipmmu_of_xlate
    +platform fe950000.fcp: ipmmu_of_xlate
    +platform fe96f000.fcp: ipmmu_of_xlate
    +platform fea27000.fcp: ipmmu_of_xlate
    +platform fea2f000.fcp: ipmmu_of_xlate
    +platform fe9af000.fcp: ipmmu_of_xlate
     rcar-fcp fe950000.fcp: ipmmu_of_xlate
     rcar-fcp fe96f000.fcp: ipmmu_of_xlate
     rcar-fcp fea27000.fcp: ipmmu_of_xlate
     rcar-fcp fea2f000.fcp: ipmmu_of_xlate
     rcar-fcp fe9af000.fcp: ipmmu_of_xlate
     rcar-dmac ec700000.dma-controller: ipmmu_of_xlate
     rcar-dmac ec720000.dma-controller: ipmmu_of_xlate
    +platform e6700000.dma-controller: ipmmu_of_xlate
    +platform e6800000.ethernet: ipmmu_of_xlate
    +platform e6700000.dma-controller: ipmmu_of_xlate
    +platform e7300000.dma-controller: ipmmu_of_xlate
    +platform e7310000.dma-controller: ipmmu_of_xlate
    +platform e6800000.ethernet: ipmmu_of_xlate
    +platform ee100000.mmc: ipmmu_of_xlate
    +platform ee140000.mmc: ipmmu_of_xlate
    +platform ee160000.mmc: ipmmu_of_xlate
    +platform e6700000.dma-controller: ipmmu_of_xlate
    +platform e7300000.dma-controller: ipmmu_of_xlate
    +platform e7310000.dma-controller: ipmmu_of_xlate
    +platform e6800000.ethernet: ipmmu_of_xlate
    +platform ee100000.mmc: ipmmu_of_xlate
    +platform ee140000.mmc: ipmmu_of_xlate
    +platform ee160000.mmc: ipmmu_of_xlate
    +platform ee300000.sata: ipmmu_of_xlate
     sata_rcar ee300000.sata: ipmmu_of_xlate
     ravb e6800000.ethernet: ipmmu_of_xlate
    -renesas_sdhi_internal_dmac ee100000.mmc: ipmmu_of_xlate
    -renesas_sdhi_internal_dmac ee140000.mmc: ipmmu_of_xlate
    -renesas_sdhi_internal_dmac ee160000.mmc: ipmmu_of_xlate
     rcar-dmac e6700000.dma-controller: ipmmu_of_xlate
     rcar-dmac e7300000.dma-controller: ipmmu_of_xlate
     rcar-dmac e7310000.dma-controller: ipmmu_of_xlate

For some devices, it can be called up to 6 times. All of the duplicates
happen before the device is bound (cfr. "platform" instead of the
actual driver name):

      6 platform ec720000.dma-controller: ipmmu_of_xlate
      6 platform ec700000.dma-controller: ipmmu_of_xlate
      3 platform e6800000.ethernet: ipmmu_of_xlate
      3 platform e6700000.dma-controller: ipmmu_of_xlate
      2 platform fea2f000.fcp: ipmmu_of_xlate
      2 platform fea27000.fcp: ipmmu_of_xlate
      2 platform ee160000.mmc: ipmmu_of_xlate
      2 platform ee140000.mmc: ipmmu_of_xlate
      2 platform ee100000.mmc: ipmmu_of_xlate
      2 platform e7310000.dma-controller: ipmmu_of_xlate
      2 platform e7300000.dma-controller: ipmmu_of_xlate
      1 platform fe9af000.fcp: ipmmu_of_xlate
      1 platform fe96f000.fcp: ipmmu_of_xlate
      1 platform fe950000.fcp: ipmmu_of_xlate
      1 platform ee300000.sata: ipmmu_of_xlate
      1 sata_rcar ee300000.sata: ipmmu_of_xlate
      1 rcar-fcp fea2f000.fcp: ipmmu_of_xlate
      1 rcar-fcp fea27000.fcp: ipmmu_of_xlate
      1 rcar-fcp fe9af000.fcp: ipmmu_of_xlate
      1 rcar-fcp fe96f000.fcp: ipmmu_of_xlate
      1 rcar-fcp fe950000.fcp: ipmmu_of_xlate
      1 rcar-dmac ec720000.dma-controller: ipmmu_of_xlate
      1 rcar-dmac ec700000.dma-controller: ipmmu_of_xlate
      1 rcar-dmac e7310000.dma-controller: ipmmu_of_xlate
      1 rcar-dmac e7300000.dma-controller: ipmmu_of_xlate
      1 rcar-dmac e6700000.dma-controller: ipmmu_of_xlate
      1 ravb e6800000.ethernet: ipmmu_of_xlate

Before, the callback was called just once for each DMA slave device.

Is this intentional?

Thanks!

[1] SATA IOMMU on R-Car Gen3 needs an out-of-tree patch to add it to
    drivers/iommu/ipmmu-vmsa.c:devices_allowlist[].

Gr{oetje,eeting}s,

                        Geert
Robin Murphy March 18, 2025, 5:24 p.m. UTC | #17
Hi Geert,

On 18/03/2025 4:37 pm, Geert Uytterhoeven wrote:
[...]
> Thanks for your patch, which is now commit bcb81ac6ae3c2ef9 ("iommu:
> Get DT/ACPI parsing into the proper probe path") in iommu/next.
> 
> This patch triggers two issues on R-Car Gen3 platforms:
> 
> 1. I am seeing a warning on Renesas Salvator-XS with R-Car M3N
> (but not on the similar board with R-Car H3), and only for SATA[1].
> Unfortunately commit 73d2f10957f517e5 ("iommu: Don't warn prematurely
> about dodgy probes") does not help:
[...]
>      Call trace:
>       __iommu_probe_device+0x208/0x38c (P)
>       iommu_probe_device+0x34/0x74
>       of_iommu_configure+0x128/0x200
>       of_dma_configure_id+0xdc/0x1d4
>       platform_dma_configure+0x48/0x6c
>       really_probe+0xf0/0x260
>       __driver_probe_device+0xec/0x104
>       driver_probe_device+0x3c/0xc0

Hurrah, this is the warning doing the correct job - something *is* off
if we're now getting here without the IOMMU configuration being done
already (for a normal device with no other funny business going on).

> 2. The IOMMU driver's iommu_ops.of_xlate() callback is called about
> three times as much as before:

That would suggest that the fwspec gets set up OK, then something later
in the __iommu_probe_device() path fails and tears it down again, so the
next attempt starts from scratch. Do you see the "Cannot attach to
IPMMU" message firing? And similarly to the Rockchip case, does the
below help?

Thanks,
Robin.

----->8-----
diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 074daf1aac4e..5d416262ae5f 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -1081,6 +1081,7 @@ static int ipmmu_probe(struct platform_device *pdev)
  		}
  	}
  
+	platform_set_drvdata(pdev, mmu);
  	/*
  	 * Register the IPMMU to the IOMMU subsystem in the following cases:
  	 * - R-Car Gen2 IPMMU (all devices registered)
@@ -1103,7 +1104,6 @@ static int ipmmu_probe(struct platform_device *pdev)
  	 * ipmmu_init() after the probe function returns.
  	 */
  
-	platform_set_drvdata(pdev, mmu);
  
  	return 0;
  }
Marek Szyprowski March 21, 2025, 12:15 p.m. UTC | #18
On 17.03.2025 19:22, Robin Murphy wrote:
> On 17/03/2025 7:37 am, Marek Szyprowski wrote:
>> On 13.03.2025 15:12, Robin Murphy wrote:
>>> On 2025-03-13 1:06 pm, Robin Murphy wrote:
>>>> On 2025-03-13 12:23 pm, Marek Szyprowski wrote:
>>>>> On 13.03.2025 12:01, Robin Murphy wrote:
>>>>>> On 2025-03-13 9:56 am, Marek Szyprowski wrote:
>>>>>> [...]
>>>>>>> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c
>>>>>>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my
>>>>>>> tests I
>>>>>>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board
>>>>>>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the
>>>>>>> relevant kernel log:
>>>>>>
>>>>>> ...and the bug-flushing-out begins!
>>>>>>
>>>>>>> Unable to handle kernel NULL pointer dereference at virtual address
>>>>>>> 00000000000003e8
>>>>>>> Mem abort info:
>>>>>>>       ESR = 0x0000000096000004
>>>>>>>       EC = 0x25: DABT (current EL), IL = 32 bits
>>>>>>>       SET = 0, FnV = 0
>>>>>>>       EA = 0, S1PTW = 0
>>>>>>>       FSC = 0x04: level 0 translation fault
>>>>>>> Data abort info:
>>>>>>>       ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>>>>>>       CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>>>>>>       GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>>>>>> [00000000000003e8] user address but active_mm is swapper
>>>>>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>>>>>> Modules linked in:
>>>>>>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533
>>>>>>> Hardware name: Hardkernel ODROID-M1 (DT)
>>>>>>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>>> pc : devm_kmalloc+0x2c/0x114
>>>>>>> lr : rk_iommu_of_xlate+0x30/0x90
>>>>>>> ...
>>>>>>> Call trace:
>>>>>>>      devm_kmalloc+0x2c/0x114 (P)
>>>>>>>      rk_iommu_of_xlate+0x30/0x90
>>>>>>
>>>>>> Yeah, looks like this is doing something a bit questionable which
>>>>>> can't
>>>>>> work properly. TBH the whole dma_dev thing could probably be
>>>>>> cleaned up
>>>>>> now that we have proper instances, but for now does this work?
>>>>>
>>>>> Yes, this patch fixes the problem I've observed.
>>>>>
>>>>> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>
>>>>> BTW, this dma_dev idea has been borrowed from my exynos_iommu driver
>>>>> and
>>>>> I doubt it can be cleaned up.
>>>>
>>>> On the contrary I suspect they both can - it all dates back to when
>>>> we had the single global platform bus iommu_ops and the SoC drivers
>>>> were forced to bodge their own notion of multiple instances, but with
>>>> the modern core code, ops are always called via a valid IOMMU
>>>> instance or domain, so in principle it should always be possible to
>>>> get at an appropriate IOMMU device now. IIRC it was mostly about
>>>> allocating and DMA-mapping the pagetables in domain_alloc, where the
>>>> private notion of instances didn't have enough information, but
>>>> domain_alloc_paging solves that.
>>>
>>> Bah, in fact I think I am going to have to do that now, since although
>>> it doesn't crash, rk_domain_alloc_paging() will also be failing for
>>> the same reason. Time to find a PSU for the RK3399 board, I guess...
>>>
>>> (Or maybe just move the dma_dev assignment earlier to match Exynos?)
>>
>> Well I just found that Exynos IOMMU is also broken on some on my test
>> boards. It looks that the runtime pm links are somehow not correctly
>> established. I will try to analyze this later in the afternoon.
>
> Hmm, I tried to get an Odroid-XU3 up and running, but it seems unable 
> to boot my original 6.14-rc3-based branch - even with the IOMMU driver 
> disabled, it's consistently dying somewhere near (or just after) init 
> with what looks like some catastrophic memory corruption issue - very 
> occasionally it's managed to print the first line of various different 
> panics.
>
> Before that point though, with the IOMMU driver enabled it does appear 
> to show signs of working OK:
>
> [    0.649703] exynos-sysmmu 14650000.sysmmu: hardware version: 3.3
> [    0.654220] platform 14450000.mixer: Adding to iommu group 1
> ...
> [    2.680920] exynos-mixer 14450000.mixer: 
> exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42924000
> ...
> [    5.196674] exynos-mixer 14450000.mixer: 
> exynos_iommu_identity_attach: Restored IOMMU to IDENTITY from pgtable 
> 0x42924000
> [    5.207091] exynos-mixer 14450000.mixer: 
> exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42884000
>
>
> The multi-instance stuff in probe/release does look a bit suspect, 
> however - seems like the second instance probe would overwrite the 
> first instance's links, and then there would be a double-del() if the 
> device were ever actually released again? I may have made that much 
> more likely to happen, but I suspect it was already possible with 
> async driver probe...

That is really strange. My Odroid XU3 boots fine from commit 
bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path"), 
although the IOMMU seems not to be working correctly. I've tested this 
with 14450000.mixer device (one need to attach HDMI cable to get it 
activated) and it looks that the video data are not being read from 
memory at all (the lack of VSYNC is reported, no IOMMU fault). However, 
from time to time, everything initializes and works properly.

It looks that this is somehow related to the different IOMMU/DMA-mapping 
glue code, as the other boards (ARM64 based) with exactly the same 
Exynos IOMMU driver always work fine. I've tried to figure out what 
actually happens, but so far I didn't get anything for sure. Disabling 
the call to dev->bus->dma_configure(dev) from iommu_init_device() seems 
to be fixing this, but this is almost equal to the revert of the 
$subject patch. I don't get why calling it in iommu_init_device() causes 
problems. It also doesn't look that this is anyhow related to the 
multi-instance stuff, as the same happens if I only leave a single 
exynos-sysmmu instance and its client (only 14450000.mixer device in the 
system).

Best regards
Robin Murphy March 21, 2025, 4:48 p.m. UTC | #19
On 21/03/2025 12:15 pm, Marek Szyprowski wrote:
> On 17.03.2025 19:22, Robin Murphy wrote:
>> On 17/03/2025 7:37 am, Marek Szyprowski wrote:
>>> On 13.03.2025 15:12, Robin Murphy wrote:
>>>> On 2025-03-13 1:06 pm, Robin Murphy wrote:
>>>>> On 2025-03-13 12:23 pm, Marek Szyprowski wrote:
>>>>>> On 13.03.2025 12:01, Robin Murphy wrote:
>>>>>>> On 2025-03-13 9:56 am, Marek Szyprowski wrote:
>>>>>>> [...]
>>>>>>>> This patch landed in yesterday's linux-next as commit bcb81ac6ae3c
>>>>>>>> ("iommu: Get DT/ACPI parsing into the proper probe path"). In my
>>>>>>>> tests I
>>>>>>>> found it breaks booting of ARM64 RK3568-based Odroid-M1 board
>>>>>>>> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts). Here is the
>>>>>>>> relevant kernel log:
>>>>>>>
>>>>>>> ...and the bug-flushing-out begins!
>>>>>>>
>>>>>>>> Unable to handle kernel NULL pointer dereference at virtual address
>>>>>>>> 00000000000003e8
>>>>>>>> Mem abort info:
>>>>>>>>        ESR = 0x0000000096000004
>>>>>>>>        EC = 0x25: DABT (current EL), IL = 32 bits
>>>>>>>>        SET = 0, FnV = 0
>>>>>>>>        EA = 0, S1PTW = 0
>>>>>>>>        FSC = 0x04: level 0 translation fault
>>>>>>>> Data abort info:
>>>>>>>>        ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>>>>>>>        CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>>>>>>>        GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>>>>>>> [00000000000003e8] user address but active_mm is swapper
>>>>>>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>>>>>>> Modules linked in:
>>>>>>>> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #15533
>>>>>>>> Hardware name: Hardkernel ODROID-M1 (DT)
>>>>>>>> pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>>>> pc : devm_kmalloc+0x2c/0x114
>>>>>>>> lr : rk_iommu_of_xlate+0x30/0x90
>>>>>>>> ...
>>>>>>>> Call trace:
>>>>>>>>       devm_kmalloc+0x2c/0x114 (P)
>>>>>>>>       rk_iommu_of_xlate+0x30/0x90
>>>>>>>
>>>>>>> Yeah, looks like this is doing something a bit questionable which
>>>>>>> can't
>>>>>>> work properly. TBH the whole dma_dev thing could probably be
>>>>>>> cleaned up
>>>>>>> now that we have proper instances, but for now does this work?
>>>>>>
>>>>>> Yes, this patch fixes the problem I've observed.
>>>>>>
>>>>>> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>>
>>>>>> BTW, this dma_dev idea has been borrowed from my exynos_iommu driver
>>>>>> and
>>>>>> I doubt it can be cleaned up.
>>>>>
>>>>> On the contrary I suspect they both can - it all dates back to when
>>>>> we had the single global platform bus iommu_ops and the SoC drivers
>>>>> were forced to bodge their own notion of multiple instances, but with
>>>>> the modern core code, ops are always called via a valid IOMMU
>>>>> instance or domain, so in principle it should always be possible to
>>>>> get at an appropriate IOMMU device now. IIRC it was mostly about
>>>>> allocating and DMA-mapping the pagetables in domain_alloc, where the
>>>>> private notion of instances didn't have enough information, but
>>>>> domain_alloc_paging solves that.
>>>>
>>>> Bah, in fact I think I am going to have to do that now, since although
>>>> it doesn't crash, rk_domain_alloc_paging() will also be failing for
>>>> the same reason. Time to find a PSU for the RK3399 board, I guess...
>>>>
>>>> (Or maybe just move the dma_dev assignment earlier to match Exynos?)
>>>
>>> Well I just found that Exynos IOMMU is also broken on some on my test
>>> boards. It looks that the runtime pm links are somehow not correctly
>>> established. I will try to analyze this later in the afternoon.
>>
>> Hmm, I tried to get an Odroid-XU3 up and running, but it seems unable
>> to boot my original 6.14-rc3-based branch - even with the IOMMU driver
>> disabled, it's consistently dying somewhere near (or just after) init
>> with what looks like some catastrophic memory corruption issue - very
>> occasionally it's managed to print the first line of various different
>> panics.
>>
>> Before that point though, with the IOMMU driver enabled it does appear
>> to show signs of working OK:
>>
>> [    0.649703] exynos-sysmmu 14650000.sysmmu: hardware version: 3.3
>> [    0.654220] platform 14450000.mixer: Adding to iommu group 1
>> ...
>> [    2.680920] exynos-mixer 14450000.mixer:
>> exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42924000
>> ...
>> [    5.196674] exynos-mixer 14450000.mixer:
>> exynos_iommu_identity_attach: Restored IOMMU to IDENTITY from pgtable
>> 0x42924000
>> [    5.207091] exynos-mixer 14450000.mixer:
>> exynos_iommu_attach_device: Attached IOMMU with pgtable 0x42884000
>>
>>
>> The multi-instance stuff in probe/release does look a bit suspect,
>> however - seems like the second instance probe would overwrite the
>> first instance's links, and then there would be a double-del() if the
>> device were ever actually released again? I may have made that much
>> more likely to happen, but I suspect it was already possible with
>> async driver probe...
> 
> That is really strange. My Odroid XU3 boots fine from commit
> bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path"),
> although the IOMMU seems not to be working correctly. I've tested this
> with 14450000.mixer device (one need to attach HDMI cable to get it
> activated) and it looks that the video data are not being read from
> memory at all (the lack of VSYNC is reported, no IOMMU fault). However,
> from time to time, everything initializes and works properly.

Urgh, seems my mistake was assuming exynos_defconfig was the right thing 
to begin from - bcb81ac6ae3c with that still dies in the same way (this 
time I saw a hint of spin_bug() being hit...), however a 
multi_v7_defconfig build does get to userspace OK again with no obvious 
signs of distress:

[root@alarm ~]# grep -Hr . /sys/kernel/iommu_groups/*/type
/sys/kernel/iommu_groups/0/type:identity
/sys/kernel/iommu_groups/1/type:identity
/sys/kernel/iommu_groups/10/type:identity
/sys/kernel/iommu_groups/2/type:identity
/sys/kernel/iommu_groups/3/type:identity
/sys/kernel/iommu_groups/4/type:identity
/sys/kernel/iommu_groups/5/type:identity
/sys/kernel/iommu_groups/6/type:identity
/sys/kernel/iommu_groups/7/type:identity
/sys/kernel/iommu_groups/8/type:identity
/sys/kernel/iommu_groups/9/type:identity

Annoyingly I do have an adapter for the fiddly micro-HDMI, but it's at 
home :(

> It looks that this is somehow related to the different IOMMU/DMA-mapping
> glue code, as the other boards (ARM64 based) with exactly the same
> Exynos IOMMU driver always work fine. I've tried to figure out what
> actually happens, but so far I didn't get anything for sure. Disabling
> the call to dev->bus->dma_configure(dev) from iommu_init_device() seems
> to be fixing this, but this is almost equal to the revert of the
> $subject patch. I don't get why calling it in iommu_init_device() causes
> problems. It also doesn't look that this is anyhow related to the
> multi-instance stuff, as the same happens if I only leave a single
> exynos-sysmmu instance and its client (only 14450000.mixer device in the
> system).

On a hunch I stuck a print in exynos_iommu_probe_device(), and it looks 
like in fact device_link_add() isn't getting called at all, and indeed 
your symptoms do sound like they could be explained by the IOMMU not 
being reliably resumed... lemme stare at exynos_iommu_of_xlate() a bit 
longer...

Thanks,
Robin.
Geert Uytterhoeven March 25, 2025, 3:32 p.m. UTC | #20
Hi Robin,

On Tue, 18 Mar 2025 at 18:24, Robin Murphy <robin.murphy@arm.com> wrote:
> On 18/03/2025 4:37 pm, Geert Uytterhoeven wrote:
> [...]
> > Thanks for your patch, which is now commit bcb81ac6ae3c2ef9 ("iommu:
> > Get DT/ACPI parsing into the proper probe path") in iommu/next.
> >
> > This patch triggers two issues on R-Car Gen3 platforms:
> >
> > 1. I am seeing a warning on Renesas Salvator-XS with R-Car M3N
> > (but not on the similar board with R-Car H3), and only for SATA[1].
> > Unfortunately commit 73d2f10957f517e5 ("iommu: Don't warn prematurely
> > about dodgy probes") does not help:
> [...]
> >      Call trace:
> >       __iommu_probe_device+0x208/0x38c (P)
> >       iommu_probe_device+0x34/0x74
> >       of_iommu_configure+0x128/0x200
> >       of_dma_configure_id+0xdc/0x1d4
> >       platform_dma_configure+0x48/0x6c
> >       really_probe+0xf0/0x260
> >       __driver_probe_device+0xec/0x104
> >       driver_probe_device+0x3c/0xc0
>
> Hurrah, this is the warning doing the correct job - something *is* off
> if we're now getting here without the IOMMU configuration being done
> already (for a normal device with no other funny business going on).
>
> > 2. The IOMMU driver's iommu_ops.of_xlate() callback is called about
> > three times as much as before:
>
> That would suggest that the fwspec gets set up OK, then something later
> in the __iommu_probe_device() path fails and tears it down again, so the
> next attempt starts from scratch. Do you see the "Cannot attach to
> IPMMU" message firing?

I do not see such messages.

> And similarly to the Rockchip case, does the
> below help?

The below is basically the same as your "[PATCH] iommu/ipmmu-vmsa:
Register in a sensible order"[1].  While that fixes my first issue,
it does not fix the second (harmless?) issue.

Note that I only noticed the second issue because I have local debug
code in soc_device_match().  Perhaps it happens, unnoticed, on other
systems too?

Thanks!

[1] https://lore.kernel.org/53be6667544de65a15415b699e38a9a965692e45.1742481687.git.robin.murphy@arm.com/

Gr{oetje,eeting}s,

                        Geert
Chen-Yu Tsai March 27, 2025, 9:47 a.m. UTC | #21
Hi,

On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote:
> In hindsight, there were some crucial subtleties overlooked when moving
> {of,acpi}_dma_configure() to driver probe time to allow waiting for
> IOMMU drivers with -EPROBE_DEFER, and these have become an
> ever-increasing source of problems. The IOMMU API has some fundamental
> assumptions that iommu_probe_device() is called for every device added
> to the system, in the order in which they are added. Calling it in a
> random order or not at all dependent on driver binding leads to
> malformed groups, a potential lack of isolation for devices with no
> driver, and all manner of unexpected concurrency and race conditions.
> We've attempted to mitigate the latter with point-fix bodges like
> iommu_probe_device_lock, but it's a losing battle and the time has come
> to bite the bullet and address the true source of the problem instead.
> 
> The crux of the matter is that the firmware parsing actually serves two
> distinct purposes; one is identifying the IOMMU instance associated with
> a device so we can check its availability, the second is actually
> telling that instance about the relevant firmware-provided data for the
> device. However the latter also depends on the former, and at the time
> there was no good place to defer and retry that separately from the
> availability check we also wanted for client driver probe.
> 
> Nowadays, though, we have a proper notion of multiple IOMMU instances in
> the core API itself, and each one gets a chance to probe its own devices
> upon registration, so we can finally make that work as intended for
> DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to
> be able to run the iommu_fwspec machinery currently buried deep in the
> wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be
> surprisingly straightforward to bootstrap this transformation by pretty
> much just calling the same path twice. At client driver probe time,
> dev->driver is obviously set; conversely at device_add(), or a
> subsequent bus_iommu_probe(), any device waiting for an IOMMU really
> should *not* have a driver already, so we can use that as a condition to
> disambiguate the two cases, and avoid recursing back into the IOMMU core
> at the wrong times.
> 
> Obviously this isn't the nicest thing, but for now it gives us a
> functional baseline to then unpick the layers in between without many
> more awkward cross-subsystem patches. There are some minor side-effects
> like dma_range_map potentially being created earlier, and some debug
> prints being repeated, but these aren't significantly detrimental. Let's
> make things work first, then deal with making them nice.
> 
> With the basic flow finally in the right order again, the next step is
> probably turning the bus->dma_configure paths inside-out, since all we
> really need from bus code is its notion of which device and input ID(s)
> to parse the common firmware properties with...
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c
> Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>

This change causes the MediaTek IOMMU driver to panic on probe,
resulting in multiple MediaTek platforms not being able to boot.
This was observed on Linus's tree at commit 1a9239bb4253
("Merge tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next")
which just received the IOMMU updates a couple merge commits
prior.

The regression was bisected down to this patch on MT8183 Juniper
Chromebook. The error is a NULL pointer dereference. Here's the
decoded backtrace:

    Unable to handle kernel read from unreadable memory at virtual address 0000000000000000
    Mem abort info:
      ESR = 0x0000000096000005
    usb 1-1.1.2: new high-speed USB device number 6 using xhci-mtk
      EC = 0x25: DABT (current EL), IL = 32 bits
      SET = 0, FnV = 0
      EA = 0, S1PTW = 0
      FSC = 0x05: level 1 translation fault
    Data abort info:
      ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
      CM = 0, WnR = 0, TnD = 0, TagAccess = 0
      GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
    [0000000000000000] user address but active_mm is swapper
    Internal error: Oops: 0000000096000005 [#1]  SMP
    Modules linked in:
    CPU: 4 UID: 0 PID: 68 Comm: kworker/u34:1 Tainted: G    B              6.14.0-05877-g1a9239bb4253 #621 PREEMPT  a6631d3f04612a5c23866dc67cf38316c6b023e0
    Tainted: [B]=BAD_PAGE
    Hardware name: Google juniper sku16 board (DT)
    Workqueue: events_unbound deferred_probe_work_func
    pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    pc : mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368 drivers/iommu/mtk_iommu.c:940)
    lr : mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368 drivers/iommu/mtk_iommu.c:940)
    sp : ffffffc0809674a0
    x29: ffffffc0809674a0 x28: ffffff80c1f0f078 x27: ffffff80c1f0f460
    x26: ffffff80c1f0f458 x25: ffffffdb1b738788 x24: ffffffc0809676d0
    x23: ffffff80c8fe4130 x22: ffffffffffffffed x21: 0000000000000000
    x20: ffffff80c1f0f010 x19: ffffff80c8fd7d00 x18: 0000000000000000
    x17: 000000040044ffff x16: 00500072b5593510 x15: 0000000000000000
    x14: ffffff80c09fbb80 x13: ffffffa5bf03f000 x12: ffffffbb6399cdf1
    x11: 1ffffffb6399cdf0 x10: ffffffbb6399cdf0 x9 : dfffffc000000000
    x8 : 000000449c663210 x7 : ffffffdb1cce6f87 x6 : 0000000000000001
    x5 : ffffffdb1cce6f80 x4 : ffffffbb6399cdf1 x3 : 0000000000000000
    x2 : 0000000000000020 x1 : ffffff80c22ed940 x0 : 0000000000000001
    Call trace:
    mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368 drivers/iommu/mtk_iommu.c:940) (P)
    __iommu_probe_device (drivers/iommu/iommu.c:461 drivers/iommu/iommu.c:563)
    probe_iommu_group (drivers/iommu/iommu.c:1722)
    bus_for_each_dev (drivers/base/bus.c:370)
    iommu_device_register (drivers/iommu/iommu.c:1875 drivers/iommu/iommu.c:276)
    mtk_iommu_probe (drivers/iommu/mtk_iommu.c:1380)
    platform_probe (drivers/base/platform.c:1404)
    really_probe (drivers/base/dd.c:579 drivers/base/dd.c:658)
    __driver_probe_device (drivers/base/dd.c:800)
    driver_probe_device (drivers/base/dd.c:830)
    __device_attach_driver (drivers/base/dd.c:959)
    bus_for_each_drv (drivers/base/bus.c:462)
    __device_attach (drivers/base/dd.c:1032)
    device_initial_probe (drivers/base/dd.c:1080)
    bus_probe_device (drivers/base/bus.c:537)
    deferred_probe_work_func (drivers/base/dd.c:124)
    process_one_work (./arch/arm64/include/asm/jump_label.h:36 ./include/trace/events/workqueue.h:110 kernel/workqueue.c:3243)
    worker_thread (kernel/workqueue.c:3313 (discriminator 2) kernel/workqueue.c:3400 (discriminator 2))
    kthread (kernel/kthread.c:464)
    ret_from_fork (arch/arm64/kernel/entry.S:863)
    Code: 92800256 f940d2b5 aa1503e0 97e8dbd7 (f94002b5)
    All code
    ========
       0:	92800256 	mov	x22, #0xffffffffffffffed    	// #-19
       4:	f940d2b5 	ldr	x21, [x21, #416]
       8:	aa1503e0 	mov	x0, x21
       c:	97e8dbd7 	bl	0xffffffffffa36f68
      10:*	f94002b5 	ldr	x21, [x21]		<-- trapping instruction

    Code starting with the faulting instruction
    ===========================================
       0:	f94002b5 	ldr	x21, [x21]
    ---[ end trace 0000000000000000 ]---


ChenYu

> ---
> 
> v2:
>  - Comment bus driver changes for clarity
>  - Use dev->iommu as the now-robust replay condition
>  - Drop the device_iommu_mapped() checks in the firmware paths as they
>    weren't doing much - we can't replace probe_device_lock just yet...
>  
>  drivers/acpi/arm64/dma.c        |  5 +++++
>  drivers/acpi/scan.c             |  7 -------
>  drivers/amba/bus.c              |  3 ++-
>  drivers/base/platform.c         |  3 ++-
>  drivers/bus/fsl-mc/fsl-mc-bus.c |  3 ++-
>  drivers/cdx/cdx.c               |  3 ++-
>  drivers/iommu/iommu.c           | 24 +++++++++++++++++++++---
>  drivers/iommu/of_iommu.c        |  7 ++++++-
>  drivers/of/device.c             |  7 ++++++-
>  drivers/pci/pci-driver.c        |  3 ++-
>  10 files changed, 48 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c
> index 52b2abf88689..f30f138352b7 100644
> --- a/drivers/acpi/arm64/dma.c
> +++ b/drivers/acpi/arm64/dma.c
> @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev)
>  	else
>  		end = (1ULL << 32) - 1;
>  
> +	if (dev->dma_range_map) {
> +		dev_dbg(dev, "dma_range_map already set\n");
> +		return;
> +	}
> +
>  	ret = acpi_dma_get_range(dev, &map);
>  	if (!ret && map) {
>  		end = dma_range_map_max(map);
> diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
> index 9f4efa8f75a6..fb1fe9f3b1a3 100644
> --- a/drivers/acpi/scan.c
> +++ b/drivers/acpi/scan.c
> @@ -1632,13 +1632,6 @@ static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in)
>  		err = viot_iommu_configure(dev);
>  	mutex_unlock(&iommu_probe_device_lock);
>  
> -	/*
> -	 * If we have reason to believe the IOMMU driver missed the initial
> -	 * iommu_probe_device() call for dev, replay it to get things in order.
> -	 */
> -	if (!err && dev->bus)
> -		err = iommu_probe_device(dev);
> -
>  	return err;
>  }
>  
> diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> index 8ef259b4d037..71482d639a6d 100644
> --- a/drivers/amba/bus.c
> +++ b/drivers/amba/bus.c
> @@ -364,7 +364,8 @@ static int amba_dma_configure(struct device *dev)
>  		ret = acpi_dma_configure(dev, attr);
>  	}
>  
> -	if (!ret && !drv->driver_managed_dma) {
> +	/* @drv may not be valid when we're called from the IOMMU layer */
> +	if (!ret && dev->driver && !drv->driver_managed_dma) {
>  		ret = iommu_device_use_default_domain(dev);
>  		if (ret)
>  			arch_teardown_dma_ops(dev);
> diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> index 6f2a33722c52..1813cfd0c4bd 100644
> --- a/drivers/base/platform.c
> +++ b/drivers/base/platform.c
> @@ -1451,7 +1451,8 @@ static int platform_dma_configure(struct device *dev)
>  		attr = acpi_get_dma_attr(to_acpi_device_node(fwnode));
>  		ret = acpi_dma_configure(dev, attr);
>  	}
> -	if (ret || drv->driver_managed_dma)
> +	/* @drv may not be valid when we're called from the IOMMU layer */
> +	if (ret || !dev->driver || drv->driver_managed_dma)
>  		return ret;
>  
>  	ret = iommu_device_use_default_domain(dev);
> diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c
> index d1f3d327ddd1..a8be8cf246fb 100644
> --- a/drivers/bus/fsl-mc/fsl-mc-bus.c
> +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c
> @@ -153,7 +153,8 @@ static int fsl_mc_dma_configure(struct device *dev)
>  	else
>  		ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT, &input_id);
>  
> -	if (!ret && !mc_drv->driver_managed_dma) {
> +	/* @mc_drv may not be valid when we're called from the IOMMU layer */
> +	if (!ret && dev->driver && !mc_drv->driver_managed_dma) {
>  		ret = iommu_device_use_default_domain(dev);
>  		if (ret)
>  			arch_teardown_dma_ops(dev);
> diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c
> index c573ed2ee71a..780fb0c4adba 100644
> --- a/drivers/cdx/cdx.c
> +++ b/drivers/cdx/cdx.c
> @@ -360,7 +360,8 @@ static int cdx_dma_configure(struct device *dev)
>  		return ret;
>  	}
>  
> -	if (!ret && !cdx_drv->driver_managed_dma) {
> +	/* @cdx_drv may not be valid when we're called from the IOMMU layer */
> +	if (!ret && dev->driver && !cdx_drv->driver_managed_dma) {
>  		ret = iommu_device_use_default_domain(dev);
>  		if (ret)
>  			arch_teardown_dma_ops(dev);
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index a3b45b84f42b..1cec7074367a 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -414,9 +414,21 @@ static int iommu_init_device(struct device *dev)
>  	if (!dev_iommu_get(dev))
>  		return -ENOMEM;
>  	/*
> -	 * For FDT-based systems and ACPI IORT/VIOT, drivers register IOMMU
> -	 * instances with non-NULL fwnodes, and client devices should have been
> -	 * identified with a fwspec by this point. Otherwise, we can currently
> +	 * For FDT-based systems and ACPI IORT/VIOT, the common firmware parsing
> +	 * is buried in the bus dma_configure path. Properly unpicking that is
> +	 * still a big job, so for now just invoke the whole thing. The device
> +	 * already having a driver bound means dma_configure has already run and
> +	 * either found no IOMMU to wait for, or we're in its replay call right
> +	 * now, so either way there's no point calling it again.
> +	 */
> +	if (!dev->driver && dev->bus->dma_configure) {
> +		mutex_unlock(&iommu_probe_device_lock);
> +		dev->bus->dma_configure(dev);
> +		mutex_lock(&iommu_probe_device_lock);
> +	}
> +	/*
> +	 * At this point, relevant devices either now have a fwspec which will
> +	 * match ops registered with a non-NULL fwnode, or we can reasonably
>  	 * assume that only one of Intel, AMD, s390, PAMU or legacy SMMUv2 can
>  	 * be present, and that any of their registered instances has suitable
>  	 * ops for probing, and thus cheekily co-opt the same mechanism.
> @@ -426,6 +438,12 @@ static int iommu_init_device(struct device *dev)
>  		ret = -ENODEV;
>  		goto err_free;
>  	}
> +	/*
> +	 * And if we do now see any replay calls, they would indicate someone
> +	 * misusing the dma_configure path outside bus code.
> +	 */
> +	if (dev->driver)
> +		dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n");
>  
>  	if (!try_module_get(ops->owner)) {
>  		ret = -EINVAL;
> diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
> index e10a68b5ffde..6b989a62def2 100644
> --- a/drivers/iommu/of_iommu.c
> +++ b/drivers/iommu/of_iommu.c
> @@ -155,7 +155,12 @@ int of_iommu_configure(struct device *dev, struct device_node *master_np,
>  		dev_iommu_free(dev);
>  	mutex_unlock(&iommu_probe_device_lock);
>  
> -	if (!err && dev->bus)
> +	/*
> +	 * If we're not on the iommu_probe_device() path (as indicated by the
> +	 * initial dev->iommu) then try to simulate it. This should no longer
> +	 * happen unless of_dma_configure() is being misused outside bus code.
> +	 */
> +	if (!err && dev->bus && !dev_iommu_present)
>  		err = iommu_probe_device(dev);
>  
>  	if (err && err != -EPROBE_DEFER)
> diff --git a/drivers/of/device.c b/drivers/of/device.c
> index edf3be197265..5053e5d532cc 100644
> --- a/drivers/of/device.c
> +++ b/drivers/of/device.c
> @@ -99,6 +99,11 @@ int of_dma_configure_id(struct device *dev, struct device_node *np,
>  	bool coherent, set_map = false;
>  	int ret;
>  
> +	if (dev->dma_range_map) {
> +		dev_dbg(dev, "dma_range_map already set\n");
> +		goto skip_map;
> +	}
> +
>  	if (np == dev->of_node)
>  		bus_np = __of_get_dma_parent(np);
>  	else
> @@ -119,7 +124,7 @@ int of_dma_configure_id(struct device *dev, struct device_node *np,
>  		end = dma_range_map_max(map);
>  		set_map = true;
>  	}
> -
> +skip_map:
>  	/*
>  	 * If @dev is expected to be DMA-capable then the bus code that created
>  	 * it should have initialised its dma_mask pointer by this point. For
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index f57ea36d125d..4468b7327cab 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1653,7 +1653,8 @@ static int pci_dma_configure(struct device *dev)
>  
>  	pci_put_host_bridge_device(bridge);
>  
> -	if (!ret && !driver->driver_managed_dma) {
> +	/* @driver may not be valid when we're called from the IOMMU layer */
> +	if (!ret && dev->driver && !driver->driver_managed_dma) {
>  		ret = iommu_device_use_default_domain(dev);
>  		if (ret)
>  			arch_teardown_dma_ops(dev);
> -- 
> 2.39.2.101.g768bb238c484.dirty
>
Louis-Alexis Eyraud March 27, 2025, 11 a.m. UTC | #22
Hi,

On Thu, 2025-03-27 at 17:47 +0800, Chen-Yu Tsai wrote:
> Hi,
> 
> On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote:
> > In hindsight, there were some crucial subtleties overlooked when
> > moving
> > {of,acpi}_dma_configure() to driver probe time to allow waiting for
> > IOMMU drivers with -EPROBE_DEFER, and these have become an
> > ever-increasing source of problems. The IOMMU API has some
> > fundamental
> > assumptions that iommu_probe_device() is called for every device
> > added
> > to the system, in the order in which they are added. Calling it in
> > a
> > random order or not at all dependent on driver binding leads to
> > malformed groups, a potential lack of isolation for devices with no
> > driver, and all manner of unexpected concurrency and race
> > conditions.
> > We've attempted to mitigate the latter with point-fix bodges like
> > iommu_probe_device_lock, but it's a losing battle and the time has
> > come
> > to bite the bullet and address the true source of the problem
> > instead.
> > 
> > The crux of the matter is that the firmware parsing actually serves
> > two
> > distinct purposes; one is identifying the IOMMU instance associated
> > with
> > a device so we can check its availability, the second is actually
> > telling that instance about the relevant firmware-provided data for
> > the
> > device. However the latter also depends on the former, and at the
> > time
> > there was no good place to defer and retry that separately from the
> > availability check we also wanted for client driver probe.
> > 
> > Nowadays, though, we have a proper notion of multiple IOMMU
> > instances in
> > the core API itself, and each one gets a chance to probe its own
> > devices
> > upon registration, so we can finally make that work as intended for
> > DT/IORT/VIOT platforms too. All we need is for iommu_probe_device()
> > to
> > be able to run the iommu_fwspec machinery currently buried deep in
> > the
> > wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be
> > surprisingly straightforward to bootstrap this transformation by
> > pretty
> > much just calling the same path twice. At client driver probe time,
> > dev->driver is obviously set; conversely at device_add(), or a
> > subsequent bus_iommu_probe(), any device waiting for an IOMMU
> > really
> > should *not* have a driver already, so we can use that as a
> > condition to
> > disambiguate the two cases, and avoid recursing back into the IOMMU
> > core
> > at the wrong times.
> > 
> > Obviously this isn't the nicest thing, but for now it gives us a
> > functional baseline to then unpick the layers in between without
> > many
> > more awkward cross-subsystem patches. There are some minor side-
> > effects
> > like dma_range_map potentially being created earlier, and some
> > debug
> > prints being repeated, but these aren't significantly detrimental.
> > Let's
> > make things work first, then deal with making them nice.
> > 
> > With the basic flow finally in the right order again, the next step
> > is
> > probably turning the bus->dma_configure paths inside-out, since all
> > we
> > really need from bus code is its notion of which device and input
> > ID(s)
> > to parse the common firmware properties with...
> > 
> > Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c
> > Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c
> > Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> 
> This change causes the MediaTek IOMMU driver to panic on probe,
> resulting in multiple MediaTek platforms not being able to boot.
> This was observed on Linus's tree at commit 1a9239bb4253
> ("Merge tag 'net-next-6.15' of
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next")
> which just received the IOMMU updates a couple merge commits
> prior.
> 
> The regression was bisected down to this patch on MT8183 Juniper
> Chromebook. The error is a NULL pointer dereference. Here's the
> decoded backtrace:
> 
>     Unable to handle kernel read from unreadable memory at virtual
> address 0000000000000000
>     Mem abort info:
>       ESR = 0x0000000096000005
>     usb 1-1.1.2: new high-speed USB device number 6 using xhci-mtk
>       EC = 0x25: DABT (current EL), IL = 32 bits
>       SET = 0, FnV = 0
>       EA = 0, S1PTW = 0
>       FSC = 0x05: level 1 translation fault
>     Data abort info:
>       ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
>       CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>       GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>     [0000000000000000] user address but active_mm is swapper
>     Internal error: Oops: 0000000096000005 [#1]  SMP
>     Modules linked in:
>     CPU: 4 UID: 0 PID: 68 Comm: kworker/u34:1 Tainted: G   
> B              6.14.0-05877-g1a9239bb4253 #621 PREEMPT 
> a6631d3f04612a5c23866dc67cf38316c6b023e0
>     Tainted: [B]=BAD_PAGE
>     Hardware name: Google juniper sku16 board (DT)
>     Workqueue: events_unbound deferred_probe_work_func
>     pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>     pc : mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368
> drivers/iommu/mtk_iommu.c:940)
>     lr : mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368
> drivers/iommu/mtk_iommu.c:940)
>     sp : ffffffc0809674a0
>     x29: ffffffc0809674a0 x28: ffffff80c1f0f078 x27: ffffff80c1f0f460
>     x26: ffffff80c1f0f458 x25: ffffffdb1b738788 x24: ffffffc0809676d0
>     x23: ffffff80c8fe4130 x22: ffffffffffffffed x21: 0000000000000000
>     x20: ffffff80c1f0f010 x19: ffffff80c8fd7d00 x18: 0000000000000000
>     x17: 000000040044ffff x16: 00500072b5593510 x15: 0000000000000000
>     x14: ffffff80c09fbb80 x13: ffffffa5bf03f000 x12: ffffffbb6399cdf1
>     x11: 1ffffffb6399cdf0 x10: ffffffbb6399cdf0 x9 : dfffffc000000000
>     x8 : 000000449c663210 x7 : ffffffdb1cce6f87 x6 : 0000000000000001
>     x5 : ffffffdb1cce6f80 x4 : ffffffbb6399cdf1 x3 : 0000000000000000
>     x2 : 0000000000000020 x1 : ffffff80c22ed940 x0 : 0000000000000001
>     Call trace:
>     mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368
> drivers/iommu/mtk_iommu.c:940) (P)
>     __iommu_probe_device (drivers/iommu/iommu.c:461
> drivers/iommu/iommu.c:563)
>     probe_iommu_group (drivers/iommu/iommu.c:1722)
>     bus_for_each_dev (drivers/base/bus.c:370)
>     iommu_device_register (drivers/iommu/iommu.c:1875
> drivers/iommu/iommu.c:276)
>     mtk_iommu_probe (drivers/iommu/mtk_iommu.c:1380)
>     platform_probe (drivers/base/platform.c:1404)
>     really_probe (drivers/base/dd.c:579 drivers/base/dd.c:658)
>     __driver_probe_device (drivers/base/dd.c:800)
>     driver_probe_device (drivers/base/dd.c:830)
>     __device_attach_driver (drivers/base/dd.c:959)
>     bus_for_each_drv (drivers/base/bus.c:462)
>     __device_attach (drivers/base/dd.c:1032)
>     device_initial_probe (drivers/base/dd.c:1080)
>     bus_probe_device (drivers/base/bus.c:537)
>     deferred_probe_work_func (drivers/base/dd.c:124)
>     process_one_work (./arch/arm64/include/asm/jump_label.h:36
> ./include/trace/events/workqueue.h:110 kernel/workqueue.c:3243)
>     worker_thread (kernel/workqueue.c:3313 (discriminator 2)
> kernel/workqueue.c:3400 (discriminator 2))
>     kthread (kernel/kthread.c:464)
>     ret_from_fork (arch/arm64/kernel/entry.S:863)
>     Code: 92800256 f940d2b5 aa1503e0 97e8dbd7 (f94002b5)
>     All code
>     ========
>        0:	92800256 	mov	x22,
> #0xffffffffffffffed    	// #-19
>        4:	f940d2b5 	ldr	x21, [x21, #416]
>        8:	aa1503e0 	mov	x0, x21
>        c:	97e8dbd7 	bl	0xffffffffffa36f68
>       10:*	f94002b5 	ldr	x21, [x21]		<--
> trapping instruction
> 
>     Code starting with the faulting instruction
>     ===========================================
>        0:	f94002b5 	ldr	x21, [x21]
>     ---[ end trace 0000000000000000 ]---
> 
> 
> ChenYu
> 
I've sent a patch [1] to fix this issue that I've also observed on
my Mediatek Genio boards (510-EVK and 1200-EVK).

[1]:https://lore.kernel.org/linux-iommu/20250327-fix-mtk-iommu-error-v1-1-df969158e752@collabora.com/

Regards,
Louis-Alexis

> > ---
> > 
> > v2:
> >  - Comment bus driver changes for clarity
> >  - Use dev->iommu as the now-robust replay condition
> >  - Drop the device_iommu_mapped() checks in the firmware paths as
> > they
> >    weren't doing much - we can't replace probe_device_lock just
> > yet...
> >  
> >  drivers/acpi/arm64/dma.c        |  5 +++++
> >  drivers/acpi/scan.c             |  7 -------
> >  drivers/amba/bus.c              |  3 ++-
> >  drivers/base/platform.c         |  3 ++-
> >  drivers/bus/fsl-mc/fsl-mc-bus.c |  3 ++-
> >  drivers/cdx/cdx.c               |  3 ++-
> >  drivers/iommu/iommu.c           | 24 +++++++++++++++++++++---
> >  drivers/iommu/of_iommu.c        |  7 ++++++-
> >  drivers/of/device.c             |  7 ++++++-
> >  drivers/pci/pci-driver.c        |  3 ++-
> >  10 files changed, 48 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c
> > index 52b2abf88689..f30f138352b7 100644
> > --- a/drivers/acpi/arm64/dma.c
> > +++ b/drivers/acpi/arm64/dma.c
> > @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev)
> >  	else
> >  		end = (1ULL << 32) - 1;
> >  
> > +	if (dev->dma_range_map) {
> > +		dev_dbg(dev, "dma_range_map already set\n");
> > +		return;
> > +	}
> > +
> >  	ret = acpi_dma_get_range(dev, &map);
> >  	if (!ret && map) {
> >  		end = dma_range_map_max(map);
> > diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
> > index 9f4efa8f75a6..fb1fe9f3b1a3 100644
> > --- a/drivers/acpi/scan.c
> > +++ b/drivers/acpi/scan.c
> > @@ -1632,13 +1632,6 @@ static int acpi_iommu_configure_id(struct
> > device *dev, const u32 *id_in)
> >  		err = viot_iommu_configure(dev);
> >  	mutex_unlock(&iommu_probe_device_lock);
> >  
> > -	/*
> > -	 * If we have reason to believe the IOMMU driver missed
> > the initial
> > -	 * iommu_probe_device() call for dev, replay it to get
> > things in order.
> > -	 */
> > -	if (!err && dev->bus)
> > -		err = iommu_probe_device(dev);
> > -
> >  	return err;
> >  }
> >  
> > diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> > index 8ef259b4d037..71482d639a6d 100644
> > --- a/drivers/amba/bus.c
> > +++ b/drivers/amba/bus.c
> > @@ -364,7 +364,8 @@ static int amba_dma_configure(struct device
> > *dev)
> >  		ret = acpi_dma_configure(dev, attr);
> >  	}
> >  
> > -	if (!ret && !drv->driver_managed_dma) {
> > +	/* @drv may not be valid when we're called from the IOMMU
> > layer */
> > +	if (!ret && dev->driver && !drv->driver_managed_dma) {
> >  		ret = iommu_device_use_default_domain(dev);
> >  		if (ret)
> >  			arch_teardown_dma_ops(dev);
> > diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> > index 6f2a33722c52..1813cfd0c4bd 100644
> > --- a/drivers/base/platform.c
> > +++ b/drivers/base/platform.c
> > @@ -1451,7 +1451,8 @@ static int platform_dma_configure(struct
> > device *dev)
> >  		attr =
> > acpi_get_dma_attr(to_acpi_device_node(fwnode));
> >  		ret = acpi_dma_configure(dev, attr);
> >  	}
> > -	if (ret || drv->driver_managed_dma)
> > +	/* @drv may not be valid when we're called from the IOMMU
> > layer */
> > +	if (ret || !dev->driver || drv->driver_managed_dma)
> >  		return ret;
> >  
> >  	ret = iommu_device_use_default_domain(dev);
> > diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-
> > mc/fsl-mc-bus.c
> > index d1f3d327ddd1..a8be8cf246fb 100644
> > --- a/drivers/bus/fsl-mc/fsl-mc-bus.c
> > +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c
> > @@ -153,7 +153,8 @@ static int fsl_mc_dma_configure(struct device
> > *dev)
> >  	else
> >  		ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT,
> > &input_id);
> >  
> > -	if (!ret && !mc_drv->driver_managed_dma) {
> > +	/* @mc_drv may not be valid when we're called from the
> > IOMMU layer */
> > +	if (!ret && dev->driver && !mc_drv->driver_managed_dma) {
> >  		ret = iommu_device_use_default_domain(dev);
> >  		if (ret)
> >  			arch_teardown_dma_ops(dev);
> > diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c
> > index c573ed2ee71a..780fb0c4adba 100644
> > --- a/drivers/cdx/cdx.c
> > +++ b/drivers/cdx/cdx.c
> > @@ -360,7 +360,8 @@ static int cdx_dma_configure(struct device
> > *dev)
> >  		return ret;
> >  	}
> >  
> > -	if (!ret && !cdx_drv->driver_managed_dma) {
> > +	/* @cdx_drv may not be valid when we're called from the
> > IOMMU layer */
> > +	if (!ret && dev->driver && !cdx_drv->driver_managed_dma) {
> >  		ret = iommu_device_use_default_domain(dev);
> >  		if (ret)
> >  			arch_teardown_dma_ops(dev);
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index a3b45b84f42b..1cec7074367a 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -414,9 +414,21 @@ static int iommu_init_device(struct device
> > *dev)
> >  	if (!dev_iommu_get(dev))
> >  		return -ENOMEM;
> >  	/*
> > -	 * For FDT-based systems and ACPI IORT/VIOT, drivers
> > register IOMMU
> > -	 * instances with non-NULL fwnodes, and client devices
> > should have been
> > -	 * identified with a fwspec by this point. Otherwise, we
> > can currently
> > +	 * For FDT-based systems and ACPI IORT/VIOT, the common
> > firmware parsing
> > +	 * is buried in the bus dma_configure path. Properly
> > unpicking that is
> > +	 * still a big job, so for now just invoke the whole
> > thing. The device
> > +	 * already having a driver bound means dma_configure has
> > already run and
> > +	 * either found no IOMMU to wait for, or we're in its
> > replay call right
> > +	 * now, so either way there's no point calling it again.
> > +	 */
> > +	if (!dev->driver && dev->bus->dma_configure) {
> > +		mutex_unlock(&iommu_probe_device_lock);
> > +		dev->bus->dma_configure(dev);
> > +		mutex_lock(&iommu_probe_device_lock);
> > +	}
> > +	/*
> > +	 * At this point, relevant devices either now have a
> > fwspec which will
> > +	 * match ops registered with a non-NULL fwnode, or we can
> > reasonably
> >  	 * assume that only one of Intel, AMD, s390, PAMU or
> > legacy SMMUv2 can
> >  	 * be present, and that any of their registered instances
> > has suitable
> >  	 * ops for probing, and thus cheekily co-opt the same
> > mechanism.
> > @@ -426,6 +438,12 @@ static int iommu_init_device(struct device
> > *dev)
> >  		ret = -ENODEV;
> >  		goto err_free;
> >  	}
> > +	/*
> > +	 * And if we do now see any replay calls, they would
> > indicate someone
> > +	 * misusing the dma_configure path outside bus code.
> > +	 */
> > +	if (dev->driver)
> > +		dev_WARN(dev, "late IOMMU probe at driver bind,
> > something fishy here!\n");
> >  
> >  	if (!try_module_get(ops->owner)) {
> >  		ret = -EINVAL;
> > diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
> > index e10a68b5ffde..6b989a62def2 100644
> > --- a/drivers/iommu/of_iommu.c
> > +++ b/drivers/iommu/of_iommu.c
> > @@ -155,7 +155,12 @@ int of_iommu_configure(struct device *dev,
> > struct device_node *master_np,
> >  		dev_iommu_free(dev);
> >  	mutex_unlock(&iommu_probe_device_lock);
> >  
> > -	if (!err && dev->bus)
> > +	/*
> > +	 * If we're not on the iommu_probe_device() path (as
> > indicated by the
> > +	 * initial dev->iommu) then try to simulate it. This
> > should no longer
> > +	 * happen unless of_dma_configure() is being misused
> > outside bus code.
> > +	 */
> > +	if (!err && dev->bus && !dev_iommu_present)
> >  		err = iommu_probe_device(dev);
> >  
> >  	if (err && err != -EPROBE_DEFER)
> > diff --git a/drivers/of/device.c b/drivers/of/device.c
> > index edf3be197265..5053e5d532cc 100644
> > --- a/drivers/of/device.c
> > +++ b/drivers/of/device.c
> > @@ -99,6 +99,11 @@ int of_dma_configure_id(struct device *dev,
> > struct device_node *np,
> >  	bool coherent, set_map = false;
> >  	int ret;
> >  
> > +	if (dev->dma_range_map) {
> > +		dev_dbg(dev, "dma_range_map already set\n");
> > +		goto skip_map;
> > +	}
> > +
> >  	if (np == dev->of_node)
> >  		bus_np = __of_get_dma_parent(np);
> >  	else
> > @@ -119,7 +124,7 @@ int of_dma_configure_id(struct device *dev,
> > struct device_node *np,
> >  		end = dma_range_map_max(map);
> >  		set_map = true;
> >  	}
> > -
> > +skip_map:
> >  	/*
> >  	 * If @dev is expected to be DMA-capable then the bus code
> > that created
> >  	 * it should have initialised its dma_mask pointer by this
> > point. For
> > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> > index f57ea36d125d..4468b7327cab 100644
> > --- a/drivers/pci/pci-driver.c
> > +++ b/drivers/pci/pci-driver.c
> > @@ -1653,7 +1653,8 @@ static int pci_dma_configure(struct device
> > *dev)
> >  
> >  	pci_put_host_bridge_device(bridge);
> >  
> > -	if (!ret && !driver->driver_managed_dma) {
> > +	/* @driver may not be valid when we're called from the
> > IOMMU layer */
> > +	if (!ret && dev->driver && !driver->driver_managed_dma) {
> >  		ret = iommu_device_use_default_domain(dev);
> >  		if (ret)
> >  			arch_teardown_dma_ops(dev);
> > -- 
> > 2.39.2.101.g768bb238c484.dirty
> > 
>
diff mbox series

Patch

diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c
index 52b2abf88689..f30f138352b7 100644
--- a/drivers/acpi/arm64/dma.c
+++ b/drivers/acpi/arm64/dma.c
@@ -26,6 +26,11 @@  void acpi_arch_dma_setup(struct device *dev)
 	else
 		end = (1ULL << 32) - 1;
 
+	if (dev->dma_range_map) {
+		dev_dbg(dev, "dma_range_map already set\n");
+		return;
+	}
+
 	ret = acpi_dma_get_range(dev, &map);
 	if (!ret && map) {
 		end = dma_range_map_max(map);
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 9f4efa8f75a6..fb1fe9f3b1a3 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1632,13 +1632,6 @@  static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in)
 		err = viot_iommu_configure(dev);
 	mutex_unlock(&iommu_probe_device_lock);
 
-	/*
-	 * If we have reason to believe the IOMMU driver missed the initial
-	 * iommu_probe_device() call for dev, replay it to get things in order.
-	 */
-	if (!err && dev->bus)
-		err = iommu_probe_device(dev);
-
 	return err;
 }
 
diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
index 8ef259b4d037..71482d639a6d 100644
--- a/drivers/amba/bus.c
+++ b/drivers/amba/bus.c
@@ -364,7 +364,8 @@  static int amba_dma_configure(struct device *dev)
 		ret = acpi_dma_configure(dev, attr);
 	}
 
-	if (!ret && !drv->driver_managed_dma) {
+	/* @drv may not be valid when we're called from the IOMMU layer */
+	if (!ret && dev->driver && !drv->driver_managed_dma) {
 		ret = iommu_device_use_default_domain(dev);
 		if (ret)
 			arch_teardown_dma_ops(dev);
diff --git a/drivers/base/platform.c b/drivers/base/platform.c
index 6f2a33722c52..1813cfd0c4bd 100644
--- a/drivers/base/platform.c
+++ b/drivers/base/platform.c
@@ -1451,7 +1451,8 @@  static int platform_dma_configure(struct device *dev)
 		attr = acpi_get_dma_attr(to_acpi_device_node(fwnode));
 		ret = acpi_dma_configure(dev, attr);
 	}
-	if (ret || drv->driver_managed_dma)
+	/* @drv may not be valid when we're called from the IOMMU layer */
+	if (ret || !dev->driver || drv->driver_managed_dma)
 		return ret;
 
 	ret = iommu_device_use_default_domain(dev);
diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c
index d1f3d327ddd1..a8be8cf246fb 100644
--- a/drivers/bus/fsl-mc/fsl-mc-bus.c
+++ b/drivers/bus/fsl-mc/fsl-mc-bus.c
@@ -153,7 +153,8 @@  static int fsl_mc_dma_configure(struct device *dev)
 	else
 		ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT, &input_id);
 
-	if (!ret && !mc_drv->driver_managed_dma) {
+	/* @mc_drv may not be valid when we're called from the IOMMU layer */
+	if (!ret && dev->driver && !mc_drv->driver_managed_dma) {
 		ret = iommu_device_use_default_domain(dev);
 		if (ret)
 			arch_teardown_dma_ops(dev);
diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c
index c573ed2ee71a..780fb0c4adba 100644
--- a/drivers/cdx/cdx.c
+++ b/drivers/cdx/cdx.c
@@ -360,7 +360,8 @@  static int cdx_dma_configure(struct device *dev)
 		return ret;
 	}
 
-	if (!ret && !cdx_drv->driver_managed_dma) {
+	/* @cdx_drv may not be valid when we're called from the IOMMU layer */
+	if (!ret && dev->driver && !cdx_drv->driver_managed_dma) {
 		ret = iommu_device_use_default_domain(dev);
 		if (ret)
 			arch_teardown_dma_ops(dev);
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index a3b45b84f42b..1cec7074367a 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -414,9 +414,21 @@  static int iommu_init_device(struct device *dev)
 	if (!dev_iommu_get(dev))
 		return -ENOMEM;
 	/*
-	 * For FDT-based systems and ACPI IORT/VIOT, drivers register IOMMU
-	 * instances with non-NULL fwnodes, and client devices should have been
-	 * identified with a fwspec by this point. Otherwise, we can currently
+	 * For FDT-based systems and ACPI IORT/VIOT, the common firmware parsing
+	 * is buried in the bus dma_configure path. Properly unpicking that is
+	 * still a big job, so for now just invoke the whole thing. The device
+	 * already having a driver bound means dma_configure has already run and
+	 * either found no IOMMU to wait for, or we're in its replay call right
+	 * now, so either way there's no point calling it again.
+	 */
+	if (!dev->driver && dev->bus->dma_configure) {
+		mutex_unlock(&iommu_probe_device_lock);
+		dev->bus->dma_configure(dev);
+		mutex_lock(&iommu_probe_device_lock);
+	}
+	/*
+	 * At this point, relevant devices either now have a fwspec which will
+	 * match ops registered with a non-NULL fwnode, or we can reasonably
 	 * assume that only one of Intel, AMD, s390, PAMU or legacy SMMUv2 can
 	 * be present, and that any of their registered instances has suitable
 	 * ops for probing, and thus cheekily co-opt the same mechanism.
@@ -426,6 +438,12 @@  static int iommu_init_device(struct device *dev)
 		ret = -ENODEV;
 		goto err_free;
 	}
+	/*
+	 * And if we do now see any replay calls, they would indicate someone
+	 * misusing the dma_configure path outside bus code.
+	 */
+	if (dev->driver)
+		dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n");
 
 	if (!try_module_get(ops->owner)) {
 		ret = -EINVAL;
diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
index e10a68b5ffde..6b989a62def2 100644
--- a/drivers/iommu/of_iommu.c
+++ b/drivers/iommu/of_iommu.c
@@ -155,7 +155,12 @@  int of_iommu_configure(struct device *dev, struct device_node *master_np,
 		dev_iommu_free(dev);
 	mutex_unlock(&iommu_probe_device_lock);
 
-	if (!err && dev->bus)
+	/*
+	 * If we're not on the iommu_probe_device() path (as indicated by the
+	 * initial dev->iommu) then try to simulate it. This should no longer
+	 * happen unless of_dma_configure() is being misused outside bus code.
+	 */
+	if (!err && dev->bus && !dev_iommu_present)
 		err = iommu_probe_device(dev);
 
 	if (err && err != -EPROBE_DEFER)
diff --git a/drivers/of/device.c b/drivers/of/device.c
index edf3be197265..5053e5d532cc 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -99,6 +99,11 @@  int of_dma_configure_id(struct device *dev, struct device_node *np,
 	bool coherent, set_map = false;
 	int ret;
 
+	if (dev->dma_range_map) {
+		dev_dbg(dev, "dma_range_map already set\n");
+		goto skip_map;
+	}
+
 	if (np == dev->of_node)
 		bus_np = __of_get_dma_parent(np);
 	else
@@ -119,7 +124,7 @@  int of_dma_configure_id(struct device *dev, struct device_node *np,
 		end = dma_range_map_max(map);
 		set_map = true;
 	}
-
+skip_map:
 	/*
 	 * If @dev is expected to be DMA-capable then the bus code that created
 	 * it should have initialised its dma_mask pointer by this point. For
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index f57ea36d125d..4468b7327cab 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -1653,7 +1653,8 @@  static int pci_dma_configure(struct device *dev)
 
 	pci_put_host_bridge_device(bridge);
 
-	if (!ret && !driver->driver_managed_dma) {
+	/* @driver may not be valid when we're called from the IOMMU layer */
+	if (!ret && dev->driver && !driver->driver_managed_dma) {
 		ret = iommu_device_use_default_domain(dev);
 		if (ret)
 			arch_teardown_dma_ops(dev);