mbox series

[v2,00/10] Refine the locking for dev->iommu_group

Message ID 0-v2-b0417f84403e+11f-iommu_group_locking_jgg@nvidia.com (mailing list archive)
Headers show
Series Refine the locking for dev->iommu_group | expand

Message

Jason Gunthorpe July 31, 2023, 5:50 p.m. UTC
This series puts the core code usage of dev->iommu_group under a
consistent set of locking rules. Currently a lot of places in the code
access this value without any locking or READ_ONCE/etc techniques and it
is not clear how or why this is safe.

Make it so the following locking rules are used:

 - It is readable by a probe'd device driver. So long as a device driver
   is probed the dev->iommu_group will be guaranteed stable without further
   locking. The reader should provide some kind of locking that ensure's
   its remove() struct device_driver op cannot progress.

 - Read/Write under the device_lock(), this primarily protects against
   parallel probe of the same device, and parallel probe/remove. This is
   useful for places that don't naturally have a device driver probed, and
   should be rare.

 - Read/Write under the global dev_iommu_group_lock. This is used during
   probe time discovery of groups. Device drivers will scan unlocked
   portions of the device tree to locate an already existing group. These
   scans can access the dev->iommu_group under the global lock to single
   thread determining and installing the group. This ensures that groups
   are reliably formed.

Add locking assertions to enforce this.

Along the way remove a bunch of opencoded group touching in drivers trying
to implement a 'single group per driver' approach by providing a core helper
that does this common algorithm.

Overall this significantly reduces the number of places within
drivers/iommu calling iommu_group_get or touching groups at all.

This goes on top of Joerg's next tree, it needs the prior series
"Consolidate the probe_device path".

This is on github: https://github.com/jgunthorpe/linux/commits/iommu_group_locking

v2:
 - Revise comments
 - NULL singleton_group during iommu_device_unregister() for clarity
v1: https://lore.kernel.org/r/0-v1-3c8177327a47+256-iommu_group_locking_jgg@nvidia.com

Jason Gunthorpe (10):
  iommu: Remove useless group refcounting
  iommu: Add a lockdep assertion for remaining dev->iommu_group reads
  iommu: Add generic_single_device_group()
  iommu/sun50i: Convert to generic_single_device_group()
  iommu/sprd: Convert to generic_single_device_group()
  iommu/rockchip: Convert to generic_single_device_group()
  iommu/ipmmu-vmsa: Convert to generic_single_device_group()
  iommu/omap: Convert to generic_single_device_group()
  iommu: Complete the locking for dev->iommu_group
  iommu/intel: Fix missing locking for show_device_domain_translation()

 drivers/iommu/intel/debugfs.c  |  34 ++++----
 drivers/iommu/iommu.c          | 155 +++++++++++++++++++++------------
 drivers/iommu/ipmmu-vmsa.c     |  22 ++---
 drivers/iommu/omap-iommu.c     |  30 +------
 drivers/iommu/omap-iommu.h     |   2 +-
 drivers/iommu/rockchip-iommu.c |  22 +----
 drivers/iommu/sprd-iommu.c     |  24 +----
 drivers/iommu/sun50i-iommu.c   |  29 ++----
 include/linux/iommu.h          |   3 +
 9 files changed, 138 insertions(+), 183 deletions(-)


base-commit: a5003e75a1714857c01317d04982eef81331fe2f

Comments

Joerg Roedel Aug. 7, 2023, 12:54 p.m. UTC | #1
On Mon, Jul 31, 2023 at 02:50:23PM -0300, Jason Gunthorpe wrote:
> Jason Gunthorpe (10):
>   iommu: Remove useless group refcounting
>   iommu: Add a lockdep assertion for remaining dev->iommu_group reads
>   iommu: Add generic_single_device_group()
>   iommu/sun50i: Convert to generic_single_device_group()
>   iommu/sprd: Convert to generic_single_device_group()
>   iommu/rockchip: Convert to generic_single_device_group()
>   iommu/ipmmu-vmsa: Convert to generic_single_device_group()
>   iommu/omap: Convert to generic_single_device_group()
>   iommu: Complete the locking for dev->iommu_group
>   iommu/intel: Fix missing locking for show_device_domain_translation()
> 
>  drivers/iommu/intel/debugfs.c  |  34 ++++----
>  drivers/iommu/iommu.c          | 155 +++++++++++++++++++++------------
>  drivers/iommu/ipmmu-vmsa.c     |  22 ++---
>  drivers/iommu/omap-iommu.c     |  30 +------
>  drivers/iommu/omap-iommu.h     |   2 +-
>  drivers/iommu/rockchip-iommu.c |  22 +----
>  drivers/iommu/sprd-iommu.c     |  24 +----
>  drivers/iommu/sun50i-iommu.c   |  29 ++----
>  include/linux/iommu.h          |   3 +
>  9 files changed, 138 insertions(+), 183 deletions(-)

Applied, thanks for the nice cleanup!
Chen-Yu Tsai Aug. 8, 2023, 10:31 a.m. UTC | #2
Hi,

On Mon, Aug 7, 2023 at 8:54 PM Joerg Roedel <joro@8bytes.org> wrote:
>
> On Mon, Jul 31, 2023 at 02:50:23PM -0300, Jason Gunthorpe wrote:
> > Jason Gunthorpe (10):
> >   iommu: Remove useless group refcounting
> >   iommu: Add a lockdep assertion for remaining dev->iommu_group reads
> >   iommu: Add generic_single_device_group()
> >   iommu/sun50i: Convert to generic_single_device_group()
> >   iommu/sprd: Convert to generic_single_device_group()
> >   iommu/rockchip: Convert to generic_single_device_group()
> >   iommu/ipmmu-vmsa: Convert to generic_single_device_group()
> >   iommu/omap: Convert to generic_single_device_group()
> >   iommu: Complete the locking for dev->iommu_group
> >   iommu/intel: Fix missing locking for show_device_domain_translation()
> >
> >  drivers/iommu/intel/debugfs.c  |  34 ++++----
> >  drivers/iommu/iommu.c          | 155 +++++++++++++++++++++------------
> >  drivers/iommu/ipmmu-vmsa.c     |  22 ++---
> >  drivers/iommu/omap-iommu.c     |  30 +------
> >  drivers/iommu/omap-iommu.h     |   2 +-
> >  drivers/iommu/rockchip-iommu.c |  22 +----
> >  drivers/iommu/sprd-iommu.c     |  24 +----
> >  drivers/iommu/sun50i-iommu.c   |  29 ++----
> >  include/linux/iommu.h          |   3 +
> >  9 files changed, 138 insertions(+), 183 deletions(-)
>
> Applied, thanks for the nice cleanup!

This series seems to cause a hung task during boot on MediaTek platforms.
It hangs with next-20230808. Reverting the 10 commits from this series
makes the system boot up again.

ChenYu

Logs follow.

INFO: task swapper/0:1 blocked for more than 122 seconds.
      Not tainted 6.5.0-rc5-next-20230808-08004-g396bbe23dbf4 #859
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:swapper/0       state:D stack:0     pid:1     ppid:0      flags:0x00000008
Call trace:
 __switch_to+0x138/0x1e8
 __schedule+0x728/0x1388
 schedule+0xa8/0x170
 schedule_timeout+0x19c/0x1b8
 __wait_for_common+0x250/0x2c0
 wait_for_completion+0x28/0x40
 __flush_work+0x37c/0x6c0
 flush_work+0x1c/0x30
 deferred_probe_initcall+0x60/0xd0
 do_one_initcall+0xe0/0x4a0
 kernel_init_freeable+0x3a4/0x730
 kernel_init+0x2c/0x1f8
 ret_from_fork+0x10/0x20
INFO: task kworker/u18:1:67 blocked for more than 122 seconds.
      Not tainted 6.5.0-rc5-next-20230808-08004-g396bbe23dbf4 #859
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u18:1   state:D stack:0     pid:67    ppid:2      flags:0x00000008
Workqueue: events_unbound deferred_probe_work_func
Call trace:
 __switch_to+0x138/0x1e8
 __schedule+0x728/0x1388
 schedule+0xa8/0x170
 schedule_preempt_disabled+0x44/0x80
 __mutex_lock+0x3fc/0x598
 mutex_lock_nested+0x2c/0x40
 __iommu_probe_device+0xb8/0x6e0
 probe_iommu_group+0x18/0x38
 bus_for_each_dev+0xe4/0x168
 bus_iommu_probe+0x8c/0x240
 iommu_device_register+0x120/0x1b0
 mtk_iommu_probe+0x494/0x7a0
 platform_probe+0x94/0x100
 really_probe+0x1e4/0x3e8
 __driver_probe_device+0xc0/0x1a0
 driver_probe_device+0x110/0x1f0
 __device_attach_driver+0xf0/0x1b0
 bus_for_each_drv+0xf0/0x170
 __device_attach+0x120/0x240
 device_initial_probe+0x1c/0x30
 bus_probe_device+0xdc/0xe8
 deferred_probe_work_func+0xf0/0x140
 process_one_work+0x3b0/0x910
 worker_thread+0x33c/0x610
 kthread+0x1dc/0x1f0
 ret_from_fork+0x10/0x20

Showing all locks held in the system:
4 locks held by kworker/u18:1/67:
 #0: ffffff80c001b538 ((wq_completion)events_unbound){+.+.}-{0:0}, at:
process_one_work+0x2c4/0x910
 #1: ffffffc080777d40 (deferred_probe_work){+.+.}-{0:0}, at:
process_one_work+0x2c4/0x910
 #2: ffffff80c14090f8 (&dev->mutex){....}-{3:3}, at: __device_attach+0x8c/0x240
 #3: ffffff80c14090f8 (&dev->mutex){....}-{3:3}, at:
__iommu_probe_device+0xb8/0x6e0
1 lock held by khungtaskd/70:
 #0: ffffffe379f945a0 (rcu_read_lock){....}-{1:2}, at:
debug_show_all_locks+0x24/0x220

=============================================

Kernel panic - not syncing: hung_task: blocked tasks
CPU: 4 PID: 70 Comm: khungtaskd Not tainted
6.5.0-rc5-next-20230808-08004-g396bbe23dbf4 #859
55426859c267064a381312eb869e94c28566a87f
Hardware name: Google juniper sku16 board (DT)
Call trace:
 dump_backtrace+0xa0/0x100
 show_stack+0x20/0x38
 dump_stack_lvl+0xdc/0x148
 dump_stack+0x1c/0x28
 panic+0x460/0x4d8
 watchdog+0x4a4/0x9d0
 kthread+0x1dc/0x1f0
 ret_from_fork+0x10/0x20
SMP: stopping secondary CPUs
Kernel Offset: 0x22f7000000 from 0xffffffc080000000
PHYS_OFFSET: 0x40000000
CPU features: 0x0000000c,92010000,0800421b
Memory Limit: none
Rebooting in 30 seconds..
Jason Gunthorpe Aug. 8, 2023, 12:24 p.m. UTC | #3
On Tue, Aug 08, 2023 at 06:31:47PM +0800, Chen-Yu Tsai wrote:
> INFO: task kworker/u18:1:67 blocked for more than 122 seconds.
>       Not tainted 6.5.0-rc5-next-20230808-08004-g396bbe23dbf4 #859
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u18:1   state:D stack:0     pid:67    ppid:2      flags:0x00000008
> Workqueue: events_unbound deferred_probe_work_func
> Call trace:
>  __switch_to+0x138/0x1e8
>  __schedule+0x728/0x1388
>  schedule+0xa8/0x170
>  schedule_preempt_disabled+0x44/0x80
>  __mutex_lock+0x3fc/0x598
>  mutex_lock_nested+0x2c/0x40
>  __iommu_probe_device+0xb8/0x6e0
>  probe_iommu_group+0x18/0x38
>  bus_for_each_dev+0xe4/0x168
>  bus_iommu_probe+0x8c/0x240
>  iommu_device_register+0x120/0x1b0
>  mtk_iommu_probe+0x494/0x7a0
>  platform_probe+0x94/0x100
>  really_probe+0x1e4/0x3e8
>  __driver_probe_device+0xc0/0x1a0
>  driver_probe_device+0x110/0x1f0
>  __device_attach_driver+0xf0/0x1b0
>  bus_for_each_drv+0xf0/0x170
>  __device_attach+0x120/0x240
>  device_initial_probe+0x1c/0x30
>  bus_probe_device+0xdc/0xe8
>  deferred_probe_work_func+0xf0/0x140
>  process_one_work+0x3b0/0x910
>  worker_thread+0x33c/0x610
>  kthread+0x1dc/0x1f0
>  ret_from_fork+0x10/0x20

Oh weird, I wonder why this didn't show up on my testing? It is trying
to probe the iommu device itself against the iommu and deadlocks on
the device_lock (which is held during probe):

> process_one_work+0x2c4/0x910
>  #2: ffffff80c14090f8 (&dev->mutex){....}-{3:3}, at: __device_attach+0x8c/0x240
>  #3: ffffff80c14090f8 (&dev->mutex){....}-{3:3}, at:

I suppose the easy fix is to just exclude the iommu driver from
probing, it doesn't have an iommu usually anyhow.

Does this work for you?

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7dbbcffac21930..faa0e3520f66b0 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -273,7 +273,7 @@ int iommu_device_register(struct iommu_device *iommu,
 
 	for (int i = 0; i < ARRAY_SIZE(iommu_buses) && !err; i++) {
 		iommu_buses[i]->iommu_ops = ops;
-		err = bus_iommu_probe(iommu_buses[i]);
+		err = bus_iommu_probe(iommu_buses[i], iommu);
 	}
 	if (err)
 		iommu_device_unregister(iommu);
@@ -1784,12 +1784,21 @@ struct iommu_domain *iommu_group_default_domain(struct iommu_group *group)
 	return group->default_domain;
 }
 
+struct probe_iommu_args {
+	struct list_head *group_list;
+	struct iommu_device *iommu;
+};
+
 static int probe_iommu_group(struct device *dev, void *data)
 {
-	struct list_head *group_list = data;
+	struct probe_iommu_args *args = data;
 	int ret;
 
-	ret = __iommu_probe_device(dev, group_list);
+	/* We never probe the iommu device itself */
+	if (args->iommu && args->iommu->dev == dev)
+		return 0;
+
+	ret = __iommu_probe_device(dev, args->group_list);
 	if (ret == -ENODEV)
 		ret = 0;
 
@@ -1858,13 +1867,16 @@ static void iommu_group_do_probe_finalize(struct device *dev)
 		ops->probe_finalize(dev);
 }
 
-int bus_iommu_probe(const struct bus_type *bus)
+int bus_iommu_probe(const struct bus_type *bus, struct iommu_device *iommu)
 {
 	struct iommu_group *group, *next;
+	struct probe_iommu_args args = {};
 	LIST_HEAD(group_list);
 	int ret;
 
-	ret = bus_for_each_dev(bus, NULL, &group_list, probe_iommu_group);
+	args.group_list = &group_list;
+	args.iommu = iommu;
+	ret = bus_for_each_dev(bus, NULL, &args, probe_iommu_group);
 	if (ret)
 		return ret;
 
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 97c45f50bf4332..e8ab7cb832ba37 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1234,6 +1234,9 @@ static int omap_iommu_probe(struct platform_device *pdev)
 		if (err)
 			goto out_sysfs;
 		obj->has_iommu_driver = true;
+	} else {
+		/* Re-probe bus to probe device attached to this IOMMU */
+		bus_iommu_probe(&platform_bus_type, NULL);
 	}
 
 	pm_runtime_enable(obj->dev);
@@ -1242,9 +1245,6 @@ static int omap_iommu_probe(struct platform_device *pdev)
 
 	dev_info(&pdev->dev, "%s registered\n", obj->name);
 
-	/* Re-probe bus to probe device attached to this IOMMU */
-	bus_iommu_probe(&platform_bus_type);
-
 	return 0;
 
 out_sysfs:
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f1e18e81fca78b..8e10225e0d611a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -465,7 +465,8 @@ static inline const struct iommu_ops *dev_iommu_ops(struct device *dev)
 	return dev->iommu->iommu_dev->ops;
 }
 
-extern int bus_iommu_probe(const struct bus_type *bus);
+extern int bus_iommu_probe(const struct bus_type *bus,
+			   struct iommu_device *iommu);
 extern bool iommu_present(const struct bus_type *bus);
 extern bool device_iommu_capable(struct device *dev, enum iommu_cap cap);
 extern bool iommu_group_has_isolated_msi(struct iommu_group *group);
Marek Szyprowski Aug. 8, 2023, 12:32 p.m. UTC | #4
Hi,

On 08.08.2023 12:31, Chen-Yu Tsai wrote:
> Hi,
>
> On Mon, Aug 7, 2023 at 8:54 PM Joerg Roedel <joro@8bytes.org> wrote:
>> On Mon, Jul 31, 2023 at 02:50:23PM -0300, Jason Gunthorpe wrote:
>>> Jason Gunthorpe (10):
>>>    iommu: Remove useless group refcounting
>>>    iommu: Add a lockdep assertion for remaining dev->iommu_group reads
>>>    iommu: Add generic_single_device_group()
>>>    iommu/sun50i: Convert to generic_single_device_group()
>>>    iommu/sprd: Convert to generic_single_device_group()
>>>    iommu/rockchip: Convert to generic_single_device_group()
>>>    iommu/ipmmu-vmsa: Convert to generic_single_device_group()
>>>    iommu/omap: Convert to generic_single_device_group()
>>>    iommu: Complete the locking for dev->iommu_group
>>>    iommu/intel: Fix missing locking for show_device_domain_translation()
>>>
>>>   drivers/iommu/intel/debugfs.c  |  34 ++++----
>>>   drivers/iommu/iommu.c          | 155 +++++++++++++++++++++------------
>>>   drivers/iommu/ipmmu-vmsa.c     |  22 ++---
>>>   drivers/iommu/omap-iommu.c     |  30 +------
>>>   drivers/iommu/omap-iommu.h     |   2 +-
>>>   drivers/iommu/rockchip-iommu.c |  22 +----
>>>   drivers/iommu/sprd-iommu.c     |  24 +----
>>>   drivers/iommu/sun50i-iommu.c   |  29 ++----
>>>   include/linux/iommu.h          |   3 +
>>>   9 files changed, 138 insertions(+), 183 deletions(-)
>> Applied, thanks for the nice cleanup!
> This series seems to cause a hung task during boot on MediaTek platforms.
> It hangs with next-20230808. Reverting the 10 commits from this series
> makes the system boot up again.

I confirm that next-20230808 is broken on ARM 32bit based Exynos boards 
too. Boards lock up very early during boot. I will try to investigate 
this soon.


Best regards
Jason Gunthorpe Aug. 8, 2023, 1 p.m. UTC | #5
On Tue, Aug 08, 2023 at 02:32:48PM +0200, Marek Szyprowski wrote:
> Hi,
> 
> On 08.08.2023 12:31, Chen-Yu Tsai wrote:
> > Hi,
> >
> > On Mon, Aug 7, 2023 at 8:54 PM Joerg Roedel <joro@8bytes.org> wrote:
> >> On Mon, Jul 31, 2023 at 02:50:23PM -0300, Jason Gunthorpe wrote:
> >>> Jason Gunthorpe (10):
> >>>    iommu: Remove useless group refcounting
> >>>    iommu: Add a lockdep assertion for remaining dev->iommu_group reads
> >>>    iommu: Add generic_single_device_group()
> >>>    iommu/sun50i: Convert to generic_single_device_group()
> >>>    iommu/sprd: Convert to generic_single_device_group()
> >>>    iommu/rockchip: Convert to generic_single_device_group()
> >>>    iommu/ipmmu-vmsa: Convert to generic_single_device_group()
> >>>    iommu/omap: Convert to generic_single_device_group()
> >>>    iommu: Complete the locking for dev->iommu_group
> >>>    iommu/intel: Fix missing locking for show_device_domain_translation()
> >>>
> >>>   drivers/iommu/intel/debugfs.c  |  34 ++++----
> >>>   drivers/iommu/iommu.c          | 155 +++++++++++++++++++++------------
> >>>   drivers/iommu/ipmmu-vmsa.c     |  22 ++---
> >>>   drivers/iommu/omap-iommu.c     |  30 +------
> >>>   drivers/iommu/omap-iommu.h     |   2 +-
> >>>   drivers/iommu/rockchip-iommu.c |  22 +----
> >>>   drivers/iommu/sprd-iommu.c     |  24 +----
> >>>   drivers/iommu/sun50i-iommu.c   |  29 ++----
> >>>   include/linux/iommu.h          |   3 +
> >>>   9 files changed, 138 insertions(+), 183 deletions(-)
> >> Applied, thanks for the nice cleanup!
> > This series seems to cause a hung task during boot on MediaTek platforms.
> > It hangs with next-20230808. Reverting the 10 commits from this series
> > makes the system boot up again.
> 
> I confirm that next-20230808 is broken on ARM 32bit based Exynos boards 
> too. Boards lock up very early during boot. I will try to investigate 
> this soon.

Any of the drivers that use platform device as the iommu_device will
have a problem, please try:

https://lore.kernel.org/linux-iommu/ZNIz%2FNVLb6WqqvQx@nvidia.com/

Jason
Marek Szyprowski Aug. 8, 2023, 1 p.m. UTC | #6
Hi All,

On 08.08.2023 14:32, Marek Szyprowski wrote:
> On 08.08.2023 12:31, Chen-Yu Tsai wrote:
>> On Mon, Aug 7, 2023 at 8:54 PM Joerg Roedel <joro@8bytes.org> wrote:
>>> On Mon, Jul 31, 2023 at 02:50:23PM -0300, Jason Gunthorpe wrote:
>>>> Jason Gunthorpe (10):
>>>>    iommu: Remove useless group refcounting
>>>>    iommu: Add a lockdep assertion for remaining dev->iommu_group reads
>>>>    iommu: Add generic_single_device_group()
>>>>    iommu/sun50i: Convert to generic_single_device_group()
>>>>    iommu/sprd: Convert to generic_single_device_group()
>>>>    iommu/rockchip: Convert to generic_single_device_group()
>>>>    iommu/ipmmu-vmsa: Convert to generic_single_device_group()
>>>>    iommu/omap: Convert to generic_single_device_group()
>>>>    iommu: Complete the locking for dev->iommu_group
>>>>    iommu/intel: Fix missing locking for 
>>>> show_device_domain_translation()
>>>>
>>>>   drivers/iommu/intel/debugfs.c  |  34 ++++----
>>>>   drivers/iommu/iommu.c          | 155 
>>>> +++++++++++++++++++++------------
>>>>   drivers/iommu/ipmmu-vmsa.c     |  22 ++---
>>>>   drivers/iommu/omap-iommu.c     |  30 +------
>>>>   drivers/iommu/omap-iommu.h     |   2 +-
>>>>   drivers/iommu/rockchip-iommu.c |  22 +----
>>>>   drivers/iommu/sprd-iommu.c     |  24 +----
>>>>   drivers/iommu/sun50i-iommu.c   |  29 ++----
>>>>   include/linux/iommu.h          |   3 +
>>>>   9 files changed, 138 insertions(+), 183 deletions(-)
>>> Applied, thanks for the nice cleanup!
>> This series seems to cause a hung task during boot on MediaTek 
>> platforms.
>> It hangs with next-20230808. Reverting the 10 commits from this series
>> makes the system boot up again.
>
> I confirm that next-20230808 is broken on ARM 32bit based Exynos 
> boards too. Boards lock up very early during boot. I will try to 
> investigate this soon.

Hmm this turned to be Exynos IOMMU specific, but the issue is probably 
somehow generic.

The deadlock happens early in __iommu_probe_device() on 
device_lock(dev). Here is a stack dump of that call:

CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.5.0-rc5-next-20230808-dirty 
#7013
Hardware name: Samsung Exynos (Flattened Device Tree)
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x58/0x70
  dump_stack_lvl from __iommu_probe_device+0x3d8/0x4ac
  __iommu_probe_device from probe_iommu_group+0x8/0x14
  probe_iommu_group from bus_for_each_dev+0x60/0xb4
  bus_for_each_dev from bus_iommu_probe+0x34/0x118
  bus_iommu_probe from iommu_device_register+0x98/0x100
  iommu_device_register from exynos_sysmmu_probe+0x238/0x3c0
  exynos_sysmmu_probe from platform_probe+0x80/0xc0
  platform_probe from really_probe+0x154/0x3d4
  really_probe from __driver_probe_device+0xa0/0x1e8
  __driver_probe_device from driver_probe_device+0x30/0xd0
  driver_probe_device from __device_attach_driver+0xbc/0x11c
  __device_attach_driver from bus_for_each_drv+0x74/0xc0
  bus_for_each_drv from __device_attach+0xec/0x1b4
  __device_attach from bus_probe_device+0x8c/0x90
  bus_probe_device from device_add+0x5b8/0x78c
  device_add from of_platform_device_create_pdata+0x94/0xcc
  of_platform_device_create_pdata from of_platform_bus_create+0x1ac/0x4d8
  of_platform_bus_create from of_platform_bus_create+0x214/0x4d8
  of_platform_bus_create from of_platform_populate+0x80/0x114
  of_platform_populate from of_platform_default_populate_init+0xcc/0xe4
  of_platform_default_populate_init from do_one_initcall+0x6c/0x318
  do_one_initcall from kernel_init_freeable+0x1c4/0x214
  kernel_init_freeable from kernel_init+0x18/0x12c
  kernel_init from ret_from_fork+0x14/0x2c

The problem here is that exynos_sysmmu_probe() is by design called under 
device_lock, then it calls iommu_device_register(), which in turn 
triggers calling __iommu_probe_device() on all platform devices in the 
system, while the still probed sysmmu device is one of them.

Frankly speaking I have no idea how to defer calling 
iommu_device_register() to avoid this deadlock. Any ideas?

Best regards
Marek Szyprowski Aug. 8, 2023, 1:08 p.m. UTC | #7
Hi Jason,

On 08.08.2023 15:00, Jason Gunthorpe wrote:
> On Tue, Aug 08, 2023 at 02:32:48PM +0200, Marek Szyprowski wrote:
>> On 08.08.2023 12:31, Chen-Yu Tsai wrote:
>>> On Mon, Aug 7, 2023 at 8:54 PM Joerg Roedel <joro@8bytes.org> wrote:
>>>> On Mon, Jul 31, 2023 at 02:50:23PM -0300, Jason Gunthorpe wrote:
>>>>> Jason Gunthorpe (10):
>>>>>     iommu: Remove useless group refcounting
>>>>>     iommu: Add a lockdep assertion for remaining dev->iommu_group reads
>>>>>     iommu: Add generic_single_device_group()
>>>>>     iommu/sun50i: Convert to generic_single_device_group()
>>>>>     iommu/sprd: Convert to generic_single_device_group()
>>>>>     iommu/rockchip: Convert to generic_single_device_group()
>>>>>     iommu/ipmmu-vmsa: Convert to generic_single_device_group()
>>>>>     iommu/omap: Convert to generic_single_device_group()
>>>>>     iommu: Complete the locking for dev->iommu_group
>>>>>     iommu/intel: Fix missing locking for show_device_domain_translation()
>>>>>
>>>>>    drivers/iommu/intel/debugfs.c  |  34 ++++----
>>>>>    drivers/iommu/iommu.c          | 155 +++++++++++++++++++++------------
>>>>>    drivers/iommu/ipmmu-vmsa.c     |  22 ++---
>>>>>    drivers/iommu/omap-iommu.c     |  30 +------
>>>>>    drivers/iommu/omap-iommu.h     |   2 +-
>>>>>    drivers/iommu/rockchip-iommu.c |  22 +----
>>>>>    drivers/iommu/sprd-iommu.c     |  24 +----
>>>>>    drivers/iommu/sun50i-iommu.c   |  29 ++----
>>>>>    include/linux/iommu.h          |   3 +
>>>>>    9 files changed, 138 insertions(+), 183 deletions(-)
>>>> Applied, thanks for the nice cleanup!
>>> This series seems to cause a hung task during boot on MediaTek platforms.
>>> It hangs with next-20230808. Reverting the 10 commits from this series
>>> makes the system boot up again.
>> I confirm that next-20230808 is broken on ARM 32bit based Exynos boards
>> too. Boards lock up very early during boot. I will try to investigate
>> this soon.
> Any of the drivers that use platform device as the iommu_device will
> have a problem, please try:
>
> https://lore.kernel.org/linux-iommu/ZNIz%2FNVLb6WqqvQx@nvidia.com/

I've checked and it doesn't help in my case. I will soon check why.

Best regards
Jason Gunthorpe Aug. 8, 2023, 1:25 p.m. UTC | #8
On Tue, Aug 08, 2023 at 03:08:30PM +0200, Marek Szyprowski wrote:
> > Any of the drivers that use platform device as the iommu_device will
> > have a problem, please try:
> >
> > https://lore.kernel.org/linux-iommu/ZNIz%2FNVLb6WqqvQx@nvidia.com/
> 
> I've checked and it doesn't help in my case. I will soon check why.

Oh, I botched it. Forgot that the iommu_device->dev is the sysfs
handle not the HW device. Maybe this:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7dbbcffac21930..dbaace482861da 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -264,6 +264,7 @@ int iommu_device_register(struct iommu_device *iommu,
 		return -EBUSY;
 
 	iommu->ops = ops;
+	iommu->hwdev = hwdev;
 	if (hwdev)
 		iommu->fwnode = dev_fwnode(hwdev);
 
@@ -273,7 +274,7 @@ int iommu_device_register(struct iommu_device *iommu,
 
 	for (int i = 0; i < ARRAY_SIZE(iommu_buses) && !err; i++) {
 		iommu_buses[i]->iommu_ops = ops;
-		err = bus_iommu_probe(iommu_buses[i]);
+		err = bus_iommu_probe(iommu_buses[i], iommu);
 	}
 	if (err)
 		iommu_device_unregister(iommu);
@@ -1784,12 +1785,21 @@ struct iommu_domain *iommu_group_default_domain(struct iommu_group *group)
 	return group->default_domain;
 }
 
+struct probe_iommu_args {
+	struct list_head *group_list;
+	struct iommu_device *iommu;
+};
+
 static int probe_iommu_group(struct device *dev, void *data)
 {
-	struct list_head *group_list = data;
+	struct probe_iommu_args *args = data;
 	int ret;
 
-	ret = __iommu_probe_device(dev, group_list);
+	/* We never probe the iommu device itself */
+	if (args->iommu && args->iommu->hwdev == dev)
+		return 0;
+
+	ret = __iommu_probe_device(dev, args->group_list);
 	if (ret == -ENODEV)
 		ret = 0;
 
@@ -1858,13 +1868,16 @@ static void iommu_group_do_probe_finalize(struct device *dev)
 		ops->probe_finalize(dev);
 }
 
-int bus_iommu_probe(const struct bus_type *bus)
+int bus_iommu_probe(const struct bus_type *bus, struct iommu_device *iommu)
 {
 	struct iommu_group *group, *next;
+	struct probe_iommu_args args = {};
 	LIST_HEAD(group_list);
 	int ret;
 
-	ret = bus_for_each_dev(bus, NULL, &group_list, probe_iommu_group);
+	args.group_list = &group_list;
+	args.iommu = iommu;
+	ret = bus_for_each_dev(bus, NULL, &args, probe_iommu_group);
 	if (ret)
 		return ret;
 
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 97c45f50bf4332..e8ab7cb832ba37 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1234,6 +1234,9 @@ static int omap_iommu_probe(struct platform_device *pdev)
 		if (err)
 			goto out_sysfs;
 		obj->has_iommu_driver = true;
+	} else {
+		/* Re-probe bus to probe device attached to this IOMMU */
+		bus_iommu_probe(&platform_bus_type, NULL);
 	}
 
 	pm_runtime_enable(obj->dev);
@@ -1242,9 +1245,6 @@ static int omap_iommu_probe(struct platform_device *pdev)
 
 	dev_info(&pdev->dev, "%s registered\n", obj->name);
 
-	/* Re-probe bus to probe device attached to this IOMMU */
-	bus_iommu_probe(&platform_bus_type);
-
 	return 0;
 
 out_sysfs:
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f1e18e81fca78b..784e0fd91fd17e 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -361,6 +361,7 @@ struct iommu_domain_ops {
  * @list: Used by the iommu-core to keep a list of registered iommus
  * @ops: iommu-ops for talking to this iommu
  * @dev: struct device for sysfs handling
+ * @hwdev: The device HW that controls the iommu
  * @singleton_group: Used internally for drivers that have only one group
  * @max_pasids: number of supported PASIDs
  */
@@ -369,6 +370,7 @@ struct iommu_device {
 	const struct iommu_ops *ops;
 	struct fwnode_handle *fwnode;
 	struct device *dev;
+	struct device *hwdev;
 	struct iommu_group *singleton_group;
 	u32 max_pasids;
 };
@@ -465,7 +467,8 @@ static inline const struct iommu_ops *dev_iommu_ops(struct device *dev)
 	return dev->iommu->iommu_dev->ops;
 }
 
-extern int bus_iommu_probe(const struct bus_type *bus);
+extern int bus_iommu_probe(const struct bus_type *bus,
+			   struct iommu_device *iommu);
 extern bool iommu_present(const struct bus_type *bus);
 extern bool device_iommu_capable(struct device *dev, enum iommu_cap cap);
 extern bool iommu_group_has_isolated_msi(struct iommu_group *group);
Marek Szyprowski Aug. 8, 2023, 2:02 p.m. UTC | #9
Hi Jason,

On 08.08.2023 15:25, Jason Gunthorpe wrote:
> On Tue, Aug 08, 2023 at 03:08:30PM +0200, Marek Szyprowski wrote:
>>> Any of the drivers that use platform device as the iommu_device will
>>> have a problem, please try:
>>>
>>> https://lore.kernel.org/linux-iommu/ZNIz%2FNVLb6WqqvQx@nvidia.com/
>> I've checked and it doesn't help in my case. I will soon check why.
> Oh, I botched it. Forgot that the iommu_device->dev is the sysfs
> handle not the HW device. Maybe this:

This fixed the early lockup, but then system hangs again a bit later. It 
looks that this device lock in __iommu_probe_device() is really 
problematic, because __iommu_probe_device() is then called during the 
iommu 'client device' probe on the probed device. Here is a complete 
call stack:

  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x58/0x70
  dump_stack_lvl from __iommu_probe_device+0x3d8/0x4b8
  __iommu_probe_device from iommu_probe_device+0x10/0x40
  iommu_probe_device from of_iommu_configure+0xf8/0x1c8
  of_iommu_configure from of_dma_configure_id+0x188/0x450
  of_dma_configure_id from platform_dma_configure+0x24/0x60
  platform_dma_configure from really_probe+0xac/0x3d4
  really_probe from __driver_probe_device+0xa0/0x1e8
  __driver_probe_device from driver_probe_device+0x30/0xd0
  driver_probe_device from __driver_attach+0x10c/0x190
  __driver_attach from bus_for_each_dev+0x60/0xb4
  bus_for_each_dev from bus_add_driver+0xe0/0x208
  bus_add_driver from driver_register+0x7c/0x118
  driver_register from exynos_drm_init+0xe0/0x14c
  exynos_drm_init from do_one_initcall+0x6c/0x318
  do_one_initcall from kernel_init_freeable+0x1c4/0x214
  kernel_init_freeable from kernel_init+0x18/0x12c
  kernel_init from ret_from_fork+0x14/0x2c

Any ideas?

Best regards
Jason Gunthorpe Aug. 8, 2023, 2:30 p.m. UTC | #10
On Tue, Aug 08, 2023 at 04:02:40PM +0200, Marek Szyprowski wrote:
> Hi Jason,
> 
> On 08.08.2023 15:25, Jason Gunthorpe wrote:
> > On Tue, Aug 08, 2023 at 03:08:30PM +0200, Marek Szyprowski wrote:
> >>> Any of the drivers that use platform device as the iommu_device will
> >>> have a problem, please try:
> >>>
> >>> https://lore.kernel.org/linux-iommu/ZNIz%2FNVLb6WqqvQx@nvidia.com/
> >> I've checked and it doesn't help in my case. I will soon check why.
> > Oh, I botched it. Forgot that the iommu_device->dev is the sysfs
> > handle not the HW device. Maybe this:
> 
> This fixed the early lockup, but then system hangs again a bit later. It 
> looks that this device lock in __iommu_probe_device() is really
> problematic, 

Yes, I expected we'd hit something like this - I checked alot of call
paths but missed these two. The self-probe is sneaky, but here the
device_lock is held way up the call chain, I just missed it.

The fix is to just annotate that we already hold the lock when calling
iommu_probe_device(), since we know in those cases that we must be
holding it:

diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index daa64dd687524b..3fc5e12f2f1c09 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1582,7 +1582,7 @@ static const struct iommu_ops *acpi_iommu_configure_id(struct device *dev,
 	 * iommu_probe_device() call for dev, replay it to get things in order.
 	 */
 	if (!err && dev->bus)
-		err = iommu_probe_device(dev);
+		err = iommu_probe_device_locked(dev);
 
 	/* Ignore all other errors apart from EPROBE_DEFER */
 	if (err == -EPROBE_DEFER) {
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7dbbcffac21930..b867d7f22954e9 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -264,6 +264,7 @@ int iommu_device_register(struct iommu_device *iommu,
 		return -EBUSY;
 
 	iommu->ops = ops;
+	iommu->hwdev = hwdev;
 	if (hwdev)
 		iommu->fwnode = dev_fwnode(hwdev);
 
@@ -273,7 +274,7 @@ int iommu_device_register(struct iommu_device *iommu,
 
 	for (int i = 0; i < ARRAY_SIZE(iommu_buses) && !err; i++) {
 		iommu_buses[i]->iommu_ops = ops;
-		err = bus_iommu_probe(iommu_buses[i]);
+		err = bus_iommu_probe(iommu_buses[i], iommu);
 	}
 	if (err)
 		iommu_device_unregister(iommu);
@@ -452,24 +453,23 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 	struct group_device *gdev;
 	int ret;
 
-	if (!ops)
-		return -ENODEV;
 	/*
 	 * Allow __iommu_probe_device() to be safely called in parallel,
 	 * both dev->iommu_group and the initial setup of dev->iommu are
 	 * protected this way.
 	 */
-	device_lock(dev);
+	device_lock_assert(dev);
+
+	if (!ops)
+		return -ENODEV;
 
 	/* Device is probed already if in a group */
-	if (dev->iommu_group) {
-		ret = 0;
-		goto out_unlock;
-	}
+	if (dev->iommu_group)
+		return 0;
 
 	ret = iommu_init_device(dev, ops);
 	if (ret)
-		goto out_unlock;
+		return ret;
 
 	group = dev->iommu_group;
 	gdev = iommu_group_alloc_device(group, dev);
@@ -505,7 +505,6 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 			list_add_tail(&group->entry, group_list);
 	}
 	mutex_unlock(&group->mutex);
-	device_unlock(dev);
 
 	if (dev_is_pci(dev))
 		iommu_dma_set_pci_32bit_workaround(dev);
@@ -519,12 +518,10 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 	iommu_deinit_device(dev);
 	mutex_unlock(&group->mutex);
 	iommu_group_put(group);
-out_unlock:
-	device_unlock(dev);
 	return ret;
 }
 
-int iommu_probe_device(struct device *dev)
+int iommu_probe_device_locked(struct device *dev)
 {
 	const struct iommu_ops *ops;
 	int ret;
@@ -540,6 +537,16 @@ int iommu_probe_device(struct device *dev)
 	return 0;
 }
 
+int iommu_probe_device(struct device *dev)
+{
+	int ret;
+
+	device_lock(dev);
+	ret = iommu_probe_device_locked(dev);
+	device_unlock(dev);
+	return ret;
+}
+
 static void __iommu_group_free_device(struct iommu_group *group,
 				      struct group_device *grp_dev)
 {
@@ -1784,12 +1791,26 @@ struct iommu_domain *iommu_group_default_domain(struct iommu_group *group)
 	return group->default_domain;
 }
 
+struct probe_iommu_args {
+	struct list_head *group_list;
+	struct iommu_device *iommu;
+};
+
 static int probe_iommu_group(struct device *dev, void *data)
 {
-	struct list_head *group_list = data;
+	struct probe_iommu_args *args = data;
+	bool need_lock;
 	int ret;
 
-	ret = __iommu_probe_device(dev, group_list);
+	/* Probing the iommu itself is always done under the device_lock */
+	need_lock = !args->iommu || args->iommu->hwdev != dev;
+
+	if (need_lock)
+		device_lock(dev);
+	ret = __iommu_probe_device(dev, args->group_list);
+	if (need_lock)
+		device_unlock(dev);
+
 	if (ret == -ENODEV)
 		ret = 0;
 
@@ -1858,13 +1879,16 @@ static void iommu_group_do_probe_finalize(struct device *dev)
 		ops->probe_finalize(dev);
 }
 
-int bus_iommu_probe(const struct bus_type *bus)
+int bus_iommu_probe(const struct bus_type *bus, struct iommu_device *iommu)
 {
 	struct iommu_group *group, *next;
+	struct probe_iommu_args args = {};
 	LIST_HEAD(group_list);
 	int ret;
 
-	ret = bus_for_each_dev(bus, NULL, &group_list, probe_iommu_group);
+	args.group_list = &group_list;
+	args.iommu = iommu;
+	ret = bus_for_each_dev(bus, NULL, &args, probe_iommu_group);
 	if (ret)
 		return ret;
 
diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
index 157b286e36bf3a..b5b7d4bd2cefb9 100644
--- a/drivers/iommu/of_iommu.c
+++ b/drivers/iommu/of_iommu.c
@@ -160,7 +160,7 @@ const struct iommu_ops *of_iommu_configure(struct device *dev,
 	 * probe for dev, replay it to get things in order.
 	 */
 	if (!err && dev->bus)
-		err = iommu_probe_device(dev);
+		err = iommu_probe_device_locked(dev);
 
 	/* Ignore all other errors apart from EPROBE_DEFER */
 	if (err == -EPROBE_DEFER) {
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 97c45f50bf4332..828679abef7503 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1234,6 +1234,10 @@ static int omap_iommu_probe(struct platform_device *pdev)
 		if (err)
 			goto out_sysfs;
 		obj->has_iommu_driver = true;
+	} else {
+		/* Re-probe bus to probe device attached to this IOMMU */
+		obj->iommu.hwdev = &pdev->dev;
+		bus_iommu_probe(&platform_bus_type, &obj->iommu);
 	}
 
 	pm_runtime_enable(obj->dev);
@@ -1242,9 +1246,6 @@ static int omap_iommu_probe(struct platform_device *pdev)
 
 	dev_info(&pdev->dev, "%s registered\n", obj->name);
 
-	/* Re-probe bus to probe device attached to this IOMMU */
-	bus_iommu_probe(&platform_bus_type);
-
 	return 0;
 
 out_sysfs:
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f1e18e81fca78b..96782bfb384462 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -361,6 +361,7 @@ struct iommu_domain_ops {
  * @list: Used by the iommu-core to keep a list of registered iommus
  * @ops: iommu-ops for talking to this iommu
  * @dev: struct device for sysfs handling
+ * @hwdev: The device HW that controls the iommu
  * @singleton_group: Used internally for drivers that have only one group
  * @max_pasids: number of supported PASIDs
  */
@@ -369,6 +370,7 @@ struct iommu_device {
 	const struct iommu_ops *ops;
 	struct fwnode_handle *fwnode;
 	struct device *dev;
+	struct device *hwdev;
 	struct iommu_group *singleton_group;
 	u32 max_pasids;
 };
@@ -465,7 +467,8 @@ static inline const struct iommu_ops *dev_iommu_ops(struct device *dev)
 	return dev->iommu->iommu_dev->ops;
 }
 
-extern int bus_iommu_probe(const struct bus_type *bus);
+extern int bus_iommu_probe(const struct bus_type *bus,
+			   struct iommu_device *iommu);
 extern bool iommu_present(const struct bus_type *bus);
 extern bool device_iommu_capable(struct device *dev, enum iommu_cap cap);
 extern bool iommu_group_has_isolated_msi(struct iommu_group *group);
@@ -709,6 +712,7 @@ static inline void dev_iommu_priv_set(struct device *dev, void *priv)
 }
 
 int iommu_probe_device(struct device *dev);
+int iommu_probe_device_locked(struct device *dev);
 
 int iommu_dev_enable_feature(struct device *dev, enum iommu_dev_features f);
 int iommu_dev_disable_feature(struct device *dev, enum iommu_dev_features f);
Marek Szyprowski Aug. 8, 2023, 2:51 p.m. UTC | #11
Hi Jason,

On 08.08.2023 16:30, Jason Gunthorpe wrote:
> On Tue, Aug 08, 2023 at 04:02:40PM +0200, Marek Szyprowski wrote:
>> On 08.08.2023 15:25, Jason Gunthorpe wrote:
>>> On Tue, Aug 08, 2023 at 03:08:30PM +0200, Marek Szyprowski wrote:
>>>>> Any of the drivers that use platform device as the iommu_device will
>>>>> have a problem, please try:
>>>>>
>>>>> https://lore.kernel.org/linux-iommu/ZNIz%2FNVLb6WqqvQx@nvidia.com/
>>>> I've checked and it doesn't help in my case. I will soon check why.
>>> Oh, I botched it. Forgot that the iommu_device->dev is the sysfs
>>> handle not the HW device. Maybe this:
>> This fixed the early lockup, but then system hangs again a bit later. It
>> looks that this device lock in __iommu_probe_device() is really
>> problematic,
> Yes, I expected we'd hit something like this - I checked alot of call
> paths but missed these two. The self-probe is sneaky, but here the
> device_lock is held way up the call chain, I just missed it.
>
> The fix is to just annotate that we already hold the lock when calling
> iommu_probe_device(), since we know in those cases that we must be
> holding it:

Great, this fixed the issue. Feel free to add:

Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>

 > ...

Best regards
Chen-Yu Tsai Aug. 9, 2023, 6:23 a.m. UTC | #12
On Tue, Aug 8, 2023 at 10:30 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Aug 08, 2023 at 04:02:40PM +0200, Marek Szyprowski wrote:
> > Hi Jason,
> >
> > On 08.08.2023 15:25, Jason Gunthorpe wrote:
> > > On Tue, Aug 08, 2023 at 03:08:30PM +0200, Marek Szyprowski wrote:
> > >>> Any of the drivers that use platform device as the iommu_device will
> > >>> have a problem, please try:
> > >>>
> > >>> https://lore.kernel.org/linux-iommu/ZNIz%2FNVLb6WqqvQx@nvidia.com/
> > >> I've checked and it doesn't help in my case. I will soon check why.
> > > Oh, I botched it. Forgot that the iommu_device->dev is the sysfs
> > > handle not the HW device. Maybe this:
> >
> > This fixed the early lockup, but then system hangs again a bit later. It
> > looks that this device lock in __iommu_probe_device() is really
> > problematic,
>
> Yes, I expected we'd hit something like this - I checked alot of call
> paths but missed these two. The self-probe is sneaky, but here the
> device_lock is held way up the call chain, I just missed it.
>
> The fix is to just annotate that we already hold the lock when calling
> iommu_probe_device(), since we know in those cases that we must be
> holding it:

This fixed things for me, so

Tested-by: Chen-Yu Tsai <wenst@chromium.org>

> diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
> index daa64dd687524b..3fc5e12f2f1c09 100644
> --- a/drivers/acpi/scan.c
> +++ b/drivers/acpi/scan.c
> @@ -1582,7 +1582,7 @@ static const struct iommu_ops *acpi_iommu_configure_id(struct device *dev,
>          * iommu_probe_device() call for dev, replay it to get things in order.
>          */
>         if (!err && dev->bus)
> -               err = iommu_probe_device(dev);
> +               err = iommu_probe_device_locked(dev);
>
>         /* Ignore all other errors apart from EPROBE_DEFER */
>         if (err == -EPROBE_DEFER) {
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 7dbbcffac21930..b867d7f22954e9 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -264,6 +264,7 @@ int iommu_device_register(struct iommu_device *iommu,
>                 return -EBUSY;
>
>         iommu->ops = ops;
> +       iommu->hwdev = hwdev;
>         if (hwdev)
>                 iommu->fwnode = dev_fwnode(hwdev);
>
> @@ -273,7 +274,7 @@ int iommu_device_register(struct iommu_device *iommu,
>
>         for (int i = 0; i < ARRAY_SIZE(iommu_buses) && !err; i++) {
>                 iommu_buses[i]->iommu_ops = ops;
> -               err = bus_iommu_probe(iommu_buses[i]);
> +               err = bus_iommu_probe(iommu_buses[i], iommu);
>         }
>         if (err)
>                 iommu_device_unregister(iommu);
> @@ -452,24 +453,23 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
>         struct group_device *gdev;
>         int ret;
>
> -       if (!ops)
> -               return -ENODEV;
>         /*
>          * Allow __iommu_probe_device() to be safely called in parallel,
>          * both dev->iommu_group and the initial setup of dev->iommu are
>          * protected this way.
>          */
> -       device_lock(dev);
> +       device_lock_assert(dev);
> +
> +       if (!ops)
> +               return -ENODEV;
>
>         /* Device is probed already if in a group */
> -       if (dev->iommu_group) {
> -               ret = 0;
> -               goto out_unlock;
> -       }
> +       if (dev->iommu_group)
> +               return 0;
>
>         ret = iommu_init_device(dev, ops);
>         if (ret)
> -               goto out_unlock;
> +               return ret;
>
>         group = dev->iommu_group;
>         gdev = iommu_group_alloc_device(group, dev);
> @@ -505,7 +505,6 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
>                         list_add_tail(&group->entry, group_list);
>         }
>         mutex_unlock(&group->mutex);
> -       device_unlock(dev);
>
>         if (dev_is_pci(dev))
>                 iommu_dma_set_pci_32bit_workaround(dev);
> @@ -519,12 +518,10 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
>         iommu_deinit_device(dev);
>         mutex_unlock(&group->mutex);
>         iommu_group_put(group);
> -out_unlock:
> -       device_unlock(dev);
>         return ret;
>  }
>
> -int iommu_probe_device(struct device *dev)
> +int iommu_probe_device_locked(struct device *dev)
>  {
>         const struct iommu_ops *ops;
>         int ret;
> @@ -540,6 +537,16 @@ int iommu_probe_device(struct device *dev)
>         return 0;
>  }
>
> +int iommu_probe_device(struct device *dev)
> +{
> +       int ret;
> +
> +       device_lock(dev);
> +       ret = iommu_probe_device_locked(dev);
> +       device_unlock(dev);
> +       return ret;
> +}
> +
>  static void __iommu_group_free_device(struct iommu_group *group,
>                                       struct group_device *grp_dev)
>  {
> @@ -1784,12 +1791,26 @@ struct iommu_domain *iommu_group_default_domain(struct iommu_group *group)
>         return group->default_domain;
>  }
>
> +struct probe_iommu_args {
> +       struct list_head *group_list;
> +       struct iommu_device *iommu;
> +};
> +
>  static int probe_iommu_group(struct device *dev, void *data)
>  {
> -       struct list_head *group_list = data;
> +       struct probe_iommu_args *args = data;
> +       bool need_lock;
>         int ret;
>
> -       ret = __iommu_probe_device(dev, group_list);
> +       /* Probing the iommu itself is always done under the device_lock */
> +       need_lock = !args->iommu || args->iommu->hwdev != dev;
> +
> +       if (need_lock)
> +               device_lock(dev);
> +       ret = __iommu_probe_device(dev, args->group_list);
> +       if (need_lock)
> +               device_unlock(dev);
> +
>         if (ret == -ENODEV)
>                 ret = 0;
>
> @@ -1858,13 +1879,16 @@ static void iommu_group_do_probe_finalize(struct device *dev)
>                 ops->probe_finalize(dev);
>  }
>
> -int bus_iommu_probe(const struct bus_type *bus)
> +int bus_iommu_probe(const struct bus_type *bus, struct iommu_device *iommu)
>  {
>         struct iommu_group *group, *next;
> +       struct probe_iommu_args args = {};
>         LIST_HEAD(group_list);
>         int ret;
>
> -       ret = bus_for_each_dev(bus, NULL, &group_list, probe_iommu_group);
> +       args.group_list = &group_list;
> +       args.iommu = iommu;
> +       ret = bus_for_each_dev(bus, NULL, &args, probe_iommu_group);
>         if (ret)
>                 return ret;
>
> diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
> index 157b286e36bf3a..b5b7d4bd2cefb9 100644
> --- a/drivers/iommu/of_iommu.c
> +++ b/drivers/iommu/of_iommu.c
> @@ -160,7 +160,7 @@ const struct iommu_ops *of_iommu_configure(struct device *dev,
>          * probe for dev, replay it to get things in order.
>          */
>         if (!err && dev->bus)
> -               err = iommu_probe_device(dev);
> +               err = iommu_probe_device_locked(dev);
>
>         /* Ignore all other errors apart from EPROBE_DEFER */
>         if (err == -EPROBE_DEFER) {
> diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
> index 97c45f50bf4332..828679abef7503 100644
> --- a/drivers/iommu/omap-iommu.c
> +++ b/drivers/iommu/omap-iommu.c
> @@ -1234,6 +1234,10 @@ static int omap_iommu_probe(struct platform_device *pdev)
>                 if (err)
>                         goto out_sysfs;
>                 obj->has_iommu_driver = true;
> +       } else {
> +               /* Re-probe bus to probe device attached to this IOMMU */
> +               obj->iommu.hwdev = &pdev->dev;
> +               bus_iommu_probe(&platform_bus_type, &obj->iommu);
>         }
>
>         pm_runtime_enable(obj->dev);
> @@ -1242,9 +1246,6 @@ static int omap_iommu_probe(struct platform_device *pdev)
>
>         dev_info(&pdev->dev, "%s registered\n", obj->name);
>
> -       /* Re-probe bus to probe device attached to this IOMMU */
> -       bus_iommu_probe(&platform_bus_type);
> -
>         return 0;
>
>  out_sysfs:
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index f1e18e81fca78b..96782bfb384462 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -361,6 +361,7 @@ struct iommu_domain_ops {
>   * @list: Used by the iommu-core to keep a list of registered iommus
>   * @ops: iommu-ops for talking to this iommu
>   * @dev: struct device for sysfs handling
> + * @hwdev: The device HW that controls the iommu
>   * @singleton_group: Used internally for drivers that have only one group
>   * @max_pasids: number of supported PASIDs
>   */
> @@ -369,6 +370,7 @@ struct iommu_device {
>         const struct iommu_ops *ops;
>         struct fwnode_handle *fwnode;
>         struct device *dev;
> +       struct device *hwdev;
>         struct iommu_group *singleton_group;
>         u32 max_pasids;
>  };
> @@ -465,7 +467,8 @@ static inline const struct iommu_ops *dev_iommu_ops(struct device *dev)
>         return dev->iommu->iommu_dev->ops;
>  }
>
> -extern int bus_iommu_probe(const struct bus_type *bus);
> +extern int bus_iommu_probe(const struct bus_type *bus,
> +                          struct iommu_device *iommu);
>  extern bool iommu_present(const struct bus_type *bus);
>  extern bool device_iommu_capable(struct device *dev, enum iommu_cap cap);
>  extern bool iommu_group_has_isolated_msi(struct iommu_group *group);
> @@ -709,6 +712,7 @@ static inline void dev_iommu_priv_set(struct device *dev, void *priv)
>  }
>
>  int iommu_probe_device(struct device *dev);
> +int iommu_probe_device_locked(struct device *dev);
>
>  int iommu_dev_enable_feature(struct device *dev, enum iommu_dev_features f);
>  int iommu_dev_disable_feature(struct device *dev, enum iommu_dev_features f);
>
>