Message ID | 6xb24fjmptxxn5js2fjrrddjae6twex5bjaftwqsuawuqqqydx@7cl3uik5ef6j (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | [bug,report] lockdep WARN at PCI device rescan | expand |
On Tue, Nov 14, 2023 at 06:54:29AM +0000, Shinichiro Kawasaki wrote: > Hello there. > > Recently I tried a couple of commands below on the kernel v6.6 and v6.7-rc1, > then observed a lockdep WARN at the second command [1]. The first command > removes a PCI device, and the second command rescans whole PCI devices to > regain the removed device. > > # echo 1 > /sys/bus/pci/devices/0000:51:00.0/remove > # echo 1 > /sys/bus/pci/rescan > > I tried this rescan for SAS-HBA or AHCI controller with HDDs. When those devices > are left in weird status after some kernel tests, I want to remove the SAS-HBA > and AHCI controller and rescan to get back the devices in good status. This > rescan looks working good except the WARN. > > The lockdep splat indicates possible deadlock between pci_rescan_remove_lock > and work_completion lock have deadlock possibility. Is the lockdep WARN a known > issue? I found a similar discussion in the past [2], but it did not discuss the > work_completion lock, so my observation looks a new, different issue. > > In the call stack, I found that the workqueue thread for i801_probe() calls > p2sb_bar(), which locks pci_rescan_remove_lock. IMHO, the issue cause looks that > pci_rescan_remove_lock is locked in both workqueue context and non-workqueue > context. As a fix trial, I created a quick patch [3]. It calls i801_probe() in > non-workqueue context only by adding a new flag to struct pci_driver. With this, > I observed the lockdep WARN disappears. Is this a good solution approach? If > not, is there any other better solution? Thanks for the report and the proposed solution. I'll add the i801 experts, Jean and Heiner, to CC. > > [1] kernel message log at the second command > > [ 242.922091] ====================================================== > [ 242.931663] WARNING: possible circular locking dependency detected > [ 242.938292] mpt3sas_cm1: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (56799464 kB) > [ 242.939415] 6.7.0-rc1-kts #1 Not tainted > [ 242.939419] ------------------------------------------------------ > [ 242.939421] bash/1615 is trying to acquire lock: > [ 242.939424] ff1100017bf87910 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0xc5/0x980 > [ 242.989069] > but task is already holding lock: > [ 243.000283] ffffffff870bf4a8 (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x96/0xd0 > [ 243.012269] > which lock already depends on the new lock. > > [ 243.028569] > the existing dependency chain (in reverse order) is: > [ 243.041611] > -> #1 (pci_rescan_remove_lock){+.+.}-{3:3}: > [ 243.053709] __mutex_lock+0x16a/0x1880 > [ 243.060767] p2sb_bar+0xa7/0x250 > [ 243.067213] i801_add_tco_spt.constprop.0+0x88/0x1f0 [i2c_i801] > [ 243.076707] i801_add_tco+0x18a/0x210 [i2c_i801] > [ 243.084727] i801_probe+0x99c/0x1500 [i2c_i801] > [ 243.092618] local_pci_probe+0xd6/0x190 > [ 243.099708] work_for_cpu_fn+0x4e/0xa0 > [ 243.106673] process_one_work+0x736/0x1230 > [ 243.114012] worker_thread+0x723/0x1300 > [ 243.121039] kthread+0x2ee/0x3d0 > [ 243.127372] ret_from_fork+0x2d/0x70 > [ 243.134073] ret_from_fork_asm+0x1b/0x30 > [ 243.141140] > -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}: > [ 243.153341] __lock_acquire+0x2e74/0x5ea0 > [ 243.160490] lock_acquire+0x196/0x4b0 > [ 243.167236] __flush_work+0xe2/0x980 > [ 243.173882] work_on_cpu_key+0xcc/0xf0 > [ 243.180709] pci_device_probe+0x548/0x740 > [ 243.187813] really_probe+0x3df/0xb80 > [ 243.194525] __driver_probe_device+0x18c/0x450 > [ 243.202128] driver_probe_device+0x4a/0x120 > [ 243.209437] __device_attach_driver+0x15e/0x270 > [ 243.217149] bus_for_each_drv+0x101/0x170 > [ 243.224260] __device_attach+0x189/0x380 > [ 243.231254] pci_bus_add_device+0x9f/0xf0 > [ 243.238360] pci_bus_add_devices+0x7f/0x190 > [ 243.245639] pci_bus_add_devices+0x114/0x190 > [ 243.253017] pci_rescan_bus+0x23/0x30 > [ 243.259711] rescan_store+0xa2/0xd0 > [ 243.266187] kernfs_fop_write_iter+0x356/0x530 > [ 243.273735] vfs_write+0x513/0xd60 > [ 243.280090] ksys_write+0xe7/0x1b0 > [ 243.286412] do_syscall_64+0x5d/0xe0 > [ 243.292908] entry_SYSCALL_64_after_hwframe+0x6e/0x76 > [ 243.301053] > other info that might help us debug this: > > [ 243.315550] Possible unsafe locking scenario: > > [ 243.325803] CPU0 CPU1 > [ 243.332654] ---- ---- > [ 243.339492] lock(pci_rescan_remove_lock); > [ 243.345937] lock((work_completion)(&wfc.work)); > [ 243.355852] lock(pci_rescan_remove_lock); > [ 243.365170] lock((work_completion)(&wfc.work)); > [ 243.372235] > *** DEADLOCK *** > > [ 243.384100] 5 locks held by bash/1615: > [ 243.390048] #0: ff1100013f4b0418 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0xe7/0x1b0 > [ 243.400833] #1: ff11000128429888 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x21d/0x530 > [ 243.412623] #2: ff11000103849968 (kn->active#136){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x241/0x530 > [ 243.424832] #3: ffffffff870bf4a8 (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x96/0xd0 > [ 243.436773] #4: ff1100019cc7e1a8 (&dev->mutex){....}-{3:3}, at: __device_attach+0x67/0x380 > [ 243.448048] > stack backtrace: > [ 243.456654] CPU: 16 PID: 1615 Comm: bash Not tainted 6.7.0-rc1-kts #1 > [ 243.465797] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022 > [ 243.476145] Call Trace: > [ 243.480820] <TASK> > [ 243.485084] dump_stack_lvl+0x57/0x90 > [ 243.491112] check_noncircular+0x2e1/0x3c0 > [ 243.497630] ? __pfx_check_noncircular+0x10/0x10 > [ 243.504747] ? __pfx___bfs+0x10/0x10 > [ 243.510680] ? lockdep_lock+0xbc/0x1a0 > [ 243.516811] ? __pfx_lockdep_lock+0x10/0x10 > [ 243.523436] __lock_acquire+0x2e74/0x5ea0 > [ 243.529866] ? __pfx___lock_acquire+0x10/0x10 > [ 243.536682] lock_acquire+0x196/0x4b0 > [ 243.542710] ? __flush_work+0xc5/0x980 > [ 243.548829] ? __pfx_lock_acquire+0x10/0x10 > [ 243.555442] ? __pfx___lock_acquire+0x10/0x10 > [ 243.562252] ? driver_probe_device+0x4a/0x120 > [ 243.569061] ? __device_attach_driver+0x15e/0x270 > [ 243.576282] ? mark_lock+0xee/0x16c0 > [ 243.582222] ? __flush_work+0xc5/0x980 > [ 243.588364] __flush_work+0xe2/0x980 > [ 243.594300] ? __flush_work+0xc5/0x980 > [ 243.600425] ? __queue_work+0x4e4/0xe30 > [ 243.606658] ? __pfx___flush_work+0x10/0x10 > [ 243.613287] ? lock_is_held_type+0xce/0x120 > [ 243.619917] ? queue_work_on+0x69/0xa0 > [ 243.626032] ? lockdep_hardirqs_on+0x7d/0x100 > [ 243.632834] work_on_cpu_key+0xcc/0xf0 > [ 243.638950] ? __pfx_work_on_cpu_key+0x10/0x10 > [ 243.645849] ? __pfx_work_for_cpu_fn+0x10/0x10 > [ 243.652738] ? __pfx_local_pci_probe+0x10/0x10 > [ 243.659638] pci_device_probe+0x548/0x740 > [ 243.666057] ? __pfx_pci_device_probe+0x10/0x10 > [ 243.673057] ? kernfs_create_link+0x167/0x230 > [ 243.679855] really_probe+0x3df/0xb80 > [ 243.685860] __driver_probe_device+0x18c/0x450 > [ 243.692737] driver_probe_device+0x4a/0x120 > [ 243.699314] __device_attach_driver+0x15e/0x270 > [ 243.706297] ? __pfx___device_attach_driver+0x10/0x10 > [ 243.713890] bus_for_each_drv+0x101/0x170 > [ 243.720312] ? __pfx_bus_for_each_drv+0x10/0x10 > [ 243.727294] ? lockdep_hardirqs_on+0x7d/0x100 > [ 243.734063] ? _raw_spin_unlock_irqrestore+0x35/0x60 > [ 243.741505] __device_attach+0x189/0x380 > [ 243.747747] ? __pfx___device_attach+0x10/0x10 > [ 243.754554] pci_bus_add_device+0x9f/0xf0 > [ 243.760836] pci_bus_add_devices+0x7f/0x190 > [ 243.767328] pci_bus_add_devices+0x114/0x190 > [ 243.773890] pci_rescan_bus+0x23/0x30 > [ 243.779741] rescan_store+0xa2/0xd0 > [ 243.785362] ? __pfx_rescan_store+0x10/0x10 > [ 243.791785] kernfs_fop_write_iter+0x356/0x530 > [ 243.798516] vfs_write+0x513/0xd60 > [ 243.804054] ? __pfx_vfs_write+0x10/0x10 > [ 243.810193] ? __fget_light+0x51/0x220 > [ 243.816125] ? __pfx_lock_release+0x10/0x10 > [ 243.822555] ksys_write+0xe7/0x1b0 > [ 243.828097] ? __pfx_ksys_write+0x10/0x10 > [ 243.834327] ? syscall_enter_from_user_mode+0x22/0x90 > [ 243.841736] ? lockdep_hardirqs_on+0x7d/0x100 > [ 243.848366] do_syscall_64+0x5d/0xe0 > [ 243.854114] ? do_syscall_64+0x6c/0xe0 > [ 243.860053] ? do_syscall_64+0x6c/0xe0 > [ 243.865989] ? lockdep_hardirqs_on+0x7d/0x100 > [ 243.872608] ? do_syscall_64+0x6c/0xe0 > [ 243.878537] ? lockdep_hardirqs_on+0x7d/0x100 > [ 243.885147] entry_SYSCALL_64_after_hwframe+0x6e/0x76 > [ 243.892555] RIP: 0033:0x7fee10d53c34 > [ 243.898305] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d 35 77 0d 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 > [ 243.922266] RSP: 002b:00007ffd173e68e8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 > [ 243.932655] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fee10d53c34 > [ 243.942564] RDX: 0000000000000002 RSI: 000055f17c9c4bc0 RDI: 0000000000000001 > [ 243.952485] RBP: 00007ffd173e6910 R08: 0000000000000073 R09: 0000000000000001 > [ 243.962408] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002 > [ 243.972328] R13: 000055f17c9c4bc0 R14: 00007fee10e245c0 R15: 00007fee10e21f20 > [ 243.982259] </TASK> > > [2] https://patchwork.kernel.org/project/linux-pci/patch/20180921205752.3191-1-keith.busch@intel.com/ > > [3] fix trial patch > > diff --git a/drivers/i2c/busses/i2c-i801.c b/drivers/i2c/busses/i2c-i801.c > index 070999139c6..00d57d4e006 100644 > --- a/drivers/i2c/busses/i2c-i801.c > +++ b/drivers/i2c/busses/i2c-i801.c > @@ -1820,6 +1820,7 @@ static struct pci_driver i801_driver = { > .pm = pm_sleep_ptr(&i801_pm_ops), > .probe_type = PROBE_PREFER_ASYNCHRONOUS, > }, > + .local_probe = true, > }; > > static int __init i2c_i801_init(struct pci_driver *drv) > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > index 51ec9e7e784..161ff37143a 100644 > --- a/drivers/pci/pci-driver.c > +++ b/drivers/pci/pci-driver.c > @@ -368,7 +368,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, > * device is probed from work_on_cpu() of the Physical device. > */ > if (node < 0 || node >= MAX_NUMNODES || !node_online(node) || > - pci_physfn_is_probed(dev)) { > + pci_physfn_is_probed(dev) || drv->local_probe) { > cpu = nr_cpu_ids; > } else { > cpumask_var_t wq_domain_mask; > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 60ca768bc86..6fd086eb26c 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -957,6 +957,7 @@ struct pci_driver { > struct device_driver driver; > struct pci_dynids dynids; > bool driver_managed_dma; > + bool local_probe; > }; > > static inline struct pci_driver *to_pci_driver(struct device_driver *drv) >
On 14.11.2023 11:16, Wolfram Sang wrote: > On Tue, Nov 14, 2023 at 06:54:29AM +0000, Shinichiro Kawasaki wrote: >> Hello there. >> >> Recently I tried a couple of commands below on the kernel v6.6 and v6.7-rc1, >> then observed a lockdep WARN at the second command [1]. The first command >> removes a PCI device, and the second command rescans whole PCI devices to >> regain the removed device. >> >> # echo 1 > /sys/bus/pci/devices/0000:51:00.0/remove >> # echo 1 > /sys/bus/pci/rescan >> >> I tried this rescan for SAS-HBA or AHCI controller with HDDs. When those devices >> are left in weird status after some kernel tests, I want to remove the SAS-HBA >> and AHCI controller and rescan to get back the devices in good status. This >> rescan looks working good except the WARN. >> >> The lockdep splat indicates possible deadlock between pci_rescan_remove_lock >> and work_completion lock have deadlock possibility. Is the lockdep WARN a known >> issue? I found a similar discussion in the past [2], but it did not discuss the >> work_completion lock, so my observation looks a new, different issue. >> >> In the call stack, I found that the workqueue thread for i801_probe() calls >> p2sb_bar(), which locks pci_rescan_remove_lock. IMHO, the issue cause looks that >> pci_rescan_remove_lock is locked in both workqueue context and non-workqueue >> context. As a fix trial, I created a quick patch [3]. It calls i801_probe() in >> non-workqueue context only by adding a new flag to struct pci_driver. With this, >> I observed the lockdep WARN disappears. Is this a good solution approach? If >> not, is there any other better solution? > > Thanks for the report and the proposed solution. I'll add the i801 > experts, Jean and Heiner, to CC. > + Bjorn, Andy i801 just uses p2sb_bar(), I don't see any issue in i801. Root cause seems to be in the PCI subsystem. Calling p2sb_bar() from a PCI driver probe callback seems to be problematic, nevertheless it's a valid API usage. The proposed fix helps to get an idea of how to work around the issue. But IMO it more cures a symptom than fixes the root cause. >> >> [1] kernel message log at the second command >> >> [ 242.922091] ====================================================== >> [ 242.931663] WARNING: possible circular locking dependency detected >> [ 242.938292] mpt3sas_cm1: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (56799464 kB) >> [ 242.939415] 6.7.0-rc1-kts #1 Not tainted >> [ 242.939419] ------------------------------------------------------ >> [ 242.939421] bash/1615 is trying to acquire lock: >> [ 242.939424] ff1100017bf87910 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0xc5/0x980 >> [ 242.989069] >> but task is already holding lock: >> [ 243.000283] ffffffff870bf4a8 (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x96/0xd0 >> [ 243.012269] >> which lock already depends on the new lock. >> >> [ 243.028569] >> the existing dependency chain (in reverse order) is: >> [ 243.041611] >> -> #1 (pci_rescan_remove_lock){+.+.}-{3:3}: >> [ 243.053709] __mutex_lock+0x16a/0x1880 >> [ 243.060767] p2sb_bar+0xa7/0x250 >> [ 243.067213] i801_add_tco_spt.constprop.0+0x88/0x1f0 [i2c_i801] >> [ 243.076707] i801_add_tco+0x18a/0x210 [i2c_i801] >> [ 243.084727] i801_probe+0x99c/0x1500 [i2c_i801] >> [ 243.092618] local_pci_probe+0xd6/0x190 >> [ 243.099708] work_for_cpu_fn+0x4e/0xa0 >> [ 243.106673] process_one_work+0x736/0x1230 >> [ 243.114012] worker_thread+0x723/0x1300 >> [ 243.121039] kthread+0x2ee/0x3d0 >> [ 243.127372] ret_from_fork+0x2d/0x70 >> [ 243.134073] ret_from_fork_asm+0x1b/0x30 >> [ 243.141140] >> -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}: >> [ 243.153341] __lock_acquire+0x2e74/0x5ea0 >> [ 243.160490] lock_acquire+0x196/0x4b0 >> [ 243.167236] __flush_work+0xe2/0x980 >> [ 243.173882] work_on_cpu_key+0xcc/0xf0 >> [ 243.180709] pci_device_probe+0x548/0x740 >> [ 243.187813] really_probe+0x3df/0xb80 >> [ 243.194525] __driver_probe_device+0x18c/0x450 >> [ 243.202128] driver_probe_device+0x4a/0x120 >> [ 243.209437] __device_attach_driver+0x15e/0x270 >> [ 243.217149] bus_for_each_drv+0x101/0x170 >> [ 243.224260] __device_attach+0x189/0x380 >> [ 243.231254] pci_bus_add_device+0x9f/0xf0 >> [ 243.238360] pci_bus_add_devices+0x7f/0x190 >> [ 243.245639] pci_bus_add_devices+0x114/0x190 >> [ 243.253017] pci_rescan_bus+0x23/0x30 >> [ 243.259711] rescan_store+0xa2/0xd0 >> [ 243.266187] kernfs_fop_write_iter+0x356/0x530 >> [ 243.273735] vfs_write+0x513/0xd60 >> [ 243.280090] ksys_write+0xe7/0x1b0 >> [ 243.286412] do_syscall_64+0x5d/0xe0 >> [ 243.292908] entry_SYSCALL_64_after_hwframe+0x6e/0x76 >> [ 243.301053] >> other info that might help us debug this: >> >> [ 243.315550] Possible unsafe locking scenario: >> >> [ 243.325803] CPU0 CPU1 >> [ 243.332654] ---- ---- >> [ 243.339492] lock(pci_rescan_remove_lock); >> [ 243.345937] lock((work_completion)(&wfc.work)); >> [ 243.355852] lock(pci_rescan_remove_lock); >> [ 243.365170] lock((work_completion)(&wfc.work)); >> [ 243.372235] >> *** DEADLOCK *** >> >> [ 243.384100] 5 locks held by bash/1615: >> [ 243.390048] #0: ff1100013f4b0418 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0xe7/0x1b0 >> [ 243.400833] #1: ff11000128429888 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x21d/0x530 >> [ 243.412623] #2: ff11000103849968 (kn->active#136){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x241/0x530 >> [ 243.424832] #3: ffffffff870bf4a8 (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x96/0xd0 >> [ 243.436773] #4: ff1100019cc7e1a8 (&dev->mutex){....}-{3:3}, at: __device_attach+0x67/0x380 >> [ 243.448048] >> stack backtrace: >> [ 243.456654] CPU: 16 PID: 1615 Comm: bash Not tainted 6.7.0-rc1-kts #1 >> [ 243.465797] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022 >> [ 243.476145] Call Trace: >> [ 243.480820] <TASK> >> [ 243.485084] dump_stack_lvl+0x57/0x90 >> [ 243.491112] check_noncircular+0x2e1/0x3c0 >> [ 243.497630] ? __pfx_check_noncircular+0x10/0x10 >> [ 243.504747] ? __pfx___bfs+0x10/0x10 >> [ 243.510680] ? lockdep_lock+0xbc/0x1a0 >> [ 243.516811] ? __pfx_lockdep_lock+0x10/0x10 >> [ 243.523436] __lock_acquire+0x2e74/0x5ea0 >> [ 243.529866] ? __pfx___lock_acquire+0x10/0x10 >> [ 243.536682] lock_acquire+0x196/0x4b0 >> [ 243.542710] ? __flush_work+0xc5/0x980 >> [ 243.548829] ? __pfx_lock_acquire+0x10/0x10 >> [ 243.555442] ? __pfx___lock_acquire+0x10/0x10 >> [ 243.562252] ? driver_probe_device+0x4a/0x120 >> [ 243.569061] ? __device_attach_driver+0x15e/0x270 >> [ 243.576282] ? mark_lock+0xee/0x16c0 >> [ 243.582222] ? __flush_work+0xc5/0x980 >> [ 243.588364] __flush_work+0xe2/0x980 >> [ 243.594300] ? __flush_work+0xc5/0x980 >> [ 243.600425] ? __queue_work+0x4e4/0xe30 >> [ 243.606658] ? __pfx___flush_work+0x10/0x10 >> [ 243.613287] ? lock_is_held_type+0xce/0x120 >> [ 243.619917] ? queue_work_on+0x69/0xa0 >> [ 243.626032] ? lockdep_hardirqs_on+0x7d/0x100 >> [ 243.632834] work_on_cpu_key+0xcc/0xf0 >> [ 243.638950] ? __pfx_work_on_cpu_key+0x10/0x10 >> [ 243.645849] ? __pfx_work_for_cpu_fn+0x10/0x10 >> [ 243.652738] ? __pfx_local_pci_probe+0x10/0x10 >> [ 243.659638] pci_device_probe+0x548/0x740 >> [ 243.666057] ? __pfx_pci_device_probe+0x10/0x10 >> [ 243.673057] ? kernfs_create_link+0x167/0x230 >> [ 243.679855] really_probe+0x3df/0xb80 >> [ 243.685860] __driver_probe_device+0x18c/0x450 >> [ 243.692737] driver_probe_device+0x4a/0x120 >> [ 243.699314] __device_attach_driver+0x15e/0x270 >> [ 243.706297] ? __pfx___device_attach_driver+0x10/0x10 >> [ 243.713890] bus_for_each_drv+0x101/0x170 >> [ 243.720312] ? __pfx_bus_for_each_drv+0x10/0x10 >> [ 243.727294] ? lockdep_hardirqs_on+0x7d/0x100 >> [ 243.734063] ? _raw_spin_unlock_irqrestore+0x35/0x60 >> [ 243.741505] __device_attach+0x189/0x380 >> [ 243.747747] ? __pfx___device_attach+0x10/0x10 >> [ 243.754554] pci_bus_add_device+0x9f/0xf0 >> [ 243.760836] pci_bus_add_devices+0x7f/0x190 >> [ 243.767328] pci_bus_add_devices+0x114/0x190 >> [ 243.773890] pci_rescan_bus+0x23/0x30 >> [ 243.779741] rescan_store+0xa2/0xd0 >> [ 243.785362] ? __pfx_rescan_store+0x10/0x10 >> [ 243.791785] kernfs_fop_write_iter+0x356/0x530 >> [ 243.798516] vfs_write+0x513/0xd60 >> [ 243.804054] ? __pfx_vfs_write+0x10/0x10 >> [ 243.810193] ? __fget_light+0x51/0x220 >> [ 243.816125] ? __pfx_lock_release+0x10/0x10 >> [ 243.822555] ksys_write+0xe7/0x1b0 >> [ 243.828097] ? __pfx_ksys_write+0x10/0x10 >> [ 243.834327] ? syscall_enter_from_user_mode+0x22/0x90 >> [ 243.841736] ? lockdep_hardirqs_on+0x7d/0x100 >> [ 243.848366] do_syscall_64+0x5d/0xe0 >> [ 243.854114] ? do_syscall_64+0x6c/0xe0 >> [ 243.860053] ? do_syscall_64+0x6c/0xe0 >> [ 243.865989] ? lockdep_hardirqs_on+0x7d/0x100 >> [ 243.872608] ? do_syscall_64+0x6c/0xe0 >> [ 243.878537] ? lockdep_hardirqs_on+0x7d/0x100 >> [ 243.885147] entry_SYSCALL_64_after_hwframe+0x6e/0x76 >> [ 243.892555] RIP: 0033:0x7fee10d53c34 >> [ 243.898305] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d 35 77 0d 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 >> [ 243.922266] RSP: 002b:00007ffd173e68e8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 >> [ 243.932655] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fee10d53c34 >> [ 243.942564] RDX: 0000000000000002 RSI: 000055f17c9c4bc0 RDI: 0000000000000001 >> [ 243.952485] RBP: 00007ffd173e6910 R08: 0000000000000073 R09: 0000000000000001 >> [ 243.962408] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002 >> [ 243.972328] R13: 000055f17c9c4bc0 R14: 00007fee10e245c0 R15: 00007fee10e21f20 >> [ 243.982259] </TASK> >> >> [2] https://patchwork.kernel.org/project/linux-pci/patch/20180921205752.3191-1-keith.busch@intel.com/ >> >> [3] fix trial patch >> >> diff --git a/drivers/i2c/busses/i2c-i801.c b/drivers/i2c/busses/i2c-i801.c >> index 070999139c6..00d57d4e006 100644 >> --- a/drivers/i2c/busses/i2c-i801.c >> +++ b/drivers/i2c/busses/i2c-i801.c >> @@ -1820,6 +1820,7 @@ static struct pci_driver i801_driver = { >> .pm = pm_sleep_ptr(&i801_pm_ops), >> .probe_type = PROBE_PREFER_ASYNCHRONOUS, >> }, >> + .local_probe = true, >> }; >> >> static int __init i2c_i801_init(struct pci_driver *drv) >> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c >> index 51ec9e7e784..161ff37143a 100644 >> --- a/drivers/pci/pci-driver.c >> +++ b/drivers/pci/pci-driver.c >> @@ -368,7 +368,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, >> * device is probed from work_on_cpu() of the Physical device. >> */ >> if (node < 0 || node >= MAX_NUMNODES || !node_online(node) || >> - pci_physfn_is_probed(dev)) { >> + pci_physfn_is_probed(dev) || drv->local_probe) { >> cpu = nr_cpu_ids; >> } else { >> cpumask_var_t wq_domain_mask; >> diff --git a/include/linux/pci.h b/include/linux/pci.h >> index 60ca768bc86..6fd086eb26c 100644 >> --- a/include/linux/pci.h >> +++ b/include/linux/pci.h >> @@ -957,6 +957,7 @@ struct pci_driver { >> struct device_driver driver; >> struct pci_dynids dynids; >> bool driver_managed_dma; >> + bool local_probe; >> }; >> >> static inline struct pci_driver *to_pci_driver(struct device_driver *drv) >>
On Tue, Nov 14, 2023 at 11:47:15AM +0100, Heiner Kallweit wrote: > On 14.11.2023 11:16, Wolfram Sang wrote: > > On Tue, Nov 14, 2023 at 06:54:29AM +0000, Shinichiro Kawasaki wrote: > >> Hello there. > >> > >> Recently I tried a couple of commands below on the kernel v6.6 and v6.7-rc1, > >> then observed a lockdep WARN at the second command [1]. The first command > >> removes a PCI device, and the second command rescans whole PCI devices to > >> regain the removed device. > >> > >> # echo 1 > /sys/bus/pci/devices/0000:51:00.0/remove > >> # echo 1 > /sys/bus/pci/rescan > >> > >> I tried this rescan for SAS-HBA or AHCI controller with HDDs. When those devices > >> are left in weird status after some kernel tests, I want to remove the SAS-HBA > >> and AHCI controller and rescan to get back the devices in good status. This > >> rescan looks working good except the WARN. > >> > >> The lockdep splat indicates possible deadlock between pci_rescan_remove_lock > >> and work_completion lock have deadlock possibility. Is the lockdep WARN a known > >> issue? I found a similar discussion in the past [2], but it did not discuss the > >> work_completion lock, so my observation looks a new, different issue. > >> > >> In the call stack, I found that the workqueue thread for i801_probe() calls > >> p2sb_bar(), which locks pci_rescan_remove_lock. IMHO, the issue cause looks that > >> pci_rescan_remove_lock is locked in both workqueue context and non-workqueue > >> context. As a fix trial, I created a quick patch [3]. It calls i801_probe() in > >> non-workqueue context only by adding a new flag to struct pci_driver. With this, > >> I observed the lockdep WARN disappears. Is this a good solution approach? If > >> not, is there any other better solution? Thanks for the report! > > Thanks for the report and the proposed solution. I'll add the i801 > > experts, Jean and Heiner, to CC. > > + Bjorn, Andy + Keith and Lukas as per [2]. > i801 just uses p2sb_bar(), I don't see any issue in i801. Root cause seems to > be in the PCI subsystem. Calling p2sb_bar() from a PCI driver probe callback > seems to be problematic, nevertheless it's a valid API usage. Yeah, the problem is that we have a PCI (kind of) device that needs to be communicated with in order to instantiate a driver from whatever bus we are trying to do. > The proposed fix helps to get an idea of how to work around the issue. > But IMO it more cures a symptom than fixes the root cause. I agree, and reading [2] makes me think the same, i.e. that currently this is a Big PCI lock for everything. I wanted, before reading [2], to propose to make it nested, but I think it's not gonna fly, So, currently I'm lack of (good) ideas and would like to hear other (more experienced) PCI developers on how is to address this... > >> [1] kernel message log at the second command > >> > >> [ 242.922091] ====================================================== > >> [ 242.931663] WARNING: possible circular locking dependency detected > >> [ 242.938292] mpt3sas_cm1: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (56799464 kB) > >> [ 242.939415] 6.7.0-rc1-kts #1 Not tainted > >> [ 242.939419] ------------------------------------------------------ > >> [ 242.939421] bash/1615 is trying to acquire lock: > >> [ 242.939424] ff1100017bf87910 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0xc5/0x980 > >> [ 242.989069] > >> but task is already holding lock: > >> [ 243.000283] ffffffff870bf4a8 (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x96/0xd0 > >> [ 243.012269] > >> which lock already depends on the new lock. > >> > >> [ 243.028569] > >> the existing dependency chain (in reverse order) is: > >> [ 243.041611] > >> -> #1 (pci_rescan_remove_lock){+.+.}-{3:3}: > >> [ 243.053709] __mutex_lock+0x16a/0x1880 > >> [ 243.060767] p2sb_bar+0xa7/0x250 > >> [ 243.067213] i801_add_tco_spt.constprop.0+0x88/0x1f0 [i2c_i801] > >> [ 243.076707] i801_add_tco+0x18a/0x210 [i2c_i801] > >> [ 243.084727] i801_probe+0x99c/0x1500 [i2c_i801] > >> [ 243.092618] local_pci_probe+0xd6/0x190 > >> [ 243.099708] work_for_cpu_fn+0x4e/0xa0 > >> [ 243.106673] process_one_work+0x736/0x1230 > >> [ 243.114012] worker_thread+0x723/0x1300 > >> [ 243.121039] kthread+0x2ee/0x3d0 > >> [ 243.127372] ret_from_fork+0x2d/0x70 > >> [ 243.134073] ret_from_fork_asm+0x1b/0x30 > >> [ 243.141140] > >> -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}: > >> [ 243.153341] __lock_acquire+0x2e74/0x5ea0 > >> [ 243.160490] lock_acquire+0x196/0x4b0 > >> [ 243.167236] __flush_work+0xe2/0x980 > >> [ 243.173882] work_on_cpu_key+0xcc/0xf0 > >> [ 243.180709] pci_device_probe+0x548/0x740 > >> [ 243.187813] really_probe+0x3df/0xb80 > >> [ 243.194525] __driver_probe_device+0x18c/0x450 > >> [ 243.202128] driver_probe_device+0x4a/0x120 > >> [ 243.209437] __device_attach_driver+0x15e/0x270 > >> [ 243.217149] bus_for_each_drv+0x101/0x170 > >> [ 243.224260] __device_attach+0x189/0x380 > >> [ 243.231254] pci_bus_add_device+0x9f/0xf0 > >> [ 243.238360] pci_bus_add_devices+0x7f/0x190 > >> [ 243.245639] pci_bus_add_devices+0x114/0x190 > >> [ 243.253017] pci_rescan_bus+0x23/0x30 > >> [ 243.259711] rescan_store+0xa2/0xd0 > >> [ 243.266187] kernfs_fop_write_iter+0x356/0x530 > >> [ 243.273735] vfs_write+0x513/0xd60 > >> [ 243.280090] ksys_write+0xe7/0x1b0 > >> [ 243.286412] do_syscall_64+0x5d/0xe0 > >> [ 243.292908] entry_SYSCALL_64_after_hwframe+0x6e/0x76 > >> [ 243.301053] > >> other info that might help us debug this: > >> > >> [ 243.315550] Possible unsafe locking scenario: > >> > >> [ 243.325803] CPU0 CPU1 > >> [ 243.332654] ---- ---- > >> [ 243.339492] lock(pci_rescan_remove_lock); > >> [ 243.345937] lock((work_completion)(&wfc.work)); > >> [ 243.355852] lock(pci_rescan_remove_lock); > >> [ 243.365170] lock((work_completion)(&wfc.work)); > >> [ 243.372235] > >> *** DEADLOCK *** > >> > >> [ 243.384100] 5 locks held by bash/1615: > >> [ 243.390048] #0: ff1100013f4b0418 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0xe7/0x1b0 > >> [ 243.400833] #1: ff11000128429888 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x21d/0x530 > >> [ 243.412623] #2: ff11000103849968 (kn->active#136){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x241/0x530 > >> [ 243.424832] #3: ffffffff870bf4a8 (pci_rescan_remove_lock){+.+.}-{3:3}, at: rescan_store+0x96/0xd0 > >> [ 243.436773] #4: ff1100019cc7e1a8 (&dev->mutex){....}-{3:3}, at: __device_attach+0x67/0x380 > >> [ 243.448048] > >> stack backtrace: > >> [ 243.456654] CPU: 16 PID: 1615 Comm: bash Not tainted 6.7.0-rc1-kts #1 > >> [ 243.465797] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022 > >> [ 243.476145] Call Trace: > >> [ 243.480820] <TASK> > >> [ 243.485084] dump_stack_lvl+0x57/0x90 > >> [ 243.491112] check_noncircular+0x2e1/0x3c0 > >> [ 243.497630] ? __pfx_check_noncircular+0x10/0x10 > >> [ 243.504747] ? __pfx___bfs+0x10/0x10 > >> [ 243.510680] ? lockdep_lock+0xbc/0x1a0 > >> [ 243.516811] ? __pfx_lockdep_lock+0x10/0x10 > >> [ 243.523436] __lock_acquire+0x2e74/0x5ea0 > >> [ 243.529866] ? __pfx___lock_acquire+0x10/0x10 > >> [ 243.536682] lock_acquire+0x196/0x4b0 > >> [ 243.542710] ? __flush_work+0xc5/0x980 > >> [ 243.548829] ? __pfx_lock_acquire+0x10/0x10 > >> [ 243.555442] ? __pfx___lock_acquire+0x10/0x10 > >> [ 243.562252] ? driver_probe_device+0x4a/0x120 > >> [ 243.569061] ? __device_attach_driver+0x15e/0x270 > >> [ 243.576282] ? mark_lock+0xee/0x16c0 > >> [ 243.582222] ? __flush_work+0xc5/0x980 > >> [ 243.588364] __flush_work+0xe2/0x980 > >> [ 243.594300] ? __flush_work+0xc5/0x980 > >> [ 243.600425] ? __queue_work+0x4e4/0xe30 > >> [ 243.606658] ? __pfx___flush_work+0x10/0x10 > >> [ 243.613287] ? lock_is_held_type+0xce/0x120 > >> [ 243.619917] ? queue_work_on+0x69/0xa0 > >> [ 243.626032] ? lockdep_hardirqs_on+0x7d/0x100 > >> [ 243.632834] work_on_cpu_key+0xcc/0xf0 > >> [ 243.638950] ? __pfx_work_on_cpu_key+0x10/0x10 > >> [ 243.645849] ? __pfx_work_for_cpu_fn+0x10/0x10 > >> [ 243.652738] ? __pfx_local_pci_probe+0x10/0x10 > >> [ 243.659638] pci_device_probe+0x548/0x740 > >> [ 243.666057] ? __pfx_pci_device_probe+0x10/0x10 > >> [ 243.673057] ? kernfs_create_link+0x167/0x230 > >> [ 243.679855] really_probe+0x3df/0xb80 > >> [ 243.685860] __driver_probe_device+0x18c/0x450 > >> [ 243.692737] driver_probe_device+0x4a/0x120 > >> [ 243.699314] __device_attach_driver+0x15e/0x270 > >> [ 243.706297] ? __pfx___device_attach_driver+0x10/0x10 > >> [ 243.713890] bus_for_each_drv+0x101/0x170 > >> [ 243.720312] ? __pfx_bus_for_each_drv+0x10/0x10 > >> [ 243.727294] ? lockdep_hardirqs_on+0x7d/0x100 > >> [ 243.734063] ? _raw_spin_unlock_irqrestore+0x35/0x60 > >> [ 243.741505] __device_attach+0x189/0x380 > >> [ 243.747747] ? __pfx___device_attach+0x10/0x10 > >> [ 243.754554] pci_bus_add_device+0x9f/0xf0 > >> [ 243.760836] pci_bus_add_devices+0x7f/0x190 > >> [ 243.767328] pci_bus_add_devices+0x114/0x190 > >> [ 243.773890] pci_rescan_bus+0x23/0x30 > >> [ 243.779741] rescan_store+0xa2/0xd0 > >> [ 243.785362] ? __pfx_rescan_store+0x10/0x10 > >> [ 243.791785] kernfs_fop_write_iter+0x356/0x530 > >> [ 243.798516] vfs_write+0x513/0xd60 > >> [ 243.804054] ? __pfx_vfs_write+0x10/0x10 > >> [ 243.810193] ? __fget_light+0x51/0x220 > >> [ 243.816125] ? __pfx_lock_release+0x10/0x10 > >> [ 243.822555] ksys_write+0xe7/0x1b0 > >> [ 243.828097] ? __pfx_ksys_write+0x10/0x10 > >> [ 243.834327] ? syscall_enter_from_user_mode+0x22/0x90 > >> [ 243.841736] ? lockdep_hardirqs_on+0x7d/0x100 > >> [ 243.848366] do_syscall_64+0x5d/0xe0 > >> [ 243.854114] ? do_syscall_64+0x6c/0xe0 > >> [ 243.860053] ? do_syscall_64+0x6c/0xe0 > >> [ 243.865989] ? lockdep_hardirqs_on+0x7d/0x100 > >> [ 243.872608] ? do_syscall_64+0x6c/0xe0 > >> [ 243.878537] ? lockdep_hardirqs_on+0x7d/0x100 > >> [ 243.885147] entry_SYSCALL_64_after_hwframe+0x6e/0x76 > >> [ 243.892555] RIP: 0033:0x7fee10d53c34 > >> [ 243.898305] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d 35 77 0d 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 > >> [ 243.922266] RSP: 002b:00007ffd173e68e8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 > >> [ 243.932655] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fee10d53c34 > >> [ 243.942564] RDX: 0000000000000002 RSI: 000055f17c9c4bc0 RDI: 0000000000000001 > >> [ 243.952485] RBP: 00007ffd173e6910 R08: 0000000000000073 R09: 0000000000000001 > >> [ 243.962408] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002 > >> [ 243.972328] R13: 000055f17c9c4bc0 R14: 00007fee10e245c0 R15: 00007fee10e21f20 > >> [ 243.982259] </TASK> > >> > >> [2] https://patchwork.kernel.org/project/linux-pci/patch/20180921205752.3191-1-keith.busch@intel.com/
On Tue, Nov 14, 2023 at 02:04:34PM +0200, Andy Shevchenko wrote: > On Tue, Nov 14, 2023 at 11:47:15AM +0100, Heiner Kallweit wrote: > > On 14.11.2023 11:16, Wolfram Sang wrote: > > > On Tue, Nov 14, 2023 at 06:54:29AM +0000, Shinichiro Kawasaki wrote: > > > > The lockdep splat indicates possible deadlock between > > > > pci_rescan_remove_lock and work_completion lock have deadlock > > > > possibility. > > > > In the call stack, I found that the workqueue thread for > > > > i801_probe() calls p2sb_bar(), which locks pci_rescan_remove_lock. > > > > i801 just uses p2sb_bar(), I don't see any issue in i801. Root cause > > seems to be in the PCI subsystem. Calling p2sb_bar() from a PCI driver > > probe callback seems to be problematic, nevertheless it's a valid API > > usage. > > So, currently I'm lack of (good) ideas and would like to hear other (more > experienced) PCI developers on how is to address this... Can you add a p2sb_bar_locked() library call which is used by the i801 probe path? Basically rename p2sb_bar() to __p2sb_bar() and add a third parameter of type boolean which signifies whether it's invoked in locked context or not, then call that from p2sb_bar() and p2sb_bar_locked() wrappers. Thanks, Lukas
On Tue, Nov 14, 2023 at 04:57:01PM +0100, Lukas Wunner wrote: > On Tue, Nov 14, 2023 at 02:04:34PM +0200, Andy Shevchenko wrote: > > On Tue, Nov 14, 2023 at 11:47:15AM +0100, Heiner Kallweit wrote: > > > On 14.11.2023 11:16, Wolfram Sang wrote: > > > > On Tue, Nov 14, 2023 at 06:54:29AM +0000, Shinichiro Kawasaki wrote: ... > > > > > The lockdep splat indicates possible deadlock between > > > > > pci_rescan_remove_lock and work_completion lock have deadlock > > > > > possibility. > > > > > In the call stack, I found that the workqueue thread for > > > > > i801_probe() calls p2sb_bar(), which locks pci_rescan_remove_lock. > > > > > > i801 just uses p2sb_bar(), I don't see any issue in i801. Root cause > > > seems to be in the PCI subsystem. Calling p2sb_bar() from a PCI driver > > > probe callback seems to be problematic, nevertheless it's a valid API > > > usage. > > > > So, currently I'm lack of (good) ideas and would like to hear other (more > > experienced) PCI developers on how is to address this... > > Can you add a p2sb_bar_locked() library call which is used by the > i801 probe path? > > Basically rename p2sb_bar() to __p2sb_bar() and add a third parameter > of type boolean which signifies whether it's invoked in locked context > or not, then call that from p2sb_bar() and p2sb_bar_locked() wrappers. It may work, I assume. Let me cook the patch.
On Tue, Nov 14, 2023 at 06:11:40PM +0200, Andy Shevchenko wrote: > On Tue, Nov 14, 2023 at 04:57:01PM +0100, Lukas Wunner wrote: > > On Tue, Nov 14, 2023 at 02:04:34PM +0200, Andy Shevchenko wrote: > > > On Tue, Nov 14, 2023 at 11:47:15AM +0100, Heiner Kallweit wrote: > > > > On 14.11.2023 11:16, Wolfram Sang wrote: > > > > > On Tue, Nov 14, 2023 at 06:54:29AM +0000, Shinichiro Kawasaki wrote: ... > > > > > > The lockdep splat indicates possible deadlock between > > > > > > pci_rescan_remove_lock and work_completion lock have deadlock > > > > > > possibility. > > > > > > In the call stack, I found that the workqueue thread for > > > > > > i801_probe() calls p2sb_bar(), which locks pci_rescan_remove_lock. > > > > > > > > i801 just uses p2sb_bar(), I don't see any issue in i801. Root cause > > > > seems to be in the PCI subsystem. Calling p2sb_bar() from a PCI driver > > > > probe callback seems to be problematic, nevertheless it's a valid API > > > > usage. > > > > > > So, currently I'm lack of (good) ideas and would like to hear other (more > > > experienced) PCI developers on how is to address this... > > > > Can you add a p2sb_bar_locked() library call which is used by the > > i801 probe path? > > > > Basically rename p2sb_bar() to __p2sb_bar() and add a third parameter > > of type boolean which signifies whether it's invoked in locked context > > or not, then call that from p2sb_bar() and p2sb_bar_locked() wrappers. > > It may work, I assume. Let me cook the patch. Hmm... But this will open a window when probing phase happens, e.g. during boot time, no? If somebody somehow calls for full rescan, we may end up in the situation when P2SB is gone before accessing it in p2sb_bar(). Now I'm wondering why simple pci_dev_get() can't be sufficient in the p2sb_bar().
On Nov 14, 2023 / 19:58, Andy Shevchenko wrote: > On Tue, Nov 14, 2023 at 06:11:40PM +0200, Andy Shevchenko wrote: > > On Tue, Nov 14, 2023 at 04:57:01PM +0100, Lukas Wunner wrote: > > > On Tue, Nov 14, 2023 at 02:04:34PM +0200, Andy Shevchenko wrote: > > > > On Tue, Nov 14, 2023 at 11:47:15AM +0100, Heiner Kallweit wrote: > > > > > On 14.11.2023 11:16, Wolfram Sang wrote: > > > > > > On Tue, Nov 14, 2023 at 06:54:29AM +0000, Shinichiro Kawasaki wrote: > > ... > > > > > > > > The lockdep splat indicates possible deadlock between > > > > > > > pci_rescan_remove_lock and work_completion lock have deadlock > > > > > > > possibility. > > > > > > > In the call stack, I found that the workqueue thread for > > > > > > > i801_probe() calls p2sb_bar(), which locks pci_rescan_remove_lock. > > > > > > > > > > i801 just uses p2sb_bar(), I don't see any issue in i801. Root cause > > > > > seems to be in the PCI subsystem. Calling p2sb_bar() from a PCI driver > > > > > probe callback seems to be problematic, nevertheless it's a valid API > > > > > usage. > > > > > > > > So, currently I'm lack of (good) ideas and would like to hear other (more > > > > experienced) PCI developers on how is to address this... > > > > > > Can you add a p2sb_bar_locked() library call which is used by the > > > i801 probe path? > > > > > > Basically rename p2sb_bar() to __p2sb_bar() and add a third parameter > > > of type boolean which signifies whether it's invoked in locked context > > > or not, then call that from p2sb_bar() and p2sb_bar_locked() wrappers. > > > > It may work, I assume. Let me cook the patch. > > Hmm... But this will open a window when probing phase happens, e.g. during > boot time, no? If somebody somehow calls for full rescan, we may end up in > the situation when P2SB is gone before accessing it in p2sb_bar(). > > Now I'm wondering why simple pci_dev_get() can't be sufficient in the > p2sb_bar(). All, thanks for the discussion. It looks rather difficult to avoid the WARN. To confirm that the deadlock is for real, I tried to remove i2c-i801 device and did /sys/bus/pci/rescan with two commands below: # echo 1 > /sys/bus/pci/devices/0000\:00\:1f.4/remove # echo 1 > /sys/bus/pci/rescan Then I observed the second command hangs. I came across another fix idea: assuming the guard by pci_rescan_remove_lock is required in p2sb_bar(), how about to do trylock? If the mutex can not be locked, make the p2sb_bar() call fail. This way, we can avoid the deadlock between pci_rescan_remove_lock and workqueue completion. I created a patch below and confirmed it avoided the lockdep WARN. The i2c-i801 probe was ok at system boot. When I did the two commands above, I observed the i2c-i801 device probe failed due to trylock failure. But I think it's far better than hang. diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index ed6b7f48736..3e784fb6cd9 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -3312,6 +3312,18 @@ void pci_lock_rescan_remove(void) } EXPORT_SYMBOL_GPL(pci_lock_rescan_remove); +/* + * Try to acquire pci_rescan_remove_lock. Returns 1 if the mutex + * has been acquired successfully, and 0 on contention. Use this + * to acquire the lock in workqueue context to avoid potential deadlock + * together with work_completion. + */ +int pci_trylock_rescan_remove(void) +{ + return mutex_trylock(&pci_rescan_remove_lock); +} +EXPORT_SYMBOL_GPL(pci_trylock_rescan_remove); + void pci_unlock_rescan_remove(void) { mutex_unlock(&pci_rescan_remove_lock); diff --git a/drivers/platform/x86/p2sb.c b/drivers/platform/x86/p2sb.c index 1cf2471d54d..7a6bee8abf9 100644 --- a/drivers/platform/x86/p2sb.c +++ b/drivers/platform/x86/p2sb.c @@ -113,7 +113,10 @@ int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem) * Prevent concurrent PCI bus scan from seeing the P2SB device and * removing via sysfs while it is temporarily exposed. */ - pci_lock_rescan_remove(); + if (!pci_trylock_rescan_remove()) { + pr_err("P2SB device accessed during PCI rescan"); + return -EBUSY; + } /* Unhide the P2SB device, if needed */ pci_bus_read_config_dword(bus, devfn_p2sb, P2SBC, &value); diff --git a/include/linux/pci.h b/include/linux/pci.h index 60ca768bc86..e6db5096217 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1439,6 +1439,7 @@ void set_pcie_hotplug_bridge(struct pci_dev *pdev); unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge); unsigned int pci_rescan_bus(struct pci_bus *bus); void pci_lock_rescan_remove(void); +int pci_trylock_rescan_remove(void); void pci_unlock_rescan_remove(void); /* Vital Product Data routines */
On Fri, Nov 24, 2023 at 10:49:45AM +0000, Shinichiro Kawasaki wrote: > On Nov 14, 2023 / 19:58, Andy Shevchenko wrote: ... > I created a patch below and confirmed it avoided the lockdep WARN. The > i2c-i801 probe was ok at system boot. Another possible solution I was thinking about is to have a local cache, so, make p2sb.c to be called just after PCI enumeration at boot time to cache the P2SB device's bar, and then cache the bar based on the device in question at the first call. Yet it may not remove all theoretical / possible scenarios with dead lock (taking into account hotpluggable devices), but won't fail the i801 driver in the above use case IIUC.
On 24.11.2023 11:49, Shinichiro Kawasaki wrote: > On Nov 14, 2023 / 19:58, Andy Shevchenko wrote: >> On Tue, Nov 14, 2023 at 06:11:40PM +0200, Andy Shevchenko wrote: >>> On Tue, Nov 14, 2023 at 04:57:01PM +0100, Lukas Wunner wrote: >>>> On Tue, Nov 14, 2023 at 02:04:34PM +0200, Andy Shevchenko wrote: >>>>> On Tue, Nov 14, 2023 at 11:47:15AM +0100, Heiner Kallweit wrote: >>>>>> On 14.11.2023 11:16, Wolfram Sang wrote: >>>>>>> On Tue, Nov 14, 2023 at 06:54:29AM +0000, Shinichiro Kawasaki wrote: >> >> ... >> >>>>>>>> The lockdep splat indicates possible deadlock between >>>>>>>> pci_rescan_remove_lock and work_completion lock have deadlock >>>>>>>> possibility. >>>>>>>> In the call stack, I found that the workqueue thread for >>>>>>>> i801_probe() calls p2sb_bar(), which locks pci_rescan_remove_lock. >>>>>> >>>>>> i801 just uses p2sb_bar(), I don't see any issue in i801. Root cause >>>>>> seems to be in the PCI subsystem. Calling p2sb_bar() from a PCI driver >>>>>> probe callback seems to be problematic, nevertheless it's a valid API >>>>>> usage. >>>>> >>>>> So, currently I'm lack of (good) ideas and would like to hear other (more >>>>> experienced) PCI developers on how is to address this... >>>> >>>> Can you add a p2sb_bar_locked() library call which is used by the >>>> i801 probe path? >>>> >>>> Basically rename p2sb_bar() to __p2sb_bar() and add a third parameter >>>> of type boolean which signifies whether it's invoked in locked context >>>> or not, then call that from p2sb_bar() and p2sb_bar_locked() wrappers. >>> >>> It may work, I assume. Let me cook the patch. >> >> Hmm... But this will open a window when probing phase happens, e.g. during >> boot time, no? If somebody somehow calls for full rescan, we may end up in >> the situation when P2SB is gone before accessing it in p2sb_bar(). >> >> Now I'm wondering why simple pci_dev_get() can't be sufficient in the >> p2sb_bar(). > > All, thanks for the discussion. It looks rather difficult to avoid the WARN. > > To confirm that the deadlock is for real, I tried to remove i2c-i801 device and > did /sys/bus/pci/rescan with two commands below: > > # echo 1 > /sys/bus/pci/devices/0000\:00\:1f.4/remove > # echo 1 > /sys/bus/pci/rescan > > Then I observed the second command hangs. > > I came across another fix idea: assuming the guard by pci_rescan_remove_lock is > required in p2sb_bar(), how about to do trylock? If the mutex can not be locked, > make the p2sb_bar() call fail. This way, we can avoid the deadlock between > pci_rescan_remove_lock and workqueue completion. > > I created a patch below and confirmed it avoided the lockdep WARN. The i2c-i801 > probe was ok at system boot. When I did the two commands above, I observed the > i2c-i801 device probe failed due to trylock failure. But I think it's far better > than hang. > I wouldn't call this a solution. A solution has to support pci drivers using p2sb_bar() in probe(). You can't simply make them fail. > > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > index ed6b7f48736..3e784fb6cd9 100644 > --- a/drivers/pci/probe.c > +++ b/drivers/pci/probe.c > @@ -3312,6 +3312,18 @@ void pci_lock_rescan_remove(void) > } > EXPORT_SYMBOL_GPL(pci_lock_rescan_remove); > > +/* > + * Try to acquire pci_rescan_remove_lock. Returns 1 if the mutex > + * has been acquired successfully, and 0 on contention. Use this > + * to acquire the lock in workqueue context to avoid potential deadlock > + * together with work_completion. > + */ > +int pci_trylock_rescan_remove(void) > +{ > + return mutex_trylock(&pci_rescan_remove_lock); > +} > +EXPORT_SYMBOL_GPL(pci_trylock_rescan_remove); > + > void pci_unlock_rescan_remove(void) > { > mutex_unlock(&pci_rescan_remove_lock); > diff --git a/drivers/platform/x86/p2sb.c b/drivers/platform/x86/p2sb.c > index 1cf2471d54d..7a6bee8abf9 100644 > --- a/drivers/platform/x86/p2sb.c > +++ b/drivers/platform/x86/p2sb.c > @@ -113,7 +113,10 @@ int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem) > * Prevent concurrent PCI bus scan from seeing the P2SB device and > * removing via sysfs while it is temporarily exposed. > */ > - pci_lock_rescan_remove(); > + if (!pci_trylock_rescan_remove()) { > + pr_err("P2SB device accessed during PCI rescan"); > + return -EBUSY; > + } > > /* Unhide the P2SB device, if needed */ > pci_bus_read_config_dword(bus, devfn_p2sb, P2SBC, &value); > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 60ca768bc86..e6db5096217 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -1439,6 +1439,7 @@ void set_pcie_hotplug_bridge(struct pci_dev *pdev); > unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge); > unsigned int pci_rescan_bus(struct pci_bus *bus); > void pci_lock_rescan_remove(void); > +int pci_trylock_rescan_remove(void); > void pci_unlock_rescan_remove(void); > > /* Vital Product Data routines */
On Nov 24, 2023 / 17:22, Andy Shevchenko wrote: > On Fri, Nov 24, 2023 at 10:49:45AM +0000, Shinichiro Kawasaki wrote: > > On Nov 14, 2023 / 19:58, Andy Shevchenko wrote: > > ... > > > I created a patch below and confirmed it avoided the lockdep WARN. The > > i2c-i801 probe was ok at system boot. > > Another possible solution I was thinking about is to have a local cache, > so, make p2sb.c to be called just after PCI enumeration at boot time > to cache the P2SB device's bar, and then cache the bar based on the device > in question at the first call. Yet it may not remove all theoretical / > possible scenarios with dead lock (taking into account hotpluggable > devices), but won't fail the i801 driver in the above use case IIUC. Thanks for the idea. I created an experimental patch below (it does not guard list nor free the list elements, so it is incomplete). I confirmed that this patch avoids the deadlock. So your idea looks working. I still observe the deadlock WARN, but it looks better than the hang by the deadlock. Having said that, Heiner says in another mail that "A solution has to support pci drivers using p2sb_bar() in probe()". This idea does not fulfill it. Hmm. diff --git a/drivers/platform/x86/p2sb.c b/drivers/platform/x86/p2sb.c index 1cf2471d54d..0e7057979a2 100644 --- a/drivers/platform/x86/p2sb.c +++ b/drivers/platform/x86/p2sb.c @@ -26,6 +26,15 @@ static const struct x86_cpu_id p2sb_cpu_ids[] = { {} }; +struct p2sb_res_cache { + struct list_head entry; + u32 bus_dev_id; + unsigned int devfn; + struct resource res; +}; + +static LIST_HEAD(p2sb_res_list); + static int p2sb_get_devfn(unsigned int *devfn) { unsigned int fn = P2SB_DEVFN_DEFAULT; @@ -75,26 +84,8 @@ static int p2sb_scan_and_read(struct pci_bus *bus, unsigned int devfn, struct re return ret; } -/** - * p2sb_bar - Get Primary to Sideband (P2SB) bridge device BAR - * @bus: PCI bus to communicate with - * @devfn: PCI slot and function to communicate with - * @mem: memory resource to be filled in - * - * The BIOS prevents the P2SB device from being enumerated by the PCI - * subsystem, so we need to unhide and hide it back to lookup the BAR. - * - * if @bus is NULL, the bus 0 in domain 0 will be used. - * If @devfn is 0, it will be replaced by devfn of the P2SB device. - * - * Caller must provide a valid pointer to @mem. - * - * Locking is handled by pci_rescan_remove_lock mutex. - * - * Return: - * 0 on success or appropriate errno value on error. - */ -int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem) +static int __p2sb_bar(struct pci_bus *bus, unsigned int devfn, + struct resource *mem) { struct pci_dev *pdev_p2sb; unsigned int devfn_p2sb; @@ -141,4 +132,54 @@ int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem) return 0; } + +/** + * p2sb_bar - Get Primary to Sideband (P2SB) bridge device BAR + * @bus: PCI bus to communicate with + * @devfn: PCI slot and function to communicate with + * @mem: memory resource to be filled in + * + * The BIOS prevents the P2SB device from being enumerated by the PCI + * subsystem, so we need to unhide and hide it back to lookup the BAR. + * + * if @bus is NULL, the bus 0 in domain 0 will be used. + * If @devfn is 0, it will be replaced by devfn of the P2SB device. + * + * Caller must provide a valid pointer to @mem. + * + * Locking is handled by pci_rescan_remove_lock mutex. + * + * Return: + * 0 on success or appropriate errno value on error. + */ +int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem) +{ + int ret; + struct p2sb_res_cache *cache; + struct resource res; + + list_for_each_entry(cache, &p2sb_res_list, entry) { + if (cache->bus_dev_id == bus->dev.id && + cache->devfn == devfn) { + memcpy(mem, &cache->res, sizeof(*mem)); + return 0; + } + } + + ret = __p2sb_bar(bus, devfn, &res); + if (ret) + return ret; + + memcpy(mem, &res, sizeof(res)); + + cache = kmalloc(sizeof(*cache), GFP_KERNEL); + if (cache) { + cache->bus_dev_id = bus->dev.id; + cache->devfn = devfn; + memcpy(&cache->res, &res, sizeof(res)); + list_add(&cache->entry, &p2sb_res_list); + } + + return 0; +} EXPORT_SYMBOL_GPL(p2sb_bar);
On Nov 24, 2023 / 18:30, Heiner Kallweit wrote: > On 24.11.2023 11:49, Shinichiro Kawasaki wrote: [...] > > I came across another fix idea: assuming the guard by pci_rescan_remove_lock is > > required in p2sb_bar(), how about to do trylock? If the mutex can not be locked, > > make the p2sb_bar() call fail. This way, we can avoid the deadlock between > > pci_rescan_remove_lock and workqueue completion. > > > > I created a patch below and confirmed it avoided the lockdep WARN. The i2c-i801 > > probe was ok at system boot. When I did the two commands above, I observed the > > i2c-i801 device probe failed due to trylock failure. But I think it's far better > > than hang. > > > > I wouldn't call this a solution. A solution has to support pci drivers using > p2sb_bar() in probe(). You can't simply make them fail. I see... it looks a bit tough to find out the good solution, but let me try some more. Here are my three observations: A) pci drivers should be able to call p2sb_bar() in probe() without failure. B) when /sys/bus/pci/rescan is written, pci_rescan_remove_lock is locked then probe() is called. C) p2sb_bar() locks pci_rescan_remove_lock. These results in the deadlock. To avoid the deadlock, one of the three needs to change. A) is not to change. I guess changing B) will be too much. So, I would like to question if we can change C). Can we remove pci_rescan_remove_lock in p2sb_bar()? Maybe no. As its inline comment says, p2sb_bar() locks pci_rescan_remove_lock to guard the section which unhides and hides the P2SB device from parallel scan. We can't simply remove it. Or, can we replace pci_rescan_remove_lock with other lock? For example, what if we have locks for each pci_bus which serialize scans for each pci_bus? Is it enough to guard the P2SB device? (Or is it just a stupid idea?) As a trial, I created a patch below. I confirmed it avoids the deadlock and the lockdep WARN. But it has impacts on PCI sub-system, and I'm not sure if it really hides the P2SB device. May ask comments on it? diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index ed6b7f48736..6995d66fb36 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -562,6 +562,7 @@ static struct pci_bus *pci_alloc_bus(struct pci_bus *parent) if (parent) b->domain_nr = parent->domain_nr; #endif + mutex_init(&b->scan_lock); return b; } @@ -2417,6 +2418,18 @@ bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *l, } EXPORT_SYMBOL(pci_bus_read_dev_vendor_id); +void pci_lock_bus_scan(struct pci_bus *bus) +{ + mutex_lock(&bus->scan_lock); +} +EXPORT_SYMBOL_GPL(pci_lock_bus_scan); + +void pci_unlock_bus_scan(struct pci_bus *bus) +{ + mutex_unlock(&bus->scan_lock); +} +EXPORT_SYMBOL_GPL(pci_unlock_bus_scan); + /* * Read the config data for a PCI device, sanity-check it, * and fill in the dev structure. @@ -2584,26 +2597,45 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus) WARN_ON(ret < 0); } -struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn) +struct pci_dev *do_pci_scan_single_device(struct pci_bus *bus, int devfn, + bool to_lock) { struct pci_dev *dev; + if (to_lock) + pci_lock_bus_scan(bus); + dev = pci_get_slot(bus, devfn); if (dev) { pci_dev_put(dev); - return dev; + goto out; } dev = pci_scan_device(bus, devfn); if (!dev) - return NULL; + goto out; pci_device_add(dev, bus); +out: + if (to_lock) + pci_unlock_bus_scan(bus); return dev; } + +struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn) +{ + return do_pci_scan_single_device(bus, devfn, true); +} EXPORT_SYMBOL(pci_scan_single_device); +struct pci_dev *pci_scan_single_device_under_lock(struct pci_bus *bus, + int devfn) +{ + return do_pci_scan_single_device(bus, devfn, false); +} +EXPORT_SYMBOL(pci_scan_single_device_under_lock); + static int next_ari_fn(struct pci_bus *bus, struct pci_dev *dev, int fn) { int pos; diff --git a/drivers/platform/x86/p2sb.c b/drivers/platform/x86/p2sb.c index 1cf2471d54d..0073ef1a9a5 100644 --- a/drivers/platform/x86/p2sb.c +++ b/drivers/platform/x86/p2sb.c @@ -65,7 +65,7 @@ static int p2sb_scan_and_read(struct pci_bus *bus, unsigned int devfn, struct re struct pci_dev *pdev; int ret; - pdev = pci_scan_single_device(bus, devfn); + pdev = pci_scan_single_device_under_lock(bus, devfn); if (!pdev) return -ENODEV; @@ -89,7 +89,7 @@ static int p2sb_scan_and_read(struct pci_bus *bus, unsigned int devfn, struct re * * Caller must provide a valid pointer to @mem. * - * Locking is handled by pci_rescan_remove_lock mutex. + * Locking is handled by bus scan lock. * * Return: * 0 on success or appropriate errno value on error. @@ -113,14 +113,14 @@ int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem) * Prevent concurrent PCI bus scan from seeing the P2SB device and * removing via sysfs while it is temporarily exposed. */ - pci_lock_rescan_remove(); + pci_lock_bus_scan(bus); /* Unhide the P2SB device, if needed */ pci_bus_read_config_dword(bus, devfn_p2sb, P2SBC, &value); if (value & P2SBC_HIDE) pci_bus_write_config_dword(bus, devfn_p2sb, P2SBC, 0); - pdev_p2sb = pci_scan_single_device(bus, devfn_p2sb); + pdev_p2sb = pci_scan_single_device_under_lock(bus, devfn_p2sb); if (devfn) ret = p2sb_scan_and_read(bus, devfn, mem); else @@ -131,7 +131,7 @@ int p2sb_bar(struct pci_bus *bus, unsigned int devfn, struct resource *mem) if (value & P2SBC_HIDE) pci_bus_write_config_dword(bus, devfn_p2sb, P2SBC, P2SBC_HIDE); - pci_unlock_rescan_remove(); + pci_unlock_bus_scan(bus); if (ret) return ret; diff --git a/include/linux/pci.h b/include/linux/pci.h index 60ca768bc86..096f018038f 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -678,6 +678,7 @@ struct pci_bus { struct bin_attribute *legacy_mem; /* Legacy mem */ unsigned int is_added:1; unsigned int unsafe_warn:1; /* warned about RW1C config write */ + struct mutex scan_lock; /* Serialize bus scan */ }; #define to_pci_bus(n) container_of(n, struct pci_bus, dev) @@ -1159,6 +1160,10 @@ static inline void pci_dev_assign_slot(struct pci_dev *dev) { } #endif int pci_scan_slot(struct pci_bus *bus, int devfn); struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn); +void pci_lock_bus_scan(struct pci_bus *bus); +void pci_unlock_bus_scan(struct pci_bus *bus); +struct pci_dev *pci_scan_single_device_under_lock(struct pci_bus *bus, + int devfn); void pci_device_add(struct pci_dev *dev, struct pci_bus *bus); unsigned int pci_scan_child_bus(struct pci_bus *bus); void pci_bus_add_device(struct pci_dev *dev);
On Tue, Nov 28, 2023 at 07:45:06AM +0000, Shinichiro Kawasaki wrote: > On Nov 24, 2023 / 17:22, Andy Shevchenko wrote: > > Another possible solution I was thinking about is to have a local cache, > > so, make p2sb.c to be called just after PCI enumeration at boot time > > to cache the P2SB device's bar, and then cache the bar based on the device > > in question at the first call. Yet it may not remove all theoretical / > > possible scenarios with dead lock (taking into account hotpluggable > > devices), but won't fail the i801 driver in the above use case IIUC. > > Thanks for the idea. I created an experimental patch below (it does not guard > list nor free the list elements, so it is incomplete). I confirmed that this > patch avoids the deadlock. So your idea looks working. I still observe the > deadlock WARN, but it looks better than the hang by the deadlock. Your patch uses a list to store a multitude of struct resource. Is that actually necessary? I thought there can only be a single P2SB device in the system? > Having said that, Heiner says in another mail that "A solution has to support > pci drivers using p2sb_bar() in probe()". This idea does not fulfill it. Hmm. Basically what you need to do is create two initcalls: Add one arch_initcall to unhide the P2SB device. The P2SB subsequently gets enumerated by the PCI core in a subsys_initcall. Then add an fs_initcall which extracts and stashes the struct resource, hides the P2SB device and destroys the corresponding pci_dev. Then you don't need to acquire any locks at runtime, just retrieve the stashed struct resource. This approach will result in the P2SB device briefly being enumerated and a driver could in theory bind to it. Andy, is this a problem? I'm not seeing any drivers in the tree which bind to 8086/c5c5. Thanks, Lukas
On Tue, Nov 28, 2023 at 10:16:28AM +0000, Shinichiro Kawasaki wrote: > Here are my three observations: > > A) pci drivers should be able to call p2sb_bar() in probe() without failure. > B) when /sys/bus/pci/rescan is written, pci_rescan_remove_lock is locked then > probe() is called. > C) p2sb_bar() locks pci_rescan_remove_lock. > > These results in the deadlock. To avoid the deadlock, one of the three needs > to change. A) is not to change. I guess changing B) will be too much. So, I > would like to question if we can change C). It is possible to allow recursive acquisition of a mutex by doing two things: * You need to store a pointer to the task_struct which is holding the lock. This allows you to identify upon a recursive acquisition that you're already holding the lock. The acquire operation becomes a no-op in this case. * You need a counter of how many times you've acquired the lock recursively. This allows you to determine upon lock release whether the release operation should be a no-op (due to previous recursive acquisition) or whether it should result in actual lock release (no previous recursive locking, or recursive locking has ended). Actually struct mutex already stores the owner of the lock, but that's only available internally. While it would be possible to allow recursive acquisition of pci_rescan_remove_lock in this way, doing so merely because of a vendor-specific platform quirk will likely be considered dodgy by the upstream community. So Andy's proposal to stash the struct resource on affected platforms seems more viable from an upstream acceptability point of view. Thanks, Lukas
On Wed, Nov 29, 2023 at 03:50:21PM +0200, Andy Shevchenko wrote: > On Wed, Nov 29, 2023 at 12:17:39PM +0100, Lukas Wunner wrote: > > On Tue, Nov 28, 2023 at 07:45:06AM +0000, Shinichiro Kawasaki wrote: > > > On Nov 24, 2023 / 17:22, Andy Shevchenko wrote: ... > > > > Another possible solution I was thinking about is to have a local cache, > > > > so, make p2sb.c to be called just after PCI enumeration at boot time > > > > to cache the P2SB device's bar, and then cache the bar based on the device > > > > in question at the first call. Yet it may not remove all theoretical / > > > > possible scenarios with dead lock (taking into account hotpluggable > > > > devices), but won't fail the i801 driver in the above use case IIUC. > > > > > > Thanks for the idea. I created an experimental patch below (it does not guard > > > list nor free the list elements, so it is incomplete). I confirmed that this > > > patch avoids the deadlock. So your idea looks working. I still observe the > > > deadlock WARN, but it looks better than the hang by the deadlock. > > > > Your patch uses a list to store a multitude of struct resource. > > Is that actually necessary? I thought there can only be a single > > P2SB device in the system? > > > > > Having said that, Heiner says in another mail that "A solution has to support > > > pci drivers using p2sb_bar() in probe()". This idea does not fulfill it. Hmm. > > > > Basically what you need to do is create two initcalls: > > > > Add one arch_initcall to unhide the P2SB device. > > > > The P2SB subsequently gets enumerated by the PCI core in a subsys_initcall. > > > > Then add an fs_initcall which extracts and stashes the struct resource, > > hides the P2SB device and destroys the corresponding pci_dev. > > > > Then you don't need to acquire any locks at runtime, just retrieve the > > stashed struct resource. > > > > This approach will result in the P2SB device briefly being enumerated > > and a driver could in theory bind to it. Andy, is this a problem? > > I'm not seeing any drivers in the tree which bind to 8086/c5c5. > > At least one problem just out of my head. The P2SB on many system is PCI > function 0. Unhiding the P2SB unhides all functions on that device, and > we have use cases for those (that's why we have two first parameters to > p2sb_bar() in case we want non-default device to be looked at). For the clarity this is true for ATOM_GOLDMONT (see p2sb_cpu_ids array).
On Nov 29, 2023 / 15:53, Andy Shevchenko wrote: > On Wed, Nov 29, 2023 at 03:50:21PM +0200, Andy Shevchenko wrote: > > On Wed, Nov 29, 2023 at 12:17:39PM +0100, Lukas Wunner wrote: > > > On Tue, Nov 28, 2023 at 07:45:06AM +0000, Shinichiro Kawasaki wrote: > > > > On Nov 24, 2023 / 17:22, Andy Shevchenko wrote: > > ... > > > > > > Another possible solution I was thinking about is to have a local cache, > > > > > so, make p2sb.c to be called just after PCI enumeration at boot time > > > > > to cache the P2SB device's bar, and then cache the bar based on the device > > > > > in question at the first call. Yet it may not remove all theoretical / > > > > > possible scenarios with dead lock (taking into account hotpluggable > > > > > devices), but won't fail the i801 driver in the above use case IIUC. > > > > > > > > Thanks for the idea. I created an experimental patch below (it does not guard > > > > list nor free the list elements, so it is incomplete). I confirmed that this > > > > patch avoids the deadlock. So your idea looks working. I still observe the > > > > deadlock WARN, but it looks better than the hang by the deadlock. > > > > > > Your patch uses a list to store a multitude of struct resource. > > > Is that actually necessary? I thought there can only be a single > > > P2SB device in the system? Yes, the list might be too much. I was not sure what is the expected number of P2SB resources to be cached. I found drivers/mfd/lpc_ich.c calls p2sb_bar() at two places for devfn=0 and devfn=(13,2), so at least two resources look required. Not sure about the future. If two static resources are sufficient, the code will be simpler. > > > > > > > Having said that, Heiner says in another mail that "A solution has to support > > > > pci drivers using p2sb_bar() in probe()". This idea does not fulfill it. Hmm. > > > > > > Basically what you need to do is create two initcalls: > > > > > > Add one arch_initcall to unhide the P2SB device. > > > > > > The P2SB subsequently gets enumerated by the PCI core in a subsys_initcall. > > > > > > Then add an fs_initcall which extracts and stashes the struct resource, > > > hides the P2SB device and destroys the corresponding pci_dev. > > > > > > Then you don't need to acquire any locks at runtime, just retrieve the > > > stashed struct resource. > > > > > > This approach will result in the P2SB device briefly being enumerated > > > and a driver could in theory bind to it. Andy, is this a problem? > > > I'm not seeing any drivers in the tree which bind to 8086/c5c5. > > > > At least one problem just out of my head. The P2SB on many system is PCI > > function 0. Unhiding the P2SB unhides all functions on that device, and > > we have use cases for those (that's why we have two first parameters to > > p2sb_bar() in case we want non-default device to be looked at). > > For the clarity this is true for ATOM_GOLDMONT (see p2sb_cpu_ids array). Lukas, thank you for the idea. If I understand the comment by Andy correctly, P2SB should not be unhidden between arch_initcall and fs_initcall. Hmm. This made me think: how about to unhide and hide P2SB just during fs_initcall to cache the P2SB resources? To try it, I added a function below on top of the previous trial patch. The added function calls p2sb_bar() for devfn=0 at fs_initcall so that the resource is cached before probe of i2c-i801. This worked good on my system. It avoided the deadlock as well as the lockdep WARN :) static int __init p2sb_fs_init(void) { struct pci_bus *bus; struct resource mem; int ret = 0; bus = pci_find_bus(0, 0); if (bus) { ret = p2sb_bar(bus, 0, &mem); if (ret) pr_err("p2sb_bar failed: %d", ret); } return 0; } fs_initcall(p2sb_fs_init); The result of the trial is encouraging, but I'm not yet sure if this idea is really feasible. I have three questions in my mind: - The trial function above assumed the P2SB device is at the PCI bus number=0 and domain=0. It is ok on my system, but is it valid always? I see this is valid at least for drivers/edac/pdn2_edac.c and drivers/watchdog/simatic-ipc-wdt.c, but not sure for drivers/mfd/lpc_ich.c and drivers/i2c/busses/i2c-i801. - The trial function above only caches the resource for devfn=0. This is not enough for drivers/mfd/lpc_ich.c. Another resource for devfn=(13,2) should be cached. It does not look good to hardcode these devfns and cache them always. It looks required to communicate devfn to cache from p2sb_bar() caller drivers to p2sb. How can we do it? - Does this work when suspend-resume happens? Comments on the questions will be appreciated.
On Thu, Nov 30, 2023 at 07:30:56AM +0000, Shinichiro Kawasaki wrote: > On Nov 29, 2023 / 15:53, Andy Shevchenko wrote: > > On Wed, Nov 29, 2023 at 03:50:21PM +0200, Andy Shevchenko wrote: > > > On Wed, Nov 29, 2023 at 12:17:39PM +0100, Lukas Wunner wrote: > > > > Your patch uses a list to store a multitude of struct resource. > > > > Is that actually necessary? I thought there can only be a single > > > > P2SB device in the system? > > Yes, the list might be too much. I was not sure what is the expected > number of P2SB resources to be cached. I found drivers/mfd/lpc_ich.c > calls p2sb_bar() at two places for devfn=0 and devfn=(13,2), so at > least two resources look required. Not sure about the future. > If two static resources are sufficient, the code will be simpler. About that p2sb_bar() call in lpc_ich.c for PCI_DEVFN(13, 2): It's in a switch-case statement for INTEL_SPI_BXT. BXT means Broxton, which is an Atom Goldmont based architecture. If you look at p2sb_cpu_ids[], you'll notice the P2SB is located at PCI_DEVFN(13, 0) on Goldmont. PCI functions with function number > 0 are not enumerable unless there is a PCI function with function number 0. So p2sb_bar() first unhides the P2SB at PCI_DEVFN(13, 0), then the SPI function at PCI_DEVFN(13, 2) becomes enumerable and p2sb_bar() retrieves the BAR address of that function. Unfortunately this is a little byzantine. For the caching approach I guess it means you can assume there is only a single P2SB device in the system but you need to cache not just the P2SB BAR, but also the BARs of functions 1 .. 7 of the P2SB device if the P2SB's function number is 0. I don't know if each of those upper functions only ever has a single BAR, but probably that's the case. > Lukas, thank you for the idea. If I understand the comment by Andy > correctly, P2SB should not be unhidden between arch_initcall and > fs_initcall. Hmm. > > This made me think: how about to unhide and hide P2SB just during > fs_initcall to cache the P2SB resources? To try it, I added a function > below on top of the previous trial patch. The added function calls > p2sb_bar() for devfn=0 at fs_initcall so that the resource is cached > before probe of i2c-i801. This worked > good on my system. It avoided the deadlock as well as the lockdep WARN :) This may work if i2c-i801 is compiled as a module, but may not work if it's builtin: It would try to access the cached P2SB BAR when it's not yet been cached. So you'd have to return -EPROBE_DEFER from p2sb_bar() if it hasn't been cached yet. And you'd have to make sure that all the callers can cope with that return value. Another approach would be to cache the BARs very early at boot in arch/x86/kernel/early-quirks.c. That would obviate the need to defer probing if the BAR hasn't been cached yet. Looking through past discussions archived in lore, I've found an important issue raised by Bjorn: "Apparently this normally hidden device is consuming PCI address space. The PCI core needs to know about this. If it doesn't, the PCI core may assign this space to another device." https://lore.kernel.org/all/20210308185212.GA1790506@bjorn-Precision-5520/ arch/x86/kernel/early-quirks.c already reserves "stolen" memory used by Intel GPUs with unified-memory architecture. It adjusts the e820 map to achieve that. I guess the same method could be used to reserve the memory used by P2SB (as well as "upper" functions if P2SB has function number 0). An early version of p2sb_bar() (which wasn't mainlined) duplicated __pci_read_base(). I suggested to instead unhide and temporarily enumerate the device, retrieve the BAR, then destroy the pci_dev and hide the P2SB again: https://lore.kernel.org/all/20220505145503.GA25423@wunner.de/ That resulted in a significant reduction in LoC. But if the BAR caching happens in arch/x86/kernel/early-quirks.c, it may be necessary to duplicate at least portions of __pci_read_base() again. Or maybe the code can be simplified for this specific use case, I don't know. Thanks, Lukas
On Thu, Nov 30, 2023 at 07:30:56AM +0000, Shinichiro Kawasaki wrote: > On Nov 29, 2023 / 15:53, Andy Shevchenko wrote: > > On Wed, Nov 29, 2023 at 03:50:21PM +0200, Andy Shevchenko wrote: > > > On Wed, Nov 29, 2023 at 12:17:39PM +0100, Lukas Wunner wrote: > > > > On Tue, Nov 28, 2023 at 07:45:06AM +0000, Shinichiro Kawasaki wrote: > > > > > On Nov 24, 2023 / 17:22, Andy Shevchenko wrote: ... > > > > > > Another possible solution I was thinking about is to have a local cache, > > > > > > so, make p2sb.c to be called just after PCI enumeration at boot time > > > > > > to cache the P2SB device's bar, and then cache the bar based on the device > > > > > > in question at the first call. Yet it may not remove all theoretical / > > > > > > possible scenarios with dead lock (taking into account hotpluggable > > > > > > devices), but won't fail the i801 driver in the above use case IIUC. > > > > > > > > > > Thanks for the idea. I created an experimental patch below (it does not guard > > > > > list nor free the list elements, so it is incomplete). I confirmed that this > > > > > patch avoids the deadlock. So your idea looks working. I still observe the > > > > > deadlock WARN, but it looks better than the hang by the deadlock. > > > > > > > > Your patch uses a list to store a multitude of struct resource. > > > > Is that actually necessary? I thought there can only be a single > > > > P2SB device in the system? > > Yes, the list might be too much. I was not sure what is the expected number of > P2SB resources to be cached. I found drivers/mfd/lpc_ich.c calls p2sb_bar() at > two places for devfn=0 and devfn=(13,2), so at least two resources look > required. Not sure about the future. If two static resources are sufficient, the > code will be simpler. PCI specification defines up to 8 functions. So, basically you need to cache 8. But note, each function may have up to 6 or more resources, we only now rely on bar 0 as it's hard coded IIRC. Theoretically we might require any bar from any function, but practically we have at most two right now. So, to follow KISS 8 should be enough. > > > > > Having said that, Heiner says in another mail that "A solution has to support > > > > > pci drivers using p2sb_bar() in probe()". This idea does not fulfill it. Hmm. > > > > > > > > Basically what you need to do is create two initcalls: > > > > > > > > Add one arch_initcall to unhide the P2SB device. > > > > > > > > The P2SB subsequently gets enumerated by the PCI core in a subsys_initcall. > > > > > > > > Then add an fs_initcall which extracts and stashes the struct resource, > > > > hides the P2SB device and destroys the corresponding pci_dev. > > > > > > > > Then you don't need to acquire any locks at runtime, just retrieve the > > > > stashed struct resource. > > > > > > > > This approach will result in the P2SB device briefly being enumerated > > > > and a driver could in theory bind to it. Andy, is this a problem? > > > > I'm not seeing any drivers in the tree which bind to 8086/c5c5. > > > > > > At least one problem just out of my head. The P2SB on many system is PCI > > > function 0. Unhiding the P2SB unhides all functions on that device, and > > > we have use cases for those (that's why we have two first parameters to > > > p2sb_bar() in case we want non-default device to be looked at). > > > > For the clarity this is true for ATOM_GOLDMONT (see p2sb_cpu_ids array). > > Lukas, thank you for the idea. If I understand the comment by Andy correctly, > P2SB should not be unhidden between arch_initcall and fs_initcall. Hmm. > > This made me think: how about to unhide and hide P2SB just during fs_initcall > to cache the P2SB resources? To try it, I added a function below on top of the > previous trial patch. The added function calls p2sb_bar() for devfn=0 at > fs_initcall so that the resource is cached before probe of i2c-i801. This worked > good on my system. It avoided the deadlock as well as the lockdep WARN :) > > static int __init p2sb_fs_init(void) > { > struct pci_bus *bus; > struct resource mem; > int ret = 0; > bus = pci_find_bus(0, 0); > if (bus) { This is inside p2sb_bar(), no need to repeat it. > ret = p2sb_bar(bus, 0, &mem); > if (ret) > pr_err("p2sb_bar failed: %d", ret); > } > return 0; > } > fs_initcall(p2sb_fs_init); > > The result of the trial is encouraging, but I'm not yet sure if this idea is > really feasible. I have three questions in my mind: > > - The trial function above assumed the P2SB device is at the PCI bus number=0 > and domain=0. It is ok on my system, but is it valid always? I see this is > valid at least for drivers/edac/pdn2_edac.c and > drivers/watchdog/simatic-ipc-wdt.c, but not sure for drivers/mfd/lpc_ich.c > and drivers/i2c/busses/i2c-i801. > > - The trial function above only caches the resource for devfn=0. This is not > enough for drivers/mfd/lpc_ich.c. Another resource for devfn=(13,2) should be > cached. It does not look good to hardcode these devfns and cache them always. > It looks required to communicate devfn to cache from p2sb_bar() caller drivers > to p2sb. How can we do it? > > - Does this work when suspend-resume happens? > > Comments on the questions will be appreciated. I would give a try with a cache for full hierarchy, basically it's either 1 or 8 resources. It would be quite weird to have devfn to be different 'in device" to P2SB itself. So, something like this. - unhide p2sb device - cache its bar 0 (with BDF, etc) - if F == 0, iterate over F == 1..7 and if there is a device, cache its bar 0 as well (as in previous entry) - make p2sb_bar() to be just a cache lookup call (mutex protected?) Note, P2SB is inside PCI South Bridge, it's unlikely we will see it in external / Thunderbolt / etc devices.
On Thu, Nov 30, 2023 at 10:36:01AM +0100, Lukas Wunner wrote: > On Thu, Nov 30, 2023 at 07:30:56AM +0000, Shinichiro Kawasaki wrote: > ... > > Lukas, thank you for the idea. If I understand the comment by Andy > > correctly, P2SB should not be unhidden between arch_initcall and > > fs_initcall. Hmm. > > > > This made me think: how about to unhide and hide P2SB just during > > fs_initcall to cache the P2SB resources? To try it, I added a function > > below on top of the previous trial patch. The added function calls > > p2sb_bar() for devfn=0 at fs_initcall so that the resource is cached > > before probe of i2c-i801. This worked > > good on my system. It avoided the deadlock as well as the lockdep WARN :) > > This may work if i2c-i801 is compiled as a module, but may not work > if it's builtin: It would try to access the cached P2SB BAR when > it's not yet been cached. So you'd have to return -EPROBE_DEFER > from p2sb_bar() if it hasn't been cached yet. And you'd have to > make sure that all the callers can cope with that return value. > > Another approach would be to cache the BARs very early at boot in > arch/x86/kernel/early-quirks.c. That would obviate the need to > defer probing if the BAR hasn't been cached yet. I hinted at something like this too: https://lore.kernel.org/all/20220107010305.GA334966@bhelgaas/ I think these hidden devices are an architectural nightmare. I doubt firmware folks have signed up to support the OS unhiding and using them. I think we're going out on a limb by assuming they're fair game for the OS to use. So I'm ... a bit hesitant to complicate the PCI core locking, enumeration, and BAR reading code to accommodate them. > Looking through past discussions archived in lore, I've found an > important issue raised by Bjorn: > > "Apparently this normally hidden device is consuming > PCI address space. The PCI core needs to know about this. If it > doesn't, the PCI core may assign this space to another device." > > https://lore.kernel.org/all/20210308185212.GA1790506@bjorn-Precision-5520/ I think that was covered because the address space is not included in the PNP0A03 _CRS, as mentioned in the v2 commit log, right? https://lore.kernel.org/all/20211221173945.53674-3-andriy.shevchenko@linux.intel.com/ Bjorn
On Nov 30, 2023 / 17:19, Andy Shevchenko wrote: > On Thu, Nov 30, 2023 at 07:30:56AM +0000, Shinichiro Kawasaki wrote: > > On Nov 29, 2023 / 15:53, Andy Shevchenko wrote: > > > On Wed, Nov 29, 2023 at 03:50:21PM +0200, Andy Shevchenko wrote: > > > > On Wed, Nov 29, 2023 at 12:17:39PM +0100, Lukas Wunner wrote: > > > > > On Tue, Nov 28, 2023 at 07:45:06AM +0000, Shinichiro Kawasaki wrote: > > > > > > On Nov 24, 2023 / 17:22, Andy Shevchenko wrote: > > ... > > > > > > > > Another possible solution I was thinking about is to have a local cache, > > > > > > > so, make p2sb.c to be called just after PCI enumeration at boot time > > > > > > > to cache the P2SB device's bar, and then cache the bar based on the device > > > > > > > in question at the first call. Yet it may not remove all theoretical / > > > > > > > possible scenarios with dead lock (taking into account hotpluggable > > > > > > > devices), but won't fail the i801 driver in the above use case IIUC. > > > > > > > > > > > > Thanks for the idea. I created an experimental patch below (it does not guard > > > > > > list nor free the list elements, so it is incomplete). I confirmed that this > > > > > > patch avoids the deadlock. So your idea looks working. I still observe the > > > > > > deadlock WARN, but it looks better than the hang by the deadlock. > > > > > > > > > > Your patch uses a list to store a multitude of struct resource. > > > > > Is that actually necessary? I thought there can only be a single > > > > > P2SB device in the system? > > > > Yes, the list might be too much. I was not sure what is the expected number of > > P2SB resources to be cached. I found drivers/mfd/lpc_ich.c calls p2sb_bar() at > > two places for devfn=0 and devfn=(13,2), so at least two resources look > > required. Not sure about the future. If two static resources are sufficient, the > > code will be simpler. > > PCI specification defines up to 8 functions. So, basically you need to cache 8. > But note, each function may have up to 6 or more resources, we only now rely on > bar 0 as it's hard coded IIRC. > > Theoretically we might require any bar from any function, but practically we have > at most two right now. So, to follow KISS 8 should be enough. Thanks. This sounds reasonable. I'll create a patch based on it. > > > > > > > Having said that, Heiner says in another mail that "A solution has to support > > > > > > pci drivers using p2sb_bar() in probe()". This idea does not fulfill it. Hmm. > > > > > > > > > > Basically what you need to do is create two initcalls: > > > > > > > > > > Add one arch_initcall to unhide the P2SB device. > > > > > > > > > > The P2SB subsequently gets enumerated by the PCI core in a subsys_initcall. > > > > > > > > > > Then add an fs_initcall which extracts and stashes the struct resource, > > > > > hides the P2SB device and destroys the corresponding pci_dev. > > > > > > > > > > Then you don't need to acquire any locks at runtime, just retrieve the > > > > > stashed struct resource. > > > > > > > > > > This approach will result in the P2SB device briefly being enumerated > > > > > and a driver could in theory bind to it. Andy, is this a problem? > > > > > I'm not seeing any drivers in the tree which bind to 8086/c5c5. > > > > > > > > At least one problem just out of my head. The P2SB on many system is PCI > > > > function 0. Unhiding the P2SB unhides all functions on that device, and > > > > we have use cases for those (that's why we have two first parameters to > > > > p2sb_bar() in case we want non-default device to be looked at). > > > > > > For the clarity this is true for ATOM_GOLDMONT (see p2sb_cpu_ids array). > > > > Lukas, thank you for the idea. If I understand the comment by Andy correctly, > > P2SB should not be unhidden between arch_initcall and fs_initcall. Hmm. > > > > This made me think: how about to unhide and hide P2SB just during fs_initcall > > to cache the P2SB resources? To try it, I added a function below on top of the > > previous trial patch. The added function calls p2sb_bar() for devfn=0 at > > fs_initcall so that the resource is cached before probe of i2c-i801. This worked > > good on my system. It avoided the deadlock as well as the lockdep WARN :) > > > > static int __init p2sb_fs_init(void) > > { > > struct pci_bus *bus; > > struct resource mem; > > int ret = 0; > > > bus = pci_find_bus(0, 0); > > if (bus) { > > This is inside p2sb_bar(), no need to repeat it. > > > ret = p2sb_bar(bus, 0, &mem); > > if (ret) > > pr_err("p2sb_bar failed: %d", ret); > > } > > return 0; > > } > > fs_initcall(p2sb_fs_init); > > > > The result of the trial is encouraging, but I'm not yet sure if this idea is > > really feasible. I have three questions in my mind: > > > > - The trial function above assumed the P2SB device is at the PCI bus number=0 > > and domain=0. It is ok on my system, but is it valid always? I see this is > > valid at least for drivers/edac/pdn2_edac.c and > > drivers/watchdog/simatic-ipc-wdt.c, but not sure for drivers/mfd/lpc_ich.c > > and drivers/i2c/busses/i2c-i801. > > > > - The trial function above only caches the resource for devfn=0. This is not > > enough for drivers/mfd/lpc_ich.c. Another resource for devfn=(13,2) should be > > cached. It does not look good to hardcode these devfns and cache them always. > > It looks required to communicate devfn to cache from p2sb_bar() caller drivers > > to p2sb. How can we do it? > > > > - Does this work when suspend-resume happens? > > > > Comments on the questions will be appreciated. > > I would give a try with a cache for full hierarchy, basically it's either 1 or > 8 resources. It would be quite weird to have devfn to be different 'in device" > to P2SB itself. > > So, something like this. > > - unhide p2sb device > - cache its bar 0 (with BDF, etc) > - if F == 0, iterate over F == 1..7 and if there is a device, cache its > bar 0 as well (as in previous entry) > - make p2sb_bar() to be just a cache lookup call (mutex protected?) > > Note, P2SB is inside PCI South Bridge, it's unlikely we will see it > in external / Thunderbolt / etc devices. Thanks again. I think I understand what you mean. I will post a patch soon. Please take a look and see if I misunderstand anything.
On Nov 30, 2023 / 10:36, Lukas Wunner wrote: > On Thu, Nov 30, 2023 at 07:30:56AM +0000, Shinichiro Kawasaki wrote: > > On Nov 29, 2023 / 15:53, Andy Shevchenko wrote: > > > On Wed, Nov 29, 2023 at 03:50:21PM +0200, Andy Shevchenko wrote: > > > > On Wed, Nov 29, 2023 at 12:17:39PM +0100, Lukas Wunner wrote: > > > > > Your patch uses a list to store a multitude of struct resource. > > > > > Is that actually necessary? I thought there can only be a single > > > > > P2SB device in the system? > > > > Yes, the list might be too much. I was not sure what is the expected > > number of P2SB resources to be cached. I found drivers/mfd/lpc_ich.c > > calls p2sb_bar() at two places for devfn=0 and devfn=(13,2), so at > > least two resources look required. Not sure about the future. > > If two static resources are sufficient, the code will be simpler. > > About that p2sb_bar() call in lpc_ich.c for PCI_DEVFN(13, 2): > > It's in a switch-case statement for INTEL_SPI_BXT. BXT means Broxton, > which is an Atom Goldmont based architecture. > > If you look at p2sb_cpu_ids[], you'll notice the P2SB is located at > PCI_DEVFN(13, 0) on Goldmont. > > PCI functions with function number > 0 are not enumerable unless there is > a PCI function with function number 0. > > So p2sb_bar() first unhides the P2SB at PCI_DEVFN(13, 0), then the > SPI function at PCI_DEVFN(13, 2) becomes enumerable and p2sb_bar() > retrieves the BAR address of that function. > > Unfortunately this is a little byzantine. Thanks for the explanation. This helped me to understand what Andy wrote in antoher e-mail. > > For the caching approach I guess it means you can assume there is only > a single P2SB device in the system but you need to cache not just the > P2SB BAR, but also the BARs of functions 1 .. 7 of the P2SB device > if the P2SB's function number is 0. I don't know if each of those > upper functions only ever has a single BAR, but probably that's the case. I see. I think this is consistent with Andy's explanation. > > > > Lukas, thank you for the idea. If I understand the comment by Andy > > correctly, P2SB should not be unhidden between arch_initcall and > > fs_initcall. Hmm. > > > > This made me think: how about to unhide and hide P2SB just during > > fs_initcall to cache the P2SB resources? To try it, I added a function > > below on top of the previous trial patch. The added function calls > > p2sb_bar() for devfn=0 at fs_initcall so that the resource is cached > > before probe of i2c-i801. This worked > > good on my system. It avoided the deadlock as well as the lockdep WARN :) > > This may work if i2c-i801 is compiled as a module, but may not work > if it's builtin: It would try to access the cached P2SB BAR when > it's not yet been cached. So you'd have to return -EPROBE_DEFER > from p2sb_bar() if it hasn't been cached yet. And you'd have to > make sure that all the callers can cope with that return value. I built i2c-i801 as a built-in module, and observed that my trial patch worked good. IIUC, i2c-i801 probe happens after fs_initcall for p2sb resource caching, even when i2c-i801 is built-in. > > Another approach would be to cache the BARs very early at boot in > arch/x86/kernel/early-quirks.c. That would obviate the need to > defer probing if the BAR hasn't been cached yet. > > Looking through past discussions archived in lore, I've found an > important issue raised by Bjorn: > > "Apparently this normally hidden device is consuming > PCI address space. The PCI core needs to know about this. If it > doesn't, the PCI core may assign this space to another device." > > https://lore.kernel.org/all/20210308185212.GA1790506@bjorn-Precision-5520/ > > arch/x86/kernel/early-quirks.c already reserves "stolen" memory used > by Intel GPUs with unified-memory architecture. It adjusts the e820 > map to achieve that. I guess the same method could be used to reserve > the memory used by P2SB (as well as "upper" functions if P2SB has > function number 0). > > An early version of p2sb_bar() (which wasn't mainlined) duplicated > __pci_read_base(). I suggested to instead unhide and temporarily > enumerate the device, retrieve the BAR, then destroy the pci_dev > and hide the P2SB again: > > https://lore.kernel.org/all/20220505145503.GA25423@wunner.de/ > > That resulted in a significant reduction in LoC. But if the BAR > caching happens in arch/x86/kernel/early-quirks.c, it may be > necessary to duplicate at least portions of __pci_read_base() again. > Or maybe the code can be simplified for this specific use case, > I don't know. Thanks. It's good to understand the past discussion (Wow, so big discussion was held for p2sb.c introduction...). As I noted above, p2sb resource cache at fs_initcall looks working good. I hope this fs_initcall approach is good enough since it will be smaller patch than the resource cache at x86 early-quirks.
diff --git a/drivers/i2c/busses/i2c-i801.c b/drivers/i2c/busses/i2c-i801.c index 070999139c6..00d57d4e006 100644 --- a/drivers/i2c/busses/i2c-i801.c +++ b/drivers/i2c/busses/i2c-i801.c @@ -1820,6 +1820,7 @@ static struct pci_driver i801_driver = { .pm = pm_sleep_ptr(&i801_pm_ops), .probe_type = PROBE_PREFER_ASYNCHRONOUS, }, + .local_probe = true, }; static int __init i2c_i801_init(struct pci_driver *drv) diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 51ec9e7e784..161ff37143a 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -368,7 +368,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, * device is probed from work_on_cpu() of the Physical device. */ if (node < 0 || node >= MAX_NUMNODES || !node_online(node) || - pci_physfn_is_probed(dev)) { + pci_physfn_is_probed(dev) || drv->local_probe) { cpu = nr_cpu_ids; } else { cpumask_var_t wq_domain_mask; diff --git a/include/linux/pci.h b/include/linux/pci.h index 60ca768bc86..6fd086eb26c 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -957,6 +957,7 @@ struct pci_driver { struct device_driver driver; struct pci_dynids dynids; bool driver_managed_dma; + bool local_probe; }; static inline struct pci_driver *to_pci_driver(struct device_driver *drv)