mbox series

[0/2] Hotplug interrupt and AER recovery deadlocks

Message ID 20200615143250.438252-1-ian.may@canonical.com
Headers show
Series Hotplug interrupt and AER recovery deadlocks | expand

Message

Ian May June 15, 2020, 2:32 p.m. UTC
We've encountered two deadlocks when the hotplug interrupt and AER
recovery trigger simultaneously on the same device.  The first
deadlock is with:

       Hotplug                                AER
----------------------------       ---------------------------
device_lock(&dev->dev)             down_write(&ctrl->reset_lock)
down_read(&ctrl->reset_lock)       device_lock(&dev->dev)

//Hotplug
[  637.084657] Call Trace:
[  637.084669]  __schedule+0x2a8/0x670
[  637.084670]  schedule+0x33/0xa0
[  637.084672]  schedule_preempt_disabled+0xe/0x10
[  637.084673]  __mutex_lock.isra.9+0x26d/0x4e0
[  637.084674]  __mutex_lock_slowpath+0x13/0x20
[  637.084675]  ? __mutex_lock_slowpath+0x13/0x20
[  637.084675]  mutex_lock+0x2f/0x40                 <---mutex_lock(&dev->mutex);
[  637.084678]  __device_driver_lock+0x2b/0x50
[  637.084679]  device_release_driver_internal+0x1f/0x1b0
[  637.084680]  device_release_driver+0x12/0x20
[  637.084680]  bus_remove_device+0xe1/0x150
[  637.084682]  device_del+0x16a/0x3a0
[  637.084687]  pci_remove_bus_device+0x7e/0x100
[  637.084690]  pci_stop_and_remove_bus_device+0x1a/0x20
[  637.084691]  pciehp_unconfigure_device+0x80/0x130
[  637.084692]  pciehp_disable_slot+0x6e/0x120
[  637.084694]  pciehp_handle_presence_or_link_change+0x76/0x470
[  637.084696]  pciehp_ist+0x14a/0x1e0               <---down_read(&ctrl->reset_lock);

//AER
[  637.110208] Call Trace:
[  637.110209]  __schedule+0x2a8/0x670
[  637.110213]  ? sched_clock+0x9/0x10
[  637.110222]  schedule+0x33/0xa0
[  637.110230]  rwsem_down_write_slowpath+0x233/0x4a0
[  637.110237]  ? __schedule+0x2b0/0x670
[  637.110239]  down_write+0x3d/0x40
[  637.110241]  ? down_write+0x3d/0x40               <---down_write(&ctrl->reset_lock)
[  637.110245]  pciehp_reset_slot+0x58/0x150
[  637.110250]  pci_reset_hotplug_slot+0x43/0x70
[  637.110256]  pci_slot_reset+0x12f/0x170           <---mutex_lock(&dev->mutex);
[  637.110264]  pci_bus_error_reset+0xac/0xd0
[  637.110266]  pcie_do_recovery+0x255/0x2c0
[  637.110267]  aer_recover_work_func+0xd2/0x100
[  637.110270]  process_one_work+0x1fd/0x3f0
[  637.110279]  worker_thread+0x34/0x410
[  637.110287]  kthread+0x121/0x140
[  637.110288]  ? process_one_work+0x3f0/0x3f0
[  637.110289]  ? kthread_park+0xb0/0xb0
[  637.110290]  ret_from_fork+0x22/0x40

Applying the first patch then exposes the following 2nd deadlock:

           Hotplug                                AER
----------------------------       ---------------------------
down_read(&ctrl->reset_lock)       mutex_lock(&pci_slot_mutex)
mutex_lock(&pci_slot_mutex)        down_write(&ctrl->reset_lock)

//Hotplug
[  391.250990] INFO: task irq/42-pciehp:1960 blocked for more than 120 seconds.
[  391.259086]       Tainted: P           OE     5.3.0-51-generic #44~18.04.2
[  391.266982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  391.275977] irq/42-pciehp   D    0  1960      2 0x80004000
[  391.275979] Call Trace:
[  391.275986]  __schedule+0x2b9/0x6c0
[  391.275987]  ? __switch_to_asm+0x34/0x70
[  391.275989]  schedule+0x42/0xb0
[  391.275992]  schedule_preempt_disabled+0xe/0x10
[  391.275997]  __mutex_lock.isra.0+0x182/0x4f0
[  391.276003]  ? raw_pci_read+0x29/0x40
[  391.276005]  ? pci_read+0x2c/0x30
[  391.276010]  ? irq_finalize_oneshot.part.0+0xf0/0xf0
[  391.276015]  __mutex_lock_slowpath+0x13/0x20
[  391.276017]  mutex_lock+0x2e/0x40                    <---pci_slot_mutex
[  391.276021]  pci_dev_assign_slot+0x19/0x5f
[  391.276025]  pci_setup_device+0xc1/0x150
[  391.276029]  ? online_show+0x20/0x70
[  391.276034]  ? irq_finalize_oneshot.part.0+0xf0/0xf0
[  391.276035]  pci_scan_single_device+0x95/0xd0
[  391.276041]  pci_scan_slot+0x4a/0x110
[  391.276042]  pciehp_configure_device+0x40/0x120
[  391.276044]  pciehp_handle_presence_or_link_change+0x184/0x290
[  391.276049]  pciehp_ist+0x141/0x150                  <---ctrl->reset_lock
[  391.276055]  irq_thread_fn+0x28/0x60
[  391.276056]  irq_thread+0xda/0x170
[  391.276057]  ? irq_forced_thread_fn+0x80/0x80
[  391.276059]  kthread+0x104/0x140
[  391.276060]  ? irq_thread_check_affinity+0xf0/0xf0
[  391.276061]  ? kthread_park+0x80/0x80
[  391.276062]  ret_from_fork+0x22/0x40

//AER
[  391.225984] INFO: task kworker/39:1:1593 blocked for more than 120 seconds.
[  391.233989]       Tainted: P           OE     5.3.0-51-generic #44~18.04.2
[  391.241887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  391.250882] kworker/39:1    D    0  1593      2 0x80004000
[  391.250890] Workqueue: events aer_recover_work_func
[  391.250893] Call Trace:
[  391.250899]  __schedule+0x2b9/0x6c0
[  391.250902]  ? sched_clock+0x9/0x10
[  391.250903]  schedule+0x42/0xb0
[  391.250906]  rwsem_down_write_slowpath+0x23a/0x4c0
[  391.250908]  down_write+0x41/0x50
[  391.250909]  pciehp_reset_lock+0x12/0x20
[  391.250911]  pci_slot_reset+0xa1/0x170              <---ctrl->reset_lock
[  391.250912]  pci_bus_error_reset+0xfb/0x130         <---pci_slot_mutex
[  391.250914]  pcie_do_recovery+0x238/0x264
[  391.250917]  aer_recover_work_func.cold+0x9a/0xa0
[  391.250919]  process_one_work+0x1da/0x390
[  391.250920]  worker_thread+0x4d/0x400
[  391.250923]  kthread+0x104/0x140
[  391.250924]  ? process_one_work+0x390/0x390
[  391.250925]  ? kthread_park+0x80/0x80
[  391.250927]  ret_from_fork+0x22/0x40

The 2nd patch is unlocking and locking the pci_slot mutex in the AER path around
the controller reset call.  This allows for both the Hotplug and AER execution 
to complete.

Ian May (2):
  PCIe hotplug interrupt and AER deadlock with reset_lock and
    device_lock
  PCIe hotplug interrupt and AER pci_slot_mutex deadlock

 drivers/pci/hotplug/pciehp.h      |  2 ++
 drivers/pci/hotplug/pciehp_core.c |  2 ++
 drivers/pci/hotplug/pciehp_hpc.c  | 21 ++++++++++++++++++---
 drivers/pci/pci.c                 |  8 ++++++++
 include/linux/pci_hotplug.h       |  2 ++
 5 files changed, 32 insertions(+), 3 deletions(-)