Message ID | 20240321063954.18711-1-brett.creeley@amd.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] pds_core: Fix pdsc_check_pci_health function to print warning | expand |
On Wed, Mar 20, 2024 at 11:39:54PM -0700, Brett Creeley wrote: > When the driver notices fw_status == 0xff it tries to perform a PCI > reset on itself via pci_reset_function() in the context of the driver's > health thread. However, pdsc_reset_prepare calls > pdsc_stop_health_thread(), which attempts to stop/flush the health > thread. This results in a deadlock because the stop/flush will never > complete since the driver called pci_reset_function() from the health > thread context. Fix this by changing the pdsc_check_pci_health_function() > to print a dev_warn() once every fw_down/fw_up cycle and requiring the > user to perform a reset on the device via sysfs's reset interface, > reloading the driver, rebinding the device, etc. > > Unloading the driver in the fw_down/dead state uncovered another issue, > which can be seen in the following trace: > > WARNING: CPU: 51 PID: 6914 at kernel/workqueue.c:1450 __queue_work+0x358/0x440 > [...] > RIP: 0010:__queue_work+0x358/0x440 > [...] > Call Trace: > <TASK> > ? __warn+0x85/0x140 > ? __queue_work+0x358/0x440 > ? report_bug+0xfc/0x1e0 > ? handle_bug+0x3f/0x70 > ? exc_invalid_op+0x17/0x70 > ? asm_exc_invalid_op+0x1a/0x20 > ? __queue_work+0x358/0x440 > queue_work_on+0x28/0x30 > pdsc_devcmd_locked+0x96/0xe0 [pds_core] > pdsc_devcmd_reset+0x71/0xb0 [pds_core] > pdsc_teardown+0x51/0xe0 [pds_core] > pdsc_remove+0x106/0x200 [pds_core] > pci_device_remove+0x37/0xc0 > device_release_driver_internal+0xae/0x140 > driver_detach+0x48/0x90 > bus_remove_driver+0x6d/0xf0 > pci_unregister_driver+0x2e/0xa0 > pdsc_cleanup_module+0x10/0x780 [pds_core] > __x64_sys_delete_module+0x142/0x2b0 > ? syscall_trace_enter.isra.18+0x126/0x1a0 > do_syscall_64+0x3b/0x90 > entry_SYSCALL_64_after_hwframe+0x72/0xdc > RIP: 0033:0x7fbd9d03a14b > [...] > > Fix this by preventing the devcmd reset if the FW is not running. > > Fixes: d9407ff11809 ("pds_core: Prevent health thread from running during reset/remove") > Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> > Signed-off-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Simon Horman <horms@kernel.org>
On Wed, 20 Mar 2024 23:39:54 -0700 Brett Creeley wrote: > When the driver notices fw_status == 0xff it tries to perform a PCI > reset on itself via pci_reset_function() in the context of the driver's > health thread. However, pdsc_reset_prepare calls > pdsc_stop_health_thread(), which attempts to stop/flush the health > thread. This results in a deadlock because the stop/flush will never > complete since the driver called pci_reset_function() from the health > thread context. Fix this by changing the pdsc_check_pci_health_function() > to print a dev_warn() once every fw_down/fw_up cycle and requiring the > user to perform a reset on the device via sysfs's reset interface, > reloading the driver, rebinding the device, etc. Dunno, to call PCI reset you don't need much device context. Perhaps you could allocate a work entry, throw it onto a per-driver workqueue, and return. Basically some minimal viable way to "asynchronously" call pci_reset_function()? You can take a ref on the device so it doesn't disappear. And flush that queue on module unload.
On 3/22/2024 6:02 PM, Jakub Kicinski wrote: > Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. > > > On Wed, 20 Mar 2024 23:39:54 -0700 Brett Creeley wrote: >> When the driver notices fw_status == 0xff it tries to perform a PCI >> reset on itself via pci_reset_function() in the context of the driver's >> health thread. However, pdsc_reset_prepare calls >> pdsc_stop_health_thread(), which attempts to stop/flush the health >> thread. This results in a deadlock because the stop/flush will never >> complete since the driver called pci_reset_function() from the health >> thread context. Fix this by changing the pdsc_check_pci_health_function() >> to print a dev_warn() once every fw_down/fw_up cycle and requiring the >> user to perform a reset on the device via sysfs's reset interface, >> reloading the driver, rebinding the device, etc. > > Dunno, to call PCI reset you don't need much device context. > Perhaps you could allocate a work entry, throw it onto a per-driver > workqueue, and return. Basically some minimal viable way to > "asynchronously" call pci_reset_function()? > You can take a ref on the device so it doesn't disappear. > And flush that queue on module unload. Hi Jakub, Yeah, this is better than my proposed solution. Now that I'm back from vacation I will work on a v2. Thanks for the review, Brett
diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c index 9662ee72814c..8e5e3797cf0c 100644 --- a/drivers/net/ethernet/amd/pds_core/core.c +++ b/drivers/net/ethernet/amd/pds_core/core.c @@ -587,6 +587,9 @@ void pdsc_fw_up(struct pdsc *pdsc) DEVLINK_HEALTH_REPORTER_STATE_HEALTHY); pdsc_notify(PDS_EVENT_RESET, &reset_event); + /* Allow for fw_status == 0xff to print another warning */ + pdsc->bad_pci_warned = false; + return; err_out: @@ -607,7 +610,11 @@ static void pdsc_check_pci_health(struct pdsc *pdsc) if (fw_status != PDS_RC_BAD_PCI) return; - pci_reset_function(pdsc->pdev); + if (!pdsc->bad_pci_warned) { + dev_warn(pdsc->dev, "fw not reachable due to failed PCI connection, fw_status = 0x%x\n", + fw_status); + pdsc->bad_pci_warned = true; + } } void pdsc_health_thread(struct work_struct *work) diff --git a/drivers/net/ethernet/amd/pds_core/core.h b/drivers/net/ethernet/amd/pds_core/core.h index 92d7657dd614..10979118be00 100644 --- a/drivers/net/ethernet/amd/pds_core/core.h +++ b/drivers/net/ethernet/amd/pds_core/core.h @@ -165,6 +165,7 @@ struct pdsc { unsigned long state; u8 fw_status; u8 fw_generation; + bool bad_pci_warned; unsigned long last_fw_time; u32 last_hb; struct timer_list wdtimer; diff --git a/drivers/net/ethernet/amd/pds_core/dev.c b/drivers/net/ethernet/amd/pds_core/dev.c index e494e1298dc9..495ef4ef8c10 100644 --- a/drivers/net/ethernet/amd/pds_core/dev.c +++ b/drivers/net/ethernet/amd/pds_core/dev.c @@ -229,6 +229,9 @@ int pdsc_devcmd_reset(struct pdsc *pdsc) .reset.opcode = PDS_CORE_CMD_RESET, }; + if (!pdsc_is_fw_running(pdsc)) + return 0; + return pdsc_devcmd(pdsc, &cmd, &comp, pdsc->devcmd_timeout); }