Message ID | 20231201001911.656-1-thinhtr@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 16b55b1f2269962fb6b5154b8bf43f37c9a96637 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v4] net/tg3: fix race condition in tg3_reset_task() | expand |
On Thu, Nov 30, 2023 at 4:19 PM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote: > > When an EEH error is encountered by a PCI adapter, the EEH driver > modifies the PCI channel's state as shown below: > > enum { > /* I/O channel is in normal state */ > pci_channel_io_normal = (__force pci_channel_state_t) 1, > > /* I/O to channel is blocked */ > pci_channel_io_frozen = (__force pci_channel_state_t) 2, > > /* PCI card is dead */ > pci_channel_io_perm_failure = (__force pci_channel_state_t) 3, > }; > > If the same EEH error then causes the tg3 driver's transmit timeout > logic to execute, the tg3_tx_timeout() function schedules a reset > task via tg3_reset_task_schedule(), which may cause a race condition > between the tg3 and EEH driver as both attempt to recover the HW via > a reset action. > > EEH driver gets error event > --> eeh_set_channel_state() > and set device to one of > error state above scheduler: tg3_reset_task() get > returned error from tg3_init_hw() > --> dev_close() shuts down the interface > tg3_io_slot_reset() and > tg3_io_resume() fail to > reset/resume the device > > To resolve this issue, we avoid the race condition by checking the PCI > channel state in the tg3_reset_task() function and skip the tg3 driver > initiated reset when the PCI channel is not in the normal state. (The > driver has no access to tg3 device registers at this point and cannot > even complete the reset task successfully without external assistance.) > We'll leave the reset procedure to be managed by the EEH driver which > calls the tg3_io_error_detected(), tg3_io_slot_reset() and > tg3_io_resume() functions as appropriate. > > Adding the same checking in tg3_dump_state() to avoid dumping all > device registers when the PCI channel is not in the normal state. > > > Signed-off-by: Thinh Tran <thinhtr@linux.vnet.ibm.com> > Tested-by: Venkata Sai Duggi <venkata.sai.duggi@ibm.com> > Reviewed-by: David Christensen <drc@linux.vnet.ibm.com> > > v4: moving the PCI error checking to tg3_reset_task() and > tg3_dump_state() > v3: re-post the patch. > v2: checking PCI errors in tg3_tx_timeout() Thanks. Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Hello: This patch was applied to netdev/net.git (main) by Jakub Kicinski <kuba@kernel.org>: On Thu, 30 Nov 2023 18:19:11 -0600 you wrote: > When an EEH error is encountered by a PCI adapter, the EEH driver > modifies the PCI channel's state as shown below: > > enum { > /* I/O channel is in normal state */ > pci_channel_io_normal = (__force pci_channel_state_t) 1, > > [...] Here is the summary with links: - [v4] net/tg3: fix race condition in tg3_reset_task() https://git.kernel.org/netdev/net/c/16b55b1f2269 You are awesome, thank you!
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c index 48b6191efa56..49f299c868a1 100644 --- a/drivers/net/ethernet/broadcom/tg3.c +++ b/drivers/net/ethernet/broadcom/tg3.c @@ -6474,6 +6474,14 @@ static void tg3_dump_state(struct tg3 *tp) int i; u32 *regs; + /* If it is a PCI error, all registers will be 0xffff, + * we don't dump them out, just report the error and return + */ + if (tp->pdev->error_state != pci_channel_io_normal) { + netdev_err(tp->dev, "PCI channel ERROR!\n"); + return; + } + regs = kzalloc(TG3_REG_BLK_SIZE, GFP_ATOMIC); if (!regs) return; @@ -11259,7 +11267,8 @@ static void tg3_reset_task(struct work_struct *work) rtnl_lock(); tg3_full_lock(tp, 0); - if (tp->pcierr_recovery || !netif_running(tp->dev)) { + if (tp->pcierr_recovery || !netif_running(tp->dev) || + tp->pdev->error_state != pci_channel_io_normal) { tg3_flag_clear(tp, RESET_TASK_PENDING); tg3_full_unlock(tp); rtnl_unlock();