Message ID | 20240327110142.159851-1-cathy.cai@unisoc.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC] net: stmmac: Fix the problem about interrupt storm | expand |
Hello Cathy, On Wed, 27 Mar 2024, Cathy Cai wrote: > Tx queue time out then reset adapter. When reset the adapter, stmmac driver > sets the state to STMMAC_DOWN and calls dev_close() function. If an interrupt > is triggered at this instant after setting state to STMMAC_DOWN, before the > dev_close() call. > ... > - set_bit(STMMAC_DOWN, &priv->state); > dev_close(priv->dev); > + set_bit(STMMAC_DOWN, &priv->state); > dev_open(priv->dev, NULL); > clear_bit(STMMAC_DOWN, &priv->state); > clear_bit(STMMAC_RESETING, &priv->state); If this IRQ issue can happen whenever STMMAC_DOWN is set while the net device is open, then it could also happen between the dev_open() and clear_bit(STMMAC_DOWN) calls right? So you'd have to clear STMMAC_DOWN before calling dev_open() but then I don't see the usefulness of setting STMMAC_DOWN and clearing it immediately. Maybe closing and opening the net device should be enough? Moreover, it seems strange to me that stmmac_interrupt() unconditionnally ignores interrupts when the driver is in STMMAC_DOWN state. This seems like dangerous behaviour, since it could cause IRQ storm issues whenever something in the driver sets this state. I'm not too familiar with the interrupt handling in this driver, but maybe stmmac_interrupt() could clear interrupts unconditionnally in the STMMAC_DOWN state? Best Regards,
Hi Romain, On Sun, Mar 31, 2024 at 4:35 PM Romain Gantois <romain.gantois@bootlin.com> wrote: > > Hello Cathy, > > On Wed, 27 Mar 2024, Cathy Cai wrote: > > > Tx queue time out then reset adapter. When reset the adapter, stmmac driver > > sets the state to STMMAC_DOWN and calls dev_close() function. If an interrupt > > is triggered at this instant after setting state to STMMAC_DOWN, before the > > dev_close() call. > > > ... > > - set_bit(STMMAC_DOWN, &priv->state); > > dev_close(priv->dev); > > + set_bit(STMMAC_DOWN, &priv->state); > > dev_open(priv->dev, NULL); > > clear_bit(STMMAC_DOWN, &priv->state); > > clear_bit(STMMAC_RESETING, &priv->state); > > If this IRQ issue can happen whenever STMMAC_DOWN is set while the net device is > open, then it could also happen between the dev_open() and > clear_bit(STMMAC_DOWN) calls right? So you'd have to clear STMMAC_DOWN before > calling dev_open() but then I don't see the usefulness of setting STMMAC_DOWN > and clearing it immediately. Maybe closing and opening the net device should be > enough? > Yes. It could also happen between the dev_open() and clear_bit(STMMAC_DOWN) calls. Although we did not reproduce this scenario, it should have happened if we had increased the number of test samples. In addition, I found that other people had similar problems before. The link is: https://lore.kernel.org/all/20210208140820.10410-11-Sergey.Semin@baikalelectronics.ru/ > > Moreover, it seems strange to me that stmmac_interrupt() unconditionnally > ignores interrupts when the driver is in STMMAC_DOWN state. This seems like > dangerous behaviour, since it could cause IRQ storm issues whenever something > in the driver sets this state. I'm not too familiar with the interrupt handling > in this driver, but maybe stmmac_interrupt() could clear interrupts > unconditionnally in the STMMAC_DOWN state? > Clear interrupts unconditionally in the STMMAC_DOWN state directly certainly won't cause this problem. This may be too rough, maybe this design has other considerations. > > Best Regards, > > -- > Romain Gantois, Bootlin > Embedded Linux and Kernel engineering > https://bootlin.com Best Regards, Cathy
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 24cd80490d19..61690b68b6ad 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -7167,8 +7167,8 @@ static void stmmac_reset_subtask(struct stmmac_priv *priv) while (test_and_set_bit(STMMAC_RESETING, &priv->state)) usleep_range(1000, 2000); - set_bit(STMMAC_DOWN, &priv->state); dev_close(priv->dev); + set_bit(STMMAC_DOWN, &priv->state); dev_open(priv->dev, NULL); clear_bit(STMMAC_DOWN, &priv->state); clear_bit(STMMAC_RESETING, &priv->state);
After I do seven days of MSR test (monkey sleep reboot test) in Android, I can encounter below netdev watchdog timeout issue. Tx queue timed out then reset adapter. There is a probability that an interruption storm will occur and the system will crash. When we do MSR test, there is a NETDEV WATCHDOG WARNING: [ 117.885804] ------------[ cut here ]------------ [ 117.885818] NETDEV WATCHDOG: eth0 (stmmaceth): transmit queue 0 timed out [ 117.885873] WARNING: CPU: 1 PID: 4169 at net/sched/sch_generic.c:473 dev_watchdog+0x2fc/0x41c [ 117.886070] sprd_systimer sprd_sip_svc sprd_wdt_fiq sprd_wdt_pon [ 117.886082] CPU: 1 PID: 4169 Comm: RenderThread Tainted: G S C O 5.4.147-ab41313 #1 [ 117.886085] Hardware name: Spreadtrum UIS6780 SoC (DT) [ 117.886090] pstate: 60400005 (nZCv daif +PAN -UAO) [ 117.886094] pc : dev_watchdog+0x2fc/0x41c [ 117.886098] lr : dev_watchdog+0x2fc/0x41c [ 117.886100] sp : ffffffc01000bcf0 [ 117.886103] x29: ffffffc01000bcf0 x28: ffffffc011eafe28 [ 117.886107] x27: ffffff80f97a5c40 x26: 00000000ffffffff [ 117.886111] x25: 0000000000000001 x24: 0000000000000008 [ 117.886114] x23: ffffffc011ea6000 x22: ffffffc011e73020 [ 117.886118] x21: 0000000000000000 x20: ffffff80f434841c [ 117.886122] x19: ffffff80f4348000 x18: ffffffc01000d048 [ 117.886127] x17: ffffffc012050044 x16: 00000000000508d0 [ 117.886130] x15: 0000000000000006 x14: 0000000000000058 [ 117.886134] x13: 0000000000000008 x12: 0000000042d7d11b [ 117.886138] x11: 0000000000000015 x10: 0000000000000001 [ 117.886141] x9 : a6fe08b7d867fd00 x8 : a6fe08b7d867fd00 [ 117.886145] x7 : 0000000000000000 x6 : ffffffc0120a0899 [ 117.886149] x5 : 0000000000000058 x4 : 0000000000000002 [ 117.886152] x3 : ffffffc01000b980 x2 : 0000000000000007 [ 117.886156] x1 : 0000000000000006 x0 : 000000000000003d [ 117.886164] [ 117.887028] [ 117.887030] Call trace: [ 117.887035] dev_watchdog+0x2fc/0x41c [ 117.887043] call_timer_fn+0x5c/0x274 [ 117.887046] expire_timers+0x74/0x1b4 [ 117.887050] __run_timers+0x250/0x2b0 [ 117.887054] run_timer_softirq+0x28/0x4c [ 117.887061] __do_softirq+0x128/0x4dc [ 117.887067] irq_exit+0xf8/0xfc [ 117.887072] __handle_domain_irq+0xb0/0x108 [ 117.887076] gic_handle_irq+0x6c/0x124 [ 117.887081] el0_irq_naked+0x64/0x74 [ 117.887084] ---[ end trace 1308772835db89f6 ]--- [ 117.887188] stmmaceth 32600000.ethernet eth0: Reset adapter. Tx queue time out then reset adapter. When reset the adapter, stmmac driver sets the state to STMMAC_DOWN and calls dev_close() function. If an interrupt is triggered at this instant after setting state to STMMAC_DOWN, before the dev_close() call. The scene is as follows: stmmac_reset_subtask() set_bit(STMMAC_DOWN, &priv->state); --->interrupt stmmac_interrupt() return IRQ_HANDLED dev_close(priv->dev); The interrupt handler stmmac_interrupt is executed, judging that the state is STMMAC_DOWN and returning IRQ_HANDLED. Then the processing will not continue, and it will not be able to clear the interrupt status. Therefore, to avoid this, set STMMAC_DOWN after dev_close(). Signed-off-by: Cathy Cai <cathy.cai@unisoc.com> --- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)