diff mbox series

[RFC] net: stmmac: Fix the problem about interrupt storm

Message ID 20240327110142.159851-1-cathy.cai@unisoc.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC] net: stmmac: Fix the problem about interrupt storm | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 944 this patch: 944
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 9 of 9 maintainers
netdev/build_clang success Errors and warnings before: 955 this patch: 955
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 955 this patch: 955
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 9 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Cathy Cai March 27, 2024, 11:01 a.m. UTC
After I do seven days of MSR test (monkey sleep reboot test) in Android,
I can encounter below netdev watchdog timeout issue. Tx queue timed out
then reset adapter. There is a probability that an interruption storm will
occur and the system will crash.

When we do MSR test, there is a NETDEV WATCHDOG WARNING:
[  117.885804] ------------[ cut here ]------------
[  117.885818] NETDEV WATCHDOG: eth0 (stmmaceth): transmit queue 0 timed
               out
[  117.885873] WARNING: CPU: 1 PID: 4169 at net/sched/sch_generic.c:473
               dev_watchdog+0x2fc/0x41c
[  117.886070]  sprd_systimer sprd_sip_svc sprd_wdt_fiq sprd_wdt_pon
[  117.886082] CPU: 1 PID: 4169 Comm: RenderThread Tainted: G S       C O
               5.4.147-ab41313 #1
[  117.886085] Hardware name: Spreadtrum UIS6780 SoC (DT)
[  117.886090] pstate: 60400005 (nZCv daif +PAN -UAO)
[  117.886094] pc : dev_watchdog+0x2fc/0x41c
[  117.886098] lr : dev_watchdog+0x2fc/0x41c
[  117.886100] sp : ffffffc01000bcf0
[  117.886103] x29: ffffffc01000bcf0 x28: ffffffc011eafe28
[  117.886107] x27: ffffff80f97a5c40 x26: 00000000ffffffff
[  117.886111] x25: 0000000000000001 x24: 0000000000000008
[  117.886114] x23: ffffffc011ea6000 x22: ffffffc011e73020
[  117.886118] x21: 0000000000000000 x20: ffffff80f434841c
[  117.886122] x19: ffffff80f4348000 x18: ffffffc01000d048
[  117.886127] x17: ffffffc012050044 x16: 00000000000508d0
[  117.886130] x15: 0000000000000006 x14: 0000000000000058
[  117.886134] x13: 0000000000000008 x12: 0000000042d7d11b
[  117.886138] x11: 0000000000000015 x10: 0000000000000001
[  117.886141] x9 : a6fe08b7d867fd00 x8 : a6fe08b7d867fd00
[  117.886145] x7 : 0000000000000000 x6 : ffffffc0120a0899
[  117.886149] x5 : 0000000000000058 x4 : 0000000000000002
[  117.886152] x3 : ffffffc01000b980 x2 : 0000000000000007
[  117.886156] x1 : 0000000000000006 x0 : 000000000000003d
[  117.886164]

[  117.887028]
[  117.887030] Call trace:
[  117.887035]  dev_watchdog+0x2fc/0x41c
[  117.887043]  call_timer_fn+0x5c/0x274
[  117.887046]  expire_timers+0x74/0x1b4
[  117.887050]  __run_timers+0x250/0x2b0
[  117.887054]  run_timer_softirq+0x28/0x4c
[  117.887061]  __do_softirq+0x128/0x4dc
[  117.887067]  irq_exit+0xf8/0xfc
[  117.887072]  __handle_domain_irq+0xb0/0x108
[  117.887076]  gic_handle_irq+0x6c/0x124
[  117.887081]  el0_irq_naked+0x64/0x74
[  117.887084] ---[ end trace 1308772835db89f6 ]---
[  117.887188] stmmaceth 32600000.ethernet eth0: Reset adapter.

Tx queue time out then reset adapter. When reset the adapter, stmmac driver
sets the state to STMMAC_DOWN and calls dev_close() function. If an interrupt
is triggered at this instant after setting state to STMMAC_DOWN, before the
dev_close() call.

The scene is as follows:
stmmac_reset_subtask()
	set_bit(STMMAC_DOWN, &priv->state);
					--->interrupt
					  stmmac_interrupt()
					    return IRQ_HANDLED
	dev_close(priv->dev);

The interrupt handler stmmac_interrupt is executed, judging that the state is
STMMAC_DOWN and returning IRQ_HANDLED. Then the processing will not continue,
and it will not be able to clear the interrupt status.

Therefore, to avoid this, set STMMAC_DOWN after dev_close().

Signed-off-by: Cathy Cai <cathy.cai@unisoc.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Romain Gantois March 31, 2024, 8:35 a.m. UTC | #1
Hello Cathy,

On Wed, 27 Mar 2024, Cathy Cai wrote:

> Tx queue time out then reset adapter. When reset the adapter, stmmac driver
> sets the state to STMMAC_DOWN and calls dev_close() function. If an interrupt
> is triggered at this instant after setting state to STMMAC_DOWN, before the
> dev_close() call.
> 
...
> -	set_bit(STMMAC_DOWN, &priv->state);
>  	dev_close(priv->dev);
> +	set_bit(STMMAC_DOWN, &priv->state);
>  	dev_open(priv->dev, NULL);
>  	clear_bit(STMMAC_DOWN, &priv->state);
>  	clear_bit(STMMAC_RESETING, &priv->state);

If this IRQ issue can happen whenever STMMAC_DOWN is set while the net device is 
open, then it could also happen between the dev_open() and 
clear_bit(STMMAC_DOWN) calls right? So you'd have to clear STMMAC_DOWN before 
calling dev_open() but then I don't see the usefulness of setting STMMAC_DOWN 
and clearing it immediately. Maybe closing and opening the net device should be 
enough?

Moreover, it seems strange to me that stmmac_interrupt() unconditionnally 
ignores interrupts when the driver is in STMMAC_DOWN state. This seems like 
dangerous behaviour, since it could cause IRQ storm issues whenever something 
in the driver sets this state. I'm not too familiar with the interrupt handling 
in this driver, but maybe stmmac_interrupt() could clear interrupts 
unconditionnally in the STMMAC_DOWN state?

Best Regards,
cathy cai April 3, 2024, 2:01 a.m. UTC | #2
Hi Romain,

On Sun, Mar 31, 2024 at 4:35 PM Romain Gantois
<romain.gantois@bootlin.com> wrote:
>
> Hello Cathy,
>
> On Wed, 27 Mar 2024, Cathy Cai wrote:
>
> > Tx queue time out then reset adapter. When reset the adapter, stmmac driver
> > sets the state to STMMAC_DOWN and calls dev_close() function. If an interrupt
> > is triggered at this instant after setting state to STMMAC_DOWN, before the
> > dev_close() call.
> >
> ...
> > -     set_bit(STMMAC_DOWN, &priv->state);
> >       dev_close(priv->dev);
> > +     set_bit(STMMAC_DOWN, &priv->state);
> >       dev_open(priv->dev, NULL);
> >       clear_bit(STMMAC_DOWN, &priv->state);
> >       clear_bit(STMMAC_RESETING, &priv->state);
>
> If this IRQ issue can happen whenever STMMAC_DOWN is set while the net device is
> open, then it could also happen between the dev_open() and
> clear_bit(STMMAC_DOWN) calls right? So you'd have to clear STMMAC_DOWN before
> calling dev_open() but then I don't see the usefulness of setting STMMAC_DOWN
> and clearing it immediately. Maybe closing and opening the net device should be
> enough?
>
 Yes. It could also happen between the dev_open() and
clear_bit(STMMAC_DOWN) calls.
Although we did not reproduce this scenario, it should have happened
if we had increased
the number of test samples. In addition, I found that other people had
similar problems before.
The link is:
https://lore.kernel.org/all/20210208140820.10410-11-Sergey.Semin@baikalelectronics.ru/

>
> Moreover, it seems strange to me that stmmac_interrupt() unconditionnally
> ignores interrupts when the driver is in STMMAC_DOWN state. This seems like
> dangerous behaviour, since it could cause IRQ storm issues whenever something
> in the driver sets this state. I'm not too familiar with the interrupt handling
> in this driver, but maybe stmmac_interrupt() could clear interrupts
> unconditionnally in the STMMAC_DOWN state?
>
Clear interrupts unconditionally in the STMMAC_DOWN state directly
certainly won't cause this problem.
This may be too rough, maybe this design has other considerations.

>
> Best Regards,
>
> --
> Romain Gantois, Bootlin
> Embedded Linux and Kernel engineering
> https://bootlin.com

 Best Regards,
Cathy
diff mbox series

Patch

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 24cd80490d19..61690b68b6ad 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -7167,8 +7167,8 @@  static void stmmac_reset_subtask(struct stmmac_priv *priv)
 	while (test_and_set_bit(STMMAC_RESETING, &priv->state))
 		usleep_range(1000, 2000);
 
-	set_bit(STMMAC_DOWN, &priv->state);
 	dev_close(priv->dev);
+	set_bit(STMMAC_DOWN, &priv->state);
 	dev_open(priv->dev, NULL);
 	clear_bit(STMMAC_DOWN, &priv->state);
 	clear_bit(STMMAC_RESETING, &priv->state);