diff mbox series

[RFC] ath10k: change to do napi_enable and napi_disable when insmod and rmmod for sdio

Message ID 20200214035555.24762-1-wgong@codeaurora.org (mailing list archive)
State New, archived
Headers show
Series [RFC] ath10k: change to do napi_enable and napi_disable when insmod and rmmod for sdio | expand

Commit Message

Wen Gong Feb. 14, 2020, 3:55 a.m. UTC
It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
test simulate crash and ifconfig down/rmmod meanwhile.

Test steps:

1.Test commands
echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down
echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio

2. dmesg:
[ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash
[ 5622.655995] ieee80211 phy0: Hardware restart was requested
[ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds.
[ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds.
[ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks
[ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G        W         4.19.86 #137
[ 5776.359846] Hardware name: MediaTek krane sku176 board (DT)
[ 5776.359855] Call trace:
[ 5776.359868]  dump_backtrace+0x0/0x170
[ 5776.359881]  show_stack+0x20/0x2c
[ 5776.359896]  dump_stack+0xd4/0x10c
[ 5776.359916]  panic+0x12c/0x29c
[ 5776.359937]  hung_task_panic+0x0/0x50
[ 5776.359953]  kthread+0x120/0x130
[ 5776.359965]  ret_from_fork+0x10/0x18
[ 5776.359986] SMP: stopping secondary CPUs
[ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000
[ 5776.360026] CPU features: 0x0,2188200c
[ 5776.360035] Memory Limit: none

command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked
callstack of ifconfig:
[<0>] __switch_to+0x120/0x13c
[<0>] msleep+0x28/0x38
[<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio]
[<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
[<0>] ath10k_halt+0x120/0x178 [ath10k_core]
[<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
[<0>] drv_stop+0xe0/0x1e4 [mac80211]
[<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
[<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
[<0>] ieee80211_stop+0x20/0x30 [mac80211]
[<0>] __dev_close_many+0xb8/0x11c
[<0>] __dev_change_flags+0xe0/0x1d0
[<0>] dev_change_flags+0x30/0x6c
[<0>] devinet_ioctl+0x370/0x564
[<0>] inet_ioctl+0xdc/0x304
[<0>] sock_do_ioctl+0x50/0x288
[<0>] compat_sock_ioctl+0x1b4/0x1aac
[<0>] __se_compat_sys_ioctl+0x100/0x26fc
[<0>] __arm64_compat_sys_ioctl+0x20/0x2c
[<0>] el0_svc_common+0xa4/0x154
[<0>] el0_svc_compat_handler+0x2c/0x38
[<0>] el0_svc_compat+0x8/0x18
[<0>] 0xffffffffffffffff

callstack of rmmod:
[<0>] __switch_to+0x120/0x13c
[<0>] msleep+0x28/0x38
[<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio]
[<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
[<0>] ath10k_halt+0x120/0x178 [ath10k_core]
[<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
[<0>] drv_stop+0xe0/0x1e4 [mac80211]
[<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
[<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
[<0>] ieee80211_stop+0x20/0x30 [mac80211]
[<0>] __dev_close_many+0xb8/0x11c
[<0>] dev_close_many+0x70/0x100
[<0>] dev_close+0x4c/0x80
[<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211]
[<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211]
[<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211]
[<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
[<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core]
[<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio]
[<0>] sdio_bus_remove+0x48/0x108
[<0>] device_release_driver_internal+0x138/0x1ec
[<0>] driver_detach+0x6c/0xa8
[<0>] bus_remove_driver+0x78/0xa8
[<0>] driver_unregister+0x30/0x50
[<0>] sdio_unregister_driver+0x28/0x34
[<0>] cleanup_module+0x14/0x6bc [ath10k_sdio]
[<0>] __arm64_sys_delete_module+0x1e0/0x22c
[<0>] el0_svc_common+0xa4/0x154
[<0>] el0_svc_compat_handler+0x2c/0x38
[<0>] el0_svc_compat+0x8/0x18
[<0>] 0xffffffffffffffff

The test command run simulate_fw_crash firstly and it call into
ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable
is called and bit NAPI_STATE_SCHED is set. After that, function
ath10k_sdio_hif_stop is called again from ath10k_stop by command
"ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked.

It is blocked by napi_synchronize, napi_disable will set bit with
NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop
becuase bit NAPI_STATE_SCHED is set by napi_disable.

function of napi_synchronize
static inline void napi_synchronize(const struct napi_struct *n)
{
	if (IS_ENABLED(CONFIG_SMP))
		while (test_bit(NAPI_STATE_SCHED, &n->state))
			msleep(1);
	else
		barrier();
}

function of napi_disable
void napi_disable(struct napi_struct *n)
{
	might_sleep();
	set_bit(NAPI_STATE_DISABLE, &n->state);

	while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
		msleep(1);
	while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
		msleep(1);

	hrtimer_cancel(&n->timer);

	clear_bit(NAPI_STATE_DISABLE, &n->state);
}

Tested with QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00042.

Signed-off-by: Wen Gong <wgong@codeaurora.org>
---
 drivers/net/wireless/ath/ath10k/sdio.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Kalle Valo Aug. 20, 2020, 8:32 a.m. UTC | #1
Wen Gong <wgong@codeaurora.org> writes:

> It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
> test simulate crash and ifconfig down/rmmod meanwhile.
>
> Test steps:
>
> 1.Test commands
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio
>
> 2. dmesg:
> [ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash
> [ 5622.655995] ieee80211 phy0: Hardware restart was requested
> [ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds.
> [ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds.
> [ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks
> [ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G        W         4.19.86 #137
> [ 5776.359846] Hardware name: MediaTek krane sku176 board (DT)
> [ 5776.359855] Call trace:
> [ 5776.359868]  dump_backtrace+0x0/0x170
> [ 5776.359881]  show_stack+0x20/0x2c
> [ 5776.359896]  dump_stack+0xd4/0x10c
> [ 5776.359916]  panic+0x12c/0x29c
> [ 5776.359937]  hung_task_panic+0x0/0x50
> [ 5776.359953]  kthread+0x120/0x130
> [ 5776.359965]  ret_from_fork+0x10/0x18
> [ 5776.359986] SMP: stopping secondary CPUs
> [ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000
> [ 5776.360026] CPU features: 0x0,2188200c
> [ 5776.360035] Memory Limit: none
>
> command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked
> callstack of ifconfig:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] __dev_change_flags+0xe0/0x1d0
> [<0>] dev_change_flags+0x30/0x6c
> [<0>] devinet_ioctl+0x370/0x564
> [<0>] inet_ioctl+0xdc/0x304
> [<0>] sock_do_ioctl+0x50/0x288
> [<0>] compat_sock_ioctl+0x1b4/0x1aac
> [<0>] __se_compat_sys_ioctl+0x100/0x26fc
> [<0>] __arm64_compat_sys_ioctl+0x20/0x2c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
>
> callstack of rmmod:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] dev_close_many+0x70/0x100
> [<0>] dev_close+0x4c/0x80
> [<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211]
> [<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211]
> [<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211]
> [<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> [<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core]
> [<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio]
> [<0>] sdio_bus_remove+0x48/0x108
> [<0>] device_release_driver_internal+0x138/0x1ec
> [<0>] driver_detach+0x6c/0xa8
> [<0>] bus_remove_driver+0x78/0xa8
> [<0>] driver_unregister+0x30/0x50
> [<0>] sdio_unregister_driver+0x28/0x34
> [<0>] cleanup_module+0x14/0x6bc [ath10k_sdio]
> [<0>] __arm64_sys_delete_module+0x1e0/0x22c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
>
> The test command run simulate_fw_crash firstly and it call into
> ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable
> is called and bit NAPI_STATE_SCHED is set. After that, function
> ath10k_sdio_hif_stop is called again from ath10k_stop by command
> "ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked.
>
> It is blocked by napi_synchronize, napi_disable will set bit with
> NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop
> becuase bit NAPI_STATE_SCHED is set by napi_disable.
>
> function of napi_synchronize
> static inline void napi_synchronize(const struct napi_struct *n)
> {
> 	if (IS_ENABLED(CONFIG_SMP))
> 		while (test_bit(NAPI_STATE_SCHED, &n->state))
> 			msleep(1);
> 	else
> 		barrier();
> }
>
> function of napi_disable
> void napi_disable(struct napi_struct *n)
> {
> 	might_sleep();
> 	set_bit(NAPI_STATE_DISABLE, &n->state);
>
> 	while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
> 		msleep(1);
> 	while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
> 		msleep(1);
>
> 	hrtimer_cancel(&n->timer);
>
> 	clear_bit(NAPI_STATE_DISABLE, &n->state);
> }
>
> Tested with QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00042.
>
> Signed-off-by: Wen Gong <wgong@codeaurora.org>
> ---
>  drivers/net/wireless/ath/ath10k/sdio.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
> index 7b894dcaad2e..b71499b171c6 100644
> --- a/drivers/net/wireless/ath/ath10k/sdio.c
> +++ b/drivers/net/wireless/ath/ath10k/sdio.c
> @@ -1756,8 +1756,6 @@ static int ath10k_sdio_hif_start(struct ath10k *ar)
>  	struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar);
>  	int ret;
>  
> -	napi_enable(&ar->napi);
> -
>  	/* Sleep 20 ms before HIF interrupts are disabled.
>  	 * This will give target plenty of time to process the BMI done
>  	 * request before interrupts are disabled.
> @@ -1884,7 +1882,6 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar)
>  	spin_unlock_bh(&ar_sdio->wr_async_lock);
>  
>  	napi_synchronize(&ar->napi);
> -	napi_disable(&ar->napi);
>  }
>  
>  #ifdef CONFIG_PM
> @@ -2121,6 +2118,7 @@ static int ath10k_sdio_probe(struct sdio_func *func,
>  
>  	netif_napi_add(&ar->napi_dev, &ar->napi, ath10k_sdio_napi_poll,
>  		       ATH10K_NAPI_BUDGET);
> +	napi_enable(&ar->napi);
>  
>  	ath10k_dbg(ar, ATH10K_DBG_BOOT,
>  		   "sdio new func %d vendor 0x%x device 0x%x block 0x%x/0x%x\n",
> @@ -2235,6 +2233,7 @@ static void ath10k_sdio_remove(struct sdio_func *func)
>  
>  	ath10k_core_unregister(ar);
>  
> +	napi_disable(&ar->napi);
>  	netif_napi_del(&ar->napi);
>  
>  	ath10k_core_destroy(ar);

I'm not really convinced that this is the right fix, but I'm no NAPI
expert. Can anyone else help?

And even if we did this fix/hack in ath10k we should change all bus
types to do the same. SDIO should not behave differently from PCI, AHB
and SNOC.
Krishna Chaitanya Aug. 20, 2020, 9:19 a.m. UTC | #2
On Thu, Aug 20, 2020 at 2:03 PM Kalle Valo <kvalo@codeaurora.org> wrote:
>
> Wen Gong <wgong@codeaurora.org> writes:
>
> > It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
> > test simulate crash and ifconfig down/rmmod meanwhile.
> >
> > Test steps:
> >
> > 1.Test commands
> > echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down
> > echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio
> >
> > 2. dmesg:
> > [ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash
> > [ 5622.655995] ieee80211 phy0: Hardware restart was requested
> > [ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds.
> > [ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds.
> > [ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks
> > [ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G        W         4.19.86 #137
> > [ 5776.359846] Hardware name: MediaTek krane sku176 board (DT)
> > [ 5776.359855] Call trace:
> > [ 5776.359868]  dump_backtrace+0x0/0x170
> > [ 5776.359881]  show_stack+0x20/0x2c
> > [ 5776.359896]  dump_stack+0xd4/0x10c
> > [ 5776.359916]  panic+0x12c/0x29c
> > [ 5776.359937]  hung_task_panic+0x0/0x50
> > [ 5776.359953]  kthread+0x120/0x130
> > [ 5776.359965]  ret_from_fork+0x10/0x18
> > [ 5776.359986] SMP: stopping secondary CPUs
> > [ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000
> > [ 5776.360026] CPU features: 0x0,2188200c
> > [ 5776.360035] Memory Limit: none
> >
> > command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked
> > callstack of ifconfig:
> > [<0>] __switch_to+0x120/0x13c
> > [<0>] msleep+0x28/0x38
> > [<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio]
> > [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> > [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> > [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> > [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> > [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> > [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> > [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> > [<0>] __dev_close_many+0xb8/0x11c
> > [<0>] __dev_change_flags+0xe0/0x1d0
> > [<0>] dev_change_flags+0x30/0x6c
> > [<0>] devinet_ioctl+0x370/0x564
> > [<0>] inet_ioctl+0xdc/0x304
> > [<0>] sock_do_ioctl+0x50/0x288
> > [<0>] compat_sock_ioctl+0x1b4/0x1aac
> > [<0>] __se_compat_sys_ioctl+0x100/0x26fc
> > [<0>] __arm64_compat_sys_ioctl+0x20/0x2c
> > [<0>] el0_svc_common+0xa4/0x154
> > [<0>] el0_svc_compat_handler+0x2c/0x38
> > [<0>] el0_svc_compat+0x8/0x18
> > [<0>] 0xffffffffffffffff
> >
> > callstack of rmmod:
> > [<0>] __switch_to+0x120/0x13c
> > [<0>] msleep+0x28/0x38
> > [<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio]
> > [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> > [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> > [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> > [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> > [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> > [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> > [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> > [<0>] __dev_close_many+0xb8/0x11c
> > [<0>] dev_close_many+0x70/0x100
> > [<0>] dev_close+0x4c/0x80
> > [<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211]
> > [<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211]
> > [<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211]
> > [<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> > [<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core]
> > [<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio]
> > [<0>] sdio_bus_remove+0x48/0x108
> > [<0>] device_release_driver_internal+0x138/0x1ec
> > [<0>] driver_detach+0x6c/0xa8
> > [<0>] bus_remove_driver+0x78/0xa8
> > [<0>] driver_unregister+0x30/0x50
> > [<0>] sdio_unregister_driver+0x28/0x34
> > [<0>] cleanup_module+0x14/0x6bc [ath10k_sdio]
> > [<0>] __arm64_sys_delete_module+0x1e0/0x22c
> > [<0>] el0_svc_common+0xa4/0x154
> > [<0>] el0_svc_compat_handler+0x2c/0x38
> > [<0>] el0_svc_compat+0x8/0x18
> > [<0>] 0xffffffffffffffff
> >
> > The test command run simulate_fw_crash firstly and it call into
> > ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable
> > is called and bit NAPI_STATE_SCHED is set. After that, function
> > ath10k_sdio_hif_stop is called again from ath10k_stop by command
> > "ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked.
> >
> > It is blocked by napi_synchronize, napi_disable will set bit with
> > NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop
> > becuase bit NAPI_STATE_SCHED is set by napi_disable.
> >
> > function of napi_synchronize
> > static inline void napi_synchronize(const struct napi_struct *n)
> > {
> >       if (IS_ENABLED(CONFIG_SMP))
> >               while (test_bit(NAPI_STATE_SCHED, &n->state))
> >                       msleep(1);
> >       else
> >               barrier();
> > }
> >
> > function of napi_disable
> > void napi_disable(struct napi_struct *n)
> > {
> >       might_sleep();
> >       set_bit(NAPI_STATE_DISABLE, &n->state);
> >
> >       while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
> >               msleep(1);
> >       while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
> >               msleep(1);
> >
> >       hrtimer_cancel(&n->timer);
> >
> >       clear_bit(NAPI_STATE_DISABLE, &n->state);
> > }
> >
> > Tested with QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00042.
> >
> > Signed-off-by: Wen Gong <wgong@codeaurora.org>
> > ---
> >  drivers/net/wireless/ath/ath10k/sdio.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
> > index 7b894dcaad2e..b71499b171c6 100644
> > --- a/drivers/net/wireless/ath/ath10k/sdio.c
> > +++ b/drivers/net/wireless/ath/ath10k/sdio.c
> > @@ -1756,8 +1756,6 @@ static int ath10k_sdio_hif_start(struct ath10k *ar)
> >       struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar);
> >       int ret;
> >
> > -     napi_enable(&ar->napi);
> > -
> >       /* Sleep 20 ms before HIF interrupts are disabled.
> >        * This will give target plenty of time to process the BMI done
> >        * request before interrupts are disabled.
> > @@ -1884,7 +1882,6 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar)
> >       spin_unlock_bh(&ar_sdio->wr_async_lock);
> >
> >       napi_synchronize(&ar->napi);
> > -     napi_disable(&ar->napi);
> >  }
> >
> >  #ifdef CONFIG_PM
> > @@ -2121,6 +2118,7 @@ static int ath10k_sdio_probe(struct sdio_func *func,
> >
> >       netif_napi_add(&ar->napi_dev, &ar->napi, ath10k_sdio_napi_poll,
> >                      ATH10K_NAPI_BUDGET);
> > +     napi_enable(&ar->napi);
> >
> >       ath10k_dbg(ar, ATH10K_DBG_BOOT,
> >                  "sdio new func %d vendor 0x%x device 0x%x block 0x%x/0x%x\n",
> > @@ -2235,6 +2233,7 @@ static void ath10k_sdio_remove(struct sdio_func *func)
> >
> >       ath10k_core_unregister(ar);
> >
> > +     napi_disable(&ar->napi);
> >       netif_napi_del(&ar->napi);
> >
> >       ath10k_core_destroy(ar);
>
> I'm not really convinced that this is the right fix, but I'm no NAPI
> expert. Can anyone else help?
Calling napi_disable() twice can lead to hangs, but moving NAPI from
start/stop to
the probe isn't the right approach as the datapath is tied to start/stop.

Maybe check the state of NAPI before disable?

 if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
  napi_disable(&ar->napi)

or maintain napi_state like this https://patchwork.kernel.org/patch/10249365/

Also, the most common cause for such issues (1st
napi_synchronize/napi_disable hang)
is that napi_poll is being scheduled, so, you might want to check that
napi_schedule isn't
called after stop.

cd ath10k; git log --grep=napi shows plenty of such issues. the one
that matches closest is
c2cac2f74ab4bcf0db0dcf3a612f1e5b52d145c8, so, it could just be a regression.
Krishna Chaitanya Aug. 20, 2020, 9:26 a.m. UTC | #3
On Thu, Aug 20, 2020 at 2:49 PM Krishna Chaitanya
<chaitanya.mgit@gmail.com> wrote:
>
> On Thu, Aug 20, 2020 at 2:03 PM Kalle Valo <kvalo@codeaurora.org> wrote:
> >
> > Wen Gong <wgong@codeaurora.org> writes:
> >
> > > It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
> > > test simulate crash and ifconfig down/rmmod meanwhile.
> > >
> > > Test steps:
> > >
> > > 1.Test commands
> > > echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down
> > > echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio
> > >
> > > 2. dmesg:
> > > [ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash
> > > [ 5622.655995] ieee80211 phy0: Hardware restart was requested
> > > [ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds.
> > > [ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds.
> > > [ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks
> > > [ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G        W         4.19.86 #137
> > > [ 5776.359846] Hardware name: MediaTek krane sku176 board (DT)
> > > [ 5776.359855] Call trace:
> > > [ 5776.359868]  dump_backtrace+0x0/0x170
> > > [ 5776.359881]  show_stack+0x20/0x2c
> > > [ 5776.359896]  dump_stack+0xd4/0x10c
> > > [ 5776.359916]  panic+0x12c/0x29c
> > > [ 5776.359937]  hung_task_panic+0x0/0x50
> > > [ 5776.359953]  kthread+0x120/0x130
> > > [ 5776.359965]  ret_from_fork+0x10/0x18
> > > [ 5776.359986] SMP: stopping secondary CPUs
> > > [ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000
> > > [ 5776.360026] CPU features: 0x0,2188200c
> > > [ 5776.360035] Memory Limit: none
> > >
> > > command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked
> > > callstack of ifconfig:
> > > [<0>] __switch_to+0x120/0x13c
> > > [<0>] msleep+0x28/0x38
> > > [<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio]
> > > [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> > > [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> > > [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> > > [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> > > [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> > > [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> > > [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> > > [<0>] __dev_close_many+0xb8/0x11c
> > > [<0>] __dev_change_flags+0xe0/0x1d0
> > > [<0>] dev_change_flags+0x30/0x6c
> > > [<0>] devinet_ioctl+0x370/0x564
> > > [<0>] inet_ioctl+0xdc/0x304
> > > [<0>] sock_do_ioctl+0x50/0x288
> > > [<0>] compat_sock_ioctl+0x1b4/0x1aac
> > > [<0>] __se_compat_sys_ioctl+0x100/0x26fc
> > > [<0>] __arm64_compat_sys_ioctl+0x20/0x2c
> > > [<0>] el0_svc_common+0xa4/0x154
> > > [<0>] el0_svc_compat_handler+0x2c/0x38
> > > [<0>] el0_svc_compat+0x8/0x18
> > > [<0>] 0xffffffffffffffff
> > >
> > > callstack of rmmod:
> > > [<0>] __switch_to+0x120/0x13c
> > > [<0>] msleep+0x28/0x38
> > > [<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio]
> > > [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> > > [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> > > [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> > > [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> > > [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> > > [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> > > [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> > > [<0>] __dev_close_many+0xb8/0x11c
> > > [<0>] dev_close_many+0x70/0x100
> > > [<0>] dev_close+0x4c/0x80
> > > [<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211]
> > > [<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211]
> > > [<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211]
> > > [<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> > > [<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core]
> > > [<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio]
> > > [<0>] sdio_bus_remove+0x48/0x108
> > > [<0>] device_release_driver_internal+0x138/0x1ec
> > > [<0>] driver_detach+0x6c/0xa8
> > > [<0>] bus_remove_driver+0x78/0xa8
> > > [<0>] driver_unregister+0x30/0x50
> > > [<0>] sdio_unregister_driver+0x28/0x34
> > > [<0>] cleanup_module+0x14/0x6bc [ath10k_sdio]
> > > [<0>] __arm64_sys_delete_module+0x1e0/0x22c
> > > [<0>] el0_svc_common+0xa4/0x154
> > > [<0>] el0_svc_compat_handler+0x2c/0x38
> > > [<0>] el0_svc_compat+0x8/0x18
> > > [<0>] 0xffffffffffffffff
> > >
> > > The test command run simulate_fw_crash firstly and it call into
> > > ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable
> > > is called and bit NAPI_STATE_SCHED is set. After that, function
> > > ath10k_sdio_hif_stop is called again from ath10k_stop by command
> > > "ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked.
> > >
> > > It is blocked by napi_synchronize, napi_disable will set bit with
> > > NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop
> > > becuase bit NAPI_STATE_SCHED is set by napi_disable.
> > >
> > > function of napi_synchronize
> > > static inline void napi_synchronize(const struct napi_struct *n)
> > > {
> > >       if (IS_ENABLED(CONFIG_SMP))
> > >               while (test_bit(NAPI_STATE_SCHED, &n->state))
> > >                       msleep(1);
> > >       else
> > >               barrier();
> > > }
> > >
> > > function of napi_disable
> > > void napi_disable(struct napi_struct *n)
> > > {
> > >       might_sleep();
> > >       set_bit(NAPI_STATE_DISABLE, &n->state);
> > >
> > >       while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
> > >               msleep(1);
> > >       while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
> > >               msleep(1);
> > >
> > >       hrtimer_cancel(&n->timer);
> > >
> > >       clear_bit(NAPI_STATE_DISABLE, &n->state);
> > > }
> > >
> > > Tested with QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00042.
> > >
> > > Signed-off-by: Wen Gong <wgong@codeaurora.org>
> > > ---
> > >  drivers/net/wireless/ath/ath10k/sdio.c | 5 ++---
> > >  1 file changed, 2 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
> > > index 7b894dcaad2e..b71499b171c6 100644
> > > --- a/drivers/net/wireless/ath/ath10k/sdio.c
> > > +++ b/drivers/net/wireless/ath/ath10k/sdio.c
> > > @@ -1756,8 +1756,6 @@ static int ath10k_sdio_hif_start(struct ath10k *ar)
> > >       struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar);
> > >       int ret;
> > >
> > > -     napi_enable(&ar->napi);
> > > -
> > >       /* Sleep 20 ms before HIF interrupts are disabled.
> > >        * This will give target plenty of time to process the BMI done
> > >        * request before interrupts are disabled.
> > > @@ -1884,7 +1882,6 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar)
> > >       spin_unlock_bh(&ar_sdio->wr_async_lock);
> > >
> > >       napi_synchronize(&ar->napi);
> > > -     napi_disable(&ar->napi);
> > >  }
> > >
> > >  #ifdef CONFIG_PM
> > > @@ -2121,6 +2118,7 @@ static int ath10k_sdio_probe(struct sdio_func *func,
> > >
> > >       netif_napi_add(&ar->napi_dev, &ar->napi, ath10k_sdio_napi_poll,
> > >                      ATH10K_NAPI_BUDGET);
> > > +     napi_enable(&ar->napi);
> > >
> > >       ath10k_dbg(ar, ATH10K_DBG_BOOT,
> > >                  "sdio new func %d vendor 0x%x device 0x%x block 0x%x/0x%x\n",
> > > @@ -2235,6 +2233,7 @@ static void ath10k_sdio_remove(struct sdio_func *func)
> > >
> > >       ath10k_core_unregister(ar);
> > >
> > > +     napi_disable(&ar->napi);
> > >       netif_napi_del(&ar->napi);
> > >
> > >       ath10k_core_destroy(ar);
> >
> > I'm not really convinced that this is the right fix, but I'm no NAPI
> > expert. Can anyone else help?
> Calling napi_disable() twice can lead to hangs, but moving NAPI from
> start/stop to
> the probe isn't the right approach as the datapath is tied to start/stop.
>
> Maybe check the state of NAPI before disable?
>
>  if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
>   napi_disable(&ar->napi)
>
> or maintain napi_state like this https://patchwork.kernel.org/patch/10249365/
>
> Also, the most common cause for such issues (1st
> napi_synchronize/napi_disable hang)
> is that napi_poll is being scheduled, so, you might want to check that
> napi_schedule isn't
> called after stop.
>
> cd ath10k; git log --grep=napi shows plenty of such issues. the one
> that matches closest is
> c2cac2f74ab4bcf0db0dcf3a612f1e5b52d145c8, so, it could just be a regression.
Also, I see that napi_schedule() is being called from work_queue async_work_rx
so we should cancel that work in hif_stop before calling napi_synchronize.
Wen Gong Aug. 20, 2020, 10:14 a.m. UTC | #4
On 2020-08-20 17:19, Krishna Chaitanya wrote:
...
>> > diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
>> > index 7b894dcaad2e..b71499b171c6 100644
>> > --- a/drivers/net/wireless/ath/ath10k/sdio.c
>> > +++ b/drivers/net/wireless/ath/ath10k/sdio.c
>> > @@ -1756,8 +1756,6 @@ static int ath10k_sdio_hif_start(struct ath10k *ar)
>> >       struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar);
>> >       int ret;
>> >
>> > -     napi_enable(&ar->napi);
>> > -
>> >       /* Sleep 20 ms before HIF interrupts are disabled.
>> >        * This will give target plenty of time to process the BMI done
>> >        * request before interrupts are disabled.
>> > @@ -1884,7 +1882,6 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar)
>> >       spin_unlock_bh(&ar_sdio->wr_async_lock);
>> >
>> >       napi_synchronize(&ar->napi);
>> > -     napi_disable(&ar->napi);
>> >  }
>> >
>> >  #ifdef CONFIG_PM
>> > @@ -2121,6 +2118,7 @@ static int ath10k_sdio_probe(struct sdio_func *func,
>> >
>> >       netif_napi_add(&ar->napi_dev, &ar->napi, ath10k_sdio_napi_poll,
>> >                      ATH10K_NAPI_BUDGET);
>> > +     napi_enable(&ar->napi);
>> >
>> >       ath10k_dbg(ar, ATH10K_DBG_BOOT,
>> >                  "sdio new func %d vendor 0x%x device 0x%x block 0x%x/0x%x\n",
>> > @@ -2235,6 +2233,7 @@ static void ath10k_sdio_remove(struct sdio_func *func)
>> >
>> >       ath10k_core_unregister(ar);
>> >
>> > +     napi_disable(&ar->napi);
>> >       netif_napi_del(&ar->napi);
>> >
>> >       ath10k_core_destroy(ar);
>> 
>> I'm not really convinced that this is the right fix, but I'm no NAPI
>> expert. Can anyone else help?
> Calling napi_disable() twice can lead to hangs, but moving NAPI from
> start/stop to
> the probe isn't the right approach as the datapath is tied to 
> start/stop.
> 
> Maybe check the state of NAPI before disable?
> 
>  if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
>   napi_disable(&ar->napi)
> or maintain napi_state like this 
> https://patchwork.kernel.org/patch/10249365/
it is better to use above link's patch.
napi.state is controlled by napi API, it is better ath10k not know it.
> 
> Also, the most common cause for such issues (1st
> napi_synchronize/napi_disable hang)
> is that napi_poll is being scheduled, so, you might want to check that
> napi_schedule isn't
> called after stop.
> 
> cd ath10k; git log --grep=napi shows plenty of such issues. the one
> that matches closest is
> c2cac2f74ab4bcf0db0dcf3a612f1e5b52d145c8, so, it could just be a 
> regression.
This above commit's scene is not same with this patch.
It is hang for only do 1 simulate crash of the commit, this patch is 
doing simulate crash and rmmod meanwhile.
Wen Gong Aug. 20, 2020, 10:20 a.m. UTC | #5
On 2020-08-20 17:26, Krishna Chaitanya wrote:
> On Thu, Aug 20, 2020 at 2:49 PM Krishna Chaitanya
> <chaitanya.mgit@gmail.com> wrote:
>> 
>> On Thu, Aug 20, 2020 at 2:03 PM Kalle Valo <kvalo@codeaurora.org> 
>> wrote:
>> >
>> > Wen Gong <wgong@codeaurora.org> writes:
>> >
...
>> > I'm not really convinced that this is the right fix, but I'm no NAPI
>> > expert. Can anyone else help?
>> Calling napi_disable() twice can lead to hangs, but moving NAPI from
>> start/stop to
>> the probe isn't the right approach as the datapath is tied to 
>> start/stop.
>> 
>> Maybe check the state of NAPI before disable?
>> 
>>  if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
>>   napi_disable(&ar->napi)
>> 
>> or maintain napi_state like this 
>> https://patchwork.kernel.org/patch/10249365/
>> 
>> Also, the most common cause for such issues (1st
>> napi_synchronize/napi_disable hang)
>> is that napi_poll is being scheduled, so, you might want to check that
>> napi_schedule isn't
>> called after stop.
>> 
>> cd ath10k; git log --grep=napi shows plenty of such issues. the one
>> that matches closest is
>> c2cac2f74ab4bcf0db0dcf3a612f1e5b52d145c8, so, it could just be a 
>> regression.
> Also, I see that napi_schedule() is being called from work_queue 
> async_work_rx
> so we should cancel that work in hif_stop before calling 
> napi_synchronize.
Yes, I will do that in a new patch.
Krishna Chaitanya Aug. 20, 2020, 10:52 a.m. UTC | #6
On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote:
>
> On 2020-08-20 17:19, Krishna Chaitanya wrote:
> ...
> >> > diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
> >> > index 7b894dcaad2e..b71499b171c6 100644
> >> > --- a/drivers/net/wireless/ath/ath10k/sdio.c
> >> > +++ b/drivers/net/wireless/ath/ath10k/sdio.c
> >> > @@ -1756,8 +1756,6 @@ static int ath10k_sdio_hif_start(struct ath10k *ar)
> >> >       struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar);
> >> >       int ret;
> >> >
> >> > -     napi_enable(&ar->napi);
> >> > -
> >> >       /* Sleep 20 ms before HIF interrupts are disabled.
> >> >        * This will give target plenty of time to process the BMI done
> >> >        * request before interrupts are disabled.
> >> > @@ -1884,7 +1882,6 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar)
> >> >       spin_unlock_bh(&ar_sdio->wr_async_lock);
> >> >
> >> >       napi_synchronize(&ar->napi);
> >> > -     napi_disable(&ar->napi);
> >> >  }
> >> >
> >> >  #ifdef CONFIG_PM
> >> > @@ -2121,6 +2118,7 @@ static int ath10k_sdio_probe(struct sdio_func *func,
> >> >
> >> >       netif_napi_add(&ar->napi_dev, &ar->napi, ath10k_sdio_napi_poll,
> >> >                      ATH10K_NAPI_BUDGET);
> >> > +     napi_enable(&ar->napi);
> >> >
> >> >       ath10k_dbg(ar, ATH10K_DBG_BOOT,
> >> >                  "sdio new func %d vendor 0x%x device 0x%x block 0x%x/0x%x\n",
> >> > @@ -2235,6 +2233,7 @@ static void ath10k_sdio_remove(struct sdio_func *func)
> >> >
> >> >       ath10k_core_unregister(ar);
> >> >
> >> > +     napi_disable(&ar->napi);
> >> >       netif_napi_del(&ar->napi);
> >> >
> >> >       ath10k_core_destroy(ar);
> >>
> >> I'm not really convinced that this is the right fix, but I'm no NAPI
> >> expert. Can anyone else help?
> > Calling napi_disable() twice can lead to hangs, but moving NAPI from
> > start/stop to
> > the probe isn't the right approach as the datapath is tied to
> > start/stop.
> >
> > Maybe check the state of NAPI before disable?
> >
> >  if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
> >   napi_disable(&ar->napi)
> > or maintain napi_state like this
> > https://patchwork.kernel.org/patch/10249365/
> it is better to use above link's patch.
> napi.state is controlled by napi API, it is better ath10k not know it.
Sure, but IMHO just canceling the async rx work should solve the issue.

> > Also, the most common cause for such issues (1st
> > napi_synchronize/napi_disable hang)
> > is that napi_poll is being scheduled, so, you might want to check that
> > napi_schedule isn't
> > called after stop.
> >
> > cd ath10k; git log --grep=napi shows plenty of such issues. the one
> > that matches closest is
> > c2cac2f74ab4bcf0db0dcf3a612f1e5b52d145c8, so, it could just be a
> > regression.
> This above commit's scene is not same with this patch.
> It is hang for only do 1 simulate crash of the commit, this patch is
> doing simulate crash and rmmod meanwhile.
yes, it is the closest one I could find.
Wen Gong Aug. 20, 2020, 2:37 p.m. UTC | #7
On 2020-08-20 18:52, Krishna Chaitanya wrote:
> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote:
>> 
>> On 2020-08-20 17:19, Krishna Chaitanya wrote:
...
>> >> I'm not really convinced that this is the right fix, but I'm no NAPI
>> >> expert. Can anyone else help?
>> > Calling napi_disable() twice can lead to hangs, but moving NAPI from
>> > start/stop to
>> > the probe isn't the right approach as the datapath is tied to
>> > start/stop.
>> >
>> > Maybe check the state of NAPI before disable?
>> >
>> >  if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
>> >   napi_disable(&ar->napi)
>> > or maintain napi_state like this
>> > https://patchwork.kernel.org/patch/10249365/
>> it is better to use above link's patch.
>> napi.state is controlled by napi API, it is better ath10k not know it.
> Sure, but IMHO just canceling the async rx work should solve the issue.
Oh no, canceling the async rx work will not solve this issue, rx worker
ath10k_rx_indication_async_work call napi_schedule, after napi_complete,
the NAPI_STATE_SCHED will clear.
The issue of this patch is because 2 thread called to hif_stop and
NAPI_STATE_SCHED not clear.
...
Krishna Chaitanya Aug. 20, 2020, 4:08 p.m. UTC | #8
On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote:
>
> On 2020-08-20 18:52, Krishna Chaitanya wrote:
> > On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote:
> >>
> >> On 2020-08-20 17:19, Krishna Chaitanya wrote:
> ...
> >> >> I'm not really convinced that this is the right fix, but I'm no NAPI
> >> >> expert. Can anyone else help?
> >> > Calling napi_disable() twice can lead to hangs, but moving NAPI from
> >> > start/stop to
> >> > the probe isn't the right approach as the datapath is tied to
> >> > start/stop.
> >> >
> >> > Maybe check the state of NAPI before disable?
> >> >
> >> >  if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
> >> >   napi_disable(&ar->napi)
> >> > or maintain napi_state like this
> >> > https://patchwork.kernel.org/patch/10249365/
> >> it is better to use above link's patch.
> >> napi.state is controlled by napi API, it is better ath10k not know it.
> > Sure, but IMHO just canceling the async rx work should solve the issue.
> Oh no, canceling the async rx work will not solve this issue, rx worker
> ath10k_rx_indication_async_work call napi_schedule, after napi_complete,
> the NAPI_STATE_SCHED will clear.
> The issue of this patch is because 2 thread called to hif_stop and
> NAPI_STATE_SCHED not clear.
That fix is still valid and good to have.

ndev_stop being called twice is typical scenarios (stop vs rmmod), so
 just checking the netdev_flags for IFF_UP and returning from hif_Stop
should suffice, no?
Ben Greear Aug. 20, 2020, 4:32 p.m. UTC | #9
On 8/20/20 9:08 AM, Krishna Chaitanya wrote:
> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote:
>>
>> On 2020-08-20 18:52, Krishna Chaitanya wrote:
>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote:
>>>>
>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote:
>> ...
>>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI
>>>>>> expert. Can anyone else help?
>>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from
>>>>> start/stop to
>>>>> the probe isn't the right approach as the datapath is tied to
>>>>> start/stop.
>>>>>
>>>>> Maybe check the state of NAPI before disable?
>>>>>
>>>>>   if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
>>>>>    napi_disable(&ar->napi)
>>>>> or maintain napi_state like this
>>>>> https://patchwork.kernel.org/patch/10249365/
>>>> it is better to use above link's patch.
>>>> napi.state is controlled by napi API, it is better ath10k not know it.
>>> Sure, but IMHO just canceling the async rx work should solve the issue.
>> Oh no, canceling the async rx work will not solve this issue, rx worker
>> ath10k_rx_indication_async_work call napi_schedule, after napi_complete,
>> the NAPI_STATE_SCHED will clear.
>> The issue of this patch is because 2 thread called to hif_stop and
>> NAPI_STATE_SCHED not clear.
> That fix is still valid and good to have.
> 
> ndev_stop being called twice is typical scenarios (stop vs rmmod), so
>   just checking the netdev_flags for IFF_UP and returning from hif_Stop
> should suffice, no?

My approach to fix this problem was to add a boolean in ath10k as to whether
it had napi enabled or not, and then check that before trying to enable/disable
it again.  Seems to work fine, and cleaner in my mind than checking internal
napi flags.

Thanks,
Ben
Krishna Chaitanya Aug. 20, 2020, 5 p.m. UTC | #10
On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote:
>
> On 8/20/20 9:08 AM, Krishna Chaitanya wrote:
> > On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote:
> >>
> >> On 2020-08-20 18:52, Krishna Chaitanya wrote:
> >>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote:
> >>>>
> >>>> On 2020-08-20 17:19, Krishna Chaitanya wrote:
> >> ...
> >>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI
> >>>>>> expert. Can anyone else help?
> >>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from
> >>>>> start/stop to
> >>>>> the probe isn't the right approach as the datapath is tied to
> >>>>> start/stop.
> >>>>>
> >>>>> Maybe check the state of NAPI before disable?
> >>>>>
> >>>>>   if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
> >>>>>    napi_disable(&ar->napi)
> >>>>> or maintain napi_state like this
> >>>>> https://patchwork.kernel.org/patch/10249365/
> >>>> it is better to use above link's patch.
> >>>> napi.state is controlled by napi API, it is better ath10k not know it.
> >>> Sure, but IMHO just canceling the async rx work should solve the issue.
> >> Oh no, canceling the async rx work will not solve this issue, rx worker
> >> ath10k_rx_indication_async_work call napi_schedule, after napi_complete,
> >> the NAPI_STATE_SCHED will clear.
> >> The issue of this patch is because 2 thread called to hif_stop and
> >> NAPI_STATE_SCHED not clear.
> > That fix is still valid and good to have.
> >
> > ndev_stop being called twice is typical scenarios (stop vs rmmod), so
> >   just checking the netdev_flags for IFF_UP and returning from hif_Stop
> > should suffice, no?
>
> My approach to fix this problem was to add a boolean in ath10k as to whether
> it had napi enabled or not, and then check that before trying to enable/disable
> it again.  Seems to work fine, and cleaner in my mind than checking internal
> napi flags.
A much simpler approach is just to check for IFF_UP and skip NAPI (and others)
in the hif_stop no? (provided proper RTNL locking is done if hif_stop
is being called
internally as well).
Ben Greear Aug. 20, 2020, 5:07 p.m. UTC | #11
On 8/20/20 10:00 AM, Krishna Chaitanya wrote:
> On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote:
>>
>> On 8/20/20 9:08 AM, Krishna Chaitanya wrote:
>>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote:
>>>>
>>>> On 2020-08-20 18:52, Krishna Chaitanya wrote:
>>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote:
>>>>>>
>>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote:
>>>> ...
>>>>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI
>>>>>>>> expert. Can anyone else help?
>>>>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from
>>>>>>> start/stop to
>>>>>>> the probe isn't the right approach as the datapath is tied to
>>>>>>> start/stop.
>>>>>>>
>>>>>>> Maybe check the state of NAPI before disable?
>>>>>>>
>>>>>>>    if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
>>>>>>>     napi_disable(&ar->napi)
>>>>>>> or maintain napi_state like this
>>>>>>> https://patchwork.kernel.org/patch/10249365/
>>>>>> it is better to use above link's patch.
>>>>>> napi.state is controlled by napi API, it is better ath10k not know it.
>>>>> Sure, but IMHO just canceling the async rx work should solve the issue.
>>>> Oh no, canceling the async rx work will not solve this issue, rx worker
>>>> ath10k_rx_indication_async_work call napi_schedule, after napi_complete,
>>>> the NAPI_STATE_SCHED will clear.
>>>> The issue of this patch is because 2 thread called to hif_stop and
>>>> NAPI_STATE_SCHED not clear.
>>> That fix is still valid and good to have.
>>>
>>> ndev_stop being called twice is typical scenarios (stop vs rmmod), so
>>>    just checking the netdev_flags for IFF_UP and returning from hif_Stop
>>> should suffice, no?
>>
>> My approach to fix this problem was to add a boolean in ath10k as to whether
>> it had napi enabled or not, and then check that before trying to enable/disable
>> it again.  Seems to work fine, and cleaner in my mind than checking internal
>> napi flags.
> A much simpler approach is just to check for IFF_UP and skip NAPI (and others)
> in the hif_stop no? (provided proper RTNL locking is done if hif_stop
> is being called
> internally as well).
> 

I'm not sure, but I think the driver should be internally consistent and not
spend a lot of time trying to guess about interactions with objects higher
in the stack.

Here is my original patch to fix this, it is not complex.

https://patchwork.kernel.org/patch/10249363/

Thanks,
Ben
Krishna Chaitanya Aug. 20, 2020, 5:41 p.m. UTC | #12
On Thu, Aug 20, 2020 at 10:38 PM Ben Greear <greearb@candelatech.com> wrote:
>
> On 8/20/20 10:00 AM, Krishna Chaitanya wrote:
> > On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote:
> >>
> >> On 8/20/20 9:08 AM, Krishna Chaitanya wrote:
> >>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote:
> >>>>
> >>>> On 2020-08-20 18:52, Krishna Chaitanya wrote:
> >>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote:
> >>>>>>
> >>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote:
> >>>> ...
> >>>>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI
> >>>>>>>> expert. Can anyone else help?
> >>>>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from
> >>>>>>> start/stop to
> >>>>>>> the probe isn't the right approach as the datapath is tied to
> >>>>>>> start/stop.
> >>>>>>>
> >>>>>>> Maybe check the state of NAPI before disable?
> >>>>>>>
> >>>>>>>    if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
> >>>>>>>     napi_disable(&ar->napi)
> >>>>>>> or maintain napi_state like this
> >>>>>>> https://patchwork.kernel.org/patch/10249365/
> >>>>>> it is better to use above link's patch.
> >>>>>> napi.state is controlled by napi API, it is better ath10k not know it.
> >>>>> Sure, but IMHO just canceling the async rx work should solve the issue.
> >>>> Oh no, canceling the async rx work will not solve this issue, rx worker
> >>>> ath10k_rx_indication_async_work call napi_schedule, after napi_complete,
> >>>> the NAPI_STATE_SCHED will clear.
> >>>> The issue of this patch is because 2 thread called to hif_stop and
> >>>> NAPI_STATE_SCHED not clear.
> >>> That fix is still valid and good to have.
> >>>
> >>> ndev_stop being called twice is typical scenarios (stop vs rmmod), so
> >>>    just checking the netdev_flags for IFF_UP and returning from hif_Stop
> >>> should suffice, no?
> >>
> >> My approach to fix this problem was to add a boolean in ath10k as to whether
> >> it had napi enabled or not, and then check that before trying to enable/disable
> >> it again.  Seems to work fine, and cleaner in my mind than checking internal
> >> napi flags.
> > A much simpler approach is just to check for IFF_UP and skip NAPI (and others)
> > in the hif_stop no? (provided proper RTNL locking is done if hif_stop
> > is being called
> > internally as well).
> >
>
> I'm not sure, but I think the driver should be internally consistent and not
> spend a lot of time trying to guess about interactions with objects higher
> in the stack.
Fair enough, the network interface state is a basic thing controlled
by the driver,
so, should be okay to use. Anyways, the in-driver approach has more control.
>
> Here is my original patch to fix this, it is not complex.
>
> https://patchwork.kernel.org/patch/10249363/
Sure, I have shared your patch above :).
Krishna Chaitanya Aug. 20, 2020, 5:42 p.m. UTC | #13
On Thu, Aug 20, 2020 at 11:11 PM Krishna Chaitanya
<chaitanya.mgit@gmail.com> wrote:
>
> On Thu, Aug 20, 2020 at 10:38 PM Ben Greear <greearb@candelatech.com> wrote:
> >
> > On 8/20/20 10:00 AM, Krishna Chaitanya wrote:
> > > On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote:
> > >>
> > >> On 8/20/20 9:08 AM, Krishna Chaitanya wrote:
> > >>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote:
> > >>>>
> > >>>> On 2020-08-20 18:52, Krishna Chaitanya wrote:
> > >>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote:
> > >>>>>>
> > >>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote:
> > >>>> ...
> > >>>>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI
> > >>>>>>>> expert. Can anyone else help?
> > >>>>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from
> > >>>>>>> start/stop to
> > >>>>>>> the probe isn't the right approach as the datapath is tied to
> > >>>>>>> start/stop.
> > >>>>>>>
> > >>>>>>> Maybe check the state of NAPI before disable?
> > >>>>>>>
> > >>>>>>>    if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
> > >>>>>>>     napi_disable(&ar->napi)
> > >>>>>>> or maintain napi_state like this
> > >>>>>>> https://patchwork.kernel.org/patch/10249365/
> > >>>>>> it is better to use above link's patch.
> > >>>>>> napi.state is controlled by napi API, it is better ath10k not know it.
> > >>>>> Sure, but IMHO just canceling the async rx work should solve the issue.
> > >>>> Oh no, canceling the async rx work will not solve this issue, rx worker
> > >>>> ath10k_rx_indication_async_work call napi_schedule, after napi_complete,
> > >>>> the NAPI_STATE_SCHED will clear.
> > >>>> The issue of this patch is because 2 thread called to hif_stop and
> > >>>> NAPI_STATE_SCHED not clear.
> > >>> That fix is still valid and good to have.
> > >>>
> > >>> ndev_stop being called twice is typical scenarios (stop vs rmmod), so
> > >>>    just checking the netdev_flags for IFF_UP and returning from hif_Stop
> > >>> should suffice, no?
> > >>
> > >> My approach to fix this problem was to add a boolean in ath10k as to whether
> > >> it had napi enabled or not, and then check that before trying to enable/disable
> > >> it again.  Seems to work fine, and cleaner in my mind than checking internal
> > >> napi flags.
> > > A much simpler approach is just to check for IFF_UP and skip NAPI (and others)
> > > in the hif_stop no? (provided proper RTNL locking is done if hif_stop
> > > is being called
> > > internally as well).
> > >
> >
> > I'm not sure, but I think the driver should be internally consistent and not
> > spend a lot of time trying to guess about interactions with objects higher
> > in the stack.
> Fair enough, the network interface state is a basic thing controlled
> by the driver,
> so, should be okay to use. Anyways, the in-driver approach has more control.
> >
> > Here is my original patch to fix this, it is not complex.
> >
> > https://patchwork.kernel.org/patch/10249363/
> Sure, I have shared your patch above :).
Sent a bit early, any idea why this wasn't upstreamed earlier?
Ben Greear Aug. 20, 2020, 5:53 p.m. UTC | #14
On 8/20/20 10:42 AM, Krishna Chaitanya wrote:
> On Thu, Aug 20, 2020 at 11:11 PM Krishna Chaitanya
> <chaitanya.mgit@gmail.com> wrote:
>>
>> On Thu, Aug 20, 2020 at 10:38 PM Ben Greear <greearb@candelatech.com> wrote:
>>>
>>> On 8/20/20 10:00 AM, Krishna Chaitanya wrote:
>>>> On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote:
>>>>>
>>>>> On 8/20/20 9:08 AM, Krishna Chaitanya wrote:
>>>>>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote:
>>>>>>>
>>>>>>> On 2020-08-20 18:52, Krishna Chaitanya wrote:
>>>>>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote:
>>>>>>>>>
>>>>>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote:
>>>>>>> ...
>>>>>>>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI
>>>>>>>>>>> expert. Can anyone else help?
>>>>>>>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from
>>>>>>>>>> start/stop to
>>>>>>>>>> the probe isn't the right approach as the datapath is tied to
>>>>>>>>>> start/stop.
>>>>>>>>>>
>>>>>>>>>> Maybe check the state of NAPI before disable?
>>>>>>>>>>
>>>>>>>>>>     if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
>>>>>>>>>>      napi_disable(&ar->napi)
>>>>>>>>>> or maintain napi_state like this
>>>>>>>>>> https://patchwork.kernel.org/patch/10249365/
>>>>>>>>> it is better to use above link's patch.
>>>>>>>>> napi.state is controlled by napi API, it is better ath10k not know it.
>>>>>>>> Sure, but IMHO just canceling the async rx work should solve the issue.
>>>>>>> Oh no, canceling the async rx work will not solve this issue, rx worker
>>>>>>> ath10k_rx_indication_async_work call napi_schedule, after napi_complete,
>>>>>>> the NAPI_STATE_SCHED will clear.
>>>>>>> The issue of this patch is because 2 thread called to hif_stop and
>>>>>>> NAPI_STATE_SCHED not clear.
>>>>>> That fix is still valid and good to have.
>>>>>>
>>>>>> ndev_stop being called twice is typical scenarios (stop vs rmmod), so
>>>>>>     just checking the netdev_flags for IFF_UP and returning from hif_Stop
>>>>>> should suffice, no?
>>>>>
>>>>> My approach to fix this problem was to add a boolean in ath10k as to whether
>>>>> it had napi enabled or not, and then check that before trying to enable/disable
>>>>> it again.  Seems to work fine, and cleaner in my mind than checking internal
>>>>> napi flags.
>>>> A much simpler approach is just to check for IFF_UP and skip NAPI (and others)
>>>> in the hif_stop no? (provided proper RTNL locking is done if hif_stop
>>>> is being called
>>>> internally as well).
>>>>
>>>
>>> I'm not sure, but I think the driver should be internally consistent and not
>>> spend a lot of time trying to guess about interactions with objects higher
>>> in the stack.
>> Fair enough, the network interface state is a basic thing controlled
>> by the driver,
>> so, should be okay to use. Anyways, the in-driver approach has more control.
>>>
>>> Here is my original patch to fix this, it is not complex.
>>>
>>> https://patchwork.kernel.org/patch/10249363/
>> Sure, I have shared your patch above :).
> Sent a bit early, any idea why this wasn't upstreamed earlier?

No, one comment from Michal indicated maybe there were more problems lurking
in this area, but he seemed to be OK with the patch over all.  After that,
it was just ignored.

Thanks,
Ben
Krishna Chaitanya Aug. 20, 2020, 8:15 p.m. UTC | #15
On Thu, Aug 20, 2020 at 11:23 PM Ben Greear <greearb@candelatech.com> wrote:
>
> On 8/20/20 10:42 AM, Krishna Chaitanya wrote:
> > On Thu, Aug 20, 2020 at 11:11 PM Krishna Chaitanya
> > <chaitanya.mgit@gmail.com> wrote:
> >>
> >> On Thu, Aug 20, 2020 at 10:38 PM Ben Greear <greearb@candelatech.com> wrote:
> >>>
> >>> On 8/20/20 10:00 AM, Krishna Chaitanya wrote:
> >>>> On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote:
> >>>>>
> >>>>> On 8/20/20 9:08 AM, Krishna Chaitanya wrote:
> >>>>>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote:
> >>>>>>>
> >>>>>>> On 2020-08-20 18:52, Krishna Chaitanya wrote:
> >>>>>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote:
> >>>>>>>>>
> >>>>>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote:
> >>>>>>> ...
> >>>>>>>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI
> >>>>>>>>>>> expert. Can anyone else help?
> >>>>>>>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from
> >>>>>>>>>> start/stop to
> >>>>>>>>>> the probe isn't the right approach as the datapath is tied to
> >>>>>>>>>> start/stop.
> >>>>>>>>>>
> >>>>>>>>>> Maybe check the state of NAPI before disable?
> >>>>>>>>>>
> >>>>>>>>>>     if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
> >>>>>>>>>>      napi_disable(&ar->napi)
> >>>>>>>>>> or maintain napi_state like this
> >>>>>>>>>> https://patchwork.kernel.org/patch/10249365/
> >>>>>>>>> it is better to use above link's patch.
> >>>>>>>>> napi.state is controlled by napi API, it is better ath10k not know it.
> >>>>>>>> Sure, but IMHO just canceling the async rx work should solve the issue.
> >>>>>>> Oh no, canceling the async rx work will not solve this issue, rx worker
> >>>>>>> ath10k_rx_indication_async_work call napi_schedule, after napi_complete,
> >>>>>>> the NAPI_STATE_SCHED will clear.
> >>>>>>> The issue of this patch is because 2 thread called to hif_stop and
> >>>>>>> NAPI_STATE_SCHED not clear.
> >>>>>> That fix is still valid and good to have.
> >>>>>>
> >>>>>> ndev_stop being called twice is typical scenarios (stop vs rmmod), so
> >>>>>>     just checking the netdev_flags for IFF_UP and returning from hif_Stop
> >>>>>> should suffice, no?
> >>>>>
> >>>>> My approach to fix this problem was to add a boolean in ath10k as to whether
> >>>>> it had napi enabled or not, and then check that before trying to enable/disable
> >>>>> it again.  Seems to work fine, and cleaner in my mind than checking internal
> >>>>> napi flags.
> >>>> A much simpler approach is just to check for IFF_UP and skip NAPI (and others)
> >>>> in the hif_stop no? (provided proper RTNL locking is done if hif_stop
> >>>> is being called
> >>>> internally as well).
> >>>>
> >>>
> >>> I'm not sure, but I think the driver should be internally consistent and not
> >>> spend a lot of time trying to guess about interactions with objects higher
> >>> in the stack.
> >> Fair enough, the network interface state is a basic thing controlled
> >> by the driver,
> >> so, should be okay to use. Anyways, the in-driver approach has more control.
> >>>
> >>> Here is my original patch to fix this, it is not complex.
> >>>
> >>> https://patchwork.kernel.org/patch/10249363/
> >> Sure, I have shared your patch above :).
> > Sent a bit early, any idea why this wasn't upstreamed earlier?
>
> No, one comment from Michal indicated maybe there were more problems lurking
> in this area, but he seemed to be OK with the patch over all.  After that,
> it was just ignored.
>
Now might be a good time to push for it :)
Ben Greear Aug. 20, 2020, 8:59 p.m. UTC | #16
On 8/20/20 1:15 PM, Krishna Chaitanya wrote:
> On Thu, Aug 20, 2020 at 11:23 PM Ben Greear <greearb@candelatech.com> wrote:
>>
>> On 8/20/20 10:42 AM, Krishna Chaitanya wrote:
>>> On Thu, Aug 20, 2020 at 11:11 PM Krishna Chaitanya
>>> <chaitanya.mgit@gmail.com> wrote:
>>>>
>>>> On Thu, Aug 20, 2020 at 10:38 PM Ben Greear <greearb@candelatech.com> wrote:
>>>>>
>>>>> On 8/20/20 10:00 AM, Krishna Chaitanya wrote:
>>>>>> On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote:
>>>>>>>
>>>>>>> On 8/20/20 9:08 AM, Krishna Chaitanya wrote:
>>>>>>>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote:
>>>>>>>>>
>>>>>>>>> On 2020-08-20 18:52, Krishna Chaitanya wrote:
>>>>>>>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote:
>>>>>>>>> ...
>>>>>>>>>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI
>>>>>>>>>>>>> expert. Can anyone else help?
>>>>>>>>>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from
>>>>>>>>>>>> start/stop to
>>>>>>>>>>>> the probe isn't the right approach as the datapath is tied to
>>>>>>>>>>>> start/stop.
>>>>>>>>>>>>
>>>>>>>>>>>> Maybe check the state of NAPI before disable?
>>>>>>>>>>>>
>>>>>>>>>>>>      if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
>>>>>>>>>>>>       napi_disable(&ar->napi)
>>>>>>>>>>>> or maintain napi_state like this
>>>>>>>>>>>> https://patchwork.kernel.org/patch/10249365/
>>>>>>>>>>> it is better to use above link's patch.
>>>>>>>>>>> napi.state is controlled by napi API, it is better ath10k not know it.
>>>>>>>>>> Sure, but IMHO just canceling the async rx work should solve the issue.
>>>>>>>>> Oh no, canceling the async rx work will not solve this issue, rx worker
>>>>>>>>> ath10k_rx_indication_async_work call napi_schedule, after napi_complete,
>>>>>>>>> the NAPI_STATE_SCHED will clear.
>>>>>>>>> The issue of this patch is because 2 thread called to hif_stop and
>>>>>>>>> NAPI_STATE_SCHED not clear.
>>>>>>>> That fix is still valid and good to have.
>>>>>>>>
>>>>>>>> ndev_stop being called twice is typical scenarios (stop vs rmmod), so
>>>>>>>>      just checking the netdev_flags for IFF_UP and returning from hif_Stop
>>>>>>>> should suffice, no?
>>>>>>>
>>>>>>> My approach to fix this problem was to add a boolean in ath10k as to whether
>>>>>>> it had napi enabled or not, and then check that before trying to enable/disable
>>>>>>> it again.  Seems to work fine, and cleaner in my mind than checking internal
>>>>>>> napi flags.
>>>>>> A much simpler approach is just to check for IFF_UP and skip NAPI (and others)
>>>>>> in the hif_stop no? (provided proper RTNL locking is done if hif_stop
>>>>>> is being called
>>>>>> internally as well).
>>>>>>
>>>>>
>>>>> I'm not sure, but I think the driver should be internally consistent and not
>>>>> spend a lot of time trying to guess about interactions with objects higher
>>>>> in the stack.
>>>> Fair enough, the network interface state is a basic thing controlled
>>>> by the driver,
>>>> so, should be okay to use. Anyways, the in-driver approach has more control.
>>>>>
>>>>> Here is my original patch to fix this, it is not complex.
>>>>>
>>>>> https://patchwork.kernel.org/patch/10249363/
>>>> Sure, I have shared your patch above :).
>>> Sent a bit early, any idea why this wasn't upstreamed earlier?
>>
>> No, one comment from Michal indicated maybe there were more problems lurking
>> in this area, but he seemed to be OK with the patch over all.  After that,
>> it was just ignored.
>>
> Now might be a good time to push for it :)
> 

It is generally a waste of time in my experience.  Kalle is the maintainer and should
be seeing any of this he cares to see.  If he likes the patch, he can apply it or
something similar.  If you have a reproducible test case, see if the patch fixes
things, that might help it be accepted.

Thanks,
Ben
Wen Gong Aug. 21, 2020, 2:45 a.m. UTC | #17
On 2020-08-21 04:59, Ben Greear wrote:
> On 8/20/20 1:15 PM, Krishna Chaitanya wrote:
>> On Thu, Aug 20, 2020 at 11:23 PM Ben Greear <greearb@candelatech.com> 
>> wrote:
>>> 
>>> On 8/20/20 10:42 AM, Krishna Chaitanya wrote:
>>>> On Thu, Aug 20, 2020 at 11:11 PM Krishna Chaitanya
>>>> <chaitanya.mgit@gmail.com> wrote:
>>>>> 
>>>>> On Thu, Aug 20, 2020 at 10:38 PM Ben Greear 
>>>>> <greearb@candelatech.com> wrote:
>>>>>> 
>>>>>> On 8/20/20 10:00 AM, Krishna Chaitanya wrote:
>>>>>>> On Thu, Aug 20, 2020 at 10:02 PM Ben Greear 
>>>>>>> <greearb@candelatech.com> wrote:
>>>>>>>> 
>>>>>>>> On 8/20/20 9:08 AM, Krishna Chaitanya wrote:
>>>>>>>>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> On 2020-08-20 18:52, Krishna Chaitanya wrote:
>>>>>>>>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong 
>>>>>>>>>>> <wgong@codeaurora.org> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote:
>>>>>>>>>> ...
>>>>>>>>>>>>>> I'm not really convinced that this is the right fix, but 
>>>>>>>>>>>>>> I'm no NAPI
>>>>>>>>>>>>>> expert. Can anyone else help?
>>>>>>>>>>>>> Calling napi_disable() twice can lead to hangs, but moving 
>>>>>>>>>>>>> NAPI from
>>>>>>>>>>>>> start/stop to
>>>>>>>>>>>>> the probe isn't the right approach as the datapath is tied 
>>>>>>>>>>>>> to
>>>>>>>>>>>>> start/stop.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Maybe check the state of NAPI before disable?
>>>>>>>>>>>>> 
>>>>>>>>>>>>>      if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state))
>>>>>>>>>>>>>       napi_disable(&ar->napi)
>>>>>>>>>>>>> or maintain napi_state like this
>>>>>>>>>>>>> https://patchwork.kernel.org/patch/10249365/
>>>>>>>>>>>> it is better to use above link's patch.
>>>>>>>>>>>> napi.state is controlled by napi API, it is better ath10k 
>>>>>>>>>>>> not know it.
>>>>>>>>>>> Sure, but IMHO just canceling the async rx work should solve 
>>>>>>>>>>> the issue.
>>>>>>>>>> Oh no, canceling the async rx work will not solve this issue, 
>>>>>>>>>> rx worker
>>>>>>>>>> ath10k_rx_indication_async_work call napi_schedule, after 
>>>>>>>>>> napi_complete,
>>>>>>>>>> the NAPI_STATE_SCHED will clear.
>>>>>>>>>> The issue of this patch is because 2 thread called to hif_stop 
>>>>>>>>>> and
>>>>>>>>>> NAPI_STATE_SCHED not clear.
>>>>>>>>> That fix is still valid and good to have.
>>>>>>>>> 
>>>>>>>>> ndev_stop being called twice is typical scenarios (stop vs 
>>>>>>>>> rmmod), so
>>>>>>>>>      just checking the netdev_flags for IFF_UP and returning 
>>>>>>>>> from hif_Stop
>>>>>>>>> should suffice, no?
>>>>>>>> 
>>>>>>>> My approach to fix this problem was to add a boolean in ath10k 
>>>>>>>> as to whether
>>>>>>>> it had napi enabled or not, and then check that before trying to 
>>>>>>>> enable/disable
>>>>>>>> it again.  Seems to work fine, and cleaner in my mind than 
>>>>>>>> checking internal
>>>>>>>> napi flags.
>>>>>>> A much simpler approach is just to check for IFF_UP and skip NAPI 
>>>>>>> (and others)
>>>>>>> in the hif_stop no? (provided proper RTNL locking is done if 
>>>>>>> hif_stop
>>>>>>> is being called
>>>>>>> internally as well).
>>>>>>> 
>>>>>> 
>>>>>> I'm not sure, but I think the driver should be internally 
>>>>>> consistent and not
>>>>>> spend a lot of time trying to guess about interactions with 
>>>>>> objects higher
>>>>>> in the stack.
>>>>> Fair enough, the network interface state is a basic thing 
>>>>> controlled
>>>>> by the driver,
>>>>> so, should be okay to use. Anyways, the in-driver approach has more 
>>>>> control.
>>>>>> 
>>>>>> Here is my original patch to fix this, it is not complex.
>>>>>> 
>>>>>> https://patchwork.kernel.org/patch/10249363/
>>>>> Sure, I have shared your patch above :).
>>>> Sent a bit early, any idea why this wasn't upstreamed earlier?
>>> 
>>> No, one comment from Michal indicated maybe there were more problems 
>>> lurking
>>> in this area, but he seemed to be OK with the patch over all.  After 
>>> that,
>>> it was just ignored.
>>> 
>> Now might be a good time to push for it :)
>> 
> 
> It is generally a waste of time in my experience.  Kalle is the
> maintainer and should
> be seeing any of this he cares to see.  If he likes the patch, he can
> apply it or
> something similar.  If you have a reproducible test case, see if the 
> patch fixes
> things, that might help it be accepted.
I have 2 cmd, each one can reproduce the hang.
echo soft > 
/sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 
0.05;ifconfig wlan0 down
echo soft > 
/sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod 
ath10k_sdio
and with the my patch, it fix the hang. Change of my patch is similar 
with your
patch(https://patchwork.kernel.org/patch/10249365/), so it should also 
fix the hang with your patch.
> 
> Thanks,
> Ben
Wen Gong Aug. 24, 2020, 4:35 a.m. UTC | #18
On 2020-08-21 10:45, Wen Gong wrote:
...
> I have 2 cmd, each one can reproduce the hang.
> echo soft >
> /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep
> 0.05;ifconfig wlan0 down
> echo soft >
> /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod
> ath10k_sdio
> and with the my patch, it fix the hang. Change of my patch is similar 
> with your
> patch(https://patchwork.kernel.org/patch/10249365/), so it should also
> fix the hang with your patch.
I have sent v2: https://patchwork.kernel.org/patch/11732165/
Kalle Valo Sept. 7, 2020, 4:07 p.m. UTC | #19
Ben Greear <greearb@candelatech.com> writes:

>>>>>> Here is my original patch to fix this, it is not complex.
>>>>>>
>>>>>> https://patchwork.kernel.org/patch/10249363/
>>>>> Sure, I have shared your patch above :).
>>>> Sent a bit early, any idea why this wasn't upstreamed earlier?
>>>
>>> No, one comment from Michal indicated maybe there were more problems lurking
>>> in this area, but he seemed to be OK with the patch over all.  After that,
>>> it was just ignored.
>>>
>> Now might be a good time to push for it :)
>>
>
> It is generally a waste of time in my experience.  Kalle is the maintainer and should
> be seeing any of this he cares to see.  If he likes the patch, he can apply it or
> something similar.  If you have a reproducible test case, see if the patch fixes
> things, that might help it be accepted.

The problem with yours (Ben's) patches is that you have your own set of
patches for ath10k and your own firmware. So I cannot know at all if
your patches work with upstream ath10k and upstream firmware, and would
need to test the patches myself. But nowadays I just can't find the time
for testing. So if someone else can do the testing and provide a
Tested-on tag it would it increase my confidence level for the patches.
Ben Greear Sept. 7, 2020, 5:18 p.m. UTC | #20
On 9/7/20 9:07 AM, Kalle Valo wrote:
> Ben Greear <greearb@candelatech.com> writes:
> 
>>>>>>> Here is my original patch to fix this, it is not complex.
>>>>>>>
>>>>>>> https://patchwork.kernel.org/patch/10249363/
>>>>>> Sure, I have shared your patch above :).
>>>>> Sent a bit early, any idea why this wasn't upstreamed earlier?
>>>>
>>>> No, one comment from Michal indicated maybe there were more problems lurking
>>>> in this area, but he seemed to be OK with the patch over all.  After that,
>>>> it was just ignored.
>>>>
>>> Now might be a good time to push for it :)
>>>
>>
>> It is generally a waste of time in my experience.  Kalle is the maintainer and should
>> be seeing any of this he cares to see.  If he likes the patch, he can apply it or
>> something similar.  If you have a reproducible test case, see if the patch fixes
>> things, that might help it be accepted.
> 
> The problem with yours (Ben's) patches is that you have your own set of
> patches for ath10k and your own firmware. So I cannot know at all if
> your patches work with upstream ath10k and upstream firmware, and would
> need to test the patches myself. But nowadays I just can't find the time
> for testing. So if someone else can do the testing and provide a
> Tested-on tag it would it increase my confidence level for the patches.

Surely codeaura could get a few entry level engineers to run basic testing against
your target platforms on a regular basis?  The several years of time this bug was
known (to me at least, and to whoever saw my original patch) and the time wasted
by codeaura to rediscover and re-fix the bug would have much better been spent just
testing and review my patch to begin with.  And not just my patches either, this
pattern is far and wide in ath10k.

Also, my driver is often tested against various upstream QCA firmware and chipsets in openwrt,
so while bugs are always possible, there is some test coverage.

Thanks,
Ben
diff mbox series

Patch

diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
index 7b894dcaad2e..b71499b171c6 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -1756,8 +1756,6 @@  static int ath10k_sdio_hif_start(struct ath10k *ar)
 	struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar);
 	int ret;
 
-	napi_enable(&ar->napi);
-
 	/* Sleep 20 ms before HIF interrupts are disabled.
 	 * This will give target plenty of time to process the BMI done
 	 * request before interrupts are disabled.
@@ -1884,7 +1882,6 @@  static void ath10k_sdio_hif_stop(struct ath10k *ar)
 	spin_unlock_bh(&ar_sdio->wr_async_lock);
 
 	napi_synchronize(&ar->napi);
-	napi_disable(&ar->napi);
 }
 
 #ifdef CONFIG_PM
@@ -2121,6 +2118,7 @@  static int ath10k_sdio_probe(struct sdio_func *func,
 
 	netif_napi_add(&ar->napi_dev, &ar->napi, ath10k_sdio_napi_poll,
 		       ATH10K_NAPI_BUDGET);
+	napi_enable(&ar->napi);
 
 	ath10k_dbg(ar, ATH10K_DBG_BOOT,
 		   "sdio new func %d vendor 0x%x device 0x%x block 0x%x/0x%x\n",
@@ -2235,6 +2233,7 @@  static void ath10k_sdio_remove(struct sdio_func *func)
 
 	ath10k_core_unregister(ar);
 
+	napi_disable(&ar->napi);
 	netif_napi_del(&ar->napi);
 
 	ath10k_core_destroy(ar);