diff mbox series

[v2] wifi: mac80211: Fix performance issue with mutex_lock

Message ID 20220915043527.737133-1-venkatch@gmail.com (mailing list archive)
State Rejected
Delegated to: Johannes Berg
Headers show
Series [v2] wifi: mac80211: Fix performance issue with mutex_lock | expand

Commit Message

Venkat Ch Sept. 15, 2022, 4:35 a.m. UTC
From: Venkat Chimata <venkata@shasta.cloud>

OpenWiFi's ucentral agent periodically (typically 120 seconds)
issues nl80211 call to get associated client list from the
WLAN driver. Somehow this operation was causing tx/rx delays
sometimes and the video calls on connected clients are experiencing
jitter. The associated client list was protected by
a mutex lock. I saw that ieee80211_check_fast_xmit_all uses
rcu_read_lock and rcu_read_unlock to iterate through sta_list.
I took it as a refernce and changed the lock to rcu_read lock
from mutex.
Also saw this this comment just above sta_mutex declaration.
        /* Station data */
        /*
         * The mutex only protects the list, hash table and
         * counter, reads are done with RCU.
         */
        struct mutex sta_mtx;

Hence tried changing mutex_lock(/unlock) in ieee80211_dump_station
to rcu_read_lock(/unlock) and it resolved the jitter issue in the
video calls.

Tests:
We had this issue show up consistently and the patch fixed the issue.
We spent a good part of 2 weeks following up on this and with this
fix, the video calls are smooth.

Also tested if this could cause any crashes with below mentioned
process.

1. Connect 3 clients
2. A script running dumping clients in an infinite loop
3. Continuously disconnect / connect one client to see if
   there is any crash. No crash was observed.

---
v2:
* Resolve warning
 net/mac80211/cfg.c:858:33: warning:
 unused variable 'local' [-Wunused-variable]
 ---

Signed-off-by: Venkat Chimata <venkata@shasta.cloud>
---
 net/mac80211/cfg.c      | 7 ++-----
 net/mac80211/sta_info.c | 3 +--
 2 files changed, 3 insertions(+), 7 deletions(-)

Comments

kernel test robot Sept. 20, 2022, 7:24 a.m. UTC | #1
Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 662cceb50ea735492ff67b5ecdbc8d9de15212bd ("[PATCH v2] wifi: mac80211: Fix performance issue with mutex_lock")
url: https://github.com/intel-lab-lkp/linux/commits/venkatch-gmail-com/wifi-mac80211-Fix-performance-issue-with-mutex_lock/20220915-123721
base: https://git.kernel.org/cgit/linux/kernel/git/wireless/wireless-next.git main
patch link: https://lore.kernel.org/linux-wireless/20220915043527.737133-1-venkatch@gmail.com

in testcase: hwsim
version: hwsim-x86_64-717e5d7-1_20220525
with following parameters:

	test: group-03



on test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (Ivy Bridge) with 16G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Link: https://lore.kernel.org/r/202209201552.e4f42742-oliver.sang@intel.com


[  103.595219][ T5052] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274
[  103.595404][ T5052] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 5052, name: iw
[  103.595601][ T5052] preempt_count: 1, expected: 0
[  103.595740][ T5052] CPU: 3 PID: 5052 Comm: iw Not tainted 6.0.0-rc3-00691-g662cceb50ea7 #1
[  103.595921][ T5052] Hardware name:  /DZ77BH-55K, BIOS BHZ7710H.86A.0097.2012.1228.1346 12/28/2012
[  103.596088][ T5052] Call Trace:
[  103.596151][ T5052]  <TASK>
[ 103.596209][ T5052] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1)) 
[ 103.596297][ T5052] __might_resched.cold (kernel/sched/core.c:9893) 
[ 103.596391][ T5052] ? cfg80211_sinfo_alloc_tid_stats (net/wireless/util.c:2255) cfg80211
[ 103.596557][ T5052] kmem_cache_alloc_trace (include/linux/sched/mm.h:274 mm/slab.h:700 mm/slub.c:3157 mm/slub.c:3251 mm/slub.c:3282) 
[ 103.596655][ T5052] cfg80211_sinfo_alloc_tid_stats (net/wireless/util.c:2255) cfg80211
[ 103.596807][ T5052] sta_set_sinfo (net/mac80211/sta_info.c:2651 (discriminator 1)) mac80211
[ 103.596956][ T5052] ? kfree (mm/slub.c:1780 mm/slub.c:3534 mm/slub.c:4562) 
[ 103.597034][ T5052] ieee80211_dump_station (net/mac80211/cfg.c:864) mac80211
[ 103.597183][ T5052] nl80211_dump_station (net/wireless/rdev-ops.h:245 net/wireless/nl80211.c:6565) cfg80211
[ 103.597330][ T5052] ? nl80211_send_station (net/wireless/nl80211.c:6539) cfg80211
[ 103.597482][ T5052] ? __alloc_skb (net/core/skbuff.c:448) 
[ 103.597568][ T5052] ? kmalloc_reserve (net/core/skbuff.c:400) 
[ 103.597656][ T5052] ? __mutex_lock_slowpath (kernel/locking/mutex.c:282) 
[ 103.597750][ T5052] netlink_dump (net/netlink/af_netlink.c:2275) 
[ 103.597832][ T5052] ? netlink_getsockopt (net/netlink/af_netlink.c:2213) 
[ 103.597925][ T5052] ? mutex_unlock (arch/x86/include/asm/atomic64_64.h:190 include/linux/atomic/atomic-long.h:449 include/linux/atomic/atomic-instrumented.h:1790 kernel/locking/mutex.c:181 kernel/locking/mutex.c:540) 
[ 103.598009][ T5052] ? __mutex_unlock_slowpath+0x2c0/0x2c0 
[ 103.598123][ T5052] __netlink_dump_start (include/linux/instrumented.h:101 include/linux/atomic/atomic-instrumented.h:176 include/linux/refcount.h:272 include/linux/refcount.h:315 include/linux/refcount.h:333 include/net/sock.h:1975 net/netlink/af_netlink.c:2382) 
[ 103.598217][ T5052] genl_family_rcv_msg_dumpit (net/netlink/genetlink.c:689) 
[ 103.598318][ T5052] ? genl_rcv (net/netlink/genetlink.c:655) 
[ 103.598395][ T5052] ? __stack_depot_save (lib/stackdepot.c:424) 
[ 103.598490][ T5052] ? genl_family_rcv_msg_doit (net/netlink/genetlink.c:563) 
[ 103.598590][ T5052] ? nl80211_send_station (net/wireless/nl80211.c:6539) cfg80211
[ 103.598739][ T5052] ? genl_family_rcv_msg_dumpit (net/netlink/genetlink.c:637) 
[ 103.598842][ T5052] ? genl_op_from_small (net/netlink/genetlink.c:136) 
[ 103.598933][ T5052] genl_rcv_msg (net/netlink/genetlink.c:775 net/netlink/genetlink.c:795) 
[ 103.599016][ T5052] ? genl_get_cmd (net/netlink/genetlink.c:784) 
[ 103.599100][ T5052] ? nl80211_dump_station (net/wireless/nl80211.c:6592) cfg80211
[ 103.599245][ T5052] ? nl80211_send_station (net/wireless/nl80211.c:6539) cfg80211
[ 103.599394][ T5052] ? kasan_save_stack (mm/kasan/common.c:40) 
[ 103.599511][ T5052] ? kasan_save_stack (mm/kasan/common.c:39) 
[ 103.599613][ T5052] ? __kasan_slab_alloc (mm/kasan/common.c:45 mm/kasan/common.c:437 mm/kasan/common.c:470) 
[ 103.599704][ T5052] ? kmem_cache_alloc_lru (mm/slab.h:727 mm/slub.c:3243 mm/slub.c:3251 mm/slub.c:3258 mm/slub.c:3275) 
[ 103.599798][ T5052] netlink_rcv_skb (net/netlink/af_netlink.c:2538) 
[ 103.599882][ T5052] ? genl_get_cmd (net/netlink/genetlink.c:784) 
[ 103.599967][ T5052] ? netlink_ack (net/netlink/af_netlink.c:2515) 
[ 103.600049][ T5052] ? netlink_recvmsg (net/netlink/af_netlink.c:514) 
[ 103.600136][ T5052] ? _copy_from_iter (lib/iov_iter.c:628 (discriminator 11)) 
[ 103.600226][ T5052] genl_rcv (net/netlink/genetlink.c:807) 
[ 103.600299][ T5052] netlink_unicast (net/netlink/af_netlink.c:1320 net/netlink/af_netlink.c:1345) 
[ 103.600386][ T5052] ? netlink_attachskb (net/netlink/af_netlink.c:1330) 
[ 103.600488][ T5052] ? check_heap_object (arch/x86/include/asm/bitops.h:207 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 include/linux/page-flags.h:487 mm/usercopy.c:193) 
[ 103.600588][ T5052] netlink_sendmsg (net/netlink/af_netlink.c:1921) 
[ 103.600675][ T5052] ? netlink_unicast (net/netlink/af_netlink.c:1841) 
[ 103.600764][ T5052] ? __import_iovec (lib/iov_iter.c:1771) 
[ 103.600849][ T5052] ? netlink_unicast (net/netlink/af_netlink.c:1841) 
[ 103.600936][ T5052] sock_sendmsg (net/socket.c:714 net/socket.c:734) 
[ 103.601016][ T5052] ____sys_sendmsg (net/socket.c:2482) 
[ 103.601101][ T5052] ? kernel_sendmsg (net/socket.c:2429) 
[ 103.601184][ T5052] ? __copy_msghdr (net/socket.c:2409) 
[ 103.601269][ T5052] ? __kasan_record_aux_stack (mm/kasan/generic.c:348) 
[ 103.601369][ T5052] ___sys_sendmsg (net/socket.c:2538) 
[ 103.601461][ T5052] ? __ia32_sys_recvmmsg (net/socket.c:2525) 
[ 103.601566][ T5052] ? rcu_gp_kthread (kernel/rcu/tree_nocb.h:371) 
[ 103.601653][ T5052] ? fsnotify_grab_connector (fs/notify/mark.c:601) 
[ 103.601751][ T5052] ? filemap_map_pmd (mm/filemap.c:3324) 
[ 103.601839][ T5052] ? do_read_fault (mm/memory.c:4487 mm/memory.c:4513) 
[ 103.601925][ T5052] ? do_fault (mm/memory.c:4647) 
[ 103.602003][ T5052] ? __handle_mm_fault (mm/memory.c:5053) 
[ 103.602093][ T5052] ? rcu_gp_kthread (kernel/rcu/tree_nocb.h:371) 
[ 103.602180][ T5052] ? copy_page_range (mm/memory.c:4963) 
[ 103.602267][ T5052] ? __fget_light (arch/x86/include/asm/atomic.h:29 include/linux/atomic/atomic-instrumented.h:28 fs/file.c:1005) 
[ 103.602352][ T5052] __sys_sendmsg (include/linux/file.h:31 net/socket.c:2567) 
[ 103.602435][ T5052] ? __sys_sendmsg_sock (net/socket.c:2553) 
[ 103.602545][ T5052] ? do_user_addr_fault (arch/x86/mm/fault.c:1426) 
[ 103.602636][ T5052] ? exit_to_user_mode_loop (include/linux/sched.h:2305 include/linux/resume_user_mode.h:61 kernel/entry/common.c:169) 
[ 103.602731][ T5052] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[ 103.602812][ T5052] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) 
[  103.602915][ T5052] RIP: 0033:0x7fcd499bf2c3
[ 103.602996][ T5052] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b7 66 2e 0f 1f 84 00 00 00 00 00 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 89 54 24 1c 48
All code
========
   0:	64 89 02             	mov    %eax,%fs:(%rdx)
   3:	48 c7 c0 ff ff ff ff 	mov    $0xffffffffffffffff,%rax
   a:	eb b7                	jmp    0xffffffffffffffc3
   c:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  13:	00 00 00 
  16:	90                   	nop
  17:	64 8b 04 25 18 00 00 	mov    %fs:0x18,%eax
  1e:	00 
  1f:	85 c0                	test   %eax,%eax
  21:	75 14                	jne    0x37
  23:	b8 2e 00 00 00       	mov    $0x2e,%eax
  28:	0f 05                	syscall 
  2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
  30:	77 55                	ja     0x87
  32:	c3                   	retq   
  33:	0f 1f 40 00          	nopl   0x0(%rax)
  37:	48 83 ec 28          	sub    $0x28,%rsp
  3b:	89 54 24 1c          	mov    %edx,0x1c(%rsp)
  3f:	48                   	rex.W

Code starting with the faulting instruction
===========================================
   0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
   6:	77 55                	ja     0x5d
   8:	c3                   	retq   
   9:	0f 1f 40 00          	nopl   0x0(%rax)
   d:	48 83 ec 28          	sub    $0x28,%rsp
  11:	89 54 24 1c          	mov    %edx,0x1c(%rsp)
  15:	48                   	rex.W
[  103.603312][ T5052] RSP: 002b:00007ffd473a4358 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  103.603472][ T5052] RAX: ffffffffffffffda RBX: 000055d33f5b4390 RCX: 00007fcd499bf2c3
[  103.603625][ T5052] RDX: 0000000000000000 RSI: 00007ffd473a4390 RDI: 0000000000000003
[  103.603758][ T5052] RBP: 000055d33f5b9780 R08: 000055d33f5b54c0 R09: 000055d33f5b44b0
[  103.603894][ T5052] R10: 000055d33f290940 R11: 0000000000000246 R12: 000055d33f5b98c0
[  103.604030][ T5052] R13: 00007ffd473a4390 R14: 000055d33f5b97d0 R15: 000055d33f5b98c0
[  103.604166][ T5052]  </TASK>
[  103.605988][  T405] Station dump on dev[0]:
[  103.605997][  T405]
[  103.606940][  T405] Station 02:00:00:00:03:00 (on wlan0)
[  103.606947][  T405]
[  103.607595][  T405] 	inactive time:	61 ms
[  103.607602][  T405]
[  103.608090][  T405] 	rx bytes:	150775
[  103.608096][  T405]
[  103.608636][  T405] 	rx packets:	172
[  103.608643][  T405]
[  103.609122][  T405] 	tx bytes:	64715
[  103.609128][  T405]
[  103.609621][  T405] 	tx packets:	54
[  103.609627][  T405]
[  103.610061][  T405] 	tx retries:	0
[  103.610067][  T405]
[  103.610590][  T405] 	tx failed:	0
[  103.610596][  T405]
[  103.611031][  T405] 	beacon loss:	0
[  103.611037][  T405]
[  103.611473][  T405] 	beacon rx:	25
[  103.611480][  T405]
[  103.611979][  T405] 	rx drop misc:	30
[  103.611985][  T405]
[  103.612510][  T405] 	signal:  	-27 dBm
[  103.612521][  T405]
[  103.613094][  T405] 	signal avg:	-27 dBm
[  103.613100][  T405]
[  103.613841][  T405] 	beacon signal avg:	-27 dBm
[  103.613848][  T405]
[  103.614976][  T405] 	tx bitrate:	13.0 MBit/s VHT-MCS 1 VHT-NSS 1
[  103.614983][  T405]
[  103.615774][    T8] wlan0: deauthenticated from 02:00:00:00:03:00 (Reason: 3=DEAUTH_LEAVING)
[  103.615801][   T60] wlan1: deauthenticated from 02:00:00:00:03:00 (Reason: 3=DEAUTH_LEAVING)
[  103.615855][  T405] 	tx duration:	0 us
[  103.615864][  T405]


To reproduce:

        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        sudo bin/lkp install job.yaml           # job file is attached in this email
        bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
        sudo bin/lkp run generated-yaml-file

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.
Felix Fietkau Sept. 20, 2022, 8:33 a.m. UTC | #2
On 15.09.22 06:35, venkatch@gmail.com wrote:
> From: Venkat Chimata <venkata@shasta.cloud>
> 
> OpenWiFi's ucentral agent periodically (typically 120 seconds)
> issues nl80211 call to get associated client list from the
> WLAN driver. Somehow this operation was causing tx/rx delays
> sometimes and the video calls on connected clients are experiencing
> jitter. The associated client list was protected by
> a mutex lock. I saw that ieee80211_check_fast_xmit_all uses
> rcu_read_lock and rcu_read_unlock to iterate through sta_list.
> I took it as a refernce and changed the lock to rcu_read lock
> from mutex.
> Also saw this this comment just above sta_mutex declaration.
>          /* Station data */
>          /*
>           * The mutex only protects the list, hash table and
>           * counter, reads are done with RCU.
>           */
>          struct mutex sta_mtx;
> 
> Hence tried changing mutex_lock(/unlock) in ieee80211_dump_station
> to rcu_read_lock(/unlock) and it resolved the jitter issue in the
> video calls.
> 
> Tests:
> We had this issue show up consistently and the patch fixed the issue.
> We spent a good part of 2 weeks following up on this and with this
> fix, the video calls are smooth.
> 
> Also tested if this could cause any crashes with below mentioned
> process.
> 
> 1. Connect 3 clients
> 2. A script running dumping clients in an infinite loop
> 3. Continuously disconnect / connect one client to see if
>     there is any crash. No crash was observed.
The drv_sta_statistics call (called from sta_set_sinfo) can sleep, so 
you're not allowed to use RCU for iterating over the station list.
Also, please analyze and explain the actual root cause before proposing 
another fix.

- Felix
Felix Fietkau Sept. 20, 2022, 9:10 a.m. UTC | #3
On 20.09.22 11:05, Venkat Ch wrote:
> Hi Felix,
> Thanks for the comments. What part of  drv_sta_statistics could make it 
> sleep? I might be missing something. Please explain.
It's just a wrapper around the driver .sta_statistics op.
The documentation in include/net/mac80211.h states that this callback is 
allowed to sleep, and it does so in a few drivers.

- Felix
Venkat Ch Sept. 20, 2022, 6:27 p.m. UTC | #4
Hi Felix,

 Thank you. I browsed through the code. There seems to be sleep
operations wcn36xx platform.ath11k  does n't seem to have the sleep
operations in sta_statistics. We are using ath11k based chipset. Will
it impact things if we apply this patch for ath11k boards only as a
platform specific patch?

Thanks & Regards
Venkat

Venkat

On Tue, 20 Sept 2022 at 14:40, Felix Fietkau <nbd@nbd.name> wrote:
>
>
> On 20.09.22 11:05, Venkat Ch wrote:
> > Hi Felix,
> > Thanks for the comments. What part of  drv_sta_statistics could make it
> > sleep? I might be missing something. Please explain.
> It's just a wrapper around the driver .sta_statistics op.
> The documentation in include/net/mac80211.h states that this callback is
> allowed to sleep, and it does so in a few drivers.
>
> - Felix



--
If you rest, you rust
Felix Fietkau Sept. 20, 2022, 6:51 p.m. UTC | #5
On 20.09.22 20:27, Venkat Ch wrote:
> Hi Felix,
> 
>   Thank you. I browsed through the code. There seems to be sleep
> operations wcn36xx platform.ath11k  does n't seem to have the sleep
> operations in sta_statistics. We are using ath11k based chipset. Will
> it impact things if we apply this patch for ath11k boards only as a
> platform specific patch?
I think it's a bad idea to keep this hack without understanding why it 
helps at all.

- Felix
Venkat Ch Sept. 20, 2022, 7:23 p.m. UTC | #6
Hi Felix,

 Following is the background of the problem, how I traced to
mutex_lock and why I propose rcu locks.

Issue:
 On a 10Mbps upload / 50 Mbps download connection, the following issue reported.

Video periodically freezes and/or appears delayed when on Zoom or Teams.
1. Video will freeze for 10 or 15 seconds periodically when on a call,
but audio continues and the session doesn't drop.
2. The video  still works but it appears delay (I move, but the video
of my movement is noticeably delay by a second or so)

Tracing to mutex_lock(sta_mutex):

 When I investigated, I found that the ucentral agent in openwifi
fetches the station list periodically.  Without the station list
fetch, the video quality is just fine. I investigated the station list
path and found this mutex_lock. I also see that earlier it was
rcu_lock which protected the station list in this path. In this
commit, https://github.com/torvalds/linux/commit/66572cfc30a4b764150c83ee5d842a3ce17991c9,
rcu lock was changed to mutex lock without providing any reason.

I also saw this comment just above the sta_mutex declaration.
        /* Station data */
        /*
         * The mutex only protects the list, hash table and
         * counter, reads are done with RCU.
         */
        struct mutex sta_mtx;

So I reverted back the mutex_lock with rcu_lock and it just worked
fine. We tested for more than 2 weeks before concluding this analysis.

I think the usage of mutex_lock is impacting the tx / rx path
somewhere and hence the issue. It is a challenge to trace the exact
function though.

Thanks & Regards
Venkat

On Wed, 21 Sept 2022 at 00:21, Felix Fietkau <nbd@nbd.name> wrote:
>
>
> On 20.09.22 20:27, Venkat Ch wrote:
> > Hi Felix,
> >
> >   Thank you. I browsed through the code. There seems to be sleep
> > operations wcn36xx platform.ath11k  does n't seem to have the sleep
> > operations in sta_statistics. We are using ath11k based chipset. Will
> > it impact things if we apply this patch for ath11k boards only as a
> > platform specific patch?
> I think it's a bad idea to keep this hack without understanding why it
> helps at all.
>
> - Felix
Felix Fietkau Sept. 21, 2022, 8:05 a.m. UTC | #7
On 20.09.22 21:23, Venkat Ch wrote:
> Hi Felix,
> 
>   Following is the background of the problem, how I traced to
> mutex_lock and why I propose rcu locks.
> 
> Issue:
>   On a 10Mbps upload / 50 Mbps download connection, the following issue reported.
> 
> Video periodically freezes and/or appears delayed when on Zoom or Teams.
> 1. Video will freeze for 10 or 15 seconds periodically when on a call,
> but audio continues and the session doesn't drop.
> 2. The video  still works but it appears delay (I move, but the video
> of my movement is noticeably delay by a second or so)
> 
> Tracing to mutex_lock(sta_mutex):
> 
>   When I investigated, I found that the ucentral agent in openwifi
> fetches the station list periodically.  Without the station list
> fetch, the video quality is just fine. I investigated the station list
> path and found this mutex_lock. I also see that earlier it was
> rcu_lock which protected the station list in this path. In this
> commit, https://github.com/torvalds/linux/commit/66572cfc30a4b764150c83ee5d842a3ce17991c9,
> rcu lock was changed to mutex lock without providing any reason.
The reason seems clear to me, even though it was not explicitly stated 
in the commit message: in sta_set_sinfo it introduces a call to a driver 
op that is allowed to sleep.

> I also saw this comment just above the sta_mutex declaration.
>          /* Station data */
>          /*
>           * The mutex only protects the list, hash table and
>           * counter, reads are done with RCU.
>           */
>          struct mutex sta_mtx;
> 
> So I reverted back the mutex_lock with rcu_lock and it just worked
> fine. We tested for more than 2 weeks before concluding this analysis.
> 
> I think the usage of mutex_lock is impacting the tx / rx path
> somewhere and hence the issue. It is a challenge to trace the exact
> function though.

I don't see any critical part in the tx/rx path which depends on the 
sta_mtx lock. My guess is that for some reason your change is simply 
accidentally papering over the real bug, which may be somewhere else 
entirely, maybe even in the driver. A freeze for 10-15 seconds 
definitely does not sound like simple lock contention on the mutex, 
since a single station dump will be much faster than that.

- Felix
Venkat Ch Sept. 21, 2022, 8:31 a.m. UTC | #8
Hi Felix,

 Thanks again for patiently discussing things. I clearly understand
what you said about locks. Now that we know the background of the
problem, do you suggest any potential solution or any potential
direction that I should start investigating? Please let me know.

Thanks & Regards
Venkat

On Wed, 21 Sept 2022 at 13:35, Felix Fietkau <nbd@nbd.name> wrote:
>
>
> On 20.09.22 21:23, Venkat Ch wrote:
> > Hi Felix,
> >
> >   Following is the background of the problem, how I traced to
> > mutex_lock and why I propose rcu locks.
> >
> > Issue:
> >   On a 10Mbps upload / 50 Mbps download connection, the following issue reported.
> >
> > Video periodically freezes and/or appears delayed when on Zoom or Teams.
> > 1. Video will freeze for 10 or 15 seconds periodically when on a call,
> > but audio continues and the session doesn't drop.
> > 2. The video  still works but it appears delay (I move, but the video
> > of my movement is noticeably delay by a second or so)
> >
> > Tracing to mutex_lock(sta_mutex):
> >
> >   When I investigated, I found that the ucentral agent in openwifi
> > fetches the station list periodically.  Without the station list
> > fetch, the video quality is just fine. I investigated the station list
> > path and found this mutex_lock. I also see that earlier it was
> > rcu_lock which protected the station list in this path. In this
> > commit, https://github.com/torvalds/linux/commit/66572cfc30a4b764150c83ee5d842a3ce17991c9,
> > rcu lock was changed to mutex lock without providing any reason.
> The reason seems clear to me, even though it was not explicitly stated
> in the commit message: in sta_set_sinfo it introduces a call to a driver
> op that is allowed to sleep.
>
> > I also saw this comment just above the sta_mutex declaration.
> >          /* Station data */
> >          /*
> >           * The mutex only protects the list, hash table and
> >           * counter, reads are done with RCU.
> >           */
> >          struct mutex sta_mtx;
> >
> > So I reverted back the mutex_lock with rcu_lock and it just worked
> > fine. We tested for more than 2 weeks before concluding this analysis.
> >
> > I think the usage of mutex_lock is impacting the tx / rx path
> > somewhere and hence the issue. It is a challenge to trace the exact
> > function though.
>
> I don't see any critical part in the tx/rx path which depends on the
> sta_mtx lock. My guess is that for some reason your change is simply
> accidentally papering over the real bug, which may be somewhere else
> entirely, maybe even in the driver. A freeze for 10-15 seconds
> definitely does not sound like simple lock contention on the mutex,
> since a single station dump will be much faster than that.
>
> - Felix
Felix Fietkau Sept. 21, 2022, 10:38 a.m. UTC | #9
On 21.09.22 10:31, Venkat Ch wrote:
> Hi Felix,
> 
>   Thanks again for patiently discussing things. I clearly understand
> what you said about locks. Now that we know the background of the
> problem, do you suggest any potential solution or any potential
> direction that I should start investigating? Please let me know.

Maybe try enabling lockdep and see if anything shows up.

- Felix
diff mbox series

Patch

diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c
index 6a8350d..a005364 100644
--- a/net/mac80211/cfg.c
+++ b/net/mac80211/cfg.c
@@ -790,20 +790,17 @@  static int ieee80211_dump_station(struct wiphy *wiphy, struct net_device *dev,
 				  int idx, u8 *mac, struct station_info *sinfo)
 {
 	struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev);
-	struct ieee80211_local *local = sdata->local;
 	struct sta_info *sta;
 	int ret = -ENOENT;
 
-	mutex_lock(&local->sta_mtx);
-
+	rcu_read_lock();
 	sta = sta_info_get_by_idx(sdata, idx);
 	if (sta) {
 		ret = 0;
 		memcpy(mac, sta->sta.addr, ETH_ALEN);
 		sta_set_sinfo(sta, sinfo, true);
 	}
-
-	mutex_unlock(&local->sta_mtx);
+	rcu_read_unlock();
 
 	return ret;
 }
diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
index 550a610..af6fa75 100644
--- a/net/mac80211/sta_info.c
+++ b/net/mac80211/sta_info.c
@@ -231,8 +231,7 @@  struct sta_info *sta_info_get_by_idx(struct ieee80211_sub_if_data *sdata,
 	struct sta_info *sta;
 	int i = 0;
 
-	list_for_each_entry_rcu(sta, &local->sta_list, list,
-				lockdep_is_held(&local->sta_mtx)) {
+	list_for_each_entry_rcu(sta, &local->sta_list, list) {
 		if (sdata != sta->sdata)
 			continue;
 		if (i < idx) {