wifi: cfg80211: fix bss rbn double erase issue

Message ID	20231025103304.22082-1-jiazi.li@transsion.com (mailing list archive)
State	Changes Requested
Delegated to:	Johannes Berg
Headers	show Return-Path: <linux-wireless-owner@vger.kernel.org> From: Jiazi Li <jqqlijiazi@gmail.com> To: Johannes Berg <johannes@sipsolutions.net>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com> Cc: Jiazi Li <jiazi.li@transsion.com>, linux-wireless@vger.kernel.org Subject: [PATCH] wifi: cfg80211: fix bss rbn double erase issue Date: Wed, 25 Oct 2023 18:33:04 +0800 Message-Id: <20231025103304.22082-1-jiazi.li@transsion.com> Precedence: bulk
Series	wifi: cfg80211: fix bss rbn double erase issue \| expand wifi: cfg80211: fix bss rbn double erase issue

Jiazi Li Oct. 25, 2023, 10:33 a.m. UTC

If cfg80211_update_assoc_bss_entry call rb_insert_bss re-insert bss
failed because cmp_bss return 0, this bss->rbn will continue to hold
expired data, such as __rd_parent_color.
And this bss still in rdev->bss_list, maybe double erase in
__cfg80211_bss_expire later.
Double erase a rbtree node(with expired parent and color data) maybe
corrupt rbtree, so add a in_rbtree flag to fix this issue.

Signed-off-by: Jiazi Li <jiazi.li@transsion.com>
---
 net/wireless/core.h |  1 +
 net/wireless/scan.c | 11 ++++++++---
 2 files changed, 9 insertions(+), 3 deletions(-)

Johannes Berg Oct. 25, 2023, 10:36 a.m. UTC | #1

Huh.

On Wed, 2023-10-25 at 18:33 +0800, Jiazi Li wrote:
> If cfg80211_update_assoc_bss_entry call rb_insert_bss re-insert bss
> failed because cmp_bss return 0,
> 

Ok that's bad - so you hit the WARN_ON there? How that? We should fix
that too?

> this bss->rbn will continue to hold
> expired data, such as __rd_parent_color.

Does that matter in any way?

> And this bss still in rdev->bss_list, maybe double erase in
> __cfg80211_bss_expire later.
> Double erase a rbtree node(with expired parent and color data) maybe
> corrupt rbtree, so add a in_rbtree flag to fix this issue.

This seems overly complex - couldn't we just remove it from the list too
or something? It's already a case that "should never happen" so ... not
sure we need to do something "good"?

johannes

Jeff Johnson Oct. 25, 2023, 4:04 p.m. UTC | #2

On 10/25/2023 3:33 AM, Jiazi Li wrote:
> If cfg80211_update_assoc_bss_entry call rb_insert_bss re-insert bss
> failed because cmp_bss return 0, this bss->rbn will continue to hold
> expired data, such as __rd_parent_color.
> And this bss still in rdev->bss_list, maybe double erase in
> __cfg80211_bss_expire later.
> Double erase a rbtree node(with expired parent and color data) maybe
> corrupt rbtree, so add a in_rbtree flag to fix this issue.
> 
> Signed-off-by: Jiazi Li <jiazi.li@transsion.com>

Note your Signed-off-by doesn't match your e-mail address in your e-mail 
header. From the actual e-mail source it seems Google is trashing your 
headers:
From: Jiazi Li <jqqlijiazi@gmail.com>
X-Google-Original-From: Jiazi Li <jiazi.li@transsion.com>

That needs to be resolved

Jiazi Li Oct. 26, 2023, 1:35 a.m. UTC | #3

On Wed, Oct 25, 2023 at 12:36:24PM +0200, Johannes Berg wrote:
> Huh.
> 
> On Wed, 2023-10-25 at 18:33 +0800, Jiazi Li wrote:
> > If cfg80211_update_assoc_bss_entry call rb_insert_bss re-insert bss
> > failed because cmp_bss return 0,
> > 
> 
> Ok that's bad - so you hit the WARN_ON there? How that? We should fix
> that too?
> 
Yes, hit this WARN_ON in the test of direct connection between mobile
phones and PC. Here is the log:
[ 2741.982362] -----------[ cut here ]-----------
[ 2741.982446] WARNING: CPU: 6 PID: 2175 at net/wireless/scan.c:1496 cfg80211_update_assoc_bss_entry+0x350/0x378 [cfg80211]
[ 2741.983274] Call trace:
[ 2741.983343] cfg80211_update_assoc_bss_entry+0x350/0x378 [cfg80211]
[ 2741.983413] cfg80211_ch_switch_notify+0x7c/0x1b4 [cfg80211]
[ 2741.983487] kalP2pIndicateChnlSwitch+0x520/0xa34 [wlan_drv_gen4m_6895]
[ 2741.983558] p2pRoleFsmRunEventChnlGrant+0x358/0x5c0 [wlan_drv_gen4m_6895]
[ 2741.983629] p2pFsmRunEventChGrant+0x12c/0x16c [wlan_drv_gen4m_6895]
[ 2741.983699] mboxRcvAllMsg+0x198/0x23c [wlan_drv_gen4m_6895]
[ 2741.983770] kalProcessTxReq+0x50/0x340 [wlan_drv_gen4m_6895]
[ 2741.983840] main_thread+0x410/0x1730 [wlan_drv_gen4m_6895]
[ 2741.983846] kthread+0x150/0x200
[ 2741.983849] ret_from_fork+0x10/0x30
I don't know why this is happening yet.
> > this bss->rbn will continue to hold
> > expired data, such as __rd_parent_color.
> 
> Does that matter in any way?
> 
It caused a null pointer issue in rb_erase:
[ 2781.560091][T502248] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 2781.560115][T502248] Mem abort info:
[ 2781.560128][T502248]   ESR = 0x96000005
[ 2781.560143][T502248]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 2781.560156][T502248]   SET = 0, FnV = 0
[ 2781.560168][T502248]   EA = 0, S1PTW = 0
[ 2781.560178][T502248] Data abort info:
[ 2781.560187][T502248]   ISV = 0, ISS = 0x00000005
[ 2781.560196][T502248]   CM = 0, WnR = 0
[ 2781.560207][T502248] user pgtable: 4k pages, 39-bit VAs, pgdp=00000001cb653000
[ 2781.560216][T502248] [0000000000000000] pgd=0000000000000000
[ 2781.560222][T502248] , p4d=0000000000000000
[ 2781.560230][T502248] , pud=0000000000000000
[ 2781.560238][T502248]
[ 2781.560254][T502248] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[ 2781.568590][ T2176] [wlan][2176]halSetFWOwn:(INIT INFO) FW OWN:1, IntSta:0x08000000
[ 2781.591991][T502248] Kernel Offset: 0x22e2600000 from 0xffffffc008000000
[ 2781.591998][T502248] PHYS_OFFSET: 0x40000000
[ 2781.592002][T502248] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 2781.592010][T502248] pc : [0xffffffe2eb0698e4] rb_erase+0x1c8/0x354
[ 2781.592123][T502248] lr : [0xffffffe2edef93b4] __cfg80211_unlink_bss+0xac/0x1a0 [cfg80211]
[ 2781.592127][T502248] sp : ffffffc02ce43720
[ 2781.592130][T502248] x29: ffffffc02ce43720 x28: 0000000000000014
[ 2781.592133][T502248] x27: ffffff812e343e00 x26: 0000000000000000
[ 2781.592135][T502248] x25: ffffff81cb89c100 x24: fffffffefff6a5f0
[ 2781.592138][T502248] x23: ffffff80f58b0000 x22: 0000000000000001
[ 2781.592140][T502248] x21: ffffff809fbe8da0 x20: ffffff80f58b0000
[ 2781.592143][T502248] x19: ffffff809fbe8d00 x18: ffffffc02c4a7080
[ 2781.592146][T502248] x17: 000000000000012b x16: 0000000000000000
[ 2781.592148][T502248] x15: 000000000000001f x14: ffffff81b372c628
[ 2781.592151][T502248] x13: ffffff80c3298b28 x12: 0000000000000000
[ 2781.592154][T502248] x11: 0000000000000000 x10: ffffff80c3298b20
[ 2781.592156][T502248] x9 : 0000000000000000 x8 : ffffff80c3298b20
[ 2781.592159][T502248] x7 : 000000000000012a x6 : ffffff818755a000
[ 2781.592161][T502248] x5 : 000000008020001c x4 : ffffffff02541020
[ 2781.592164][T502248] x3 : 000000008020001c x2 : ffffff809fbe8da0
[ 2781.592166][T502248] x1 : ffffff80f58b00e8 x0 : ffffff809fbe8d20
[ 2781.592649][T502248] CPU: 5 PID: 2248 Comm: wpa_supplicant Tainted: P        W  O      5.10.177-android12-9-00002-g593c61caffd9-ab10731447 #1
[ 2781.592652][T502248] Hardware name: MT6896 (DT)
[ 2781.592656][T502248] Call trace:
[ 2781.592665][T502248]  dump_backtrace.cfi_jt+0x0/0x8
[ 2781.592669][T502248]  dump_stack_lvl+0xdc/0x138
[ 2781.592672][T502248]  dump_stack+0x1c/0x2c
[ 2781.592698][T502248]  mrdump_common_die+0x3a8/0x544 [mrdump]
[ 2781.592710][T502248]  ipanic_die+0x24/0x38 [mrdump]
[ 2781.592717][T502248]  die+0x358/0x680
[ 2781.592723][T502248]  die_kernel_fault+0x84/0x94
[ 2781.592726][T502248]  __do_kernel_fault+0x240/0x28c
[ 2781.592735][T502248]  do_page_fault+0xb4/0x770
[ 2781.592738][T502248]  do_translation_fault+0x48/0x64
[ 2781.592741][T502248]  do_mem_abort+0x6c/0x164
[ 2781.592745][T502248]  el1_abort+0x44/0x68
[ 2781.592748][T502248]  el1_sync_handler+0x58/0x88
[ 2781.592751][T502248]  el1_sync+0x8c/0x140
[ 2781.592754][T502248]  rb_erase+0x1c8/0x354
[ 2781.592838][T502248]  nl80211_dump_scan+0x564/0x634 [cfg80211]
[ 2781.592845][T502248]  netlink_dump+0x1b0/0x414
[ 2781.592848][T502248]  __netlink_dump_start+0x224/0x478
[ 2781.592852][T502248]  genl_rcv_msg+0x434/0x5c0
[ 2781.592855][T502248]  netlink_rcv_skb+0xf0/0x170
[ 2781.592858][T502248]  genl_rcv+0x3c/0x58
[ 2781.592861][T502248]  netlink_unicast_kernel+0xb8/0x248
[ 2781.592864][T502248]  netlink_unicast+0x18c/0x2fc
[ 2781.592867][T502248]  netlink_sendmsg+0x4c8/0x6f8
[ 2781.592873][T502248]  ____sys_sendmsg+0x240/0x3cc
[ 2781.592876][T502248]  __sys_sendmsg+0x21c/0x320
[ 2781.592880][T502248]  __arm64_sys_sendmsg+0x28/0x38
[ 2781.592886][T502248]  el0_svc_common+0xd4/0x270
[ 2781.592890][T502248]  el0_svc+0x28/0x98
[ 2781.592893][T502248]  el0_sync_handler+0x8c/0xf0
[ 2781.592896][T502248]  el0_sync+0x1b4/0x1c0
The one currently eraseing is the bss re-insert failed earlier.
The status of this bss is as follows after re-insert fail:
1. color is black
2. no child
After 40s, nl80211_dump_scan want to erase this bss, because it's
color is black, need rebalance, but it's parent(from expired
__rb_parent_color) now doesn't have any children, so we encountered a
null pointer issue in ____rb_erase_color.
> > And this bss still in rdev->bss_list, maybe double erase in
> > __cfg80211_bss_expire later.
> > Double erase a rbtree node(with expired parent and color data) maybe
> > corrupt rbtree, so add a in_rbtree flag to fix this issue.
> 
> This seems overly complex - couldn't we just remove it from the list too
> or something? It's already a case that "should never happen" so ... not
> sure we need to do something "good"?
> 
Will remove it from list when re-insert fail cause confusion in it's
refcount? Which could lead to leakage or use-after-free?
> johannes
>

Jiazi Li Oct. 26, 2023, 1:43 a.m. UTC | #4

On Wed, Oct 25, 2023 at 09:04:48AM -0700, Jeff Johnson wrote:
> On 10/25/2023 3:33 AM, Jiazi Li wrote:
> > If cfg80211_update_assoc_bss_entry call rb_insert_bss re-insert bss
> > failed because cmp_bss return 0, this bss->rbn will continue to hold
> > expired data, such as __rd_parent_color.
> > And this bss still in rdev->bss_list, maybe double erase in
> > __cfg80211_bss_expire later.
> > Double erase a rbtree node(with expired parent and color data) maybe
> > corrupt rbtree, so add a in_rbtree flag to fix this issue.
> > 
> > Signed-off-by: Jiazi Li <jiazi.li@transsion.com>
> 
> Note your Signed-off-by doesn't match your e-mail address in your e-mail
> header. From the actual e-mail source it seems Google is trashing your
> headers:
> From: Jiazi Li <jqqlijiazi@gmail.com>
> X-Google-Original-From: Jiazi Li <jiazi.li@transsion.com>
> 
> That needs to be resolved
Thanks for your reminder, I will use the same e-mail address in the
feature.
> 
>

Johannes Berg Dec. 14, 2023, 1:13 p.m. UTC | #5

> > 
> > Ok that's bad - so you hit the WARN_ON there? How that? We should fix
> > that too?
> > 
> Yes, hit this WARN_ON in the test of direct connection between mobile
> phones and PC. Here is the log:
> [ 2741.982362] -----------[ cut here ]-----------
> [ 2741.982446] WARNING: CPU: 6 PID: 2175 at net/wireless/scan.c:1496 cfg80211_update_assoc_bss_entry+0x350/0x378 [cfg80211]

Right, so you can reproduce that - can you find a fix for it?

> I don't know why this is happening yet.

OK ...

We have some basic kunit infrastructure, maybe you can work out
something there.

> > > this bss->rbn will continue to hold
> > > expired data, such as __rd_parent_color.
> > 
> > Does that matter in any way?
> > 
> It caused a null pointer issue in rb_erase:

Well, OK, so the thing isn't about it holding a color or 'expired' data
or whatever, it's about it being still on the rbtree, no?

> > > And this bss still in rdev->bss_list, maybe double erase in
> > > __cfg80211_bss_expire later.
> > > Double erase a rbtree node(with expired parent and color data) maybe
> > > corrupt rbtree, so add a in_rbtree flag to fix this issue.
> > 
> > This seems overly complex - couldn't we just remove it from the list too
> > or something? It's already a case that "should never happen" so ... not
> > sure we need to do something "good"?
> > 
> Will remove it from list when re-insert fail cause confusion in it's
> refcount? Which could lead to leakage or use-after-free?
> > 

It's a warn-on anyway, better we leak it than crash?

johannes

Jiazi Li Dec. 15, 2023, 10:21 a.m. UTC | #6

On Thu, Dec 14, 2023 at 02:13:38PM +0100, Johannes Berg wrote:
> > > 
> > > Ok that's bad - so you hit the WARN_ON there? How that? We should fix
> > > that too?
> > > 
> > Yes, hit this WARN_ON in the test of direct connection between mobile
> > phones and PC. Here is the log:
> > [ 2741.982362] -----------[ cut here ]-----------
> > [ 2741.982446] WARNING: CPU: 6 PID: 2175 at net/wireless/scan.c:1496 cfg80211_update_assoc_bss_entry+0x350/0x378 [cfg80211]
> 
> Right, so you can reproduce that - can you find a fix for it?
> 

I am responsible for kernel stability and I am not very familiar with wireless code.
The colleague in charge of the WiFi module also couldn't find the root cause, so we
used the workaround solution I mentioned earlier to address this issue.

> > I don't know why this is happening yet.
> 
> OK ...
> 
> We have some basic kunit infrastructure, maybe you can work out
> something there.
>

I would suggest that my WiFi colleagues use basic kunit infrastructure to identify
the root cause of the problem.

> > > > this bss->rbn will continue to hold
> > > > expired data, such as __rd_parent_color.
> > > 
> > > Does that matter in any way?
> > > 
> > It caused a null pointer issue in rb_erase:
> 
> Well, OK, so the thing isn't about it holding a color or 'expired' data
> or whatever, it's about it being still on the rbtree, no?
> 

I think the thing is rbn has been erased in cfg80211_update_assoc_bss_entry, but
it couldn't be reinserted into rbtree:
  	rb_erase(&cbss->rbn, &rdev->bss_tree);
  	rb_insert_bss(rdev, cbss);//reinsert fail
  	rdev->bss_generation++;

So bss->rbn is not in rbtree now, but bss still in rdev->bss_list.
This leads to erasing this rbn that is not in the rbtree in __cfg80211_bss_expire.
Expired __rb_parent_color, rb_right, and rb_left in this rbn may cause various crash issues:
1.
[56994.336470][T312578] WARNING: CPU: 3 PID: 12578 at net/wireless/scan.c:1495 cfg80211_update_assoc_bss_entry+0x350/0x378 [cfg80211]
......
[57049.728279][T712578] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000004
......
[57049.765847][T712578] pc : [0xffffffe95cc51388] cmp_bss+0x30/0x3a4 [cfg80211]
[57049.765973][T712578] lr : [0xffffffe95cc50c68] cfg80211_bss_update+0x7c/0x76c [cfg80211]
2.
[12114.124799][T412320] WARNING: CPU: 4 PID: 12320 at net/wireless/scan.c:1496 cfg80211_update_assoc_bss_entry+0x350/0x378 [cfg80211]
......
[12418.513153][T212320] Unable to handle kernel paging request at virtual address ffff81b1d69f1928
......
[12418.548440][T212320] pc : [0xffffffec9e669630] rb_insert_color+0xe4/0x164
[12418.548762][T212320] lr : [0xffffffec99766fb0] cfg80211_bss_update+0x3d4/0x76c [cfg80211]

I have encountered crashes in the code of other modules, and it is speculated that the use after free of rbn has damaged the memory used by other modules.
3.
[ 3981.870858][T510216] WARNING: CPU: 5 PID: 10216 at net/wireless/scan.c:1496 cfg80211_update_assoc_bss_entry+0x350/0x378 [cfg80211]
......
[ 4020.227747][ T4070] list_del corruption. prev->next should be ffffff80ebeace00, but was ffffff81950e8b30
[ 4020.227892][ T4070] ------------[ cut here ]------------
[ 4020.227913][ T4070] kernel BUG at lib/list_debug.c:61!
[ 4020.359413][ T4070] pc : [0xffffffd0985eecf8] __list_del_entry_valid+0xc0/0xd4
[ 4020.359438][ T4070] lr : [0xffffffd0985eecf8] __list_del_entry_valid+0xc0/0xd4
4.
[ 4858.776299][T102099] WARNING: CPU: 1 PID: 2099 at net/wireless/scan.c:1496 cfg80211_update_assoc_bss_entry+0x350/0x378 [cfg80211]
......
[ 5557.453407][T732106] Unable to handle kernel paging request at virtual address 20ffffff813691f8
......
[ 5557.466271][T732106] pc : [0xffffffd0ce5bda08] binder_open+0x208/0x608
[ 5557.466273][T732106] lr : [0xffffffd0ce5bd9f4] binder_open+0x1f4/0x608

> > > > And this bss still in rdev->bss_list, maybe double erase in
> > > > __cfg80211_bss_expire later.
> > > > Double erase a rbtree node(with expired parent and color data) maybe
> > > > corrupt rbtree, so add a in_rbtree flag to fix this issue.
> > > 
> > > This seems overly complex - couldn't we just remove it from the list too
> > > or something? It's already a case that "should never happen" so ... not
> > > sure we need to do something "good"?
> > > 
> > Will remove it from list when re-insert fail cause confusion in it's
> > refcount? Which could lead to leakage or use-after-free?
> > > 
> 
> It's a warn-on anyway, better we leak it than crash?
> 

Inaccurate refcount may cause use after free, which can also lead to crash.

> johannes

wifi: cfg80211: fix bss rbn double erase issue

Commit Message

Comments

Patch