Message ID | 20240619070824.537856-1-s.hauer@pengutronix.de (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Kalle Valo |
Headers | show |
Series | [RFC] mwifiex: Fix NULL pointer deref | expand |
Sascha Hauer <s.hauer@pengutronix.de> writes: > When an Access Point is repeatedly started it happens that the > interrupts handler is called with priv->wdev.wiphy being NULL, but > dereferenced in mwifiex_parse_single_response_buf() resulting in: > > | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000140 > | Mem abort info: > | ESR = 0x0000000096000004 > | EC = 0x25: DABT (current EL), IL = 32 bits > | SET = 0, FnV = 0 > | EA = 0, S1PTW = 0 > | FSC = 0x04: level 0 translation fault > | Data abort info: > | ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 > | CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > | GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > | user pgtable: 4k pages, 48-bit VAs, pgdp=0000000046d96000 > | [0000000000000140] pgd=0000000000000000, p4d=0000000000000000 > | Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP > | Modules linked in: caam_jr caamhash_desc spidev caamalg_desc crypto_engine authenc libdes mwifiex_sdio mwifiex crct10dif_ce cdc_acm onboard_usb_hub fsl_imx8_ddr_perf imx8m_ddrc rtc_ds1307 lm75 rtc_snvs imx_sdma caam imx8mm_thermal spi_imx error imx_cpufreq_dt fuse ip_tables x_tables ipv6 > | CPU: 0 PID: 8 Comm: kworker/0:1 Not tainted 6.9.0-00007-g937242013fce-dirty #18 > | Hardware name: somemachine (DT) > | Workqueue: events sdio_irq_work > | pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > | pc : mwifiex_get_cfp+0xd8/0x15c [mwifiex] > | lr : mwifiex_get_cfp+0x34/0x15c [mwifiex] > | sp : ffff8000818b3a70 > | x29: ffff8000818b3a70 x28: ffff000006bfd8a5 x27: 0000000000000004 > | x26: 000000000000002c x25: 0000000000001511 x24: 0000000002e86bc9 > | x23: ffff000006bfd996 x22: 0000000000000004 x21: ffff000007bec000 > | x20: 000000000000002c x19: 0000000000000000 x18: 0000000000000000 > | x17: 000000040044ffff x16: 00500072b5503510 x15: ccc283740681e517 > | x14: 0201000101006d15 x13: 0000000002e8ff43 x12: 002c01000000ffb1 > | x11: 0100000000000000 x10: 02e8ff43002c0100 x9 : 0000ffb100100157 > | x8 : ffff000003d20000 x7 : 00000000000002f1 x6 : 00000000ffffe124 > | x5 : 0000000000000001 x4 : 0000000000000003 x3 : 0000000000000000 > | x2 : 0000000000000000 x1 : 0001000000011001 x0 : 0000000000000000 > | Call trace: > | mwifiex_get_cfp+0xd8/0x15c [mwifiex] > | mwifiex_parse_single_response_buf+0x1d0/0x504 [mwifiex] > | mwifiex_handle_event_ext_scan_report+0x19c/0x2f8 [mwifiex] > | mwifiex_process_sta_event+0x298/0xf0c [mwifiex] > | mwifiex_process_event+0x110/0x238 [mwifiex] > | mwifiex_main_process+0x428/0xa44 [mwifiex] > | mwifiex_sdio_interrupt+0x64/0x12c [mwifiex_sdio] > | process_sdio_pending_irqs+0x64/0x1b8 > | sdio_irq_work+0x4c/0x7c > | process_one_work+0x148/0x2a0 > | worker_thread+0x2fc/0x40c > | kthread+0x110/0x114 > | ret_from_fork+0x10/0x20 > | Code: a94153f3 a8c37bfd d50323bf d65f03c0 (f940a000) > | ---[ end trace 0000000000000000 ]--- > > Fix this by adding a NULL check before dereferencing this pointer. > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> > > --- > > This is the most obvious fix for this problem, but I am not sure if we > might want to catch priv->wdev.wiphy being NULL earlier in the call > chain. I haven't looked at the call but the symptoms sound like that either we are enabling the interrupts too early or there's some kind of locking problem so that an other cpu doesn't see the change.
Hi Sascha, On Wed, Jun 19, 2024 at 11:05:28AM +0300, Kalle Valo wrote: > Sascha Hauer <s.hauer@pengutronix.de> writes: > > > When an Access Point is repeatedly started it happens that the > > interrupts handler is called with priv->wdev.wiphy being NULL, but > > dereferenced in mwifiex_parse_single_response_buf() resulting in: > > > > | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000140 ... > > | pc : mwifiex_get_cfp+0xd8/0x15c [mwifiex] > > | lr : mwifiex_get_cfp+0x34/0x15c [mwifiex] > > | sp : ffff8000818b3a70 > > | x29: ffff8000818b3a70 x28: ffff000006bfd8a5 x27: 0000000000000004 > > | x26: 000000000000002c x25: 0000000000001511 x24: 0000000002e86bc9 > > | x23: ffff000006bfd996 x22: 0000000000000004 x21: ffff000007bec000 > > | x20: 000000000000002c x19: 0000000000000000 x18: 0000000000000000 > > | x17: 000000040044ffff x16: 00500072b5503510 x15: ccc283740681e517 > > | x14: 0201000101006d15 x13: 0000000002e8ff43 x12: 002c01000000ffb1 > > | x11: 0100000000000000 x10: 02e8ff43002c0100 x9 : 0000ffb100100157 > > | x8 : ffff000003d20000 x7 : 00000000000002f1 x6 : 00000000ffffe124 > > | x5 : 0000000000000001 x4 : 0000000000000003 x3 : 0000000000000000 > > | x2 : 0000000000000000 x1 : 0001000000011001 x0 : 0000000000000000 > > | Call trace: > > | mwifiex_get_cfp+0xd8/0x15c [mwifiex] > > | mwifiex_parse_single_response_buf+0x1d0/0x504 [mwifiex] > > | mwifiex_handle_event_ext_scan_report+0x19c/0x2f8 [mwifiex] > > | mwifiex_process_sta_event+0x298/0xf0c [mwifiex] > > | mwifiex_process_event+0x110/0x238 [mwifiex] > > | mwifiex_main_process+0x428/0xa44 [mwifiex] > > | mwifiex_sdio_interrupt+0x64/0x12c [mwifiex_sdio] > > | process_sdio_pending_irqs+0x64/0x1b8 > > | sdio_irq_work+0x4c/0x7c > > | process_one_work+0x148/0x2a0 > > | worker_thread+0x2fc/0x40c > > | kthread+0x110/0x114 > > | ret_from_fork+0x10/0x20 > > | Code: a94153f3 a8c37bfd d50323bf d65f03c0 (f940a000) > > | ---[ end trace 0000000000000000 ]--- > > > > Fix this by adding a NULL check before dereferencing this pointer. > > > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> > > > > --- > > > > This is the most obvious fix for this problem, but I am not sure if we > > might want to catch priv->wdev.wiphy being NULL earlier in the call > > chain. > > I haven't looked at the call but the symptoms sound like that either we > are enabling the interrupts too early or there's some kind of locking > problem so that an other cpu doesn't see the change. I agree with Kalle that there's a different underlying bug involved, and (my conclusion:) we shouldn't whack-a-mole the NULL pointer without addressing the underlying problem. Looking a bit closer (and without much other context to go on): I believe that one potential underlying problem is the complete lack of locking between cfg80211 entry points (such as mwifiex_add_virtual_intf() or mwifiex_cfg80211_change_virtual_intf()) and most stuff in the main loop (mwifiex_main_process()). The former call sites only hold the wiphy lock, and the latter tends to ... mostly not hold any locks, but rely on sequentialization with itself, and using its |main_proc_lock| for setup and teardown. It's all really bad and ready to fall down like a house of cards at any moment. Unfortunately, no one has spent time on rearchitecting this driver. So it's possible that mwifiex_process_event() (mwifiex_get_priv_by_id() / mwifiex_get_priv()) is getting a hold of a not-fully-initialized 'priv' structure. BTW, in case I can reproduce and poke at your scenario, what exactly is your test case? Are you just starting / killing / restarting hostapd in a loop? Are you running a full network manager stack that's doing something more complex (e.g., initiating scans)? Can you reproduce with some more targeted set of `iw` commands? (`iw phy ... interface add ...; iw dev ... del`) Is there anything else interesting in the dmesg logs? (Some of the worst behaviors in this driver come when we see command timeouts and mwifiex_reinit_sw(), for example.) Or barring that, can you get some kind of trace of the nl80211 command sequence, so it's clearer which command(s) are involved leading up to the problem? Brian
On Thu, Jun 20, 2024 at 12:48:01PM -0700, Brian Norris wrote: > Hi Sascha, > > On Wed, Jun 19, 2024 at 11:05:28AM +0300, Kalle Valo wrote: > > Sascha Hauer <s.hauer@pengutronix.de> writes: > > > > > When an Access Point is repeatedly started it happens that the > > > interrupts handler is called with priv->wdev.wiphy being NULL, but > > > dereferenced in mwifiex_parse_single_response_buf() resulting in: > > > > > > | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000140 > ... > > > | pc : mwifiex_get_cfp+0xd8/0x15c [mwifiex] > > > | lr : mwifiex_get_cfp+0x34/0x15c [mwifiex] > > > | sp : ffff8000818b3a70 > > > | x29: ffff8000818b3a70 x28: ffff000006bfd8a5 x27: 0000000000000004 > > > | x26: 000000000000002c x25: 0000000000001511 x24: 0000000002e86bc9 > > > | x23: ffff000006bfd996 x22: 0000000000000004 x21: ffff000007bec000 > > > | x20: 000000000000002c x19: 0000000000000000 x18: 0000000000000000 > > > | x17: 000000040044ffff x16: 00500072b5503510 x15: ccc283740681e517 > > > | x14: 0201000101006d15 x13: 0000000002e8ff43 x12: 002c01000000ffb1 > > > | x11: 0100000000000000 x10: 02e8ff43002c0100 x9 : 0000ffb100100157 > > > | x8 : ffff000003d20000 x7 : 00000000000002f1 x6 : 00000000ffffe124 > > > | x5 : 0000000000000001 x4 : 0000000000000003 x3 : 0000000000000000 > > > | x2 : 0000000000000000 x1 : 0001000000011001 x0 : 0000000000000000 > > > | Call trace: > > > | mwifiex_get_cfp+0xd8/0x15c [mwifiex] > > > | mwifiex_parse_single_response_buf+0x1d0/0x504 [mwifiex] > > > | mwifiex_handle_event_ext_scan_report+0x19c/0x2f8 [mwifiex] > > > | mwifiex_process_sta_event+0x298/0xf0c [mwifiex] > > > | mwifiex_process_event+0x110/0x238 [mwifiex] > > > | mwifiex_main_process+0x428/0xa44 [mwifiex] > > > | mwifiex_sdio_interrupt+0x64/0x12c [mwifiex_sdio] > > > | process_sdio_pending_irqs+0x64/0x1b8 > > > | sdio_irq_work+0x4c/0x7c > > > | process_one_work+0x148/0x2a0 > > > | worker_thread+0x2fc/0x40c > > > | kthread+0x110/0x114 > > > | ret_from_fork+0x10/0x20 > > > | Code: a94153f3 a8c37bfd d50323bf d65f03c0 (f940a000) > > > | ---[ end trace 0000000000000000 ]--- > > > > > > Fix this by adding a NULL check before dereferencing this pointer. > > > > > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> > > > > > > --- > > > > > > This is the most obvious fix for this problem, but I am not sure if we > > > might want to catch priv->wdev.wiphy being NULL earlier in the call > > > chain. > > > > I haven't looked at the call but the symptoms sound like that either we > > are enabling the interrupts too early or there's some kind of locking > > problem so that an other cpu doesn't see the change. > > I agree with Kalle that there's a different underlying bug involved, and > (my conclusion:) we shouldn't whack-a-mole the NULL pointer without > addressing the underlying problem. > > Looking a bit closer (and without much other context to go on): I believe > that one potential underlying problem is the complete lack of locking > between cfg80211 entry points (such as mwifiex_add_virtual_intf() or > mwifiex_cfg80211_change_virtual_intf()) and most stuff in the main loop > (mwifiex_main_process()). The former call sites only hold the wiphy > lock, and the latter tends to ... mostly not hold any locks, but rely on > sequentialization with itself, and using its |main_proc_lock| for setup > and teardown. It's all really bad and ready to fall down like a house of > cards at any moment. Unfortunately, no one has spent time on > rearchitecting this driver. > > So it's possible that mwifiex_process_event() (mwifiex_get_priv_by_id() > / mwifiex_get_priv()) is getting a hold of a not-fully-initialized > 'priv' structure. > > BTW, in case I can reproduce and poke at your scenario, what exactly > is your test case? Are you just starting / killing / restarting hostapd > in a loop? I am running plain wpa_supplicant -i mlan0 with this config: network={ ssid="somessid" mode=2 frequency=2412 key_mgmt=WPA-PSK WPA-PSK-SHA256 proto=RSN group=CCMP pairwise=CCMP psk="12345678" } wait for the AP to be established, <ctrl-c> wpa_supplicant and start it again. It doesn't seem to be a locking problem, see the patch below which fixes my problem. At some point during incoming events the correct adapter->priv[] is selected based on bss_num and bss_type. when adapter->priv[0] is used for AP mode then an incoming event with type MWIFIEX_BSS_TYPE_STA leads to adapter->priv[1] being picked which is unused and doesn't have a wiphy attached to it. Sascha -------------------------8<---------------------------- From 3357963821294ff7de26259a154a1cb5bab760cb Mon Sep 17 00:00:00 2001 From: Sascha Hauer <s.hauer@pengutronix.de> Date: Tue, 18 Jun 2024 12:20:20 +0200 Subject: [PATCH] mwifiex: Do not return unused priv in mwifiex_get_priv_by_id() mwifiex_get_priv_by_id() returns the priv pointer corresponding to the bss_num and bss_type, but without checking if the priv is actually currently in use. Unused priv pointers do not have a wiphy attached to them which can lead to NULL pointer dereferences further down the callstack. Fix this by returning only used priv pointers which have priv->bss_mode set to something else than NL80211_IFTYPE_UNSPECIFIED. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> --- drivers/net/wireless/marvell/mwifiex/main.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/wireless/marvell/mwifiex/main.h b/drivers/net/wireless/marvell/mwifiex/main.h index 175882485a195..c5164ae41b547 100644 --- a/drivers/net/wireless/marvell/mwifiex/main.h +++ b/drivers/net/wireless/marvell/mwifiex/main.h @@ -1287,6 +1287,9 @@ mwifiex_get_priv_by_id(struct mwifiex_adapter *adapter, for (i = 0; i < adapter->priv_num; i++) { if (adapter->priv[i]) { + if (adapter->priv[i]->bss_mode == NL80211_IFTYPE_UNSPECIFIED) + continue; + if ((adapter->priv[i]->bss_num == bss_num) && (adapter->priv[i]->bss_type == bss_type)) break;
On Fri, Jun 21, 2024 at 11:07:27AM +0200, Sascha Hauer wrote: > I am running plain wpa_supplicant -i mlan0 with this config: > > network={ > ssid="somessid" > mode=2 > frequency=2412 > key_mgmt=WPA-PSK WPA-PSK-SHA256 > proto=RSN > group=CCMP > pairwise=CCMP > psk="12345678" > } > > wait for the AP to be established, <ctrl-c> wpa_supplicant and start it > again. Thanks. I still can't reproduce, but that's OK. From your fuller description now, it seems it depends on the particulars of the bss indices in use, and maybe my device/firmware is behaving differently. Anyway, your current description and patch below make a lot more sense. > It doesn't seem to be a locking problem, see the patch below which fixes > my problem. Sure. It's probably harder to trigger real issues out of some of these kinds of race conditions, and the logical problem below sounds a lot more likely. > At some point during incoming events the correct adapter->priv[] > is selected based on bss_num and bss_type. when adapter->priv[0] is used > for AP mode then an incoming event with type MWIFIEX_BSS_TYPE_STA leads > to adapter->priv[1] being picked which is unused and doesn't have a > wiphy attached to it. Ack, that makes a lot of sense -- although I think it also implies either you're getting some kind of corrupt event buffer from your device too, or there's something else logically weird with your firmware when you're receiving MWIFIEX_BSS_TYPE_STA events for a *_AP interface. (Or maybe that's also racy, as you're changing the virtual interface from STA to AP? Not sure.) It also highlights that I find this fallback code from mwifiex_process_event() weird: /* Get BSS number and corresponding priv */ priv = mwifiex_get_priv_by_id(adapter, EVENT_GET_BSS_NUM(eventcause), EVENT_GET_BSS_TYPE(eventcause)); if (!priv) priv = mwifiex_get_priv(adapter, MWIFIEX_BSS_ROLE_ANY); ^^ // if we didn't match the right BSS/event // metadata, we'll deliver the event to an // arbitrary interface? But I don't think that's your problem. And at least so far, I don't think the AP and STA event IDs collide in any way, so I don't think we're likely to end up misbehaving even if something is misdelievered. > > -------------------------8<---------------------------- > > From 3357963821294ff7de26259a154a1cb5bab760cb Mon Sep 17 00:00:00 2001 > From: Sascha Hauer <s.hauer@pengutronix.de> > Date: Tue, 18 Jun 2024 12:20:20 +0200 > Subject: [PATCH] mwifiex: Do not return unused priv in > mwifiex_get_priv_by_id() > > mwifiex_get_priv_by_id() returns the priv pointer corresponding to the > bss_num and bss_type, but without checking if the priv is actually > currently in use. > Unused priv pointers do not have a wiphy attached to them which can lead > to NULL pointer dereferences further down the callstack. > Fix this by returning only used priv pointers which have priv->bss_mode > set to something else than NL80211_IFTYPE_UNSPECIFIED. > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> > --- > drivers/net/wireless/marvell/mwifiex/main.h | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/net/wireless/marvell/mwifiex/main.h b/drivers/net/wireless/marvell/mwifiex/main.h > index 175882485a195..c5164ae41b547 100644 > --- a/drivers/net/wireless/marvell/mwifiex/main.h > +++ b/drivers/net/wireless/marvell/mwifiex/main.h > @@ -1287,6 +1287,9 @@ mwifiex_get_priv_by_id(struct mwifiex_adapter *adapter, > > for (i = 0; i < adapter->priv_num; i++) { > if (adapter->priv[i]) { > + if (adapter->priv[i]->bss_mode == NL80211_IFTYPE_UNSPECIFIED) > + continue; > + > if ((adapter->priv[i]->bss_num == bss_num) && > (adapter->priv[i]->bss_type == bss_type)) > break; Acked-by: Brian Norris <briannorris@chromium.org> Something about this formatting didn't get properly picked up by patchwork though, so you'll need to resubmit. Feel free to carry the above tag with it. Brian
Brian Norris <briannorris@chromium.org> writes: > On Fri, Jun 21, 2024 at 11:07:27AM +0200, Sascha Hauer wrote: >> I am running plain wpa_supplicant -i mlan0 with this config: >> >> network={ >> ssid="somessid" >> mode=2 >> frequency=2412 >> key_mgmt=WPA-PSK WPA-PSK-SHA256 >> proto=RSN >> group=CCMP >> pairwise=CCMP >> psk="12345678" >> } >> >> wait for the AP to be established, <ctrl-c> wpa_supplicant and start it >> again. > > Thanks. I still can't reproduce, but that's OK. From your fuller > description now, it seems it depends on the particulars of the bss > indices in use, and maybe my device/firmware is behaving differently. > Anyway, your current description and patch below make a lot more sense. Indeed, this makes a lot more sense. Thank you both for getting to the bottom of it!
Hello Sascha, On Fri, Jun 21, 2024 at 11:07:27AM +0200, Sascha Hauer wrote: > On Thu, Jun 20, 2024 at 12:48:01PM -0700, Brian Norris wrote: > > Hi Sascha, > > > > On Wed, Jun 19, 2024 at 11:05:28AM +0300, Kalle Valo wrote: > > > Sascha Hauer <s.hauer@pengutronix.de> writes: > > > > > > > When an Access Point is repeatedly started it happens that the > > > > interrupts handler is called with priv->wdev.wiphy being NULL, but > > > > dereferenced in mwifiex_parse_single_response_buf() resulting in: > > > > > > > > | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000140 > > ... > > > > | pc : mwifiex_get_cfp+0xd8/0x15c [mwifiex] > > > > | lr : mwifiex_get_cfp+0x34/0x15c [mwifiex] > > > > | sp : ffff8000818b3a70 > > > > | x29: ffff8000818b3a70 x28: ffff000006bfd8a5 x27: 0000000000000004 > > > > | x26: 000000000000002c x25: 0000000000001511 x24: 0000000002e86bc9 > > > > | x23: ffff000006bfd996 x22: 0000000000000004 x21: ffff000007bec000 > > > > | x20: 000000000000002c x19: 0000000000000000 x18: 0000000000000000 > > > > | x17: 000000040044ffff x16: 00500072b5503510 x15: ccc283740681e517 > > > > | x14: 0201000101006d15 x13: 0000000002e8ff43 x12: 002c01000000ffb1 > > > > | x11: 0100000000000000 x10: 02e8ff43002c0100 x9 : 0000ffb100100157 > > > > | x8 : ffff000003d20000 x7 : 00000000000002f1 x6 : 00000000ffffe124 > > > > | x5 : 0000000000000001 x4 : 0000000000000003 x3 : 0000000000000000 > > > > | x2 : 0000000000000000 x1 : 0001000000011001 x0 : 0000000000000000 > > > > | Call trace: > > > > | mwifiex_get_cfp+0xd8/0x15c [mwifiex] > > > > | mwifiex_parse_single_response_buf+0x1d0/0x504 [mwifiex] > > > > | mwifiex_handle_event_ext_scan_report+0x19c/0x2f8 [mwifiex] > > > > | mwifiex_process_sta_event+0x298/0xf0c [mwifiex] > > > > | mwifiex_process_event+0x110/0x238 [mwifiex] > > > > | mwifiex_main_process+0x428/0xa44 [mwifiex] > > > > | mwifiex_sdio_interrupt+0x64/0x12c [mwifiex_sdio] > > > > | process_sdio_pending_irqs+0x64/0x1b8 > > > > | sdio_irq_work+0x4c/0x7c > > > > | process_one_work+0x148/0x2a0 > > > > | worker_thread+0x2fc/0x40c > > > > | kthread+0x110/0x114 > > > > | ret_from_fork+0x10/0x20 > > > > | Code: a94153f3 a8c37bfd d50323bf d65f03c0 (f940a000) > > > > | ---[ end trace 0000000000000000 ]--- > > > > > > > > Fix this by adding a NULL check before dereferencing this pointer. > > > > > > > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> > > > > > > > > --- > > > > > > > > This is the most obvious fix for this problem, but I am not sure if we > > > > might want to catch priv->wdev.wiphy being NULL earlier in the call > > > > chain. > > > > > > I haven't looked at the call but the symptoms sound like that either we > > > are enabling the interrupts too early or there's some kind of locking > > > problem so that an other cpu doesn't see the change. > > > > I agree with Kalle that there's a different underlying bug involved, and > > (my conclusion:) we shouldn't whack-a-mole the NULL pointer without > > addressing the underlying problem. > > > > Looking a bit closer (and without much other context to go on): I believe > > that one potential underlying problem is the complete lack of locking > > between cfg80211 entry points (such as mwifiex_add_virtual_intf() or > > mwifiex_cfg80211_change_virtual_intf()) and most stuff in the main loop > > (mwifiex_main_process()). The former call sites only hold the wiphy > > lock, and the latter tends to ... mostly not hold any locks, but rely on > > sequentialization with itself, and using its |main_proc_lock| for setup > > and teardown. It's all really bad and ready to fall down like a house of > > cards at any moment. Unfortunately, no one has spent time on > > rearchitecting this driver. > > > > So it's possible that mwifiex_process_event() (mwifiex_get_priv_by_id() > > / mwifiex_get_priv()) is getting a hold of a not-fully-initialized > > 'priv' structure. > > > > BTW, in case I can reproduce and poke at your scenario, what exactly > > is your test case? Are you just starting / killing / restarting hostapd > > in a loop? > > I am running plain wpa_supplicant -i mlan0 with this config: > > network={ > ssid="somessid" > mode=2 > frequency=2412 > key_mgmt=WPA-PSK WPA-PSK-SHA256 > proto=RSN > group=CCMP > pairwise=CCMP > psk="12345678" > } > > wait for the AP to be established, <ctrl-c> wpa_supplicant and start it > again. > > It doesn't seem to be a locking problem, see the patch below which fixes > my problem. At some point during incoming events the correct adapter->priv[] > is selected based on bss_num and bss_type. when adapter->priv[0] is used > for AP mode then an incoming event with type MWIFIEX_BSS_TYPE_STA leads > to adapter->priv[1] being picked which is unused and doesn't have a > wiphy attached to it. > > Sascha > > -------------------------8<---------------------------- > > From 3357963821294ff7de26259a154a1cb5bab760cb Mon Sep 17 00:00:00 2001 > From: Sascha Hauer <s.hauer@pengutronix.de> > Date: Tue, 18 Jun 2024 12:20:20 +0200 > Subject: [PATCH] mwifiex: Do not return unused priv in > mwifiex_get_priv_by_id() > > mwifiex_get_priv_by_id() returns the priv pointer corresponding to the > bss_num and bss_type, but without checking if the priv is actually > currently in use. > Unused priv pointers do not have a wiphy attached to them which can lead > to NULL pointer dereferences further down the callstack. > Fix this by returning only used priv pointers which have priv->bss_mode > set to something else than NL80211_IFTYPE_UNSPECIFIED. > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> > --- > drivers/net/wireless/marvell/mwifiex/main.h | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/net/wireless/marvell/mwifiex/main.h b/drivers/net/wireless/marvell/mwifiex/main.h > index 175882485a195..c5164ae41b547 100644 > --- a/drivers/net/wireless/marvell/mwifiex/main.h > +++ b/drivers/net/wireless/marvell/mwifiex/main.h > @@ -1287,6 +1287,9 @@ mwifiex_get_priv_by_id(struct mwifiex_adapter *adapter, > > for (i = 0; i < adapter->priv_num; i++) { > if (adapter->priv[i]) { > + if (adapter->priv[i]->bss_mode == NL80211_IFTYPE_UNSPECIFIED) > + continue; > + > if ((adapter->priv[i]->bss_num == bss_num) && > (adapter->priv[i]->bss_type == bss_type)) > break; The change looks fine to me. I am just wondering if this might have anything to do with commit a17b9f590f6e ("wifi: mwifiex: Fix interface type change"), maybe you have already looked into it? Before that commit a wrong priv pointer was picked (different scenario from what you describe however). Francesco
On Mon, Jun 24, 2024 at 06:20:14PM +0200, Francesco Dolcini wrote: > Hello Sascha, > > On Fri, Jun 21, 2024 at 11:07:27AM +0200, Sascha Hauer wrote: > > On Thu, Jun 20, 2024 at 12:48:01PM -0700, Brian Norris wrote: > > > Hi Sascha, > > > > > > On Wed, Jun 19, 2024 at 11:05:28AM +0300, Kalle Valo wrote: > > > > Sascha Hauer <s.hauer@pengutronix.de> writes: > > > > > > > > > When an Access Point is repeatedly started it happens that the > > > > > interrupts handler is called with priv->wdev.wiphy being NULL, but > > > > > dereferenced in mwifiex_parse_single_response_buf() resulting in: > > > > > > > > > > | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000140 > > > ... > > > > > | pc : mwifiex_get_cfp+0xd8/0x15c [mwifiex] > > > > > | lr : mwifiex_get_cfp+0x34/0x15c [mwifiex] > > > > > | sp : ffff8000818b3a70 > > > > > | x29: ffff8000818b3a70 x28: ffff000006bfd8a5 x27: 0000000000000004 > > > > > | x26: 000000000000002c x25: 0000000000001511 x24: 0000000002e86bc9 > > > > > | x23: ffff000006bfd996 x22: 0000000000000004 x21: ffff000007bec000 > > > > > | x20: 000000000000002c x19: 0000000000000000 x18: 0000000000000000 > > > > > | x17: 000000040044ffff x16: 00500072b5503510 x15: ccc283740681e517 > > > > > | x14: 0201000101006d15 x13: 0000000002e8ff43 x12: 002c01000000ffb1 > > > > > | x11: 0100000000000000 x10: 02e8ff43002c0100 x9 : 0000ffb100100157 > > > > > | x8 : ffff000003d20000 x7 : 00000000000002f1 x6 : 00000000ffffe124 > > > > > | x5 : 0000000000000001 x4 : 0000000000000003 x3 : 0000000000000000 > > > > > | x2 : 0000000000000000 x1 : 0001000000011001 x0 : 0000000000000000 > > > > > | Call trace: > > > > > | mwifiex_get_cfp+0xd8/0x15c [mwifiex] > > > > > | mwifiex_parse_single_response_buf+0x1d0/0x504 [mwifiex] > > > > > | mwifiex_handle_event_ext_scan_report+0x19c/0x2f8 [mwifiex] > > > > > | mwifiex_process_sta_event+0x298/0xf0c [mwifiex] > > > > > | mwifiex_process_event+0x110/0x238 [mwifiex] > > > > > | mwifiex_main_process+0x428/0xa44 [mwifiex] > > > > > | mwifiex_sdio_interrupt+0x64/0x12c [mwifiex_sdio] > > > > > | process_sdio_pending_irqs+0x64/0x1b8 > > > > > | sdio_irq_work+0x4c/0x7c > > > > > | process_one_work+0x148/0x2a0 > > > > > | worker_thread+0x2fc/0x40c > > > > > | kthread+0x110/0x114 > > > > > | ret_from_fork+0x10/0x20 > > > > > | Code: a94153f3 a8c37bfd d50323bf d65f03c0 (f940a000) > > > > > | ---[ end trace 0000000000000000 ]--- > > > > > > > > > > Fix this by adding a NULL check before dereferencing this pointer. > > > > > > > > > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> > > > > > > > > > > --- > > > > > > > > > > This is the most obvious fix for this problem, but I am not sure if we > > > > > might want to catch priv->wdev.wiphy being NULL earlier in the call > > > > > chain. > > > > > > > > I haven't looked at the call but the symptoms sound like that either we > > > > are enabling the interrupts too early or there's some kind of locking > > > > problem so that an other cpu doesn't see the change. > > > > > > I agree with Kalle that there's a different underlying bug involved, and > > > (my conclusion:) we shouldn't whack-a-mole the NULL pointer without > > > addressing the underlying problem. > > > > > > Looking a bit closer (and without much other context to go on): I believe > > > that one potential underlying problem is the complete lack of locking > > > between cfg80211 entry points (such as mwifiex_add_virtual_intf() or > > > mwifiex_cfg80211_change_virtual_intf()) and most stuff in the main loop > > > (mwifiex_main_process()). The former call sites only hold the wiphy > > > lock, and the latter tends to ... mostly not hold any locks, but rely on > > > sequentialization with itself, and using its |main_proc_lock| for setup > > > and teardown. It's all really bad and ready to fall down like a house of > > > cards at any moment. Unfortunately, no one has spent time on > > > rearchitecting this driver. > > > > > > So it's possible that mwifiex_process_event() (mwifiex_get_priv_by_id() > > > / mwifiex_get_priv()) is getting a hold of a not-fully-initialized > > > 'priv' structure. > > > > > > BTW, in case I can reproduce and poke at your scenario, what exactly > > > is your test case? Are you just starting / killing / restarting hostapd > > > in a loop? > > > > I am running plain wpa_supplicant -i mlan0 with this config: > > > > network={ > > ssid="somessid" > > mode=2 > > frequency=2412 > > key_mgmt=WPA-PSK WPA-PSK-SHA256 > > proto=RSN > > group=CCMP > > pairwise=CCMP > > psk="12345678" > > } > > > > wait for the AP to be established, <ctrl-c> wpa_supplicant and start it > > again. > > > > It doesn't seem to be a locking problem, see the patch below which fixes > > my problem. At some point during incoming events the correct adapter->priv[] > > is selected based on bss_num and bss_type. when adapter->priv[0] is used > > for AP mode then an incoming event with type MWIFIEX_BSS_TYPE_STA leads > > to adapter->priv[1] being picked which is unused and doesn't have a > > wiphy attached to it. > > > > Sascha > > > > -------------------------8<---------------------------- > > > > From 3357963821294ff7de26259a154a1cb5bab760cb Mon Sep 17 00:00:00 2001 > > From: Sascha Hauer <s.hauer@pengutronix.de> > > Date: Tue, 18 Jun 2024 12:20:20 +0200 > > Subject: [PATCH] mwifiex: Do not return unused priv in > > mwifiex_get_priv_by_id() > > > > mwifiex_get_priv_by_id() returns the priv pointer corresponding to the > > bss_num and bss_type, but without checking if the priv is actually > > currently in use. > > Unused priv pointers do not have a wiphy attached to them which can lead > > to NULL pointer dereferences further down the callstack. > > Fix this by returning only used priv pointers which have priv->bss_mode > > set to something else than NL80211_IFTYPE_UNSPECIFIED. > > > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> > > --- > > drivers/net/wireless/marvell/mwifiex/main.h | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/net/wireless/marvell/mwifiex/main.h b/drivers/net/wireless/marvell/mwifiex/main.h > > index 175882485a195..c5164ae41b547 100644 > > --- a/drivers/net/wireless/marvell/mwifiex/main.h > > +++ b/drivers/net/wireless/marvell/mwifiex/main.h > > @@ -1287,6 +1287,9 @@ mwifiex_get_priv_by_id(struct mwifiex_adapter *adapter, > > > > for (i = 0; i < adapter->priv_num; i++) { > > if (adapter->priv[i]) { > > + if (adapter->priv[i]->bss_mode == NL80211_IFTYPE_UNSPECIFIED) > > + continue; > > + > > if ((adapter->priv[i]->bss_num == bss_num) && > > (adapter->priv[i]->bss_type == bss_type)) > > break; > > The change looks fine to me. > > I am just wondering if this might have anything to do with > commit a17b9f590f6e ("wifi: mwifiex: Fix interface type change"), maybe you have already looked into it? It looks somehow related. I just gave it a try and it at least doesn't fix my issue. Sascha
On Tue, Jul 02, 2024 at 03:32:16PM +0200, Sascha Hauer wrote: > On Mon, Jun 24, 2024 at 06:20:14PM +0200, Francesco Dolcini wrote: > > On Fri, Jun 21, 2024 at 11:07:27AM +0200, Sascha Hauer wrote: ... > > > > > > From 3357963821294ff7de26259a154a1cb5bab760cb Mon Sep 17 00:00:00 2001 > > > From: Sascha Hauer <s.hauer@pengutronix.de> > > > Date: Tue, 18 Jun 2024 12:20:20 +0200 > > > Subject: [PATCH] mwifiex: Do not return unused priv in > > > mwifiex_get_priv_by_id() > > > > > > mwifiex_get_priv_by_id() returns the priv pointer corresponding to the > > > bss_num and bss_type, but without checking if the priv is actually > > > currently in use. > > > Unused priv pointers do not have a wiphy attached to them which can lead > > > to NULL pointer dereferences further down the callstack. > > > Fix this by returning only used priv pointers which have priv->bss_mode > > > set to something else than NL80211_IFTYPE_UNSPECIFIED. > > > > > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Reviewed-by: Francesco Dolcini <francesco.dolcini@toradex.com> I guess you should send it as a proper patch? ... > > I am just wondering if this might have anything to do with > > commit a17b9f590f6e ("wifi: mwifiex: Fix interface type change"), maybe you have already looked into it? > > It looks somehow related. I just gave it a try and it at least doesn't > fix my issue. thanks for trying it out. Francesco
On Tue, Jul 02, 2024 at 10:36:11PM +0200, Francesco Dolcini wrote: > On Tue, Jul 02, 2024 at 03:32:16PM +0200, Sascha Hauer wrote: > > On Mon, Jun 24, 2024 at 06:20:14PM +0200, Francesco Dolcini wrote: > > > On Fri, Jun 21, 2024 at 11:07:27AM +0200, Sascha Hauer wrote: > > ... > > > > > > > > > From 3357963821294ff7de26259a154a1cb5bab760cb Mon Sep 17 00:00:00 2001 > > > > From: Sascha Hauer <s.hauer@pengutronix.de> > > > > Date: Tue, 18 Jun 2024 12:20:20 +0200 > > > > Subject: [PATCH] mwifiex: Do not return unused priv in > > > > mwifiex_get_priv_by_id() > > > > > > > > mwifiex_get_priv_by_id() returns the priv pointer corresponding to the > > > > bss_num and bss_type, but without checking if the priv is actually > > > > currently in use. > > > > Unused priv pointers do not have a wiphy attached to them which can lead > > > > to NULL pointer dereferences further down the callstack. > > > > Fix this by returning only used priv pointers which have priv->bss_mode > > > > set to something else than NL80211_IFTYPE_UNSPECIFIED. > > > > > > > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> > > Reviewed-by: Francesco Dolcini <francesco.dolcini@toradex.com> > > I guess you should send it as a proper patch? Yes, I just did this. Sascha
diff --git a/drivers/net/wireless/marvell/mwifiex/scan.c b/drivers/net/wireless/marvell/mwifiex/scan.c index 0326b121747cb..6211ae97da4b2 100644 --- a/drivers/net/wireless/marvell/mwifiex/scan.c +++ b/drivers/net/wireless/marvell/mwifiex/scan.c @@ -1843,6 +1843,9 @@ mwifiex_parse_single_response_buf(struct mwifiex_private *priv, u8 **bss_info, return 0; } + if (!priv->wdev.wiphy) + return 0; + band = BAND_G; if (radio_type) band = mwifiex_radio_type_to_band(*radio_type &
When an Access Point is repeatedly started it happens that the interrupts handler is called with priv->wdev.wiphy being NULL, but dereferenced in mwifiex_parse_single_response_buf() resulting in: | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000140 | Mem abort info: | ESR = 0x0000000096000004 | EC = 0x25: DABT (current EL), IL = 32 bits | SET = 0, FnV = 0 | EA = 0, S1PTW = 0 | FSC = 0x04: level 0 translation fault | Data abort info: | ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 | CM = 0, WnR = 0, TnD = 0, TagAccess = 0 | GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 | user pgtable: 4k pages, 48-bit VAs, pgdp=0000000046d96000 | [0000000000000140] pgd=0000000000000000, p4d=0000000000000000 | Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP | Modules linked in: caam_jr caamhash_desc spidev caamalg_desc crypto_engine authenc libdes mwifiex_sdio mwifiex crct10dif_ce cdc_acm onboard_usb_hub fsl_imx8_ddr_perf imx8m_ddrc rtc_ds1307 lm75 rtc_snvs imx_sdma caam imx8mm_thermal spi_imx error imx_cpufreq_dt fuse ip_tables x_tables ipv6 | CPU: 0 PID: 8 Comm: kworker/0:1 Not tainted 6.9.0-00007-g937242013fce-dirty #18 | Hardware name: somemachine (DT) | Workqueue: events sdio_irq_work | pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : mwifiex_get_cfp+0xd8/0x15c [mwifiex] | lr : mwifiex_get_cfp+0x34/0x15c [mwifiex] | sp : ffff8000818b3a70 | x29: ffff8000818b3a70 x28: ffff000006bfd8a5 x27: 0000000000000004 | x26: 000000000000002c x25: 0000000000001511 x24: 0000000002e86bc9 | x23: ffff000006bfd996 x22: 0000000000000004 x21: ffff000007bec000 | x20: 000000000000002c x19: 0000000000000000 x18: 0000000000000000 | x17: 000000040044ffff x16: 00500072b5503510 x15: ccc283740681e517 | x14: 0201000101006d15 x13: 0000000002e8ff43 x12: 002c01000000ffb1 | x11: 0100000000000000 x10: 02e8ff43002c0100 x9 : 0000ffb100100157 | x8 : ffff000003d20000 x7 : 00000000000002f1 x6 : 00000000ffffe124 | x5 : 0000000000000001 x4 : 0000000000000003 x3 : 0000000000000000 | x2 : 0000000000000000 x1 : 0001000000011001 x0 : 0000000000000000 | Call trace: | mwifiex_get_cfp+0xd8/0x15c [mwifiex] | mwifiex_parse_single_response_buf+0x1d0/0x504 [mwifiex] | mwifiex_handle_event_ext_scan_report+0x19c/0x2f8 [mwifiex] | mwifiex_process_sta_event+0x298/0xf0c [mwifiex] | mwifiex_process_event+0x110/0x238 [mwifiex] | mwifiex_main_process+0x428/0xa44 [mwifiex] | mwifiex_sdio_interrupt+0x64/0x12c [mwifiex_sdio] | process_sdio_pending_irqs+0x64/0x1b8 | sdio_irq_work+0x4c/0x7c | process_one_work+0x148/0x2a0 | worker_thread+0x2fc/0x40c | kthread+0x110/0x114 | ret_from_fork+0x10/0x20 | Code: a94153f3 a8c37bfd d50323bf d65f03c0 (f940a000) | ---[ end trace 0000000000000000 ]--- Fix this by adding a NULL check before dereferencing this pointer. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> --- This is the most obvious fix for this problem, but I am not sure if we might want to catch priv->wdev.wiphy being NULL earlier in the call chain. --- drivers/net/wireless/marvell/mwifiex/scan.c | 3 +++ 1 file changed, 3 insertions(+)