diff mbox series

[RFC] mwifiex: Fix NULL pointer deref

Message ID 20240619070824.537856-1-s.hauer@pengutronix.de (mailing list archive)
State RFC
Delegated to: Kalle Valo
Headers show
Series [RFC] mwifiex: Fix NULL pointer deref | expand

Commit Message

Sascha Hauer June 19, 2024, 7:08 a.m. UTC
When an Access Point is repeatedly started it happens that the
interrupts handler is called with priv->wdev.wiphy being NULL, but
dereferenced in mwifiex_parse_single_response_buf() resulting in:

| Unable to handle kernel NULL pointer dereference at virtual address 0000000000000140
| Mem abort info:
|   ESR = 0x0000000096000004
|   EC = 0x25: DABT (current EL), IL = 32 bits
|   SET = 0, FnV = 0
|   EA = 0, S1PTW = 0
|   FSC = 0x04: level 0 translation fault
| Data abort info:
|   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
|   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
|   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
| user pgtable: 4k pages, 48-bit VAs, pgdp=0000000046d96000
| [0000000000000140] pgd=0000000000000000, p4d=0000000000000000
| Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
| Modules linked in: caam_jr caamhash_desc spidev caamalg_desc crypto_engine authenc libdes mwifiex_sdio mwifiex crct10dif_ce cdc_acm onboard_usb_hub fsl_imx8_ddr_perf imx8m_ddrc rtc_ds1307 lm75 rtc_snvs imx_sdma caam imx8mm_thermal spi_imx error imx_cpufreq_dt fuse ip_tables x_tables ipv6
| CPU: 0 PID: 8 Comm: kworker/0:1 Not tainted 6.9.0-00007-g937242013fce-dirty #18
| Hardware name: somemachine (DT)
| Workqueue: events sdio_irq_work
| pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : mwifiex_get_cfp+0xd8/0x15c [mwifiex]
| lr : mwifiex_get_cfp+0x34/0x15c [mwifiex]
| sp : ffff8000818b3a70
| x29: ffff8000818b3a70 x28: ffff000006bfd8a5 x27: 0000000000000004
| x26: 000000000000002c x25: 0000000000001511 x24: 0000000002e86bc9
| x23: ffff000006bfd996 x22: 0000000000000004 x21: ffff000007bec000
| x20: 000000000000002c x19: 0000000000000000 x18: 0000000000000000
| x17: 000000040044ffff x16: 00500072b5503510 x15: ccc283740681e517
| x14: 0201000101006d15 x13: 0000000002e8ff43 x12: 002c01000000ffb1
| x11: 0100000000000000 x10: 02e8ff43002c0100 x9 : 0000ffb100100157
| x8 : ffff000003d20000 x7 : 00000000000002f1 x6 : 00000000ffffe124
| x5 : 0000000000000001 x4 : 0000000000000003 x3 : 0000000000000000
| x2 : 0000000000000000 x1 : 0001000000011001 x0 : 0000000000000000
| Call trace:
|  mwifiex_get_cfp+0xd8/0x15c [mwifiex]
|  mwifiex_parse_single_response_buf+0x1d0/0x504 [mwifiex]
|  mwifiex_handle_event_ext_scan_report+0x19c/0x2f8 [mwifiex]
|  mwifiex_process_sta_event+0x298/0xf0c [mwifiex]
|  mwifiex_process_event+0x110/0x238 [mwifiex]
|  mwifiex_main_process+0x428/0xa44 [mwifiex]
|  mwifiex_sdio_interrupt+0x64/0x12c [mwifiex_sdio]
|  process_sdio_pending_irqs+0x64/0x1b8
|  sdio_irq_work+0x4c/0x7c
|  process_one_work+0x148/0x2a0
|  worker_thread+0x2fc/0x40c
|  kthread+0x110/0x114
|  ret_from_fork+0x10/0x20
| Code: a94153f3 a8c37bfd d50323bf d65f03c0 (f940a000)
| ---[ end trace 0000000000000000 ]---

Fix this by adding a NULL check before dereferencing this pointer.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>

---

This is the most obvious fix for this problem, but I am not sure if we
might want to catch priv->wdev.wiphy being NULL earlier in the call
chain.
---
 drivers/net/wireless/marvell/mwifiex/scan.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Kalle Valo June 19, 2024, 8:05 a.m. UTC | #1
Sascha Hauer <s.hauer@pengutronix.de> writes:

> When an Access Point is repeatedly started it happens that the
> interrupts handler is called with priv->wdev.wiphy being NULL, but
> dereferenced in mwifiex_parse_single_response_buf() resulting in:
>
> | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000140
> | Mem abort info:
> |   ESR = 0x0000000096000004
> |   EC = 0x25: DABT (current EL), IL = 32 bits
> |   SET = 0, FnV = 0
> |   EA = 0, S1PTW = 0
> |   FSC = 0x04: level 0 translation fault
> | Data abort info:
> |   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> |   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> |   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> | user pgtable: 4k pages, 48-bit VAs, pgdp=0000000046d96000
> | [0000000000000140] pgd=0000000000000000, p4d=0000000000000000
> | Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> | Modules linked in: caam_jr caamhash_desc spidev caamalg_desc crypto_engine authenc libdes mwifiex_sdio mwifiex crct10dif_ce cdc_acm onboard_usb_hub fsl_imx8_ddr_perf imx8m_ddrc rtc_ds1307 lm75 rtc_snvs imx_sdma caam imx8mm_thermal spi_imx error imx_cpufreq_dt fuse ip_tables x_tables ipv6
> | CPU: 0 PID: 8 Comm: kworker/0:1 Not tainted 6.9.0-00007-g937242013fce-dirty #18
> | Hardware name: somemachine (DT)
> | Workqueue: events sdio_irq_work
> | pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> | pc : mwifiex_get_cfp+0xd8/0x15c [mwifiex]
> | lr : mwifiex_get_cfp+0x34/0x15c [mwifiex]
> | sp : ffff8000818b3a70
> | x29: ffff8000818b3a70 x28: ffff000006bfd8a5 x27: 0000000000000004
> | x26: 000000000000002c x25: 0000000000001511 x24: 0000000002e86bc9
> | x23: ffff000006bfd996 x22: 0000000000000004 x21: ffff000007bec000
> | x20: 000000000000002c x19: 0000000000000000 x18: 0000000000000000
> | x17: 000000040044ffff x16: 00500072b5503510 x15: ccc283740681e517
> | x14: 0201000101006d15 x13: 0000000002e8ff43 x12: 002c01000000ffb1
> | x11: 0100000000000000 x10: 02e8ff43002c0100 x9 : 0000ffb100100157
> | x8 : ffff000003d20000 x7 : 00000000000002f1 x6 : 00000000ffffe124
> | x5 : 0000000000000001 x4 : 0000000000000003 x3 : 0000000000000000
> | x2 : 0000000000000000 x1 : 0001000000011001 x0 : 0000000000000000
> | Call trace:
> |  mwifiex_get_cfp+0xd8/0x15c [mwifiex]
> |  mwifiex_parse_single_response_buf+0x1d0/0x504 [mwifiex]
> |  mwifiex_handle_event_ext_scan_report+0x19c/0x2f8 [mwifiex]
> |  mwifiex_process_sta_event+0x298/0xf0c [mwifiex]
> |  mwifiex_process_event+0x110/0x238 [mwifiex]
> |  mwifiex_main_process+0x428/0xa44 [mwifiex]
> |  mwifiex_sdio_interrupt+0x64/0x12c [mwifiex_sdio]
> |  process_sdio_pending_irqs+0x64/0x1b8
> |  sdio_irq_work+0x4c/0x7c
> |  process_one_work+0x148/0x2a0
> |  worker_thread+0x2fc/0x40c
> |  kthread+0x110/0x114
> |  ret_from_fork+0x10/0x20
> | Code: a94153f3 a8c37bfd d50323bf d65f03c0 (f940a000)
> | ---[ end trace 0000000000000000 ]---
>
> Fix this by adding a NULL check before dereferencing this pointer.
>
> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
>
> ---
>
> This is the most obvious fix for this problem, but I am not sure if we
> might want to catch priv->wdev.wiphy being NULL earlier in the call
> chain.

I haven't looked at the call but the symptoms sound like that either we
are enabling the interrupts too early or there's some kind of locking
problem so that an other cpu doesn't see the change.
Brian Norris June 20, 2024, 7:48 p.m. UTC | #2
Hi Sascha,

On Wed, Jun 19, 2024 at 11:05:28AM +0300, Kalle Valo wrote:
> Sascha Hauer <s.hauer@pengutronix.de> writes:
> 
> > When an Access Point is repeatedly started it happens that the
> > interrupts handler is called with priv->wdev.wiphy being NULL, but
> > dereferenced in mwifiex_parse_single_response_buf() resulting in:
> >
> > | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000140
...
> > | pc : mwifiex_get_cfp+0xd8/0x15c [mwifiex]
> > | lr : mwifiex_get_cfp+0x34/0x15c [mwifiex]
> > | sp : ffff8000818b3a70
> > | x29: ffff8000818b3a70 x28: ffff000006bfd8a5 x27: 0000000000000004
> > | x26: 000000000000002c x25: 0000000000001511 x24: 0000000002e86bc9
> > | x23: ffff000006bfd996 x22: 0000000000000004 x21: ffff000007bec000
> > | x20: 000000000000002c x19: 0000000000000000 x18: 0000000000000000
> > | x17: 000000040044ffff x16: 00500072b5503510 x15: ccc283740681e517
> > | x14: 0201000101006d15 x13: 0000000002e8ff43 x12: 002c01000000ffb1
> > | x11: 0100000000000000 x10: 02e8ff43002c0100 x9 : 0000ffb100100157
> > | x8 : ffff000003d20000 x7 : 00000000000002f1 x6 : 00000000ffffe124
> > | x5 : 0000000000000001 x4 : 0000000000000003 x3 : 0000000000000000
> > | x2 : 0000000000000000 x1 : 0001000000011001 x0 : 0000000000000000
> > | Call trace:
> > |  mwifiex_get_cfp+0xd8/0x15c [mwifiex]
> > |  mwifiex_parse_single_response_buf+0x1d0/0x504 [mwifiex]
> > |  mwifiex_handle_event_ext_scan_report+0x19c/0x2f8 [mwifiex]
> > |  mwifiex_process_sta_event+0x298/0xf0c [mwifiex]
> > |  mwifiex_process_event+0x110/0x238 [mwifiex]
> > |  mwifiex_main_process+0x428/0xa44 [mwifiex]
> > |  mwifiex_sdio_interrupt+0x64/0x12c [mwifiex_sdio]
> > |  process_sdio_pending_irqs+0x64/0x1b8
> > |  sdio_irq_work+0x4c/0x7c
> > |  process_one_work+0x148/0x2a0
> > |  worker_thread+0x2fc/0x40c
> > |  kthread+0x110/0x114
> > |  ret_from_fork+0x10/0x20
> > | Code: a94153f3 a8c37bfd d50323bf d65f03c0 (f940a000)
> > | ---[ end trace 0000000000000000 ]---
> >
> > Fix this by adding a NULL check before dereferencing this pointer.
> >
> > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
> >
> > ---
> >
> > This is the most obvious fix for this problem, but I am not sure if we
> > might want to catch priv->wdev.wiphy being NULL earlier in the call
> > chain.
> 
> I haven't looked at the call but the symptoms sound like that either we
> are enabling the interrupts too early or there's some kind of locking
> problem so that an other cpu doesn't see the change.

I agree with Kalle that there's a different underlying bug involved, and
(my conclusion:) we shouldn't whack-a-mole the NULL pointer without
addressing the underlying problem.

Looking a bit closer (and without much other context to go on): I believe 
that one potential underlying problem is the complete lack of locking
between cfg80211 entry points (such as mwifiex_add_virtual_intf() or
mwifiex_cfg80211_change_virtual_intf()) and most stuff in the main loop
(mwifiex_main_process()). The former call sites only hold the wiphy
lock, and the latter tends to ... mostly not hold any locks, but rely on
sequentialization with itself, and using its |main_proc_lock| for setup
and teardown. It's all really bad and ready to fall down like a house of
cards at any moment. Unfortunately, no one has spent time on
rearchitecting this driver.

So it's possible that mwifiex_process_event() (mwifiex_get_priv_by_id()
/ mwifiex_get_priv()) is getting a hold of a not-fully-initialized
'priv' structure.

BTW, in case I can reproduce and poke at your scenario, what exactly
is your test case? Are you just starting / killing / restarting hostapd
in a loop? Are you running a full network manager stack that's doing
something more complex (e.g., initiating scans)? Can you reproduce with
some more targeted set of `iw` commands? (`iw phy ... interface add ...;
iw dev ... del`) Is there anything else interesting in the dmesg logs?
(Some of the worst behaviors in this driver come when we see command
timeouts and mwifiex_reinit_sw(), for example.)

Or barring that, can you get some kind of trace of the nl80211 command
sequence, so it's clearer which command(s) are involved leading up to
the problem?

Brian
Sascha Hauer June 21, 2024, 9:07 a.m. UTC | #3
On Thu, Jun 20, 2024 at 12:48:01PM -0700, Brian Norris wrote:
> Hi Sascha,
> 
> On Wed, Jun 19, 2024 at 11:05:28AM +0300, Kalle Valo wrote:
> > Sascha Hauer <s.hauer@pengutronix.de> writes:
> > 
> > > When an Access Point is repeatedly started it happens that the
> > > interrupts handler is called with priv->wdev.wiphy being NULL, but
> > > dereferenced in mwifiex_parse_single_response_buf() resulting in:
> > >
> > > | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000140
> ...
> > > | pc : mwifiex_get_cfp+0xd8/0x15c [mwifiex]
> > > | lr : mwifiex_get_cfp+0x34/0x15c [mwifiex]
> > > | sp : ffff8000818b3a70
> > > | x29: ffff8000818b3a70 x28: ffff000006bfd8a5 x27: 0000000000000004
> > > | x26: 000000000000002c x25: 0000000000001511 x24: 0000000002e86bc9
> > > | x23: ffff000006bfd996 x22: 0000000000000004 x21: ffff000007bec000
> > > | x20: 000000000000002c x19: 0000000000000000 x18: 0000000000000000
> > > | x17: 000000040044ffff x16: 00500072b5503510 x15: ccc283740681e517
> > > | x14: 0201000101006d15 x13: 0000000002e8ff43 x12: 002c01000000ffb1
> > > | x11: 0100000000000000 x10: 02e8ff43002c0100 x9 : 0000ffb100100157
> > > | x8 : ffff000003d20000 x7 : 00000000000002f1 x6 : 00000000ffffe124
> > > | x5 : 0000000000000001 x4 : 0000000000000003 x3 : 0000000000000000
> > > | x2 : 0000000000000000 x1 : 0001000000011001 x0 : 0000000000000000
> > > | Call trace:
> > > |  mwifiex_get_cfp+0xd8/0x15c [mwifiex]
> > > |  mwifiex_parse_single_response_buf+0x1d0/0x504 [mwifiex]
> > > |  mwifiex_handle_event_ext_scan_report+0x19c/0x2f8 [mwifiex]
> > > |  mwifiex_process_sta_event+0x298/0xf0c [mwifiex]
> > > |  mwifiex_process_event+0x110/0x238 [mwifiex]
> > > |  mwifiex_main_process+0x428/0xa44 [mwifiex]
> > > |  mwifiex_sdio_interrupt+0x64/0x12c [mwifiex_sdio]
> > > |  process_sdio_pending_irqs+0x64/0x1b8
> > > |  sdio_irq_work+0x4c/0x7c
> > > |  process_one_work+0x148/0x2a0
> > > |  worker_thread+0x2fc/0x40c
> > > |  kthread+0x110/0x114
> > > |  ret_from_fork+0x10/0x20
> > > | Code: a94153f3 a8c37bfd d50323bf d65f03c0 (f940a000)
> > > | ---[ end trace 0000000000000000 ]---
> > >
> > > Fix this by adding a NULL check before dereferencing this pointer.
> > >
> > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
> > >
> > > ---
> > >
> > > This is the most obvious fix for this problem, but I am not sure if we
> > > might want to catch priv->wdev.wiphy being NULL earlier in the call
> > > chain.
> > 
> > I haven't looked at the call but the symptoms sound like that either we
> > are enabling the interrupts too early or there's some kind of locking
> > problem so that an other cpu doesn't see the change.
> 
> I agree with Kalle that there's a different underlying bug involved, and
> (my conclusion:) we shouldn't whack-a-mole the NULL pointer without
> addressing the underlying problem.
> 
> Looking a bit closer (and without much other context to go on): I believe 
> that one potential underlying problem is the complete lack of locking
> between cfg80211 entry points (such as mwifiex_add_virtual_intf() or
> mwifiex_cfg80211_change_virtual_intf()) and most stuff in the main loop
> (mwifiex_main_process()). The former call sites only hold the wiphy
> lock, and the latter tends to ... mostly not hold any locks, but rely on
> sequentialization with itself, and using its |main_proc_lock| for setup
> and teardown. It's all really bad and ready to fall down like a house of
> cards at any moment. Unfortunately, no one has spent time on
> rearchitecting this driver.
> 
> So it's possible that mwifiex_process_event() (mwifiex_get_priv_by_id()
> / mwifiex_get_priv()) is getting a hold of a not-fully-initialized
> 'priv' structure.
> 
> BTW, in case I can reproduce and poke at your scenario, what exactly
> is your test case? Are you just starting / killing / restarting hostapd
> in a loop?

I am running plain wpa_supplicant -i mlan0 with this config:

network={
        ssid="somessid"
        mode=2
        frequency=2412
        key_mgmt=WPA-PSK WPA-PSK-SHA256
        proto=RSN
        group=CCMP
        pairwise=CCMP
        psk="12345678"
}

wait for the AP to be established, <ctrl-c> wpa_supplicant and start it
again.

It doesn't seem to be a locking problem, see the patch below which fixes
my problem. At some point during incoming events the correct adapter->priv[]
is selected based on bss_num and bss_type. when adapter->priv[0] is used
for AP mode then an incoming event with type MWIFIEX_BSS_TYPE_STA leads
to adapter->priv[1] being picked which is unused and doesn't have a
wiphy attached to it.

Sascha

-------------------------8<----------------------------

From 3357963821294ff7de26259a154a1cb5bab760cb Mon Sep 17 00:00:00 2001
From: Sascha Hauer <s.hauer@pengutronix.de>
Date: Tue, 18 Jun 2024 12:20:20 +0200
Subject: [PATCH] mwifiex: Do not return unused priv in
 mwifiex_get_priv_by_id()

mwifiex_get_priv_by_id() returns the priv pointer corresponding to the
bss_num and bss_type, but without checking if the priv is actually
currently in use.
Unused priv pointers do not have a wiphy attached to them which can lead
to NULL pointer dereferences further down the callstack.
Fix this by returning only used priv pointers which have priv->bss_mode
set to something else than NL80211_IFTYPE_UNSPECIFIED.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
---
 drivers/net/wireless/marvell/mwifiex/main.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/wireless/marvell/mwifiex/main.h b/drivers/net/wireless/marvell/mwifiex/main.h
index 175882485a195..c5164ae41b547 100644
--- a/drivers/net/wireless/marvell/mwifiex/main.h
+++ b/drivers/net/wireless/marvell/mwifiex/main.h
@@ -1287,6 +1287,9 @@ mwifiex_get_priv_by_id(struct mwifiex_adapter *adapter,
 
 	for (i = 0; i < adapter->priv_num; i++) {
 		if (adapter->priv[i]) {
+			if (adapter->priv[i]->bss_mode == NL80211_IFTYPE_UNSPECIFIED)
+				continue;
+
 			if ((adapter->priv[i]->bss_num == bss_num) &&
 			    (adapter->priv[i]->bss_type == bss_type))
 				break;
Brian Norris June 21, 2024, 10:50 p.m. UTC | #4
On Fri, Jun 21, 2024 at 11:07:27AM +0200, Sascha Hauer wrote:
> I am running plain wpa_supplicant -i mlan0 with this config:
> 
> network={
>         ssid="somessid"
>         mode=2
>         frequency=2412
>         key_mgmt=WPA-PSK WPA-PSK-SHA256
>         proto=RSN
>         group=CCMP
>         pairwise=CCMP
>         psk="12345678"
> }
> 
> wait for the AP to be established, <ctrl-c> wpa_supplicant and start it
> again.

Thanks. I still can't reproduce, but that's OK. From your fuller
description now, it seems it depends on the particulars of the bss
indices in use, and maybe my device/firmware is behaving differently.
Anyway, your current description and patch below make a lot more sense.

> It doesn't seem to be a locking problem, see the patch below which fixes
> my problem.

Sure. It's probably harder to trigger real issues out of some of these
kinds of race conditions, and the logical problem below sounds a lot
more likely.

> At some point during incoming events the correct adapter->priv[]
> is selected based on bss_num and bss_type. when adapter->priv[0] is used
> for AP mode then an incoming event with type MWIFIEX_BSS_TYPE_STA leads
> to adapter->priv[1] being picked which is unused and doesn't have a
> wiphy attached to it.

Ack, that makes a lot of sense -- although I think it also implies
either you're getting some kind of corrupt event buffer from your device
too, or there's something else logically weird with your firmware when
you're receiving MWIFIEX_BSS_TYPE_STA events for a *_AP interface. (Or
maybe that's also racy, as you're changing the virtual interface from
STA to AP? Not sure.)

It also highlights that I find this fallback code from
mwifiex_process_event() weird:

        /* Get BSS number and corresponding priv */
        priv = mwifiex_get_priv_by_id(adapter, EVENT_GET_BSS_NUM(eventcause),
                                      EVENT_GET_BSS_TYPE(eventcause));
        if (!priv)
                priv = mwifiex_get_priv(adapter, MWIFIEX_BSS_ROLE_ANY);
                       ^^ // if we didn't match the right BSS/event
			  // metadata, we'll deliver the event to an
			  // arbitrary interface?

But I don't think that's your problem. And at least so far, I don't
think the AP and STA event IDs collide in any way, so I don't think
we're likely to end up misbehaving even if something is misdelievered.

> 
> -------------------------8<----------------------------
> 
> From 3357963821294ff7de26259a154a1cb5bab760cb Mon Sep 17 00:00:00 2001
> From: Sascha Hauer <s.hauer@pengutronix.de>
> Date: Tue, 18 Jun 2024 12:20:20 +0200
> Subject: [PATCH] mwifiex: Do not return unused priv in
>  mwifiex_get_priv_by_id()
> 
> mwifiex_get_priv_by_id() returns the priv pointer corresponding to the
> bss_num and bss_type, but without checking if the priv is actually
> currently in use.
> Unused priv pointers do not have a wiphy attached to them which can lead
> to NULL pointer dereferences further down the callstack.
> Fix this by returning only used priv pointers which have priv->bss_mode
> set to something else than NL80211_IFTYPE_UNSPECIFIED.
> 
> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
> ---
>  drivers/net/wireless/marvell/mwifiex/main.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/wireless/marvell/mwifiex/main.h b/drivers/net/wireless/marvell/mwifiex/main.h
> index 175882485a195..c5164ae41b547 100644
> --- a/drivers/net/wireless/marvell/mwifiex/main.h
> +++ b/drivers/net/wireless/marvell/mwifiex/main.h
> @@ -1287,6 +1287,9 @@ mwifiex_get_priv_by_id(struct mwifiex_adapter *adapter,
>  
>  	for (i = 0; i < adapter->priv_num; i++) {
>  		if (adapter->priv[i]) {
> +			if (adapter->priv[i]->bss_mode == NL80211_IFTYPE_UNSPECIFIED)
> +				continue;
> +
>  			if ((adapter->priv[i]->bss_num == bss_num) &&
>  			    (adapter->priv[i]->bss_type == bss_type))
>  				break;

Acked-by: Brian Norris <briannorris@chromium.org>

Something about this formatting didn't get properly picked up by
patchwork though, so you'll need to resubmit. Feel free to carry the
above tag with it.

Brian
Kalle Valo June 24, 2024, 11:07 a.m. UTC | #5
Brian Norris <briannorris@chromium.org> writes:

> On Fri, Jun 21, 2024 at 11:07:27AM +0200, Sascha Hauer wrote:
>> I am running plain wpa_supplicant -i mlan0 with this config:
>> 
>> network={
>>         ssid="somessid"
>>         mode=2
>>         frequency=2412
>>         key_mgmt=WPA-PSK WPA-PSK-SHA256
>>         proto=RSN
>>         group=CCMP
>>         pairwise=CCMP
>>         psk="12345678"
>> }
>> 
>> wait for the AP to be established, <ctrl-c> wpa_supplicant and start it
>> again.
>
> Thanks. I still can't reproduce, but that's OK. From your fuller
> description now, it seems it depends on the particulars of the bss
> indices in use, and maybe my device/firmware is behaving differently.
> Anyway, your current description and patch below make a lot more sense.

Indeed, this makes a lot more sense. Thank you both for getting to the
bottom of it!
Francesco Dolcini June 24, 2024, 4:20 p.m. UTC | #6
Hello Sascha,

On Fri, Jun 21, 2024 at 11:07:27AM +0200, Sascha Hauer wrote:
> On Thu, Jun 20, 2024 at 12:48:01PM -0700, Brian Norris wrote:
> > Hi Sascha,
> > 
> > On Wed, Jun 19, 2024 at 11:05:28AM +0300, Kalle Valo wrote:
> > > Sascha Hauer <s.hauer@pengutronix.de> writes:
> > > 
> > > > When an Access Point is repeatedly started it happens that the
> > > > interrupts handler is called with priv->wdev.wiphy being NULL, but
> > > > dereferenced in mwifiex_parse_single_response_buf() resulting in:
> > > >
> > > > | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000140
> > ...
> > > > | pc : mwifiex_get_cfp+0xd8/0x15c [mwifiex]
> > > > | lr : mwifiex_get_cfp+0x34/0x15c [mwifiex]
> > > > | sp : ffff8000818b3a70
> > > > | x29: ffff8000818b3a70 x28: ffff000006bfd8a5 x27: 0000000000000004
> > > > | x26: 000000000000002c x25: 0000000000001511 x24: 0000000002e86bc9
> > > > | x23: ffff000006bfd996 x22: 0000000000000004 x21: ffff000007bec000
> > > > | x20: 000000000000002c x19: 0000000000000000 x18: 0000000000000000
> > > > | x17: 000000040044ffff x16: 00500072b5503510 x15: ccc283740681e517
> > > > | x14: 0201000101006d15 x13: 0000000002e8ff43 x12: 002c01000000ffb1
> > > > | x11: 0100000000000000 x10: 02e8ff43002c0100 x9 : 0000ffb100100157
> > > > | x8 : ffff000003d20000 x7 : 00000000000002f1 x6 : 00000000ffffe124
> > > > | x5 : 0000000000000001 x4 : 0000000000000003 x3 : 0000000000000000
> > > > | x2 : 0000000000000000 x1 : 0001000000011001 x0 : 0000000000000000
> > > > | Call trace:
> > > > |  mwifiex_get_cfp+0xd8/0x15c [mwifiex]
> > > > |  mwifiex_parse_single_response_buf+0x1d0/0x504 [mwifiex]
> > > > |  mwifiex_handle_event_ext_scan_report+0x19c/0x2f8 [mwifiex]
> > > > |  mwifiex_process_sta_event+0x298/0xf0c [mwifiex]
> > > > |  mwifiex_process_event+0x110/0x238 [mwifiex]
> > > > |  mwifiex_main_process+0x428/0xa44 [mwifiex]
> > > > |  mwifiex_sdio_interrupt+0x64/0x12c [mwifiex_sdio]
> > > > |  process_sdio_pending_irqs+0x64/0x1b8
> > > > |  sdio_irq_work+0x4c/0x7c
> > > > |  process_one_work+0x148/0x2a0
> > > > |  worker_thread+0x2fc/0x40c
> > > > |  kthread+0x110/0x114
> > > > |  ret_from_fork+0x10/0x20
> > > > | Code: a94153f3 a8c37bfd d50323bf d65f03c0 (f940a000)
> > > > | ---[ end trace 0000000000000000 ]---
> > > >
> > > > Fix this by adding a NULL check before dereferencing this pointer.
> > > >
> > > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
> > > >
> > > > ---
> > > >
> > > > This is the most obvious fix for this problem, but I am not sure if we
> > > > might want to catch priv->wdev.wiphy being NULL earlier in the call
> > > > chain.
> > > 
> > > I haven't looked at the call but the symptoms sound like that either we
> > > are enabling the interrupts too early or there's some kind of locking
> > > problem so that an other cpu doesn't see the change.
> > 
> > I agree with Kalle that there's a different underlying bug involved, and
> > (my conclusion:) we shouldn't whack-a-mole the NULL pointer without
> > addressing the underlying problem.
> > 
> > Looking a bit closer (and without much other context to go on): I believe 
> > that one potential underlying problem is the complete lack of locking
> > between cfg80211 entry points (such as mwifiex_add_virtual_intf() or
> > mwifiex_cfg80211_change_virtual_intf()) and most stuff in the main loop
> > (mwifiex_main_process()). The former call sites only hold the wiphy
> > lock, and the latter tends to ... mostly not hold any locks, but rely on
> > sequentialization with itself, and using its |main_proc_lock| for setup
> > and teardown. It's all really bad and ready to fall down like a house of
> > cards at any moment. Unfortunately, no one has spent time on
> > rearchitecting this driver.
> > 
> > So it's possible that mwifiex_process_event() (mwifiex_get_priv_by_id()
> > / mwifiex_get_priv()) is getting a hold of a not-fully-initialized
> > 'priv' structure.
> > 
> > BTW, in case I can reproduce and poke at your scenario, what exactly
> > is your test case? Are you just starting / killing / restarting hostapd
> > in a loop?
> 
> I am running plain wpa_supplicant -i mlan0 with this config:
> 
> network={
>         ssid="somessid"
>         mode=2
>         frequency=2412
>         key_mgmt=WPA-PSK WPA-PSK-SHA256
>         proto=RSN
>         group=CCMP
>         pairwise=CCMP
>         psk="12345678"
> }
> 
> wait for the AP to be established, <ctrl-c> wpa_supplicant and start it
> again.
> 
> It doesn't seem to be a locking problem, see the patch below which fixes
> my problem. At some point during incoming events the correct adapter->priv[]
> is selected based on bss_num and bss_type. when adapter->priv[0] is used
> for AP mode then an incoming event with type MWIFIEX_BSS_TYPE_STA leads
> to adapter->priv[1] being picked which is unused and doesn't have a
> wiphy attached to it.
> 
> Sascha
> 
> -------------------------8<----------------------------
> 
> From 3357963821294ff7de26259a154a1cb5bab760cb Mon Sep 17 00:00:00 2001
> From: Sascha Hauer <s.hauer@pengutronix.de>
> Date: Tue, 18 Jun 2024 12:20:20 +0200
> Subject: [PATCH] mwifiex: Do not return unused priv in
>  mwifiex_get_priv_by_id()
> 
> mwifiex_get_priv_by_id() returns the priv pointer corresponding to the
> bss_num and bss_type, but without checking if the priv is actually
> currently in use.
> Unused priv pointers do not have a wiphy attached to them which can lead
> to NULL pointer dereferences further down the callstack.
> Fix this by returning only used priv pointers which have priv->bss_mode
> set to something else than NL80211_IFTYPE_UNSPECIFIED.
> 
> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
> ---
>  drivers/net/wireless/marvell/mwifiex/main.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/wireless/marvell/mwifiex/main.h b/drivers/net/wireless/marvell/mwifiex/main.h
> index 175882485a195..c5164ae41b547 100644
> --- a/drivers/net/wireless/marvell/mwifiex/main.h
> +++ b/drivers/net/wireless/marvell/mwifiex/main.h
> @@ -1287,6 +1287,9 @@ mwifiex_get_priv_by_id(struct mwifiex_adapter *adapter,
>  
>  	for (i = 0; i < adapter->priv_num; i++) {
>  		if (adapter->priv[i]) {
> +			if (adapter->priv[i]->bss_mode == NL80211_IFTYPE_UNSPECIFIED)
> +				continue;
> +
>  			if ((adapter->priv[i]->bss_num == bss_num) &&
>  			    (adapter->priv[i]->bss_type == bss_type))
>  				break;

The change looks fine to me.

I am just wondering if this might have anything to do with
commit a17b9f590f6e ("wifi: mwifiex: Fix interface type change"), maybe you have already looked into it?
Before that commit a wrong priv pointer was picked (different scenario from what you describe however).

Francesco
Sascha Hauer July 2, 2024, 1:32 p.m. UTC | #7
On Mon, Jun 24, 2024 at 06:20:14PM +0200, Francesco Dolcini wrote:
> Hello Sascha,
> 
> On Fri, Jun 21, 2024 at 11:07:27AM +0200, Sascha Hauer wrote:
> > On Thu, Jun 20, 2024 at 12:48:01PM -0700, Brian Norris wrote:
> > > Hi Sascha,
> > > 
> > > On Wed, Jun 19, 2024 at 11:05:28AM +0300, Kalle Valo wrote:
> > > > Sascha Hauer <s.hauer@pengutronix.de> writes:
> > > > 
> > > > > When an Access Point is repeatedly started it happens that the
> > > > > interrupts handler is called with priv->wdev.wiphy being NULL, but
> > > > > dereferenced in mwifiex_parse_single_response_buf() resulting in:
> > > > >
> > > > > | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000140
> > > ...
> > > > > | pc : mwifiex_get_cfp+0xd8/0x15c [mwifiex]
> > > > > | lr : mwifiex_get_cfp+0x34/0x15c [mwifiex]
> > > > > | sp : ffff8000818b3a70
> > > > > | x29: ffff8000818b3a70 x28: ffff000006bfd8a5 x27: 0000000000000004
> > > > > | x26: 000000000000002c x25: 0000000000001511 x24: 0000000002e86bc9
> > > > > | x23: ffff000006bfd996 x22: 0000000000000004 x21: ffff000007bec000
> > > > > | x20: 000000000000002c x19: 0000000000000000 x18: 0000000000000000
> > > > > | x17: 000000040044ffff x16: 00500072b5503510 x15: ccc283740681e517
> > > > > | x14: 0201000101006d15 x13: 0000000002e8ff43 x12: 002c01000000ffb1
> > > > > | x11: 0100000000000000 x10: 02e8ff43002c0100 x9 : 0000ffb100100157
> > > > > | x8 : ffff000003d20000 x7 : 00000000000002f1 x6 : 00000000ffffe124
> > > > > | x5 : 0000000000000001 x4 : 0000000000000003 x3 : 0000000000000000
> > > > > | x2 : 0000000000000000 x1 : 0001000000011001 x0 : 0000000000000000
> > > > > | Call trace:
> > > > > |  mwifiex_get_cfp+0xd8/0x15c [mwifiex]
> > > > > |  mwifiex_parse_single_response_buf+0x1d0/0x504 [mwifiex]
> > > > > |  mwifiex_handle_event_ext_scan_report+0x19c/0x2f8 [mwifiex]
> > > > > |  mwifiex_process_sta_event+0x298/0xf0c [mwifiex]
> > > > > |  mwifiex_process_event+0x110/0x238 [mwifiex]
> > > > > |  mwifiex_main_process+0x428/0xa44 [mwifiex]
> > > > > |  mwifiex_sdio_interrupt+0x64/0x12c [mwifiex_sdio]
> > > > > |  process_sdio_pending_irqs+0x64/0x1b8
> > > > > |  sdio_irq_work+0x4c/0x7c
> > > > > |  process_one_work+0x148/0x2a0
> > > > > |  worker_thread+0x2fc/0x40c
> > > > > |  kthread+0x110/0x114
> > > > > |  ret_from_fork+0x10/0x20
> > > > > | Code: a94153f3 a8c37bfd d50323bf d65f03c0 (f940a000)
> > > > > | ---[ end trace 0000000000000000 ]---
> > > > >
> > > > > Fix this by adding a NULL check before dereferencing this pointer.
> > > > >
> > > > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
> > > > >
> > > > > ---
> > > > >
> > > > > This is the most obvious fix for this problem, but I am not sure if we
> > > > > might want to catch priv->wdev.wiphy being NULL earlier in the call
> > > > > chain.
> > > > 
> > > > I haven't looked at the call but the symptoms sound like that either we
> > > > are enabling the interrupts too early or there's some kind of locking
> > > > problem so that an other cpu doesn't see the change.
> > > 
> > > I agree with Kalle that there's a different underlying bug involved, and
> > > (my conclusion:) we shouldn't whack-a-mole the NULL pointer without
> > > addressing the underlying problem.
> > > 
> > > Looking a bit closer (and without much other context to go on): I believe 
> > > that one potential underlying problem is the complete lack of locking
> > > between cfg80211 entry points (such as mwifiex_add_virtual_intf() or
> > > mwifiex_cfg80211_change_virtual_intf()) and most stuff in the main loop
> > > (mwifiex_main_process()). The former call sites only hold the wiphy
> > > lock, and the latter tends to ... mostly not hold any locks, but rely on
> > > sequentialization with itself, and using its |main_proc_lock| for setup
> > > and teardown. It's all really bad and ready to fall down like a house of
> > > cards at any moment. Unfortunately, no one has spent time on
> > > rearchitecting this driver.
> > > 
> > > So it's possible that mwifiex_process_event() (mwifiex_get_priv_by_id()
> > > / mwifiex_get_priv()) is getting a hold of a not-fully-initialized
> > > 'priv' structure.
> > > 
> > > BTW, in case I can reproduce and poke at your scenario, what exactly
> > > is your test case? Are you just starting / killing / restarting hostapd
> > > in a loop?
> > 
> > I am running plain wpa_supplicant -i mlan0 with this config:
> > 
> > network={
> >         ssid="somessid"
> >         mode=2
> >         frequency=2412
> >         key_mgmt=WPA-PSK WPA-PSK-SHA256
> >         proto=RSN
> >         group=CCMP
> >         pairwise=CCMP
> >         psk="12345678"
> > }
> > 
> > wait for the AP to be established, <ctrl-c> wpa_supplicant and start it
> > again.
> > 
> > It doesn't seem to be a locking problem, see the patch below which fixes
> > my problem. At some point during incoming events the correct adapter->priv[]
> > is selected based on bss_num and bss_type. when adapter->priv[0] is used
> > for AP mode then an incoming event with type MWIFIEX_BSS_TYPE_STA leads
> > to adapter->priv[1] being picked which is unused and doesn't have a
> > wiphy attached to it.
> > 
> > Sascha
> > 
> > -------------------------8<----------------------------
> > 
> > From 3357963821294ff7de26259a154a1cb5bab760cb Mon Sep 17 00:00:00 2001
> > From: Sascha Hauer <s.hauer@pengutronix.de>
> > Date: Tue, 18 Jun 2024 12:20:20 +0200
> > Subject: [PATCH] mwifiex: Do not return unused priv in
> >  mwifiex_get_priv_by_id()
> > 
> > mwifiex_get_priv_by_id() returns the priv pointer corresponding to the
> > bss_num and bss_type, but without checking if the priv is actually
> > currently in use.
> > Unused priv pointers do not have a wiphy attached to them which can lead
> > to NULL pointer dereferences further down the callstack.
> > Fix this by returning only used priv pointers which have priv->bss_mode
> > set to something else than NL80211_IFTYPE_UNSPECIFIED.
> > 
> > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
> > ---
> >  drivers/net/wireless/marvell/mwifiex/main.h | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/net/wireless/marvell/mwifiex/main.h b/drivers/net/wireless/marvell/mwifiex/main.h
> > index 175882485a195..c5164ae41b547 100644
> > --- a/drivers/net/wireless/marvell/mwifiex/main.h
> > +++ b/drivers/net/wireless/marvell/mwifiex/main.h
> > @@ -1287,6 +1287,9 @@ mwifiex_get_priv_by_id(struct mwifiex_adapter *adapter,
> >  
> >  	for (i = 0; i < adapter->priv_num; i++) {
> >  		if (adapter->priv[i]) {
> > +			if (adapter->priv[i]->bss_mode == NL80211_IFTYPE_UNSPECIFIED)
> > +				continue;
> > +
> >  			if ((adapter->priv[i]->bss_num == bss_num) &&
> >  			    (adapter->priv[i]->bss_type == bss_type))
> >  				break;
> 
> The change looks fine to me.
> 
> I am just wondering if this might have anything to do with
> commit a17b9f590f6e ("wifi: mwifiex: Fix interface type change"), maybe you have already looked into it?

It looks somehow related. I just gave it a try and it at least doesn't
fix my issue.

Sascha
Francesco Dolcini July 2, 2024, 8:36 p.m. UTC | #8
On Tue, Jul 02, 2024 at 03:32:16PM +0200, Sascha Hauer wrote:
> On Mon, Jun 24, 2024 at 06:20:14PM +0200, Francesco Dolcini wrote:
> > On Fri, Jun 21, 2024 at 11:07:27AM +0200, Sascha Hauer wrote:

...

> > > 
> > > From 3357963821294ff7de26259a154a1cb5bab760cb Mon Sep 17 00:00:00 2001
> > > From: Sascha Hauer <s.hauer@pengutronix.de>
> > > Date: Tue, 18 Jun 2024 12:20:20 +0200
> > > Subject: [PATCH] mwifiex: Do not return unused priv in
> > >  mwifiex_get_priv_by_id()
> > > 
> > > mwifiex_get_priv_by_id() returns the priv pointer corresponding to the
> > > bss_num and bss_type, but without checking if the priv is actually
> > > currently in use.
> > > Unused priv pointers do not have a wiphy attached to them which can lead
> > > to NULL pointer dereferences further down the callstack.
> > > Fix this by returning only used priv pointers which have priv->bss_mode
> > > set to something else than NL80211_IFTYPE_UNSPECIFIED.
> > > 
> > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>

Reviewed-by: Francesco Dolcini <francesco.dolcini@toradex.com>

I guess you should send it as a proper patch?


...

> > I am just wondering if this might have anything to do with
> > commit a17b9f590f6e ("wifi: mwifiex: Fix interface type change"), maybe you have already looked into it?
> 
> It looks somehow related. I just gave it a try and it at least doesn't
> fix my issue.

thanks for trying it out.


Francesco
Sascha Hauer July 3, 2024, 7:24 a.m. UTC | #9
On Tue, Jul 02, 2024 at 10:36:11PM +0200, Francesco Dolcini wrote:
> On Tue, Jul 02, 2024 at 03:32:16PM +0200, Sascha Hauer wrote:
> > On Mon, Jun 24, 2024 at 06:20:14PM +0200, Francesco Dolcini wrote:
> > > On Fri, Jun 21, 2024 at 11:07:27AM +0200, Sascha Hauer wrote:
> 
> ...
> 
> > > > 
> > > > From 3357963821294ff7de26259a154a1cb5bab760cb Mon Sep 17 00:00:00 2001
> > > > From: Sascha Hauer <s.hauer@pengutronix.de>
> > > > Date: Tue, 18 Jun 2024 12:20:20 +0200
> > > > Subject: [PATCH] mwifiex: Do not return unused priv in
> > > >  mwifiex_get_priv_by_id()
> > > > 
> > > > mwifiex_get_priv_by_id() returns the priv pointer corresponding to the
> > > > bss_num and bss_type, but without checking if the priv is actually
> > > > currently in use.
> > > > Unused priv pointers do not have a wiphy attached to them which can lead
> > > > to NULL pointer dereferences further down the callstack.
> > > > Fix this by returning only used priv pointers which have priv->bss_mode
> > > > set to something else than NL80211_IFTYPE_UNSPECIFIED.
> > > > 
> > > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
> 
> Reviewed-by: Francesco Dolcini <francesco.dolcini@toradex.com>
> 
> I guess you should send it as a proper patch?

Yes, I just did this.

Sascha
diff mbox series

Patch

diff --git a/drivers/net/wireless/marvell/mwifiex/scan.c b/drivers/net/wireless/marvell/mwifiex/scan.c
index 0326b121747cb..6211ae97da4b2 100644
--- a/drivers/net/wireless/marvell/mwifiex/scan.c
+++ b/drivers/net/wireless/marvell/mwifiex/scan.c
@@ -1843,6 +1843,9 @@  mwifiex_parse_single_response_buf(struct mwifiex_private *priv, u8 **bss_info,
 			return 0;
 		}
 
+		if (!priv->wdev.wiphy)
+			return 0;
+
 		band = BAND_G;
 		if (radio_type)
 			band = mwifiex_radio_type_to_band(*radio_type &