Message ID | 20170525001119.64791-5-briannorris@chromium.org (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Kalle Valo |
Headers | show |
Brian Norris <briannorris@chromium.org> writes: > In general, it's helpful to use the same code for device removal as for > device reset, as this tends to have fewer bugs. Let's move the wiphy > unregistration code into the common reset and removal code. > > In particular, it's very hard to properly handle the reset sequence when > something fails. Currently, if mwifiex_reinit_sw() fails, we've failed > to unregister the associated wiphy, and so running something as simple > as "iw phy" can trigger an OOPS, as the wiphy still has hooks back into > freed mwifiex data structures. For example, KASAN complained: > > [... see reset fail for other reasons ...] > [ 1184.821158] mwifiex_pcie 0000:01:00.0: info: dnld wifi firmware from 174948 bytes > [ 1186.870914] mwifiex_pcie 0000:01:00.0: info: FW download over, size 608396 bytes > [ 1187.685990] mwifiex_pcie 0000:01:00.0: WLAN FW is active > [ 1187.692673] mwifiex_pcie 0000:01:00.0: cmd_wait_q terminated: -512 > [ 1187.699075] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device > [ 1187.713476] mwifiex: Failed to bring up adapter: -5 > [ 1187.718644] mwifiex_pcie 0000:01:00.0: reinit failed: -5 > > [... run `iw phy` ...] > [ 1212.902419] ================================================================== > [ 1212.909806] BUG: KASAN: use-after-free in mwifiex_cfg80211_get_antenna+0x54/0xfc [mwifiex] at addr ffffffc0ad1a8028 > [ 1212.920246] Read of size 1 by task iw/3127 > [...] > [ 1212.934946] page dumped because: kasan: bad access detected > [...] > [ 1212.950665] Call trace: > [ 1212.953148] [<ffffffc00020a69c>] dump_backtrace+0x0/0x190 > [ 1212.958572] [<ffffffc00020a96c>] show_stack+0x20/0x28 > [ 1212.963648] [<ffffffc0005ce18c>] dump_stack+0xa4/0xcc > [ 1212.968723] [<ffffffc0003c4430>] kasan_report+0x378/0x500 > [ 1212.974140] [<ffffffc0003c3358>] __asan_load1+0x44/0x4c > [ 1212.979462] [<ffffffbffc2e8360>] mwifiex_cfg80211_get_antenna+0x54/0xfc [mwifiex] > [ 1212.987131] [<ffffffbffc084fc4>] nl80211_send_wiphy+0x75c/0x2de0 [cfg80211] > [ 1212.994246] [<ffffffbffc094f60>] nl80211_dump_wiphy+0x32c/0x438 [cfg80211] > [ 1213.001149] [<ffffffc000ab6404>] genl_lock_dumpit+0x48/0x64 > [ 1213.006746] [<ffffffc000ab3474>] netlink_dump+0x178/0x398 > [ 1213.012171] [<ffffffc000ab3d18>] __netlink_dump_start+0x1bc/0x260 > [...] > > This all goes away if we just tear down the wiphy on the way down, and > set it back up if/when we bring the device back up. I don't know exactly what kind of reset this is about, but isn't this a quite intrusive solution? For example, phy name changes because of this?
On Thu, Jun 01, 2017 at 12:15:45PM +0300, Kalle Valo wrote: > Brian Norris <briannorris@chromium.org> writes: > > > In general, it's helpful to use the same code for device removal as for > > device reset, as this tends to have fewer bugs. Let's move the wiphy > > unregistration code into the common reset and removal code. > > > > In particular, it's very hard to properly handle the reset sequence when > > something fails. Currently, if mwifiex_reinit_sw() fails, we've failed > > to unregister the associated wiphy, and so running something as simple > > as "iw phy" can trigger an OOPS, as the wiphy still has hooks back into > > freed mwifiex data structures. For example, KASAN complained: > > > > [... see reset fail for other reasons ...] > > [ 1184.821158] mwifiex_pcie 0000:01:00.0: info: dnld wifi firmware from 174948 bytes > > [ 1186.870914] mwifiex_pcie 0000:01:00.0: info: FW download over, size 608396 bytes > > [ 1187.685990] mwifiex_pcie 0000:01:00.0: WLAN FW is active > > [ 1187.692673] mwifiex_pcie 0000:01:00.0: cmd_wait_q terminated: -512 > > [ 1187.699075] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device > > [ 1187.713476] mwifiex: Failed to bring up adapter: -5 > > [ 1187.718644] mwifiex_pcie 0000:01:00.0: reinit failed: -5 > > > > [... run `iw phy` ...] > > [ 1212.902419] ================================================================== > > [ 1212.909806] BUG: KASAN: use-after-free in mwifiex_cfg80211_get_antenna+0x54/0xfc [mwifiex] at addr ffffffc0ad1a8028 > > [ 1212.920246] Read of size 1 by task iw/3127 > > [...] > > [ 1212.934946] page dumped because: kasan: bad access detected > > [...] > > [ 1212.950665] Call trace: > > [ 1212.953148] [<ffffffc00020a69c>] dump_backtrace+0x0/0x190 > > [ 1212.958572] [<ffffffc00020a96c>] show_stack+0x20/0x28 > > [ 1212.963648] [<ffffffc0005ce18c>] dump_stack+0xa4/0xcc > > [ 1212.968723] [<ffffffc0003c4430>] kasan_report+0x378/0x500 > > [ 1212.974140] [<ffffffc0003c3358>] __asan_load1+0x44/0x4c > > [ 1212.979462] [<ffffffbffc2e8360>] mwifiex_cfg80211_get_antenna+0x54/0xfc [mwifiex] > > [ 1212.987131] [<ffffffbffc084fc4>] nl80211_send_wiphy+0x75c/0x2de0 [cfg80211] > > [ 1212.994246] [<ffffffbffc094f60>] nl80211_dump_wiphy+0x32c/0x438 [cfg80211] > > [ 1213.001149] [<ffffffc000ab6404>] genl_lock_dumpit+0x48/0x64 > > [ 1213.006746] [<ffffffc000ab3474>] netlink_dump+0x178/0x398 > > [ 1213.012171] [<ffffffc000ab3d18>] __netlink_dump_start+0x1bc/0x260 > > [...] > > > > This all goes away if we just tear down the wiphy on the way down, and > > set it back up if/when we bring the device back up. > > I don't know exactly what kind of reset this is about, Marvell firmwares are known to be quite buggy, and there are plenty of situations in which they crash (often resulting in a command timeout). The current best workaround for these is to essentially unwind the whole driver, reset the card, and reprobe the whole thing. See anywhere that the ->card_reset() callback is called. This has been around for a long time on SDIO, and you recently merged my changes to enable this for PCIe: 6d7d579a8243 mwifiex: pcie: add card_reset() support > but isn't this a > quite intrusive solution? For example, phy name changes because of this? Yes, it is a bit intrusive. But the whole process is intrusive, as it deletes all the virtual interfaces and loses your settings. This all relies on user space being prepared to clean up and reinitialize everything afterward. And yes, this causes a phy name change. In favor: this is what the SDIO reset code *used* to do, before this commit: c742e623e941 mwifiex: sdio card reset enhancement where the SDIO driver started using the half-baked reset solution written for PCIe. Lastly, I still need to analyze a few more things in this driver, but AFAICT, if we *don't* unregister the wiphy, we are exposed to quite a few more race conditions -- not just the easy-to-notice condition described above. What happens if the wiphy still processes cfg80211 operations while we're still resetting the firmware? Much of the driver may not be prepared for this. At the moment, I can't find anything terribly wrong; if I slow down the reset (e.g., with msleep()s) I can just trigger complaints about "cmd node not available" or "card is removed", but I haven't yet found a true bug. That's not to say that there aren't such bugs out there. I'd still be willing to bet there are. And IMO, it seems wise to just do the same teardown/setup as one would do for (e.g.) 'rmmod', to prevent exposing *too* many new permutations of "wiphy is available but rest of the driver is torn down". But if none of this is convincing to you, I can take a stab at a different solution. BTW, since you're taking an interest in this code now, can I trouble you with a question? Looking at mwifiex_uninit_sw() (after this patchset), you can see a loop like this: /* Stop data */ for (i = 0; i < adapter->priv_num; i++) { priv = adapter->priv[i]; if (priv && priv->netdev) { mwifiex_stop_net_dev_queue(priv->netdev, adapter); if (netif_carrier_ok(priv->netdev)) netif_carrier_off(priv->netdev); netif_device_detach(priv->netdev); } } That seems to be the only attempt to prevent user space from talking to the device while we proceed to shut down (mwifiex_shutdown_drv()). AIUI, that's wholly insufficient, and we need to actually stop all the virtual interfaces (and possibly the wiphy as well) first. I'm looking at trying to move the mwifiex_del_virtual_intf() loop up much further in this function (but there are other bugs preventing me from doing that yet). Does that sound like the right approach to you? I'm kinda figuring this should better mimic the mac80211 ieee80211_remove_interfaces() structure. Brian
Brian Norris <briannorris@chromium.org> writes: > On Thu, Jun 01, 2017 at 12:15:45PM +0300, Kalle Valo wrote: >> Brian Norris <briannorris@chromium.org> writes: >> >> > In general, it's helpful to use the same code for device removal as for >> > device reset, as this tends to have fewer bugs. Let's move the wiphy >> > unregistration code into the common reset and removal code. >> > >> > In particular, it's very hard to properly handle the reset sequence when >> > something fails. Currently, if mwifiex_reinit_sw() fails, we've failed >> > to unregister the associated wiphy, and so running something as simple >> > as "iw phy" can trigger an OOPS, as the wiphy still has hooks back into >> > freed mwifiex data structures. For example, KASAN complained: >> > >> > [... see reset fail for other reasons ...] >> > [ 1184.821158] mwifiex_pcie 0000:01:00.0: info: dnld wifi firmware from 174948 bytes >> > [ 1186.870914] mwifiex_pcie 0000:01:00.0: info: FW download over, size 608396 bytes >> > [ 1187.685990] mwifiex_pcie 0000:01:00.0: WLAN FW is active >> > [ 1187.692673] mwifiex_pcie 0000:01:00.0: cmd_wait_q terminated: -512 >> > [ 1187.699075] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device >> > [ 1187.713476] mwifiex: Failed to bring up adapter: -5 >> > [ 1187.718644] mwifiex_pcie 0000:01:00.0: reinit failed: -5 >> > >> > [... run `iw phy` ...] >> > [ 1212.902419] ================================================================== >> > [ 1212.909806] BUG: KASAN: use-after-free in mwifiex_cfg80211_get_antenna+0x54/0xfc [mwifiex] at addr ffffffc0ad1a8028 >> > [ 1212.920246] Read of size 1 by task iw/3127 >> > [...] >> > [ 1212.934946] page dumped because: kasan: bad access detected >> > [...] >> > [ 1212.950665] Call trace: >> > [ 1212.953148] [<ffffffc00020a69c>] dump_backtrace+0x0/0x190 >> > [ 1212.958572] [<ffffffc00020a96c>] show_stack+0x20/0x28 >> > [ 1212.963648] [<ffffffc0005ce18c>] dump_stack+0xa4/0xcc >> > [ 1212.968723] [<ffffffc0003c4430>] kasan_report+0x378/0x500 >> > [ 1212.974140] [<ffffffc0003c3358>] __asan_load1+0x44/0x4c >> > [ 1212.979462] [<ffffffbffc2e8360>] mwifiex_cfg80211_get_antenna+0x54/0xfc [mwifiex] >> > [ 1212.987131] [<ffffffbffc084fc4>] nl80211_send_wiphy+0x75c/0x2de0 [cfg80211] >> > [ 1212.994246] [<ffffffbffc094f60>] nl80211_dump_wiphy+0x32c/0x438 [cfg80211] >> > [ 1213.001149] [<ffffffc000ab6404>] genl_lock_dumpit+0x48/0x64 >> > [ 1213.006746] [<ffffffc000ab3474>] netlink_dump+0x178/0x398 >> > [ 1213.012171] [<ffffffc000ab3d18>] __netlink_dump_start+0x1bc/0x260 >> > [...] >> > >> > This all goes away if we just tear down the wiphy on the way down, and >> > set it back up if/when we bring the device back up. >> >> I don't know exactly what kind of reset this is about, > > Marvell firmwares are known to be quite buggy, and there are plenty of > situations in which they crash (often resulting in a command timeout). That is a common problem with firmwares :/ > The current best workaround for these is to essentially unwind the > whole driver, reset the card, and reprobe the whole thing. See > anywhere that the ->card_reset() callback is called. > > This has been around for a long time on SDIO, and you recently merged my > changes to enable this for PCIe: > > 6d7d579a8243 mwifiex: pcie: add card_reset() support > >> but isn't this a >> quite intrusive solution? For example, phy name changes because of this? > > Yes, it is a bit intrusive. But the whole process is intrusive, as it > deletes all the virtual interfaces and loses your settings. This all > relies on user space being prepared to clean up and reinitialize > everything afterward. > > And yes, this causes a phy name change. > > In favor: this is what the SDIO reset code *used* to do, before this > commit: > > c742e623e941 mwifiex: sdio card reset enhancement > > where the SDIO driver started using the half-baked reset solution > written for PCIe. > > Lastly, I still need to analyze a few more things in this driver, but > AFAICT, if we *don't* unregister the wiphy, we are exposed to quite a > few more race conditions -- not just the easy-to-notice condition > described above. What happens if the wiphy still processes cfg80211 > operations while we're still resetting the firmware? Much of the driver > may not be prepared for this. At the moment, I can't find anything > terribly wrong; if I slow down the reset (e.g., with msleep()s) I can > just trigger complaints about "cmd node not available" or "card is > removed", but I haven't yet found a true bug. > > That's not to say that there aren't such bugs out there. I'd still be > willing to bet there are. And IMO, it seems wise to just do the same > teardown/setup as one would do for (e.g.) 'rmmod', to prevent exposing > *too* many new permutations of "wiphy is available but rest of the > driver is torn down". This feels like a sledge hammer approach causing all sort of problems for user space and I really like the mac80211 approach more. For example, if an ath10k firmware crash happens user only sees a few second pause in data traffic and a warning in kernel log, otherwise everything happens behind the scenes. Of course there are very likely races somewhere but at least I haven't seen that many reports related to firmware restart functionality. > But if none of this is convincing to you, I can take a stab at a > different solution. I don't have any problem applying this patch but more about being curious why doing it like this. And hopefully finding a less intrusive solution in the future. > BTW, since you're taking an interest in this code now, can I trouble you > with a question? Looking at mwifiex_uninit_sw() (after this patchset), > you can see a loop like this: > > /* Stop data */ > for (i = 0; i < adapter->priv_num; i++) { > priv = adapter->priv[i]; > if (priv && priv->netdev) { > mwifiex_stop_net_dev_queue(priv->netdev, adapter); > if (netif_carrier_ok(priv->netdev)) > netif_carrier_off(priv->netdev); > netif_device_detach(priv->netdev); > } > } > > That seems to be the only attempt to prevent user space from talking to > the device while we proceed to shut down (mwifiex_shutdown_drv()). AIUI, > that's wholly insufficient, and we need to actually stop all the virtual > interfaces (and possibly the wiphy as well) first. I'm looking at trying > to move the mwifiex_del_virtual_intf() loop up much further in this > function (but there are other bugs preventing me from doing that yet). > > Does that sound like the right approach to you? I'm kinda figuring this > should better mimic the mac80211 ieee80211_remove_interfaces() > structure. Johannes is much better person to answer this (CCed).
On Mon, 2017-06-05 at 18:54 +0300, Kalle Valo wrote: > > BTW, since you're taking an interest in this code now, can I > > trouble you with a question? Looking at mwifiex_uninit_sw() (after > > this patchset), you can see a loop like this: > > > > /* Stop data */ > > for (i = 0; i < adapter->priv_num; i++) { > > priv = adapter->priv[i]; > > if (priv && priv->netdev) { > > mwifiex_stop_net_dev_queue(priv->netdev, > > adapter); > > if (netif_carrier_ok(priv->netdev)) > > netif_carrier_off(priv->netdev); > > netif_device_detach(priv->netdev); > > } > > } > > > > That seems to be the only attempt to prevent user space from > > talking to the device while we proceed to shut down > > (mwifiex_shutdown_drv()). AIUI, that's wholly insufficient, and we > > need to actually stop all the virtual interfaces (and possibly the > > wiphy as well) first. I'm looking at trying to move the > > mwifiex_del_virtual_intf() loop up much further in this function > > (but there are other bugs preventing me from doing that yet). > > > > Does that sound like the right approach to you? I'm kinda figuring > > this should better mimic the mac80211 ieee80211_remove_interfaces() > > structure. > > Johannes is much better person to answer this (CCed). Wait, what? You're throwing me into pretty deep water ;-) I'm not sure what you mean by "we need to atually stop all the virtual interfaces ([...]) first". There are essentially only two/three ways to reach this - data path, which is getting stopped here, and control path (both nl80211 and perhaps ndo ops like start/stop). Without checking the code now, it seems entirely plausible that this is holding some lock that would lock out the control path entirely, for the duration until the wiphy is actually unregistered? Actually, you can't unregister with the relevant locks held (without causing deadlocks), so perhaps it's marking the wiphy as unavailable so that all operations fail? johannes
Hi Kalle (and Johannes; I'll reply to Johannes response separately too), On Mon, Jun 05, 2017 at 06:54:18PM +0300, Kalle Valo wrote: > Brian Norris <briannorris@chromium.org> writes: > > That's not to say that there aren't such bugs out there. I'd still be > > willing to bet there are. And IMO, it seems wise to just do the same > > teardown/setup as one would do for (e.g.) 'rmmod', to prevent exposing > > *too* many new permutations of "wiphy is available but rest of the > > driver is torn down". > > This feels like a sledge hammer approach causing all sort of problems Yes, it is a sledge hammer. But I'm working with what we have here. With this approach, it's also easier to tell that things aren't out-of-sync, since I'm never quite sure how much state was held in the firmware (and now won't match what user space thinks). A full removal / re-init makes this clear -- user space should expect *everything* to be reset. I'm open to learning better approaches if possible, but this also might be difficult if I don't get any support from Marvell on this. (They seem quite happy to let sleeping dogs lie.) > for user space and I really like the mac80211 approach more. For > example, if an ath10k firmware crash happens user only sees a few second > pause in data traffic and a warning in kernel log, otherwise everything > happens behind the scenes. Of course there are very likely races > somewhere but at least I haven't seen that many reports related to > firmware restart functionality. Yes, that all sounds nice. But for my sake, can you describe better what's actually going on there (e.g., can you point me at which code does this)? I'm really not familiar with mac80211 (though I was aware of the above general behavior). But to my knowledge, mac80211 drivers keep a lot more state managed in the kernel, so it's a little easier and more natural to get the driver/FW back to "the same state" than it is with a full-MAC driver. > > But if none of this is convincing to you, I can take a stab at a > > different solution. > > I don't have any problem applying this patch but more about being > curious why doing it like this. And hopefully finding a less intrusive > solution in the future. OK, sure. I'll see what I can do, but I don't see an easy path at the moment toward fixing (i.e., completely rewriting) this long-standing driver behavior. [trim] Thanks, Brian
Hi Johannes, On Fri, Jun 09, 2017 at 11:03:38AM +0200, Johannes Berg wrote: > On Mon, 2017-06-05 at 18:54 +0300, Kalle Valo wrote: > > > > BTW, since you're taking an interest in this code now, can I > > > trouble you with a question? Looking at mwifiex_uninit_sw() (after > > > this patchset), you can see a loop like this: > > > > > > /* Stop data */ > > > for (i = 0; i < adapter->priv_num; i++) { > > > priv = adapter->priv[i]; > > > if (priv && priv->netdev) { > > > mwifiex_stop_net_dev_queue(priv->netdev, > > > adapter); > > > if (netif_carrier_ok(priv->netdev)) > > > netif_carrier_off(priv->netdev); > > > netif_device_detach(priv->netdev); > > > } > > > } > > > > > > That seems to be the only attempt to prevent user space from > > > talking to the device while we proceed to shut down > > > (mwifiex_shutdown_drv()). AIUI, that's wholly insufficient, and we > > > need to actually stop all the virtual interfaces (and possibly the > > > wiphy as well) first. I'm looking at trying to move the > > > mwifiex_del_virtual_intf() loop up much further in this function > > > (but there are other bugs preventing me from doing that yet). > > > > > > Does that sound like the right approach to you? I'm kinda figuring > > > this should better mimic the mac80211 ieee80211_remove_interfaces() > > > structure. > > > > Johannes is much better person to answer this (CCed). > > Wait, what? You're throwing me into pretty deep water ;-) Regardless, thanks for the help :) > I'm not sure what you mean by "we need to atually stop all the virtual > interfaces ([...]) first". Judging by your following comments, I may have been completely mistaken. (But that's why I asked you folks!) > There are essentially only two/three ways to reach this - data path, > which is getting stopped here, and control path (both nl80211 and > perhaps ndo ops like start/stop). I think I was conflating virtual interfaces with control path (e.g., nl80211 scans, set freq, etc.). The idea is that control operations may still get *started* after the above, and it's just plain impossible to resolve the races with driver queue teardown if we're queueing up new control ops at the same time. But even if we kill off the wireless_dev's, I suppose there are still control interfaces that can talk directly to the wiphy. > Without checking the code now, it seems entirely plausible that this is > holding some lock that would lock out the control path entirely, for > the duration until the wiphy is actually unregistered? > > Actually, you can't unregister with the relevant locks held (without > causing deadlocks), so perhaps it's marking the wiphy as unavailable so > that all operations fail? One of the above two sounds along the right line. But it's something I couldn't really figure out how to do quite right. Dumb question: how would I mark the wiphy as unavailable? Is there something I can do at the cfg80211 level? Or would I really have to guard all the cfg80211 entry points into mwifiex with a flag or lock? Also, IIUC, we need to wait for all control paths to complete (or cancel) before we can free up the associated resources; so just marking "unavailable" isn't enough. Thanks, Brian
On Wed, 2017-06-21 at 10:48 -0700, Brian Norris wrote: > > Yes, that all sounds nice. But for my sake, can you describe better > what's actually going on there (e.g., can you point me at which code > does this)? It's much easier with mac80211, it has all the state. Basically the reconfig is in ieee80211_reconfig() :) > I'm really not familiar with mac80211 (though I was aware of > the above general behavior). But to my knowledge, mac80211 drivers > keep a lot more state managed in the kernel, so it's a little easier > and more natural to get the driver/FW back to "the same state" than > it is with a full-MAC driver. Indeed. johannes
On Wed, 2017-06-21 at 11:27 -0700, Brian Norris wrote: > > > I'm not sure what you mean by "we need to atually stop all the > > virtual interfaces ([...]) first". > > Judging by your following comments, I may have been completely > mistaken. > (But that's why I asked you folks!) :) > > There are essentially only two/three ways to reach this - data > > path, > > which is getting stopped here, and control path (both nl80211 and > > perhaps ndo ops like start/stop). > > I think I was conflating virtual interfaces with control path (e.g., > nl80211 scans, set freq, etc.). The idea is that control operations > may still get *started* after the above, and it's just plain > impossible to resolve the races with driver queue teardown if we're > queueing up new control ops at the same time. Agree. > But even if we kill off the wireless_dev's, I suppose there are still > control interfaces that can talk directly to the wiphy. Yeah, only a few. > > Without checking the code now, it seems entirely plausible that > > this is > > holding some lock that would lock out the control path entirely, > > for > > the duration until the wiphy is actually unregistered? > > > > Actually, you can't unregister with the relevant locks held > > (without > > causing deadlocks), so perhaps it's marking the wiphy as > > unavailable so > > that all operations fail? > > One of the above two sounds along the right line. But it's something > I couldn't really figure out how to do quite right. > > Dumb question: how would I mark the wiphy as unavailable? Is there > something I can do at the cfg80211 level? Or would I really have to > guard all the cfg80211 entry points into mwifiex with a flag or lock? There isn't really a good way to do this. You can, of course, call wiphy_unregister(), but if you could do that you'd already have the problem solved, I think? I'm not really familiar enough with the context this happens in - can't you let all the operations that try to talk to the firmware fail (because the firmware is dead, or whatever) and then call wiphy_unregister()? > Also, IIUC, we need to wait for all control paths to complete (or > cancel) before we can free up the associated resources; so just > marking "unavailable" isn't enough. Yeah, I suppose so. Though if you just do all the freeing after wiphy_unregister() it'll do that for you? johannes
(A little slow on follow-up here) On Thu, Jun 22, 2017 at 02:59:49PM +0200, Johannes Berg wrote: > On Wed, 2017-06-21 at 10:48 -0700, Brian Norris wrote: > > > > Yes, that all sounds nice. But for my sake, can you describe better > > what's actually going on there (e.g., can you point me at which code > > does this)? > > It's much easier with mac80211, it has all the state. Basically the > reconfig is in ieee80211_reconfig() :) Wow, that's not exactly simple code; I expect it could be pretty difficult to get that right today on mwifiex. The current approach actually should be *easier* (for the kernel side) to avoid bugs, as it should be basically the same thing as 'rmmod'. Nonetheless, there are plenty of bugs. Thanks for the pointer though. > > I'm really not familiar with mac80211 (though I was aware of > > the above general behavior). But to my knowledge, mac80211 drivers > > keep a lot more state managed in the kernel, so it's a little easier > > and more natural to get the driver/FW back to "the same state" than > > it is with a full-MAC driver. > > Indeed. Brian
On Thu, Jun 22, 2017 at 03:02:34PM +0200, Johannes Berg wrote: > On Wed, 2017-06-21 at 11:27 -0700, Brian Norris wrote: > > > Without checking the code now, it seems entirely plausible that > > > this is > > > holding some lock that would lock out the control path entirely, > > > for > > > the duration until the wiphy is actually unregistered? > > > > > > Actually, you can't unregister with the relevant locks held > > > (without > > > causing deadlocks), so perhaps it's marking the wiphy as > > > unavailable so > > > that all operations fail? > > > > One of the above two sounds along the right line. But it's something > > I couldn't really figure out how to do quite right. > > > > Dumb question: how would I mark the wiphy as unavailable? Is there > > something I can do at the cfg80211 level? Or would I really have to > > guard all the cfg80211 entry points into mwifiex with a flag or lock? > > There isn't really a good way to do this. You can, of course, call > wiphy_unregister(), but if you could do that you'd already have the > problem solved, I think? That's probably along the right track. There are still some things we'd need to do properly before that though, and this is where all the problems are so far. (Also, this is what Kalle was already objecting to; he didn't think we should be unregistering/recreating the wiphy, but I think he ended up softening on that a bit.) For one, I still expect I should be removing the wireless dev's before unregistering the wihpy, no? Otherwise, there will be existing wdevs backed by an unregistered wiphy? And that gets to the heart of another bug: deleting interfaces (e.g., "iw dev foo del") races with a lot of stuff -- like see mwifiex_process_sta_event() -> EVENT_EXT_SCAN_REPORT -> netif_running(priv->netdev) Because mwifiex_del_virtual_intf() doesn't stop any outstanding commands, we can be both deleting the netdev and processing scans for it. > I'm not really familiar enough with the context this happens in - can't > you let all the operations that try to talk to the firmware fail > (because the firmware is dead, or whatever) and then call > wiphy_unregister()? Yes, something like that, barring some of the other bugs mentioned. > > Also, IIUC, we need to wait for all control paths to complete (or > > cancel) before we can free up the associated resources; so just > > marking "unavailable" isn't enough. > > Yeah, I suppose so. Though if you just do all the freeing after > wiphy_unregister() it'll do that for you? Yes, I think so. Then part of the problem is probably that some of the current "cancel command" logic is tied up with the "free command structures" logic. So we're freeing some stuff too early. Anyway, those sorts of bugs aside, IIUC the full sequence for teardown should probably be something like: 1. Stop TX queues 2. Cancel outstanding commands (let them fail or finish, etc.) -- but DON'T free their backing resources yet 3. Remove wdevs 4. wiphy_unregister() 5. Free up resources Current problems are at least: * we don't do step 4 in the right place (if at all; see this patch) * step 2 mixes in "free"ing resources too early Brian
Hi Brian, > Wow, that's not exactly simple code; I expect it could be pretty > difficult to get that right today on mwifiex. Yeah, I have no doubt. You'd probably have to track a lot of state that you just pass down to the firmware too, and possibly can't even track some state that the firmware derives itself (like for example PNs for keys) > The current approach > actually should be *easier* (for the kernel side) to avoid bugs, as > it should be basically the same thing as 'rmmod'. Nonetheless, there > are plenty of bugs. :-) johannes
On Tue, 2017-06-27 at 13:48 -0700, Brian Norris wrote: > > There isn't really a good way to do this. You can, of course, call > > wiphy_unregister(), but if you could do that you'd already have the > > problem solved, I think? > > That's probably along the right track. There are still some things > we'd need to do properly before that though, and this is where all > the problems are so far. (Also, this is what Kalle was already > objecting to; he didn't think we should be unregistering/recreating > the wiphy, but I think he ended up softening on that a bit.) > > For one, I still expect I should be removing the wireless dev's > before unregistering the wihpy, no? Otherwise, there will be existing > wdevs backed by an unregistered wiphy? Yeah, that's true - though once you get rid of those they can't be accessed any more. > And that gets to the heart of another bug: deleting interfaces (e.g., > "iw dev foo del") races with a lot of stuff -- like see > > mwifiex_process_sta_event() -> > EVENT_EXT_SCAN_REPORT -> > netif_running(priv->netdev) > > Because mwifiex_del_virtual_intf() doesn't stop any outstanding > commands, we can be both deleting the netdev and processing scans for > it. Huh, well, I guess you need some kind of locking here anyway, since the user can always do things like deleting the interface while a scan is running? > > > Also, IIUC, we need to wait for all control paths to complete (or > > > cancel) before we can free up the associated resources; so just > > > marking "unavailable" isn't enough. > > > > Yeah, I suppose so. Though if you just do all the freeing after > > wiphy_unregister() it'll do that for you? > > Yes, I think so. Then part of the problem is probably that some of > the current "cancel command" logic is tied up with the "free command > structures" logic. So we're freeing some stuff too early. > > Anyway, those sorts of bugs aside, IIUC the full sequence for > teardown should probably be something like: > > 1. Stop TX queues > 2. Cancel outstanding commands (let them fail or finish, etc.) -- but > DON'T free their backing resources yet > 3. Remove wdevs > 4. wiphy_unregister() > 5. Free up resources > > Current problems are at least: > > * we don't do step 4 in the right place (if at all; see this patch) > * step 2 mixes in "free"ing resources too early So I'm not sure what you mean by splitting in 2/5 - this seems reasonable, but I don't understand why something like a scan request wouldn't be freed while you cancel it in 2? In fact, you really have to free it before you remove the corresponding wdev, or cfg80211 will complain? johannes
Hi Johannes, On Wed, Jun 28, 2017 at 09:28:49AM +0200, Johannes Berg wrote: > On Tue, 2017-06-27 at 13:48 -0700, Brian Norris wrote: > > > > There isn't really a good way to do this. You can, of course, call > > > wiphy_unregister(), but if you could do that you'd already have the > > > problem solved, I think? > > > > That's probably along the right track. There are still some things > > we'd need to do properly before that though, and this is where all > > the problems are so far. (Also, this is what Kalle was already > > objecting to; he didn't think we should be unregistering/recreating > > the wiphy, but I think he ended up softening on that a bit.) > > > > For one, I still expect I should be removing the wireless dev's > > before unregistering the wihpy, no? Otherwise, there will be existing > > wdevs backed by an unregistered wiphy? > > Yeah, that's true - though once you get rid of those they can't be > accessed any more. Right. That's the whole idea for the current reset implementation in this driver anyway. > > And that gets to the heart of another bug: deleting interfaces (e.g., > > "iw dev foo del") races with a lot of stuff -- like see > > > > mwifiex_process_sta_event() -> > > EVENT_EXT_SCAN_REPORT -> > > netif_running(priv->netdev) > > > > Because mwifiex_del_virtual_intf() doesn't stop any outstanding > > commands, we can be both deleting the netdev and processing scans for > > it. > > Huh, well, I guess you need some kind of locking here anyway, since the > user can always do things like deleting the interface while a scan is > running? Yes, some sort of locking, and maybe ability to cancel outstanding commands on just the targeted interface. I gave the locking a try myself previously and got something sorta working, before getting distracted by other problems. I also reported this directly to Marvell to see if they could be bothered to fix it. They might be working on that. But actually I think the rmmod or reset code path has this a little easier, since we're fine just killing all outstanding commands and interfaces. So these two problems are somewhat orthogonal. > > > > Also, IIUC, we need to wait for all control paths to complete (or > > > > cancel) before we can free up the associated resources; so just > > > > marking "unavailable" isn't enough. > > > > > > Yeah, I suppose so. Though if you just do all the freeing after > > > wiphy_unregister() it'll do that for you? > > > > Yes, I think so. Then part of the problem is probably that some of > > the current "cancel command" logic is tied up with the "free command > > structures" logic. So we're freeing some stuff too early. > > > > Anyway, those sorts of bugs aside, IIUC the full sequence for > > teardown should probably be something like: > > > > 1. Stop TX queues > > 2. Cancel outstanding commands (let them fail or finish, etc.) -- but > > DON'T free their backing resources yet I also failed to mention "don't queue new FW commands". The driver does this before step 1 currently, though the code isn't beautiful: mwifiex_send_cmd() ... if (adapter->surprise_removed) { mwifiex_dbg(adapter, ERROR, "PREP_CMD: card is removed\n"); return -1; } ... // continue on to prepare and queue (or sync) the command static void mwifiex_uninit_sw(struct mwifiex_adapter *adapter) { ... adapter->surprise_removed = true; ... // continue on to step 1, 2, ... (And now that I think about it, I'm pretty sure there's a race in there somewhere... Someone could easily miss the "surprise removed" check, grab a command node, and miss out on step 2 (since the command isn't sitting on any of the queues that get "canceled" yet). I believe this can easily blow up once they try to queue the command, as we are no longer ready to handle the command queue...) > > 3. Remove wdevs > > 4. wiphy_unregister() > > 5. Free up resources > > > > Current problems are at least: > > > > * we don't do step 4 in the right place (if at all; see this patch) > > * step 2 mixes in "free"ing resources too early > > So I'm not sure what you mean by splitting in 2/5 - this seems > reasonable, but I don't understand why something like a scan request > wouldn't be freed while you cancel it in 2? In fact, you really have to > free it before you remove the corresponding wdev, or cfg80211 will > complain? I haven't validated all the related code, but I think the problem isn't that a scan is still being processed after the wdev is removed. The problem is simply that we've canceled the command (and it will "complete" before the wdev removal), but we're freeing the associated resource before the caller is actually completely done with it. Or in more detail: The driver keeps a pool of FW command structures (like 'struct cmd_ctrl_node') that are queued in various ways. We cancel everything in step 2 (mwifiex_adapter_cleanup() -> mwifiex_cancel_all_pending_cmd()), but we can't free the pool (mwifiex_free_cmd_buffer()) until step 5, since there can be wdev or wiphy control paths that are still exiting (and using one of the cmd nodes' "condition" variables) until we've finished waiting on them with step 3 or 4. But currently we free these buffers in step 2 (mwifiex_adapter_cleanup() -> mwifiex_free_cmd_buffer()). I could have missed something else, and my description above definitely isn't exhaustive. But AFAICT, this straightens out a solution for some of the problems I've noticed recently. (But then, I noticed another problem, as noted above...ugh.) Anyway, thanks for the pointers so far! Brian
(I was on vacation and inbox exploded again) Brian Norris <briannorris@chromium.org> writes: > On Thu, Jun 22, 2017 at 03:02:34PM +0200, Johannes Berg wrote: >> On Wed, 2017-06-21 at 11:27 -0700, Brian Norris wrote: > >> > > Without checking the code now, it seems entirely plausible that >> > > this is >> > > holding some lock that would lock out the control path entirely, >> > > for >> > > the duration until the wiphy is actually unregistered? >> > > >> > > Actually, you can't unregister with the relevant locks held >> > > (without >> > > causing deadlocks), so perhaps it's marking the wiphy as >> > > unavailable so >> > > that all operations fail? >> > >> > One of the above two sounds along the right line. But it's something >> > I couldn't really figure out how to do quite right. >> > >> > Dumb question: how would I mark the wiphy as unavailable? Is there >> > something I can do at the cfg80211 level? Or would I really have to >> > guard all the cfg80211 entry points into mwifiex with a flag or lock? >> >> There isn't really a good way to do this. You can, of course, call >> wiphy_unregister(), but if you could do that you'd already have the >> problem solved, I think? > > That's probably along the right track. There are still some things we'd > need to do properly before that though, and this is where all the > problems are so far. (Also, this is what Kalle was already objecting to; > he didn't think we should be unregistering/recreating the wiphy, but I > think he ended up softening on that a bit.) Just to clarify I was more like hoping there would be a better way to do it, not really objecting the current method, but I didn't realise that of course it's harder to do a transparent firmware restart with fullmac design. And certainly what are you doing now is much much better than doing nothing after a firmware crash.
diff --git a/drivers/net/wireless/marvell/mwifiex/main.c b/drivers/net/wireless/marvell/mwifiex/main.c index 3b7316b537ea..be3badba028a 100644 --- a/drivers/net/wireless/marvell/mwifiex/main.c +++ b/drivers/net/wireless/marvell/mwifiex/main.c @@ -1401,6 +1401,10 @@ static void mwifiex_uninit_sw(struct mwifiex_adapter *adapter) rtnl_unlock(); } vfree(adapter->chan_stats); + + wiphy_unregister(adapter->wiphy); + wiphy_free(adapter->wiphy); + adapter->wiphy = NULL; } /* @@ -1682,9 +1686,6 @@ int mwifiex_remove_card(struct mwifiex_adapter *adapter) mwifiex_uninit_sw(adapter); - wiphy_unregister(adapter->wiphy); - wiphy_free(adapter->wiphy); - if (adapter->irq_wakeup >= 0) device_init_wakeup(adapter->dev, false);
In general, it's helpful to use the same code for device removal as for device reset, as this tends to have fewer bugs. Let's move the wiphy unregistration code into the common reset and removal code. In particular, it's very hard to properly handle the reset sequence when something fails. Currently, if mwifiex_reinit_sw() fails, we've failed to unregister the associated wiphy, and so running something as simple as "iw phy" can trigger an OOPS, as the wiphy still has hooks back into freed mwifiex data structures. For example, KASAN complained: [... see reset fail for other reasons ...] [ 1184.821158] mwifiex_pcie 0000:01:00.0: info: dnld wifi firmware from 174948 bytes [ 1186.870914] mwifiex_pcie 0000:01:00.0: info: FW download over, size 608396 bytes [ 1187.685990] mwifiex_pcie 0000:01:00.0: WLAN FW is active [ 1187.692673] mwifiex_pcie 0000:01:00.0: cmd_wait_q terminated: -512 [ 1187.699075] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device [ 1187.713476] mwifiex: Failed to bring up adapter: -5 [ 1187.718644] mwifiex_pcie 0000:01:00.0: reinit failed: -5 [... run `iw phy` ...] [ 1212.902419] ================================================================== [ 1212.909806] BUG: KASAN: use-after-free in mwifiex_cfg80211_get_antenna+0x54/0xfc [mwifiex] at addr ffffffc0ad1a8028 [ 1212.920246] Read of size 1 by task iw/3127 [...] [ 1212.934946] page dumped because: kasan: bad access detected [...] [ 1212.950665] Call trace: [ 1212.953148] [<ffffffc00020a69c>] dump_backtrace+0x0/0x190 [ 1212.958572] [<ffffffc00020a96c>] show_stack+0x20/0x28 [ 1212.963648] [<ffffffc0005ce18c>] dump_stack+0xa4/0xcc [ 1212.968723] [<ffffffc0003c4430>] kasan_report+0x378/0x500 [ 1212.974140] [<ffffffc0003c3358>] __asan_load1+0x44/0x4c [ 1212.979462] [<ffffffbffc2e8360>] mwifiex_cfg80211_get_antenna+0x54/0xfc [mwifiex] [ 1212.987131] [<ffffffbffc084fc4>] nl80211_send_wiphy+0x75c/0x2de0 [cfg80211] [ 1212.994246] [<ffffffbffc094f60>] nl80211_dump_wiphy+0x32c/0x438 [cfg80211] [ 1213.001149] [<ffffffc000ab6404>] genl_lock_dumpit+0x48/0x64 [ 1213.006746] [<ffffffc000ab3474>] netlink_dump+0x178/0x398 [ 1213.012171] [<ffffffc000ab3d18>] __netlink_dump_start+0x1bc/0x260 [...] This all goes away if we just tear down the wiphy on the way down, and set it back up if/when we bring the device back up. Signed-off-by: Brian Norris <briannorris@chromium.org> --- drivers/net/wireless/marvell/mwifiex/main.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)