Message ID | 1423224354-24955-1-git-send-email-michal.kazior@tieto.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On 02/06/2015 04:05 AM, Michal Kazior wrote: > It makes little sense to continue and let > firmware-host state become inconsistent if a WMI > command can't be submitted to firmware. > > This effectively prevents after-affects of > tx-credit starvation bug which include spurious > sta kickout events and inability to associate new > stations after some time when acting as AP. > > This should also speed up recovery/teardown in > some cases when firmware stops responding for some > reason. I have not seen a WMI timeout since I added keep-alive and CE polling in my firmware, but the patch looks OK to me. You might add something about 'WMI' in that warning message to make it more clear what is not being responsive. At least in my tests, I could continue to receive network traffic while WMI was blocked. Thanks, Ben > > Signed-off-by: Michal Kazior <michal.kazior@tieto.com> > --- > drivers/net/wireless/ath/ath10k/wmi.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c > index aeea1c7..776b257 100644 > --- a/drivers/net/wireless/ath/ath10k/wmi.c > +++ b/drivers/net/wireless/ath/ath10k/wmi.c > @@ -1045,9 +1045,15 @@ int ath10k_wmi_cmd_send(struct ath10k *ar, struct sk_buff *skb, u32 cmd_id) > (ret != -EAGAIN); > }), 3*HZ); > > - if (ret) > + if (ret) { > dev_kfree_skb_any(skb); > > + if (ret == -EAGAIN) { > + ath10k_warn(ar, "firmware unresponsive, restarting..\n"); > + queue_work(ar->workqueue, &ar->restart_work); > + } > + } > + > return ret; > } > >
On 6 February 2015 at 17:15, Ben Greear <greearb@candelatech.com> wrote: > On 02/06/2015 04:05 AM, Michal Kazior wrote: >> >> It makes little sense to continue and let >> firmware-host state become inconsistent if a WMI >> command can't be submitted to firmware. >> >> This effectively prevents after-affects of >> tx-credit starvation bug which include spurious >> sta kickout events and inability to associate new >> stations after some time when acting as AP. >> >> This should also speed up recovery/teardown in >> some cases when firmware stops responding for some >> reason. > > > I have not seen a WMI timeout since I added keep-alive > and CE polling in my firmware, but the patch looks OK > to me. This is mainly aimed at the tx-credit starvation due to mgmt-tx being stuck on client powersave buffering. > You might add something about 'WMI' in that warning > message to make it more clear what is not being > responsive. Good point. > At least in my tests, I could continue > to receive network traffic while WMI was blocked. Yeah. Traffic works with the tx-credit starvation as well but what good is this if you have inconsistent driver-firmware state after failing to send a few commands? Micha? -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/08/2015 10:24 PM, Michal Kazior wrote: > On 6 February 2015 at 17:15, Ben Greear <greearb@candelatech.com> wrote: >> On 02/06/2015 04:05 AM, Michal Kazior wrote: >>> >>> It makes little sense to continue and let >>> firmware-host state become inconsistent if a WMI >>> command can't be submitted to firmware. >>> >>> This effectively prevents after-affects of >>> tx-credit starvation bug which include spurious >>> sta kickout events and inability to associate new >>> stations after some time when acting as AP. >>> >>> This should also speed up recovery/teardown in >>> some cases when firmware stops responding for some >>> reason. >> >> >> I have not seen a WMI timeout since I added keep-alive >> and CE polling in my firmware, but the patch looks OK >> to me. > > This is mainly aimed at the tx-credit starvation due to mgmt-tx being > stuck on client powersave buffering. > > >> You might add something about 'WMI' in that warning >> message to make it more clear what is not being >> responsive. > > Good point. > > >> At least in my tests, I could continue >> to receive network traffic while WMI was blocked. > > Yeah. Traffic works with the tx-credit starvation as well but what > good is this if you have inconsistent driver-firmware state after > failing to send a few commands? I just mention it because someone debugging the system might wonder why the firmware is 'locked up' while it is still passing traffic. I agree with your patch in general. If it is just powersave issue, could you force (overriding wmi tx-credits limit) a flush at the 1 second mark and hope it recovers by 3 second timeout? Thanks, Ben
On 9 February 2015 at 17:03, Ben Greear <greearb@candelatech.com> wrote: > On 02/08/2015 10:24 PM, Michal Kazior wrote: >> On 6 February 2015 at 17:15, Ben Greear <greearb@candelatech.com> wrote: [...] >>> At least in my tests, I could continue >>> to receive network traffic while WMI was blocked. >> >> >> Yeah. Traffic works with the tx-credit starvation as well but what >> good is this if you have inconsistent driver-firmware state after >> failing to send a few commands? > > > I just mention it because someone debugging the system might > wonder why the firmware is 'locked up' while it is still passing traffic. I might as well include this info in the commit log. > If it is just powersave issue, could you force (overriding wmi tx-credits > limit) > a flush at the 1 second mark and hope it recovers by 3 second timeout? Well, there's a couple of problems with that. 1) you suggest to call wmi_flush() which calls wmi_cmd_send() *while* in wmi_cmd_send(), 2) you're out of credits, how do you intend to submit a frame, 3) yes, you can force it, but that's super ugly, 4) and this still doesn't guarantee you'll get your tx-credit due to firmware quirkiness. (but maybe the latest 10.2.4 behaves differently? who knows..) My original stab at this involved a proactive flushing but it had it's problems (delayed mgmt frames, could still get stuck in rare cases). Micha? -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/09/2015 10:09 PM, Michal Kazior wrote: > On 9 February 2015 at 17:03, Ben Greear <greearb@candelatech.com> wrote: >> On 02/08/2015 10:24 PM, Michal Kazior wrote: >>> On 6 February 2015 at 17:15, Ben Greear <greearb@candelatech.com> wrote: > [...] >>>> At least in my tests, I could continue >>>> to receive network traffic while WMI was blocked. >>> >>> >>> Yeah. Traffic works with the tx-credit starvation as well but what >>> good is this if you have inconsistent driver-firmware state after >>> failing to send a few commands? >> >> >> I just mention it because someone debugging the system might >> wonder why the firmware is 'locked up' while it is still passing traffic. > > I might as well include this info in the commit log. > > >> If it is just powersave issue, could you force (overriding wmi tx-credits >> limit) >> a flush at the 1 second mark and hope it recovers by 3 second timeout? > > Well, there's a couple of problems with that. > > 1) you suggest to call wmi_flush() which calls wmi_cmd_send() *while* > in wmi_cmd_send(), > 2) you're out of credits, how do you intend to submit a frame, > 3) yes, you can force it, but that's super ugly, > 4) and this still doesn't guarantee you'll get your tx-credit due to > firmware quirkiness. If you do force the wmi send, ignoring credits, and even if you are running stock firmware with wmi credits replenishment issues, then you at least have a chance of recovering. (Upstream firmware requires lots of WMI messages to flush, I think..like you can only flush per peer or something like that, so maybe there is no realistic way to flush with small number of WMI messages.) I've hacked CT firmware to do a flush of all vdevs itself when it detects WMI hang. I don't have a good test bed to reproduce the problem reliably, but I should know after a few days if the flush works at all. If not, then it's a moot point anyway. Thanks, Ben
> -----Original Message----- > From: Michal Kazior [mailto:michal.kazior@tieto.com] > Sent: 6. helmikuuta 2015 14:06 > To: ath10k@lists.infradead.org > Cc: linux-wireless@vger.kernel.org; malaakso@elisanet.fi; > greearb@candelatech.com; Michal Kazior > Subject: [RFT] ath10k: restart fw on tx-credit timeout > > It makes little sense to continue and let firmware-host state become > inconsistent if a WMI command can't be submitted to firmware. > > This effectively prevents after-affects of tx-credit starvation bug which > include spurious sta kickout events and inability to associate new stations > after some time when acting as AP. > > This should also speed up recovery/teardown in some cases when firmware > stops responding for some reason. > > Signed-off-by: Michal Kazior <michal.kazior@tieto.com> > --- > drivers/net/wireless/ath/ath10k/wmi.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/wireless/ath/ath10k/wmi.c > b/drivers/net/wireless/ath/ath10k/wmi.c > index aeea1c7..776b257 100644 > --- a/drivers/net/wireless/ath/ath10k/wmi.c > +++ b/drivers/net/wireless/ath/ath10k/wmi.c > @@ -1045,9 +1045,15 @@ int ath10k_wmi_cmd_send(struct ath10k *ar, > struct sk_buff *skb, u32 cmd_id) > (ret != -EAGAIN); > }), 3*HZ); > > - if (ret) > + if (ret) { > dev_kfree_skb_any(skb); > > + if (ret == -EAGAIN) { > + ath10k_warn(ar, "firmware > unresponsive, restarting..\n"); > + queue_work(ar->workqueue, > &ar->restart_work); > + } > + } > + > return ret; > } > > -- > 1.8.5.3 (Re-sending due to E-mail client messing up formatting) Hi Michal, I've been running OpenWrt with this patch (applied on top of backports from 2014-11-04) since last Friday, and today after one mobile phone left the building, Wed Feb 11 14:00:11 2015 daemon.info hostapd: wlan0: STA bc:c6:db:14:83:cc IEEE 802.11: disassociated due to inactivity Wed Feb 11 14:00:12 2015 daemon.info hostapd: wlan0: STA bc:c6:db:14:83:cc IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE) Wed Feb 11 14:00:12 2015 kern.warn kernel: [ 2800.890000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 I got this: [ 2800.890000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 [ 2800.990000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 [ 2801.090000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 [ 2801.200000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 [ 2801.300000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 [ 2801.400000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 [ 2801.500000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 [ 2801.610000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 [ 2801.710000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 [ 2801.810000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 [ 2803.760000] ------------[ cut here ]------------ [ 2803.760000] WARNING: CPU: 0 PID: 2531 at /home/matti/Projects/openwrt/trunk/openwrt/build_dir/target-mips_34kc_uClibc -0.9.33.2/linux-ar71xx_generic/compat-wireless-2014-11-04/net/mac80211/sta_i nfo.c:916 sta_info_move_state+0x580/0x604 [mac80211]() [ 2803.780000] Modules linked in: ifb pppoe ppp_async iptable_nat ath9k pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_MASQUERADE ath9k_common xt_time xt_tcpudp xt_tcpmss xt_string xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY ts_kmp ts_fsm ts_bm slhc nf_nat_irc nf_nat_ftp nf_nat nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack_irc nf_conntrack_ftp iptable_raw iptable_mangle iptable_filter ipt_REJECT ipt_ECN ip_tables crc_ccitt ath9k_hw act_connmark nf_conntrack act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress ath10k_pci ath10k_core ath mac80211 cfg80211 compat ledtrig_usbdev ip6t_REJECT ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables ipv6 arc4 crypto_blkcipher usb_storage ehci_platform ehci_hcd sd_mod scsi_mod gpio_button_hotplug ext4 crc16 jbd2 mbcache usbcore nls_base usb_common mii crypto_hash [last unloaded: ifb] [ 2803.880000] CPU: 0 PID: 2531 Comm: hostapd Not tainted 3.14.30 #1 [ 2803.890000] Stack : 00000006 00000000 00000000 00000000 00000000 00000000 803bc98e 00000035 [ 2803.890000] 873576d8 00000000 802fcad8 80347623 000009e3 803b3b5c 873576d8 00000000 [ 2803.890000] 00000004 00000000 00000008 8029e880 00000003 8020040c 00000394 00000000 [ 2803.890000] 802ffc9c 85265abc 00000000 00000000 00000000 00000000 00000000 00000000 [ 2803.890000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 2803.890000] ... [ 2803.930000] Call Trace: [ 2803.930000] [<80242644>] show_stack+0x48/0x70 [ 2803.930000] [<802aee18>] warn_slowpath_common+0x84/0xb4 [ 2803.940000] [<802aeed0>] warn_slowpath_null+0x18/0x24 [ 2803.940000] [<872051e0>] sta_info_move_state+0x580/0x604 [mac80211] [ 2803.950000] [ 2803.950000] ---[ end trace bd86e18a7ac162bc ]--- [ 2804.260000] ieee80211 phy0: Hardware restart was requested [ 2807.260000] ath10k_warn: 22 callbacks suppressed [ 2807.260000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting.. [ 2807.270000] ath10k_pci 0000:01:00.0: failed to set beacon mode for vdev 0: -11 [ 2810.270000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting.. [ 2810.270000] ath10k_pci 0000:01:00.0: failed to put down monitor vdev 1: -11 [ 2813.280000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting.. [ 2813.280000] ath10k_pci 0000:01:00.0: failed to to request monitor vdev 1 stop: -11 [ 2818.290000] ath10k_pci 0000:01:00.0: failed to synchronise monitor vdev 1: -145 [ 2818.290000] ath10k_pci 0000:01:00.0: failed to stop monitor vdev: -145 [ 2821.300000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting.. [ 2824.410000] ath10k_pci 0000:01:00.0: device successfully recovered [ 2824.710000] ieee80211 phy0: Hardware restart was requested [ 2827.710000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting.. [ 2830.820000] ath10k_pci 0000:01:00.0: device successfully recovered [ 2831.120000] ieee80211 phy0: Hardware restart was requested [ 2834.220000] ath10k_pci 0000:01:00.0: device successfully recovered Afterwards everything seems to be working normally. There was also an unexpected router reboot yesterday, but I didn't get any logs from that and don't know if it's related. I'll keep testing. Matti -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/10/2015 09:01 AM, Ben Greear wrote: > I've hacked CT firmware to do a flush of all vdevs itself when it detects WMI hang. > I don't have a good test bed to reproduce the problem reliably, but I should know > after a few days if the flush works at all. If not, then it's a moot point anyway. So, this appears to at least partially work. But, what we notice is that when using multiple station vdevs, the system pretty much becomes useless if we get any significant number of stuck or slow-to-transmit management buffers over WMI. Part of this is because WMI messages are sent when holding rtnl much of the time, I think. I would guess that an AP with lots of peers associated might have similar problems if peers are not ACKing packets reliably. Probably the only useful way to fix this is to make the firmware and driver able to send management frames over the normal transport like every other data packet? Any idea what it wasn't written like that to begin with? Thanks, Ben
On 11 February 2015 at 23:25, Ben Greear <greearb@candelatech.com> wrote: > On 02/10/2015 09:01 AM, Ben Greear wrote: > >> I've hacked CT firmware to do a flush of all vdevs itself when it detects WMI hang. >> I don't have a good test bed to reproduce the problem reliably, but I should know >> after a few days if the flush works at all. If not, then it's a moot point anyway. > > So, this appears to at least partially work. > > But, what we notice is that when using multiple station vdevs, the system pretty much > becomes useless if we get any significant number of stuck or slow-to-transmit management > buffers over WMI. Part of this is because WMI messages are sent when holding rtnl > much of the time, I think. Most, if not all, WMI commands are sent while holding conf_mutex. This lock is taken in many situations including when RTNL is held so your observation isn't entirely correct but isn't wrong either. > I would guess that an AP with lots of peers associated might have similar problems > if peers are not ACKing packets reliably. It's not the ACKing per se. It's whether stations are asleep and unresponsive or not. You could do funny DoS attacks with a single ath9k card (using virtual stations) on ath10k APs now I guess :-) > Probably the only useful way to fix this is to make the firmware and driver able to > send management frames over the normal transport like every other data packet? Agreed. HTT should've been used for entire traffic, including management frames. The workaround could've been to guarantee to have only 1 wmi-mgmt-tx in-flight but since tx-credits aren't replenished predictably you'll end up with the patch I originally did, i.e. sleep 2*bcn intval and wmi-peer-flush-tids after each unicast mgmt frame to a known station. > Any idea what it wasn't written like that to begin with? Beats me. Micha? -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/11/2015 10:55 PM, Michal Kazior wrote: > On 11 February 2015 at 23:25, Ben Greear <greearb@candelatech.com> wrote: >> On 02/10/2015 09:01 AM, Ben Greear wrote: >> >>> I've hacked CT firmware to do a flush of all vdevs itself when it detects WMI hang. >>> I don't have a good test bed to reproduce the problem reliably, but I should know >>> after a few days if the flush works at all. If not, then it's a moot point anyway. >> >> So, this appears to at least partially work. >> >> But, what we notice is that when using multiple station vdevs, the system pretty much >> becomes useless if we get any significant number of stuck or slow-to-transmit management >> buffers over WMI. Part of this is because WMI messages are sent when holding rtnl >> much of the time, I think. > > Most, if not all, WMI commands are sent while holding conf_mutex. This > lock is taken in many situations including when RTNL is held so your > observation isn't entirely correct but isn't wrong either. > > >> I would guess that an AP with lots of peers associated might have similar problems >> if peers are not ACKing packets reliably. > > It's not the ACKing per se. It's whether stations are asleep and > unresponsive or not. You could do funny DoS attacks with a single > ath9k card (using virtual stations) on ath10k APs now I guess :-) In our lab we have some setups where there should be no power-save at all, but still see this issue. Unlucky (or nefarious) broken-ness in the peer can seem to mostly hang the local system due to the 'not entirely correct' assumption above :) >> Probably the only useful way to fix this is to make the firmware and driver able to >> send management frames over the normal transport like every other data packet? > > Agreed. HTT should've been used for entire traffic, including management frames. > > The workaround could've been to guarantee to have only 1 wmi-mgmt-tx > in-flight but since tx-credits aren't replenished predictably you'll > end up with the patch I originally did, i.e. sleep 2*bcn intval and > wmi-peer-flush-tids after each unicast mgmt frame to a known station. Even assuming I have the tx-credits replenishment fixed, that work-around would make sending sending mgt frames to many peers very slow when at least a few peers are not answering quickly, right? >> Any idea what it wasn't written like that to begin with? > > Beats me. This might be something I can fix in CT firmware..but trying to kick a release out the door, so I think I'll put this off for a bit. Thanks, Ben
> Afterwards everything seems to be working normally. There was also an > unexpected router reboot yesterday, but I didn't get any logs from that and > don't know if it's related. I'll keep testing. > > Matti Turns out the reboots were due to a "Data bus error", which were caused by cold restarts, as far as I can tell (this is a QCA9558-based router). I added reset_mode=1 to ath10k_pci parameters, but then this workaround didn't seem to work anymore (in the log below 20:a9:9b:9e:7c:b0 and bc:c6:db:14:83:cc are mobile phones and 00:03:7f:48:da:c9 is the ath10k access point): Fri Feb 13 14:19:56 2015 daemon.info hostapd: wlan0: STA 20:a9:9b:9e:7c:b0 IEEE 802.11: disassociated due to inactivity Fri Feb 13 14:19:57 2015 daemon.info hostapd: wlan0: STA 20:a9:9b:9e:7c:b0 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE) Fri Feb 13 14:19:57 2015 kern.warn kernel: [68482.510000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 14:19:57 2015 kern.warn kernel: [68482.610000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 14:19:57 2015 kern.warn kernel: [68482.710000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 14:19:57 2015 kern.warn kernel: [68482.810000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 14:19:57 2015 kern.warn kernel: [68482.920000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 14:19:57 2015 kern.warn kernel: [68483.020000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 14:19:57 2015 kern.warn kernel: [68483.120000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 14:19:58 2015 kern.warn kernel: [68483.220000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 14:19:58 2015 kern.warn kernel: [68483.330000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 14:19:58 2015 kern.warn kernel: [68483.430000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.390000] ------------[ cut here ]------------ Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.390000] WARNING: CPU: 0 PID: 2550 at /home/matti/Projects/openwrt/trunk/openwrt/build_dir/target-mips_34kc_uClibc-0.9.33.2/linux-ar71xx_generic/compat-wir eless-2014-11-04/net/mac80211/sta_info.c:916 sta_info_move_state+0x580/0x604 [mac80211]() Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.410000] Modules linked in: ifb pppoe ppp_async iptable_nat ath9k pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_MASQUERADE ath9k_co mmon xt_time xt_tcpudp xt_tcpmss xt_string xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY ts_kmp ts_fsm ts_bFri Feb 13 14:20:00 2015 kern.warn kernel: [68485.510000] CPU: 0 PID: 2550 Comm: hostapd Not tainted 3.14.30 #1 Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.520000] Stack : 00000006 00000000 00000000 00000000 00000000 00000000 803bc98e 00000035 Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.520000] 85fd7ba0 00000000 802fcad8 80347623 000009f6 803b3b5c 85fd7ba0 00000000 Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.520000] 00000004 00000000 00000008 8029e880 00000003 8020040c 00000394 00000000 Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.520000] 802ffc9c 85f9dabc 00000000 00000000 00000000 00000000 00000000 00000000 Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.520000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.520000] ... Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.560000] Call Trace: Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.560000] [<80242644>] show_stack+0x48/0x70 Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.560000] [<802aee18>] warn_slowpath_common+0x84/0xb4 Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.570000] [<802aeed0>] warn_slowpath_null+0x18/0x24 Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.570000] [<872851e0>] sta_info_move_state+0x580/0x604 [mac80211] Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.580000] Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.580000] ---[ end trace 830537c24c03b27c ]--- Fri Feb 13 14:20:00 2015 kern.info kernel: [68485.890000] ieee80211 phy0: Hardware restart was requested Fri Feb 13 14:20:03 2015 kern.warn kernel: [68488.890000] ath10k_warn: 23 callbacks suppressed Fri Feb 13 14:20:03 2015 kern.warn kernel: [68488.890000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting.. Fri Feb 13 14:20:03 2015 kern.warn kernel: [68488.900000] ath10k_pci 0000:01:00.0: failed to put down monitor vdev 1: -11 Fri Feb 13 14:20:06 2015 kern.warn kernel: [68491.900000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting.. Fri Feb 13 14:20:06 2015 kern.warn kernel: [68491.900000] ath10k_pci 0000:01:00.0: failed to to request monitor vdev 1 stop: -11 Fri Feb 13 14:20:11 2015 kern.warn kernel: [68496.910000] ath10k_pci 0000:01:00.0: failed to synchronise monitor vdev 1: -145 Fri Feb 13 14:20:11 2015 kern.warn kernel: [68496.910000] ath10k_pci 0000:01:00.0: failed to stop monitor vdev: -145 Fri Feb 13 14:20:14 2015 kern.warn kernel: [68499.920000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting.. Fri Feb 13 14:20:17 2015 kern.info kernel: [68503.020000] ath10k_pci 0000:01:00.0: device successfully recovered Fri Feb 13 14:20:18 2015 kern.info kernel: [68503.320000] ieee80211 phy0: Hardware restart was requested Fri Feb 13 14:20:21 2015 kern.warn kernel: [68506.320000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting.. Fri Feb 13 14:20:21 2015 kern.warn kernel: [68506.320000] ath10k_pci 0000:01:00.0: failed to install key for vdev 0 peer 00:03:7f:48:da:c9: -11 Fri Feb 13 14:20:21 2015 kern.err kernel: [68506.330000] wlan0: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-11) Fri Feb 13 14:20:24 2015 kern.warn kernel: [68509.340000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting.. Fri Feb 13 14:20:24 2015 kern.warn kernel: [68509.340000] ath10k_pci 0000:01:00.0: failed to install key for vdev 0 peer 00:03:7f:48:da:c9: -11 Fri Feb 13 14:20:24 2015 kern.err kernel: [68509.350000] wlan0: failed to set key (1, ff:ff:ff:ff:ff:ff) to hardware (-11) Fri Feb 13 14:20:27 2015 kern.warn kernel: [68512.360000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting.. Fri Feb 13 14:20:30 2015 kern.info kernel: [68515.470000] ath10k_pci 0000:01:00.0: device successfully recovered Fri Feb 13 14:20:30 2015 kern.info kernel: [68515.770000] ieee80211 phy0: Hardware restart was requested Fri Feb 13 14:20:33 2015 kern.warn kernel: [68518.770000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting.. Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.880000] ath10k_pci 0000:01:00.0: device is wedged, will not restart Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.890000] ath10k_pci 0000:01:00.0: failed to install key for vdev 0 peer bc:c6:db:14:83:cc: -70 Fri Feb 13 14:20:36 2015 kern.err kernel: [68521.890000] wlan0: failed to set key (0, bc:c6:db:14:83:cc) to hardware (-70) Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.900000] ath10k_pci 0000:01:00.0: failed to install key for vdev 0 peer 00:03:7f:48:da:c9: -70 Fri Feb 13 14:20:36 2015 kern.err kernel: [68521.910000] wlan0: failed to set key (2, ff:ff:ff:ff:ff:ff) to hardware (-70) Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.920000] ath10k_pci 0000:01:00.0: failed to install key for vdev 0 peer 00:03:7f:48:da:c9: -70 Fri Feb 13 14:20:36 2015 kern.err kernel: [68521.930000] wlan0: failed to set key (1, ff:ff:ff:ff:ff:ff) to hardware (-70) Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.940000] ath10k_pci 0000:01:00.0: failed to install key for vdev 0 peer bc:c6:db:14:83:cc: -70 Fri Feb 13 14:20:36 2015 kern.err kernel: [68521.940000] wlan0: failed to remove key (0, bc:c6:db:14:83:cc) from hardware (-70) Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.950000] ath10k_pci 0000:01:00.0: failed to delete peer bc:c6:db:14:83:cc for vdev 0: -70 Fri Feb 13 14:20:36 2015 daemon.info hostapd: wlan0: STA bc:c6:db:14:83:cc IEEE 802.11: deauthenticated due to local deauth request Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.960000] ath10k_pci 0000:01:00.0: failed to set beacon mode for vdev 0: -70 Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.970000] ath10k_pci 0000:01:00.0: failed to recalculate rts/cts prot for vdev 0: -70 Fri Feb 13 14:20:38 2015 kern.warn kernel: [68523.820000] ath10k_warn: 17 callbacks suppressed Fri Feb 13 14:20:38 2015 kern.warn kernel: [68523.830000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 14:20:38 2015 kern.warn kernel: [68523.920000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 14:20:38 2015 kern.warn kernel: [68524.030000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 ... this goes on and on, and the wireless network is not even visible anymore until I tear down wlan0 and put it back up: ... Fri Feb 13 16:32:17 2015 kern.warn kernel: [76422.940000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.240000] ath10k_warn: 45 callbacks suppressed Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.240000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.300000] ath10k_pci 0000:01:00.0: Spurious quick kickout for STA bc:c6:db:14:83:cc Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.340000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.440000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.550000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.650000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.750000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.850000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.960000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 16:32:22 2015 kern.warn kernel: [76428.060000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0 Fri Feb 13 16:32:27 2015 kern.info kernel: [76432.210000] device wlan0 left promiscuous mode Fri Feb 13 16:32:27 2015 kern.info kernel: [76432.220000] br-lan: port 2(wlan0) entered disabled state Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.230000] ------------[ cut here ]------------ Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.230000] WARNING: CPU: 0 PID: 2550 at /home/matti/Projects/openwrt/trunk/openwrt/build_dir/target-mips_34kc_uClibc-0.9.33.2/linux-ar71xx_generic/compat-wir eless-2014-11-04/net/mac80211/key.c:642 ieee80211_free_keys+0x118/0x1b8 [mac80211]() Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.250000] Modules linked in: ifb pppoe ppp_async iptable_nat ath9k pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_MASQUERADE ath9k_co mmon xt_time xt_tcpudp xt_tcpmss xt_string xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY ts_kmp ts_fsm ts_bFri Feb 13 16:32:27 2015 kern.warn kernel: [76432.350000] CPU: 0 PID: 2550 Comm: hostapd Tainted: G W 3.14.30 #1 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.360000] Stack : 00000006 00000000 00000000 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.370000] ath10k_warn: 49 callbacks suppressed Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.370000] ath10k_pci 0000:01:00.0: No VIF found for vdev 0 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000] ath10k_pci 0000:01:00.0: no vif for vdev_id 0 found Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000] 00000000 00000000 00000000 803bc98e 00000040 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000] 85fd7ba0 866bb8a4 802fcad8 80347623 000009f6 803b3b5c 85fd7ba0 866bb8a4 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000] 872cfe2c 00000000 00000001 8029e880 00000003 80200398 00000282 866bb8a4 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000] 802ffc9c 85f9dc44 00000000 00000000 00000000 00000000 00000000 00000000 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000] ... Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.420000] Call Trace: Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.420000] [<80242644>] show_stack+0x48/0x70 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.420000] [<802aee18>] warn_slowpath_common+0x84/0xb4 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.430000] [<802aeed0>] warn_slowpath_null+0x18/0x24 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.430000] [<872a69e4>] ieee80211_free_keys+0x118/0x1b8 [mac80211] Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.440000] [<872921c4>] ieee80211_del_virtual_monitor+0x67c/0x8fc [mac80211] Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.450000] [<80067a04>] _cond_resched+0x2c/0x3c Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.450000] Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.450000] ---[ end trace 830537c24c03b27d ]--- Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.460000] ath10k_pci 0000:01:00.0: failed to remove peer for AP vdev 0: -70 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.460000] ath10k_pci 0000:01:00.0: No VIF found for vdev 0 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.470000] ath10k_pci 0000:01:00.0: no vif for vdev_id 0 found Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.480000] ath10k_pci 0000:01:00.0: failed to delete WMI vdev 0: -70 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.480000] ath10k_pci 0000:01:00.0: removing stale peer bc:c6:db:14:83:cc from vdev_id 0 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.490000] ath10k_pci 0000:01:00.0: removing stale peer 00:03:7f:48:da:c9 from vdev_id 0 Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.500000] ath10k_pci 0000:01:00.0: could not suspend target (-70) Fri Feb 13 16:32:27 2015 daemon.notice netifd: Network device 'wlan0' link is down Fri Feb 13 16:32:55 2015 kern.info kernel: [76460.490000] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready Fri Feb 13 16:32:55 2015 kern.info kernel: [76460.490000] device wlan0 entered promiscuous mode Fri Feb 13 16:32:58 2015 kern.info kernel: [76463.170000] br-lan: port 2(wlan0) entered forwarding state Fri Feb 13 16:32:58 2015 kern.info kernel: [76463.180000] br-lan: port 2(wlan0) entered forwarding state Fri Feb 13 16:32:58 2015 kern.info kernel: [76463.180000] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready Fri Feb 13 16:32:58 2015 daemon.notice netifd: Network device 'wlan0' link is up Fri Feb 13 16:32:59 2015 kern.info kernel: [76465.180000] br-lan: port 2(wlan0) entered forwarding state Matti -- To unsubscribe from this list: send the line "unsubscribe linux-wireless" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c index aeea1c7..776b257 100644 --- a/drivers/net/wireless/ath/ath10k/wmi.c +++ b/drivers/net/wireless/ath/ath10k/wmi.c @@ -1045,9 +1045,15 @@ int ath10k_wmi_cmd_send(struct ath10k *ar, struct sk_buff *skb, u32 cmd_id) (ret != -EAGAIN); }), 3*HZ); - if (ret) + if (ret) { dev_kfree_skb_any(skb); + if (ret == -EAGAIN) { + ath10k_warn(ar, "firmware unresponsive, restarting..\n"); + queue_work(ar->workqueue, &ar->restart_work); + } + } + return ret; }
It makes little sense to continue and let firmware-host state become inconsistent if a WMI command can't be submitted to firmware. This effectively prevents after-affects of tx-credit starvation bug which include spurious sta kickout events and inability to associate new stations after some time when acting as AP. This should also speed up recovery/teardown in some cases when firmware stops responding for some reason. Signed-off-by: Michal Kazior <michal.kazior@tieto.com> --- drivers/net/wireless/ath/ath10k/wmi.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)