diff mbox

[RFT] ath10k: restart fw on tx-credit timeout

Message ID 1423224354-24955-1-git-send-email-michal.kazior@tieto.com (mailing list archive)
State RFC
Headers show

Commit Message

Michal Kazior Feb. 6, 2015, 12:05 p.m. UTC
It makes little sense to continue and let
firmware-host state become inconsistent if a WMI
command can't be submitted to firmware.

This effectively prevents after-affects of
tx-credit starvation bug which include spurious
sta kickout events and inability to associate new
stations after some time when acting as AP.

This should also speed up recovery/teardown in
some cases when firmware stops responding for some
reason.

Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
---
 drivers/net/wireless/ath/ath10k/wmi.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Ben Greear Feb. 6, 2015, 4:15 p.m. UTC | #1
On 02/06/2015 04:05 AM, Michal Kazior wrote:
> It makes little sense to continue and let
> firmware-host state become inconsistent if a WMI
> command can't be submitted to firmware.
>
> This effectively prevents after-affects of
> tx-credit starvation bug which include spurious
> sta kickout events and inability to associate new
> stations after some time when acting as AP.
>
> This should also speed up recovery/teardown in
> some cases when firmware stops responding for some
> reason.

I have not seen a WMI timeout since I added keep-alive
and CE polling in my firmware, but the patch looks OK
to me.

You might add something about 'WMI' in that warning
message to make it more clear what is not being
responsive.  At least in my tests, I could continue
to receive network traffic while WMI was blocked.

Thanks,
Ben

>
> Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
> ---
>   drivers/net/wireless/ath/ath10k/wmi.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c
> index aeea1c7..776b257 100644
> --- a/drivers/net/wireless/ath/ath10k/wmi.c
> +++ b/drivers/net/wireless/ath/ath10k/wmi.c
> @@ -1045,9 +1045,15 @@ int ath10k_wmi_cmd_send(struct ath10k *ar, struct sk_buff *skb, u32 cmd_id)
>   		(ret != -EAGAIN);
>   	}), 3*HZ);
>
> -	if (ret)
> +	if (ret) {
>   		dev_kfree_skb_any(skb);
>
> +		if (ret == -EAGAIN) {
> +			ath10k_warn(ar, "firmware unresponsive, restarting..\n");
> +			queue_work(ar->workqueue, &ar->restart_work);
> +		}
> +	}
> +
>   	return ret;
>   }
>
>
Michal Kazior Feb. 9, 2015, 6:24 a.m. UTC | #2
On 6 February 2015 at 17:15, Ben Greear <greearb@candelatech.com> wrote:
> On 02/06/2015 04:05 AM, Michal Kazior wrote:
>>
>> It makes little sense to continue and let
>> firmware-host state become inconsistent if a WMI
>> command can't be submitted to firmware.
>>
>> This effectively prevents after-affects of
>> tx-credit starvation bug which include spurious
>> sta kickout events and inability to associate new
>> stations after some time when acting as AP.
>>
>> This should also speed up recovery/teardown in
>> some cases when firmware stops responding for some
>> reason.
>
>
> I have not seen a WMI timeout since I added keep-alive
> and CE polling in my firmware, but the patch looks OK
> to me.

This is mainly aimed at the tx-credit starvation due to mgmt-tx being
stuck on client powersave buffering.


> You might add something about 'WMI' in that warning
> message to make it more clear what is not being
> responsive.

Good point.


> At least in my tests, I could continue
> to receive network traffic while WMI was blocked.

Yeah. Traffic works with the tx-credit starvation as well but what
good is this if you have inconsistent driver-firmware state after
failing to send a few commands?


Micha?
Ben Greear Feb. 9, 2015, 4:03 p.m. UTC | #3
On 02/08/2015 10:24 PM, Michal Kazior wrote:
> On 6 February 2015 at 17:15, Ben Greear <greearb@candelatech.com> wrote:
>> On 02/06/2015 04:05 AM, Michal Kazior wrote:
>>>
>>> It makes little sense to continue and let
>>> firmware-host state become inconsistent if a WMI
>>> command can't be submitted to firmware.
>>>
>>> This effectively prevents after-affects of
>>> tx-credit starvation bug which include spurious
>>> sta kickout events and inability to associate new
>>> stations after some time when acting as AP.
>>>
>>> This should also speed up recovery/teardown in
>>> some cases when firmware stops responding for some
>>> reason.
>>
>>
>> I have not seen a WMI timeout since I added keep-alive
>> and CE polling in my firmware, but the patch looks OK
>> to me.
>
> This is mainly aimed at the tx-credit starvation due to mgmt-tx being
> stuck on client powersave buffering.
>
>
>> You might add something about 'WMI' in that warning
>> message to make it more clear what is not being
>> responsive.
>
> Good point.
>
>
>> At least in my tests, I could continue
>> to receive network traffic while WMI was blocked.
>
> Yeah. Traffic works with the tx-credit starvation as well but what
> good is this if you have inconsistent driver-firmware state after
> failing to send a few commands?

I just mention it because someone debugging the system might
wonder why the firmware is 'locked up' while it is still passing traffic.

I agree with your patch in general.

If it is just powersave issue, could you force (overriding wmi tx-credits limit)
a flush at the 1 second mark and hope it recovers by 3 second timeout?

Thanks,
Ben
Michal Kazior Feb. 10, 2015, 6:09 a.m. UTC | #4
On 9 February 2015 at 17:03, Ben Greear <greearb@candelatech.com> wrote:
> On 02/08/2015 10:24 PM, Michal Kazior wrote:
>> On 6 February 2015 at 17:15, Ben Greear <greearb@candelatech.com> wrote:
[...]
>>> At least in my tests, I could continue
>>> to receive network traffic while WMI was blocked.
>>
>>
>> Yeah. Traffic works with the tx-credit starvation as well but what
>> good is this if you have inconsistent driver-firmware state after
>> failing to send a few commands?
>
>
> I just mention it because someone debugging the system might
> wonder why the firmware is 'locked up' while it is still passing traffic.

I might as well include this info in the commit log.


> If it is just powersave issue, could you force (overriding wmi tx-credits
> limit)
> a flush at the 1 second mark and hope it recovers by 3 second timeout?

Well, there's a couple of problems with that.

1) you suggest to call wmi_flush() which calls wmi_cmd_send() *while*
in wmi_cmd_send(),
2) you're out of credits, how do you intend to submit a frame,
3) yes, you can force it, but that's super ugly,
4) and this still doesn't guarantee you'll get your tx-credit due to
firmware quirkiness.

(but maybe the latest 10.2.4 behaves differently? who knows..)

My original stab at this involved a proactive flushing but it had it's
problems (delayed mgmt frames, could still get stuck in rare cases).


Micha?
Ben Greear Feb. 10, 2015, 5:01 p.m. UTC | #5
On 02/09/2015 10:09 PM, Michal Kazior wrote:
> On 9 February 2015 at 17:03, Ben Greear <greearb@candelatech.com> wrote:
>> On 02/08/2015 10:24 PM, Michal Kazior wrote:
>>> On 6 February 2015 at 17:15, Ben Greear <greearb@candelatech.com> wrote:
> [...]
>>>> At least in my tests, I could continue
>>>> to receive network traffic while WMI was blocked.
>>>
>>>
>>> Yeah. Traffic works with the tx-credit starvation as well but what
>>> good is this if you have inconsistent driver-firmware state after
>>> failing to send a few commands?
>>
>>
>> I just mention it because someone debugging the system might
>> wonder why the firmware is 'locked up' while it is still passing traffic.
> 
> I might as well include this info in the commit log.
> 
> 
>> If it is just powersave issue, could you force (overriding wmi tx-credits
>> limit)
>> a flush at the 1 second mark and hope it recovers by 3 second timeout?
> 
> Well, there's a couple of problems with that.
> 
> 1) you suggest to call wmi_flush() which calls wmi_cmd_send() *while*
> in wmi_cmd_send(),
> 2) you're out of credits, how do you intend to submit a frame,
> 3) yes, you can force it, but that's super ugly,
> 4) and this still doesn't guarantee you'll get your tx-credit due to
> firmware quirkiness.

If you do force the wmi send, ignoring credits, and even if you are running
stock firmware with wmi credits replenishment issues, then you at least
have a chance of recovering.  (Upstream firmware requires lots of WMI messages
to flush, I think..like you can only flush per peer or something like that,
so maybe there is no realistic way to flush with small number of WMI messages.)

I've hacked CT firmware to do a flush of all vdevs itself when it detects WMI hang.
I don't have a good test bed to reproduce the problem reliably, but I should know
after a few days if the flush works at all.  If not, then it's a moot point anyway.

Thanks,
Ben
Matti Laakso Feb. 11, 2015, 1:30 p.m. UTC | #6
> -----Original Message-----
> From: Michal Kazior [mailto:michal.kazior@tieto.com]
> Sent: 6. helmikuuta 2015 14:06
> To: ath10k@lists.infradead.org
> Cc: linux-wireless@vger.kernel.org; malaakso@elisanet.fi;
> greearb@candelatech.com; Michal Kazior
> Subject: [RFT] ath10k: restart fw on tx-credit timeout
> 
> It makes little sense to continue and let firmware-host state become
> inconsistent if a WMI command can't be submitted to firmware.
> 
> This effectively prevents after-affects of tx-credit starvation bug which
> include spurious sta kickout events and inability to associate new
stations
> after some time when acting as AP.
> 
> This should also speed up recovery/teardown in some cases when firmware
> stops responding for some reason.
> 
> Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
> ---
>  drivers/net/wireless/ath/ath10k/wmi.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/wireless/ath/ath10k/wmi.c
> b/drivers/net/wireless/ath/ath10k/wmi.c
> index aeea1c7..776b257 100644
> --- a/drivers/net/wireless/ath/ath10k/wmi.c
> +++ b/drivers/net/wireless/ath/ath10k/wmi.c
> @@ -1045,9 +1045,15 @@ int ath10k_wmi_cmd_send(struct ath10k *ar,
> struct sk_buff *skb, u32 cmd_id)
>  		(ret != -EAGAIN);
>  	}), 3*HZ);
> 
> -	if (ret)
> +	if (ret) {
>  		dev_kfree_skb_any(skb);
> 
> +		if (ret == -EAGAIN) {
> +			ath10k_warn(ar, "firmware
> unresponsive, restarting..\n");
> +			queue_work(ar->workqueue,
> &ar->restart_work);
> +		}
> +	}
> +
>  	return ret;
>  }
> 
> --
> 1.8.5.3

(Re-sending due to E-mail client messing up formatting)

Hi Michal,

I've been running OpenWrt with this patch (applied on top of backports from
2014-11-04) since last Friday, and today after one mobile phone left the
building,

Wed Feb 11 14:00:11 2015 daemon.info hostapd: wlan0: STA bc:c6:db:14:83:cc
IEEE 802.11: disassociated due to inactivity
Wed Feb 11 14:00:12 2015 daemon.info hostapd: wlan0: STA bc:c6:db:14:83:cc
IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Wed Feb 11 14:00:12 2015 kern.warn kernel: [ 2800.890000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0

I got this:

[ 2800.890000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0
[ 2800.990000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0
[ 2801.090000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0
[ 2801.200000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0
[ 2801.300000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0
[ 2801.400000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0
[ 2801.500000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0
[ 2801.610000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0
[ 2801.710000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0
[ 2801.810000] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0
[ 2803.760000] ------------[ cut here ]------------
[ 2803.760000] WARNING: CPU: 0 PID: 2531 at
/home/matti/Projects/openwrt/trunk/openwrt/build_dir/target-mips_34kc_uClibc
-0.9.33.2/linux-ar71xx_generic/compat-wireless-2014-11-04/net/mac80211/sta_i
nfo.c:916 sta_info_move_state+0x580/0x604 [mac80211]()
[ 2803.780000] Modules linked in: ifb pppoe ppp_async iptable_nat ath9k
pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4
ipt_MASQUERADE ath9k_common xt_time xt_tcpudp xt_tcpmss xt_string
xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit
xt_length xt_id xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark
xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL
xt_DSCP xt_CT xt_CLASSIFY ts_kmp ts_fsm ts_bm slhc nf_nat_irc nf_nat_ftp
nf_nat nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack_irc
nf_conntrack_ftp iptable_raw iptable_mangle iptable_filter ipt_REJECT
ipt_ECN ip_tables crc_ccitt ath9k_hw act_connmark nf_conntrack act_skbedit
act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc
sch_ingress ath10k_pci ath10k_core ath mac80211 cfg80211 compat
ledtrig_usbdev ip6t_REJECT ip6table_raw ip6table_mangle ip6table_filter
ip6_tables x_tables ipv6 arc4 crypto_blkcipher usb_storage ehci_platform
ehci_hcd sd_mod scsi_mod gpio_button_hotplug ext4 crc16 jbd2 mbcache usbcore
nls_base usb_common mii crypto_hash [last unloaded: ifb]
[ 2803.880000] CPU: 0 PID: 2531 Comm: hostapd Not tainted 3.14.30 #1
[ 2803.890000] Stack : 00000006 00000000 00000000 00000000 00000000 00000000
803bc98e 00000035
[ 2803.890000]    873576d8 00000000 802fcad8 80347623 000009e3 803b3b5c
873576d8 00000000
[ 2803.890000]    00000004 00000000 00000008 8029e880 00000003 8020040c
00000394 00000000
[ 2803.890000]    802ffc9c 85265abc 00000000 00000000 00000000 00000000
00000000 00000000
[ 2803.890000]    00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000
[ 2803.890000]    ...
[ 2803.930000] Call Trace:
[ 2803.930000] [<80242644>] show_stack+0x48/0x70
[ 2803.930000] [<802aee18>] warn_slowpath_common+0x84/0xb4
[ 2803.940000] [<802aeed0>] warn_slowpath_null+0x18/0x24
[ 2803.940000] [<872051e0>] sta_info_move_state+0x580/0x604 [mac80211]
[ 2803.950000]
[ 2803.950000] ---[ end trace bd86e18a7ac162bc ]---
[ 2804.260000] ieee80211 phy0: Hardware restart was requested
[ 2807.260000] ath10k_warn: 22 callbacks suppressed
[ 2807.260000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting..
[ 2807.270000] ath10k_pci 0000:01:00.0: failed to set beacon mode for vdev
0: -11
[ 2810.270000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting..
[ 2810.270000] ath10k_pci 0000:01:00.0: failed to put down monitor vdev 1:
-11
[ 2813.280000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting..
[ 2813.280000] ath10k_pci 0000:01:00.0: failed to to request monitor vdev 1
stop: -11
[ 2818.290000] ath10k_pci 0000:01:00.0: failed to synchronise monitor vdev
1: -145
[ 2818.290000] ath10k_pci 0000:01:00.0: failed to stop monitor vdev: -145
[ 2821.300000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting..
[ 2824.410000] ath10k_pci 0000:01:00.0: device successfully recovered
[ 2824.710000] ieee80211 phy0: Hardware restart was requested
[ 2827.710000] ath10k_pci 0000:01:00.0: firmware unresponsive, restarting..
[ 2830.820000] ath10k_pci 0000:01:00.0: device successfully recovered
[ 2831.120000] ieee80211 phy0: Hardware restart was requested
[ 2834.220000] ath10k_pci 0000:01:00.0: device successfully recovered

Afterwards everything seems to be working normally. There was also an
unexpected router reboot yesterday, but I didn't get any logs from that and
don't know if it's related. I'll keep testing.

Matti
Ben Greear Feb. 11, 2015, 10:25 p.m. UTC | #7
On 02/10/2015 09:01 AM, Ben Greear wrote:

> I've hacked CT firmware to do a flush of all vdevs itself when it detects WMI hang.
> I don't have a good test bed to reproduce the problem reliably, but I should know
> after a few days if the flush works at all.  If not, then it's a moot point anyway.

So, this appears to at least partially work.

But, what we notice is that when using multiple station vdevs, the system pretty much
becomes useless if we get any significant number of stuck or slow-to-transmit management
buffers over WMI.  Part of this is because WMI messages are sent when holding rtnl
much of the time, I think.

I would guess that an AP with lots of peers associated might have similar problems
if peers are not ACKing packets reliably.

Probably the only useful way to fix this is to make the firmware and driver able to
send management frames over the normal transport like every other data packet?

Any idea what it wasn't written like that to begin with?

Thanks,
Ben
Michal Kazior Feb. 12, 2015, 6:55 a.m. UTC | #8
On 11 February 2015 at 23:25, Ben Greear <greearb@candelatech.com> wrote:
> On 02/10/2015 09:01 AM, Ben Greear wrote:
>
>> I've hacked CT firmware to do a flush of all vdevs itself when it detects WMI hang.
>> I don't have a good test bed to reproduce the problem reliably, but I should know
>> after a few days if the flush works at all.  If not, then it's a moot point anyway.
>
> So, this appears to at least partially work.
>
> But, what we notice is that when using multiple station vdevs, the system pretty much
> becomes useless if we get any significant number of stuck or slow-to-transmit management
> buffers over WMI.  Part of this is because WMI messages are sent when holding rtnl
> much of the time, I think.

Most, if not all, WMI commands are sent while holding conf_mutex. This
lock is taken in many situations including when RTNL is held so your
observation isn't entirely correct but isn't wrong either.


> I would guess that an AP with lots of peers associated might have similar problems
> if peers are not ACKing packets reliably.

It's not the ACKing per se. It's whether stations are asleep and
unresponsive or not. You could do funny DoS attacks with a single
ath9k card (using virtual stations) on ath10k APs now I guess :-)


> Probably the only useful way to fix this is to make the firmware and driver able to
> send management frames over the normal transport like every other data packet?

Agreed. HTT should've been used for entire traffic, including management frames.

The workaround could've been to guarantee to have only 1 wmi-mgmt-tx
in-flight but since tx-credits aren't replenished predictably you'll
end up with the patch I originally did, i.e. sleep 2*bcn intval and
wmi-peer-flush-tids after each unicast mgmt frame to a known station.


> Any idea what it wasn't written like that to begin with?

Beats me.


Micha?
Ben Greear Feb. 12, 2015, 1:21 p.m. UTC | #9
On 02/11/2015 10:55 PM, Michal Kazior wrote:
> On 11 February 2015 at 23:25, Ben Greear <greearb@candelatech.com> wrote:
>> On 02/10/2015 09:01 AM, Ben Greear wrote:
>>
>>> I've hacked CT firmware to do a flush of all vdevs itself when it detects WMI hang.
>>> I don't have a good test bed to reproduce the problem reliably, but I should know
>>> after a few days if the flush works at all.  If not, then it's a moot point anyway.
>>
>> So, this appears to at least partially work.
>>
>> But, what we notice is that when using multiple station vdevs, the system pretty much
>> becomes useless if we get any significant number of stuck or slow-to-transmit management
>> buffers over WMI.  Part of this is because WMI messages are sent when holding rtnl
>> much of the time, I think.
>
> Most, if not all, WMI commands are sent while holding conf_mutex. This
> lock is taken in many situations including when RTNL is held so your
> observation isn't entirely correct but isn't wrong either.
>
>
>> I would guess that an AP with lots of peers associated might have similar problems
>> if peers are not ACKing packets reliably.
>
> It's not the ACKing per se. It's whether stations are asleep and
> unresponsive or not. You could do funny DoS attacks with a single
> ath9k card (using virtual stations) on ath10k APs now I guess :-)

In our lab we have some setups where there should be no power-save at all,
but still see this issue.  Unlucky (or nefarious) broken-ness in the peer can seem to
mostly hang the local system due to the 'not entirely correct' assumption above :)


>> Probably the only useful way to fix this is to make the firmware and driver able to
>> send management frames over the normal transport like every other data packet?
>
> Agreed. HTT should've been used for entire traffic, including management frames.
>
> The workaround could've been to guarantee to have only 1 wmi-mgmt-tx
> in-flight but since tx-credits aren't replenished predictably you'll
> end up with the patch I originally did, i.e. sleep 2*bcn intval and
> wmi-peer-flush-tids after each unicast mgmt frame to a known station.

Even assuming I have the tx-credits replenishment fixed,
that work-around would make sending sending mgt frames to many peers
very slow when at least a few peers are not answering quickly, right?

>> Any idea what it wasn't written like that to begin with?
>
> Beats me.

This might be something I can fix in CT firmware..but trying to kick a release out
the door, so I think I'll put this off for a bit.

Thanks,
Ben
Matti Laakso Feb. 14, 2015, 8:35 a.m. UTC | #10
> Afterwards everything seems to be working normally. There was also an
> unexpected router reboot yesterday, but I didn't get any logs from that and
> don't know if it's related. I'll keep testing.
>
> Matti

Turns out the reboots were due to a "Data bus error", which were caused
by cold restarts, as far as I can tell (this is a QCA9558-based router).
I added reset_mode=1 to ath10k_pci parameters, but then this workaround
didn't seem to work anymore (in the log below 20:a9:9b:9e:7c:b0 and
bc:c6:db:14:83:cc are mobile phones and 00:03:7f:48:da:c9 is the ath10k
access point):

Fri Feb 13 14:19:56 2015 daemon.info hostapd: wlan0: STA
20:a9:9b:9e:7c:b0 IEEE 802.11: disassociated due to inactivity
Fri Feb 13 14:19:57 2015 daemon.info hostapd: wlan0: STA
20:a9:9b:9e:7c:b0 IEEE 802.11: deauthenticated due to inactivity (timer
DEAUTH/REMOVE)
Fri Feb 13 14:19:57 2015 kern.warn kernel: [68482.510000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 14:19:57 2015 kern.warn kernel: [68482.610000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 14:19:57 2015 kern.warn kernel: [68482.710000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 14:19:57 2015 kern.warn kernel: [68482.810000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 14:19:57 2015 kern.warn kernel: [68482.920000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 14:19:57 2015 kern.warn kernel: [68483.020000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 14:19:57 2015 kern.warn kernel: [68483.120000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 14:19:58 2015 kern.warn kernel: [68483.220000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 14:19:58 2015 kern.warn kernel: [68483.330000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 14:19:58 2015 kern.warn kernel: [68483.430000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.390000] ------------[
cut here ]------------
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.390000] WARNING: CPU:
0 PID: 2550 at
/home/matti/Projects/openwrt/trunk/openwrt/build_dir/target-mips_34kc_uClibc-0.9.33.2/linux-ar71xx_generic/compat-wir
eless-2014-11-04/net/mac80211/sta_info.c:916
sta_info_move_state+0x580/0x604 [mac80211]()
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.410000] Modules linked
in: ifb pppoe ppp_async iptable_nat ath9k pppox ppp_generic nf_nat_ipv4
nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_MASQUERADE ath9k_co
mmon xt_time xt_tcpudp xt_tcpmss xt_string xt_statistic xt_state
xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id
xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit
xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT
xt_CLASSIFY ts_kmp ts_fsm ts_bFri Feb 13 14:20:00 2015 kern.warn kernel:
[68485.510000] CPU: 0 PID: 2550 Comm: hostapd Not tainted
3.14.30 #1
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.520000] Stack :
00000006 00000000 00000000 00000000 00000000 00000000 803bc98e 00000035
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.520000]       85fd7ba0
00000000 802fcad8 80347623 000009f6 803b3b5c 85fd7ba0 00000000
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.520000]       00000004
00000000 00000008 8029e880 00000003 8020040c 00000394 00000000
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.520000]       802ffc9c
85f9dabc 00000000 00000000 00000000 00000000 00000000 00000000
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.520000]       00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.520000]       ...
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.560000] Call Trace:
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.560000] [<80242644>]
show_stack+0x48/0x70
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.560000] [<802aee18>]
warn_slowpath_common+0x84/0xb4
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.570000] [<802aeed0>]
warn_slowpath_null+0x18/0x24
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.570000] [<872851e0>]
sta_info_move_state+0x580/0x604 [mac80211]
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.580000]
Fri Feb 13 14:20:00 2015 kern.warn kernel: [68485.580000] ---[ end trace
830537c24c03b27c ]---
Fri Feb 13 14:20:00 2015 kern.info kernel: [68485.890000] ieee80211
phy0: Hardware restart was requested
Fri Feb 13 14:20:03 2015 kern.warn kernel: [68488.890000] ath10k_warn:
23 callbacks suppressed
Fri Feb 13 14:20:03 2015 kern.warn kernel: [68488.890000] ath10k_pci
0000:01:00.0: firmware unresponsive, restarting..
Fri Feb 13 14:20:03 2015 kern.warn kernel: [68488.900000] ath10k_pci
0000:01:00.0: failed to put down monitor vdev 1: -11
Fri Feb 13 14:20:06 2015 kern.warn kernel: [68491.900000] ath10k_pci
0000:01:00.0: firmware unresponsive, restarting..
Fri Feb 13 14:20:06 2015 kern.warn kernel: [68491.900000] ath10k_pci
0000:01:00.0: failed to to request monitor vdev 1 stop: -11
Fri Feb 13 14:20:11 2015 kern.warn kernel: [68496.910000] ath10k_pci
0000:01:00.0: failed to synchronise monitor vdev 1: -145
Fri Feb 13 14:20:11 2015 kern.warn kernel: [68496.910000] ath10k_pci
0000:01:00.0: failed to stop monitor vdev: -145
Fri Feb 13 14:20:14 2015 kern.warn kernel: [68499.920000] ath10k_pci
0000:01:00.0: firmware unresponsive, restarting..
Fri Feb 13 14:20:17 2015 kern.info kernel: [68503.020000] ath10k_pci
0000:01:00.0: device successfully recovered
Fri Feb 13 14:20:18 2015 kern.info kernel: [68503.320000] ieee80211
phy0: Hardware restart was requested
Fri Feb 13 14:20:21 2015 kern.warn kernel: [68506.320000] ath10k_pci
0000:01:00.0: firmware unresponsive, restarting..
Fri Feb 13 14:20:21 2015 kern.warn kernel: [68506.320000] ath10k_pci
0000:01:00.0: failed to install key for vdev 0 peer 00:03:7f:48:da:c9: -11
Fri Feb 13 14:20:21 2015 kern.err kernel: [68506.330000] wlan0: failed
to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-11)
Fri Feb 13 14:20:24 2015 kern.warn kernel: [68509.340000] ath10k_pci
0000:01:00.0: firmware unresponsive, restarting..
Fri Feb 13 14:20:24 2015 kern.warn kernel: [68509.340000] ath10k_pci
0000:01:00.0: failed to install key for vdev 0 peer 00:03:7f:48:da:c9: -11
Fri Feb 13 14:20:24 2015 kern.err kernel: [68509.350000] wlan0: failed
to set key (1, ff:ff:ff:ff:ff:ff) to hardware (-11)
Fri Feb 13 14:20:27 2015 kern.warn kernel: [68512.360000] ath10k_pci
0000:01:00.0: firmware unresponsive, restarting..
Fri Feb 13 14:20:30 2015 kern.info kernel: [68515.470000] ath10k_pci
0000:01:00.0: device successfully recovered
Fri Feb 13 14:20:30 2015 kern.info kernel: [68515.770000] ieee80211
phy0: Hardware restart was requested
Fri Feb 13 14:20:33 2015 kern.warn kernel: [68518.770000] ath10k_pci
0000:01:00.0: firmware unresponsive, restarting..
Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.880000] ath10k_pci
0000:01:00.0: device is wedged, will not restart
Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.890000] ath10k_pci
0000:01:00.0: failed to install key for vdev 0 peer bc:c6:db:14:83:cc: -70
Fri Feb 13 14:20:36 2015 kern.err kernel: [68521.890000] wlan0: failed
to set key (0, bc:c6:db:14:83:cc) to hardware (-70)
Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.900000] ath10k_pci
0000:01:00.0: failed to install key for vdev 0 peer 00:03:7f:48:da:c9: -70
Fri Feb 13 14:20:36 2015 kern.err kernel: [68521.910000] wlan0: failed
to set key (2, ff:ff:ff:ff:ff:ff) to hardware (-70)
Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.920000] ath10k_pci
0000:01:00.0: failed to install key for vdev 0 peer 00:03:7f:48:da:c9: -70
Fri Feb 13 14:20:36 2015 kern.err kernel: [68521.930000] wlan0: failed
to set key (1, ff:ff:ff:ff:ff:ff) to hardware (-70)
Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.940000] ath10k_pci
0000:01:00.0: failed to install key for vdev 0 peer bc:c6:db:14:83:cc: -70
Fri Feb 13 14:20:36 2015 kern.err kernel: [68521.940000] wlan0: failed
to remove key (0, bc:c6:db:14:83:cc) from hardware (-70)
Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.950000] ath10k_pci
0000:01:00.0: failed to delete peer bc:c6:db:14:83:cc for vdev 0: -70
Fri Feb 13 14:20:36 2015 daemon.info hostapd: wlan0: STA
bc:c6:db:14:83:cc IEEE 802.11: deauthenticated due to local deauth request
Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.960000] ath10k_pci
0000:01:00.0: failed to set beacon mode for vdev 0: -70
Fri Feb 13 14:20:36 2015 kern.warn kernel: [68521.970000] ath10k_pci
0000:01:00.0: failed to recalculate rts/cts prot for vdev 0: -70
Fri Feb 13 14:20:38 2015 kern.warn kernel: [68523.820000] ath10k_warn:
17 callbacks suppressed
Fri Feb 13 14:20:38 2015 kern.warn kernel: [68523.830000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 14:20:38 2015 kern.warn kernel: [68523.920000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 14:20:38 2015 kern.warn kernel: [68524.030000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
...

this goes on and on, and the wireless network is not even visible
anymore until I tear down wlan0 and put it back up:

...
Fri Feb 13 16:32:17 2015 kern.warn kernel: [76422.940000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.240000] ath10k_warn:
45 callbacks suppressed
Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.240000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.300000] ath10k_pci
0000:01:00.0: Spurious quick kickout for STA bc:c6:db:14:83:cc
Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.340000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.440000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.550000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.650000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.750000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.850000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 16:32:22 2015 kern.warn kernel: [76427.960000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 16:32:22 2015 kern.warn kernel: [76428.060000] ath10k_pci
0000:01:00.0: SWBA overrun on vdev 0
Fri Feb 13 16:32:27 2015 kern.info kernel: [76432.210000] device wlan0
left promiscuous mode
Fri Feb 13 16:32:27 2015 kern.info kernel: [76432.220000] br-lan: port
2(wlan0) entered disabled state
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.230000] ------------[
cut here ]------------
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.230000] WARNING: CPU:
0 PID: 2550 at
/home/matti/Projects/openwrt/trunk/openwrt/build_dir/target-mips_34kc_uClibc-0.9.33.2/linux-ar71xx_generic/compat-wir
eless-2014-11-04/net/mac80211/key.c:642 ieee80211_free_keys+0x118/0x1b8
[mac80211]()
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.250000] Modules linked
in: ifb pppoe ppp_async iptable_nat ath9k pppox ppp_generic nf_nat_ipv4
nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_MASQUERADE ath9k_co
mmon xt_time xt_tcpudp xt_tcpmss xt_string xt_statistic xt_state
xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id
xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit
xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT
xt_CLASSIFY ts_kmp ts_fsm ts_bFri Feb 13 16:32:27 2015 kern.warn kernel:
[76432.350000] CPU: 0 PID: 2550 Comm: hostapd Tainted: G 
      W    3.14.30 #1
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.360000] Stack :
00000006 00000000 00000000
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.370000] ath10k_warn:
49 callbacks suppressed
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.370000] ath10k_pci
0000:01:00.0: No VIF found for vdev 0
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000] ath10k_pci
0000:01:00.0: no vif for vdev_id 0 found
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000]  00000000
00000000 00000000 803bc98e 00000040
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000]       85fd7ba0
866bb8a4 802fcad8 80347623 000009f6 803b3b5c 85fd7ba0 866bb8a4
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000]       872cfe2c
00000000 00000001 8029e880 00000003 80200398 00000282 866bb8a4
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000]       802ffc9c
85f9dc44 00000000 00000000 00000000 00000000 00000000 00000000
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000]       00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.380000]       ...
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.420000] Call Trace:
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.420000] [<80242644>]
show_stack+0x48/0x70
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.420000] [<802aee18>]
warn_slowpath_common+0x84/0xb4
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.430000] [<802aeed0>]
warn_slowpath_null+0x18/0x24
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.430000] [<872a69e4>]
ieee80211_free_keys+0x118/0x1b8 [mac80211]
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.440000] [<872921c4>]
ieee80211_del_virtual_monitor+0x67c/0x8fc [mac80211]
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.450000] [<80067a04>]
_cond_resched+0x2c/0x3c
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.450000]
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.450000] ---[ end trace
830537c24c03b27d ]---
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.460000] ath10k_pci
0000:01:00.0: failed to remove peer for AP vdev 0: -70
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.460000] ath10k_pci
0000:01:00.0: No VIF found for vdev 0
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.470000] ath10k_pci
0000:01:00.0: no vif for vdev_id 0 found
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.480000] ath10k_pci
0000:01:00.0: failed to delete WMI vdev 0: -70
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.480000] ath10k_pci
0000:01:00.0: removing stale peer bc:c6:db:14:83:cc from vdev_id 0
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.490000] ath10k_pci
0000:01:00.0: removing stale peer 00:03:7f:48:da:c9 from vdev_id 0
Fri Feb 13 16:32:27 2015 kern.warn kernel: [76432.500000] ath10k_pci
0000:01:00.0: could not suspend target (-70)
Fri Feb 13 16:32:27 2015 daemon.notice netifd: Network device 'wlan0'
link is down
Fri Feb 13 16:32:55 2015 kern.info kernel: [76460.490000] IPv6:
ADDRCONF(NETDEV_UP): wlan0: link is not ready
Fri Feb 13 16:32:55 2015 kern.info kernel: [76460.490000] device wlan0
entered promiscuous mode
Fri Feb 13 16:32:58 2015 kern.info kernel: [76463.170000] br-lan: port
2(wlan0) entered forwarding state
Fri Feb 13 16:32:58 2015 kern.info kernel: [76463.180000] br-lan: port
2(wlan0) entered forwarding state
Fri Feb 13 16:32:58 2015 kern.info kernel: [76463.180000] IPv6:
ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
Fri Feb 13 16:32:58 2015 daemon.notice netifd: Network device 'wlan0'
link is up
Fri Feb 13 16:32:59 2015 kern.info kernel: [76465.180000] br-lan: port
2(wlan0) entered forwarding state

Matti
diff mbox

Patch

diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c
index aeea1c7..776b257 100644
--- a/drivers/net/wireless/ath/ath10k/wmi.c
+++ b/drivers/net/wireless/ath/ath10k/wmi.c
@@ -1045,9 +1045,15 @@  int ath10k_wmi_cmd_send(struct ath10k *ar, struct sk_buff *skb, u32 cmd_id)
 		(ret != -EAGAIN);
 	}), 3*HZ);
 
-	if (ret)
+	if (ret) {
 		dev_kfree_skb_any(skb);
 
+		if (ret == -EAGAIN) {
+			ath10k_warn(ar, "firmware unresponsive, restarting..\n");
+			queue_work(ar->workqueue, &ar->restart_work);
+		}
+	}
+
 	return ret;
 }