diff mbox

mac80211: debugfs var for the default aggregation timeout.

Message ID 1455658091-28262-2-git-send-email-apenwarr@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Johannes Berg
Headers show

Commit Message

Avery Pennarun Feb. 16, 2016, 9:28 p.m. UTC
Since around the beginning of time, ath9k aggregates have timed out after
5000 TU (around 5000ms) of inactivity, but nobody seems to be quite sure
why, and this magic number seems to have migrated around from one place to
another.  An openbsd mailing list recently had a patch to disable the
timeout completely, which they say matches some commercial routers:
https://www.mail-archive.com/tech@openbsd.org/msg29456.html

Even in Linux, several non-ath9k drivers default to no timeout already.  I
think changing it directly to zero would be safe, but to allow a more
structured investigation, let's make it configurable for now.

Signed-off-by: Avery Pennarun <apenwarr@gmail.com>
---
 net/mac80211/debugfs_netdev.c      | 4 ++++
 net/mac80211/rc80211_minstrel_ht.c | 4 +++-
 2 files changed, 7 insertions(+), 1 deletion(-)

Comments

Johannes Berg Feb. 16, 2016, 9:44 p.m. UTC | #1
On Tue, 2016-02-16 at 16:28 -0500, Avery Pennarun wrote:
> Since around the beginning of time, ath9k aggregates have timed out
> after
> 5000 TU (around 5000ms) of inactivity, but nobody seems to be quite
> sure
> why, and this magic number seems to have migrated around from one
> place to
> another.  An openbsd mailing list recently had a patch to disable the
> timeout completely, which they say matches some commercial routers:
> https://www.mail-archive.com/tech@openbsd.org/msg29456.html
> 
> Even in Linux, several non-ath9k drivers default to no timeout
> already.  I
> think changing it directly to zero would be safe, but to allow a more
> structured investigation, let's make it configurable for now.

I believe the original timeout came from some ancient Intel code.
Nobody gave it any thought at the time, and it was merged and preserved
to this day.

The reason for Intel specifically is that every aggregation session
(both RX and TX), even idle ones, takes up hardware resources, so
there's a limited need to drop completely idle sessions.

As a result, I believe that changing the default for every driver other
than iwlegacy, iwlwifi/dvm and iwlwifi/mvm would be perfectly fine.

johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sujith Manoharan Feb. 17, 2016, 2:05 a.m. UTC | #2
Avery Pennarun wrote:
> Since around the beginning of time, ath9k aggregates have timed out after
> 5000 TU (around 5000ms) of inactivity, but nobody seems to be quite sure
> why, and this magic number seems to have migrated around from one place to
> another.  An openbsd mailing list recently had a patch to disable the
> timeout completely, which they say matches some commercial routers:
> https://www.mail-archive.com/tech@openbsd.org/msg29456.html

We actually did this for ath9k:
https://git.kernel.org/cgit/linux/kernel/git/wireless/wireless-testing.git/commit/?id=bd2ce6e43f65127bc723e7fcc044758cf8113260

Almost all APs set the timeout to zero, only mac80211 differs
and uses 5000 TUs. I guess ath9k lost this fix when the old
RC was removed and switched to minstrel.

Sujith
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Feb. 23, 2016, 10:14 a.m. UTC | #3
On Tue, 2016-02-16 at 16:28 -0500, Avery Pennarun wrote:
> Since around the beginning of time, ath9k aggregates have timed out
> after
> 5000 TU (around 5000ms) of inactivity, but nobody seems to be quite
> sure
> why, and this magic number seems to have migrated around from one
> place to
> another.  An openbsd mailing list recently had a patch to disable the
> timeout completely, which they say matches some commercial routers:
> https://www.mail-archive.com/tech@openbsd.org/msg29456.html
> 
> Even in Linux, several non-ath9k drivers default to no timeout
> already.  I
> think changing it directly to zero would be safe, but to allow a more
> structured investigation, let's make it configurable for now.
> 
Since we just made it zero, perhaps we don't need this?

Although perhaps we still want it to be able to debug it?

Anyway - you shouldn't create a debugfs file and play with the extern
stuff etc., let minstrel create the debugfs file in minstrel_ht_alloc()

johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avery Pennarun Feb. 23, 2016, 6:43 p.m. UTC | #4
On Tue, Feb 23, 2016 at 5:14 AM, Johannes Berg
<johannes@sipsolutions.net> wrote:
> On Tue, 2016-02-16 at 16:28 -0500, Avery Pennarun wrote:
>> Since around the beginning of time, ath9k aggregates have timed out
>> after
>> 5000 TU (around 5000ms) of inactivity, but nobody seems to be quite
>> sure
>> why, and this magic number seems to have migrated around from one
>> place to
>> another.  An openbsd mailing list recently had a patch to disable the
>> timeout completely, which they say matches some commercial routers:
>> https://www.mail-archive.com/tech@openbsd.org/msg29456.html
>>
>> Even in Linux, several non-ath9k drivers default to no timeout
>> already.  I
>> think changing it directly to zero would be safe, but to allow a more
>> structured investigation, let's make it configurable for now.
>>
> Since we just made it zero, perhaps we don't need this?
>
> Although perhaps we still want it to be able to debug it?

We're putting my version of the patch into our devices in order to be
able to try different values and see how it changes the percentage of
devices with nonzero 'pending' field in agg_status.  I'm hoping using
zero here will result in total elimination of the pending problem, but
we'll see.

It probably makes sense not to apply this upstream if the default
value is zero now anyway.

> Anyway - you shouldn't create a debugfs file and play with the extern
> stuff etc., let minstrel create the debugfs file in minstrel_ht_alloc()

Good point.  I had a feeling I was doing that in the wrong place :)

If people think this is important, I can respin the patch, otherwise
feel free to discard.

Have fun,

Avery
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Feb. 23, 2016, 8:05 p.m. UTC | #5
On Tue, 2016-02-23 at 13:43 -0500, Avery Pennarun wrote:

> We're putting my version of the patch into our devices in order to be
> able to try different values and see how it changes the percentage of
> devices with nonzero 'pending' field in agg_status.  I'm hoping using
> zero here will result in total elimination of the pending problem,
> but we'll see.

:)
I for one would be interested in the result. And, if you find mac80211
is at fault, knowing what happens there.

> If people think this is important, I can respin the patch, otherwise
> feel free to discard.
> 

Up to you. I've discarded it for now with "changes requested", if you
never resend I'll not worry about it. Clearly we haven't had a need for
it so far.

johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avery Pennarun April 5, 2016, 11:46 p.m. UTC | #6
On Tue, Feb 23, 2016 at 3:05 PM, Johannes Berg
<johannes@sipsolutions.net> wrote:
> On Tue, 2016-02-23 at 13:43 -0500, Avery Pennarun wrote:
>> We're putting my version of the patch into our devices in order to be
>> able to try different values and see how it changes the percentage of
>> devices with nonzero 'pending' field in agg_status.  I'm hoping using
>> zero here will result in total elimination of the pending problem,
>> but we'll see.
>
> :)
> I for one would be interested in the result. And, if you find mac80211
> is at fault, knowing what happens there.

Here's the promised update!  The news is not as good as I had hoped.

Across the GFiber fleet, number of APs per day observing the problem
(ie. the pending field > 0 for more than a minute for any station),
with the original aggregation timeout, is about 41% (yikes).  With the
aggregation timeout set to zero, the number of APs observing the
problem in a day drops to about 10%.

Obviously this is a huge improvement, but the problem isn't completely
eliminated.  In retrospect that's not totally surprising, as there are
reasons other than an AP-side aggregation timeout that an aggregation
would need to be negotiated, and a race condition in aggregation queue
setup could happen at any of those times.  I was just hoping that
those other cases would be much less frequent than they apparently
are.

This test was with backports-20150525 on ath9k.  (We have newer
versions in the queue, but they haven't rolled out to our customers
yet.  Anyway, earlier in this thread, I was able to trigger the race
condition on much newer backports.  Unfortunately the current fix
makes my reproducible test case go away, but I don't know any reason
to assume the race condition is fixed.)

While we're here, unfortunately it turns out that just observing the
agg_status file can cause crashes (though not very often... except for
a few unlucky customers), probably due to a different race condition.
Any suggestions about this one?  Stack trace attached below.  (I think
the stack trace suggests a mac80211 problem?)

Thanks!

Avery


03/30,133400.674 Unable to handle kernel paging request at virtual
address 5b35da9e
03/30,133400.675 pgd = ac238000
03/30,133400.675 [5b35da9e] *pgd=00000000
03/30,133400.675 Internal error: Oops: 5 [#1] PREEMPT SMP
03/30,133400.680 Modules linked in: ccm nf_conntrack_netlink
auto_bridge(O) fci(O) nfnetlink pktgen ath9k_htc(O) mwifiex_usb(O)
mwifiex(O) ath10k_pci(O) ath10k_core(O) arc4 ath9k(O) mac80211(O)
ath9k_common(O) ath9k_hw(O) ath(O) cfg80211(O) compat(O) bmoca(O)
xt_connmark ip6table_mangle xt_CLASSIFY iptable_mangle xt_helper
nf_nat_sip nf_conntrack_sip ip6t_REJECT ip6t_LOG nf_conntrack_ipv6
nf_defrag_ipv6 ip6table_filter ip6_tables nf_nat_rtsp
nf_conntrack_rtsp nf_nat_h323 nf_conntrack_h323 nf_nat_irc
nf_conntrack_irc nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre
nf_nat_proto_gre nf_nat_tftp nf_conntrack_tftp nf_nat_ftp
nf_conntrack_ftp ipt_MASQUERADE ipt_REJECT ipt_LOG xt_limit xt_pkttype
xt_conntrack xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables pfe(O)
03/30,133400.753 CPU: 0    Tainted: G           O  (3.2.26 #1)
03/30,133400.758 PC is at sta_agg_status_read+0xeb/0x170 [mac80211]
03/30,133400.764 LR is at sta_agg_status_read+0xd8/0x170 [mac80211]
03/30,133400.770 pc : [<838b4d0c>]    lr : [<838b4cf9>]    psr: 20010033
03/30,133400.770 sp : ac0c3c58  ip : 0000000f  fp : ac0c3c71
03/30,133400.782 r10: ac341800  r9 : af7f3b53  r8 : 00000001
03/30,133400.787 r7 : 00000007  r6 : 5b35da40  r5 : ac0c3f38  r4 : ac0c3d90
03/30,133400.794 r3 : ac0c3d8d  r2 : 838c6958  r1 : 000001a8  r0 : ac0c3d90
03/30,133400.800 Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA Thumb
 Segment user
03/30,133400.807 Control: 50c53c7d  Table: 2c23804a  DAC: 00000015
03/30,133400.813 Process psstat (pid: 25220, stack limit = 0xac0c22f0)
03/30,133400.819 Stack: (0xac0c3c58 to 0xac0c4000)
03/30,133400.824 3c40:
      00000209 a6199050
03/30,133400.832 3c60: ac0c3d58 7e957143 00000001 ac0c3f88 78656e00
69642074 676f6c61 6b6f745f
03/30,133400.840 3c80: 203a6e65 0a317830 09444954 09585209 4e4b5444
4e535309 58540909 4b544409
03/30,133400.848 3ca0: 6570094e 6e69646e 30300a67 09300909 30307830
30783009 09093030 78300930
03/30,133400.857 3cc0: 30093030 300a3030 30090931 30783009 78300930
09303030 30093009 09303078
03/30,133400.865 3ce0: 0a303030 09093230 78300930 30093030 30303078
09300909 30307830 30303009
03/30,133400.873 3d00: 0933300a 30093009 09303078 30307830 30090930
30783009 30300930 34300a30
03/30,133400.881 3d20: 09300909 30307830 30783009 09093030 78300930
30093030 300a3030 30090935
03/30,133400.889 3d40: 30783009 78300930 09303030 30093009 09303078
0a303030 09093630 78300931
03/30,133400.898 3d60: 30096632 32323678 31090966 38783009 32310933
30343230 35383333 0937300a
03/30,133400.906 3d80: 30093109 09303578 31307830 31090961 38300a00
09300909 30307830 30783009
03/30,133400.914 3da0: 09093030 78300930 30093030 300a3030 30090939
30783009 78300930 09303030
03/30,133400.922 3dc0: 30093009 09303078 0a303030 09093031 78300930
30093030 30303078 09300909
03/30,133400.930 3de0: 30307830 30303009 0931310a 30093009 09303078
30307830 30090930 30783009
03/30,133400.939 3e00: 30300930 32310a30 09300909 30307830 30783009
09093030 78300930 30093030
03/30,133400.947 3e20: 310a3030 30090933 30783009 78300930 09303030
30093009 09303078 0a303030
03/30,133400.955 3e40: 09093431 78300930 30093030 30303078 09300909
30307830 30303009 0935310a
03/30,133400.963 3e60: 30093009 09303078 30307830 30090930 30783009
30300930 00000a30 bfa440c0
03/30,133400.971 3e80: 842caf64 842cadb4 ac0c3e90 8401ea65 00000000
c55f8337 c55f8337 000015e4
03/30,133400.980 3ea0: bb3f54b8 ac0c3eb8 c55f8337 84021ad3 bf82f060
00000001 80000000 00000000
03/30,133400.988 3ec0: 84008b15 00000000 00000002 84040045 ffffffff
00000000 00000002 84470aac
03/30,133400.996 3ee0: ac0c3f18 bb05f780 00000000 840401b5 00000000
84431160 ac0c2000 bb3f5480
03/30,133401.004 3f00: ac0c3f18 842c792d 84cb6160 00000005 ac0c3f18
00000001 bd3e9668 00000001
03/30,133401.012 3f20: 842caf64 842cadb4 8400caf5 00000000 00000000
00000000 8443e8b8 bd3e9660
03/30,133401.021 3f40: 838b4c21 7e957143 ac0c3f88 7e957143 ac0c2000
00000000 0002802c 840993bb
03/30,133401.029 3f60: 00000030 00000001 bd3e9660 00000001 000001fa
00000000 7e957143 84099637
03/30,133401.037 3f80: ac0c2000 00000000 000001fa 00000000 00000001
0049fc94 0049fcca 00000003
03/30,133401.045 3fa0: 8400cc44 8400ca81 00000001 0049fc94 00000000
7e957143 00000001 00000030
03/30,133401.054 3fc0: 00000001 0049fc94 0049fcca 00000003 ffffffff
00028028 00000000 0002802c
03/30,133401.062 3fe0: 00000000 7e957134 00014428 2ac417cc 60010010
00000000 00000000 00000000
03/30,133401.070 [<838b4d0c>] (sta_agg_status_read+0xeb/0x170
[mac80211]) from [<840993bb>] (vfs_read+0x5f/0xcc)
03/30,133401.080 [<840993bb>] (vfs_read+0x5f/0xcc) from [<84099637>]
(sys_read+0x27/0x48)
03/30,133401.088 [<84099637>] (sys_read+0x27/0x48) from [<8400ca81>]
(ret_fast_syscall+0x1/0x46)
03/30,133401.096 Code: f1b8 0f00 d036 4620 (f896) 305e
03/30,133401.104 ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0,
skipped old beacon
03/30,133401.106 waveguide: ZONEED: APs=63
peer-APs=LOFHAL(-10),QIYQAW(-26) stations=
03/30,133401.106 waveguide: Connected station VVMKUW taxonomy:
BCM4339;iPhone 6/6+;802.11ac n:1,w:80
03/30,133401.106 waveguide: Connected station FYLWIQ taxonomy:
SHA:f0297d6b773948dcc4c86451a0207ba7d9e97e1cc864b3031001ae2105faa872;Unknown;802.11n
n:2,w:40
03/30,133401.143 ---[ end trace e62670ec7c09380f ]---
03/30,133401.148 Kernel panic - not syncing: Fatal exception
03/30,133401.153 [<840111e1>] (unwind_backtrace+0x1/0x8c) from
[<842c61fd>] (panic+0x5d/0x134)
03/30,133401.162 [<842c61fd>] (panic+0x5d/0x134) from [<8400f60b>]
(die+0x203/0x224)
03/30,133401.169 [<8400f60b>] (die+0x203/0x224) from [<842c5897>]
(__do_kernel_fault.part.5+0x4f/0x5c)
03/30,133401.178 [<842c5897>] (__do_kernel_fault.part.5+0x4f/0x5c)
from [<84013d23>] (do_page_fault+0x20b/0x268)
03/30,133401.188 [<84013d23>] (do_page_fault+0x20b/0x268) from
[<84008293>] (do_DataAbort+0x2f/0x70)
03/30,133401.197 [<84008293>] (do_DataAbort+0x2f/0x70) from
[<8400c4f5>] (__dabt_svc+0x35/0x60)
03/30,133401.206 Exception stack(0xac0c3c10 to 0xac0c3c58)
03/30,133401.211 3c00:                                     ac0c3d90
000001a8 838c6958 ac0c3d8d
03/30,133401.220 3c20: ac0c3d90 ac0c3f38 5b35da40 00000007 00000001
af7f3b53 ac341800 ac0c3c71
03/30,133401.228 3c40: 0000000f ac0c3c58 838b4cf9 838b4d0c 20010033 ffffffff
03/30,133401.235 [<8400c4f5>] (__dabt_svc+0x35/0x60) from [<838b4d0c>]
(sta_agg_status_read+0xeb/0x170 [mac80211])
03/30,133401.245 [<838b4d0c>] (sta_agg_status_read+0xeb/0x170
[mac80211]) from [<840993bb>] (vfs_read+0x5f/0xcc)
03/30,133401.255 [<840993bb>] (vfs_read+0x5f/0xcc) from [<84099637>]
(sys_read+0x27/0x48)
03/30,133401.263 [<84099637>] (sys_read+0x27/0x48) from [<8400ca81>]
(ret_fast_syscall+0x1/0x46)
03/30,133401.272 CPU1: stopping
03/30,133401.275 [<840111e1>] (unwind_backtrace+0x1/0x8c) from
[<8401088d>] (handle_IPI+0xcd/0x104)
03/30,133401.283 [<8401088d>] (handle_IPI+0xcd/0x104) from
[<8400c7c1>] (__irq_usr+0x41/0xa0)
03/30,133401.292 Rebooting in 3 seconds..
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg April 6, 2016, 7:40 a.m. UTC | #7
On Tue, 2016-04-05 at 19:46 -0400, Avery Pennarun wrote:

> This test was with backports-20150525 on ath9k.  (We have newer
> versions in the queue, but they haven't rolled out to our customers
> yet.  Anyway, earlier in this thread, I was able to trigger the race
> condition on much newer backports.  Unfortunately the current fix
> makes my reproducible test case go away, but I don't know any reason
> to assume the race condition is fixed.)

Well, we know that the timeout is likely unrelated to the issue (other
than not triggering the broken code path that frequently), so you can
revert the timeout change for the test case.

> While we're here, unfortunately it turns out that just observing the
> agg_status file can cause crashes (though not very often... except
> for a few unlucky customers), probably due to a different race
> condition.
> Any suggestions about this one?  Stack trace attached below.  (I
> think the stack trace suggests a mac80211 problem?)
> 

That has to be a mac80211 problem, yeah.
(Side note: I'm a bit surprised this is a 32-bit system?)

Looks like we use RCU protection to get the data. Can I get the
mac80211.ko binary (with debug data) corresponding to the crash below?

johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avery Pennarun April 8, 2016, 1:32 a.m. UTC | #8
On Wed, Apr 6, 2016 at 3:40 AM, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Tue, 2016-04-05 at 19:46 -0400, Avery Pennarun wrote:
>
>> This test was with backports-20150525 on ath9k.  (We have newer
>> versions in the queue, but they haven't rolled out to our customers
>> yet.  Anyway, earlier in this thread, I was able to trigger the race
>> condition on much newer backports.  Unfortunately the current fix
>> makes my reproducible test case go away, but I don't know any reason
>> to assume the race condition is fixed.)
>
> Well, we know that the timeout is likely unrelated to the issue (other
> than not triggering the broken code path that frequently), so you can
> revert the timeout change for the test case.

Yes.  And I can make it happen more often by making it timeout the
aggregation agreement much more frequently than usual.

>> While we're here, unfortunately it turns out that just observing the
>> agg_status file can cause crashes (though not very often... except
>> for a few unlucky customers), probably due to a different race
>> condition.
>> Any suggestions about this one?  Stack trace attached below.  (I
>> think the stack trace suggests a mac80211 problem?)
>
> That has to be a mac80211 problem, yeah.
> (Side note: I'm a bit surprised this is a 32-bit system?)

We're going for all of good, fast, and cheap here.  That should end well :)

> Looks like we use RCU protection to get the data. Can I get the
> mac80211.ko binary (with debug data) corresponding to the crash below?

Yes.  Here it is:
http://apenwarr.ca/tmp/mac80211-agg-status-crash.ko

Thanks for your help!
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg April 8, 2016, 6:56 a.m. UTC | #9
On Thu, 2016-04-07 at 21:32 -0400, Avery Pennarun wrote:

> Yes.  Here it is:
> http://apenwarr.ca/tmp/mac80211-agg-status-crash.ko
> 

Unfortunately there are no debug symbols in this file, so it doesn't
help me much. I can't even seem to get objdump to disassemble it
correctly: looks like the file is in thumb, going from things
like R_ARM_THM_CALL relocations, but even -Mforce-thumb doesn't seem to
DRT; sta_agg_status_read+0xeb isn't even a valid instruction offset in
regular ARM mode.

johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg April 8, 2016, 7:01 a.m. UTC | #10
On Fri, 2016-04-08 at 08:56 +0200, Johannes Berg wrote:
> On Thu, 2016-04-07 at 21:32 -0400, Avery Pennarun wrote:
> 
> > 
> > Yes.  Here it is:
> > http://apenwarr.ca/tmp/mac80211-agg-status-crash.ko
> > 
> Unfortunately there are no debug symbols in this file, so it doesn't
> help me much. I can't even seem to get objdump to disassemble it
> correctly: looks like the file is in thumb, going from things
> like R_ARM_THM_CALL relocations, but even -Mforce-thumb doesn't seem
> to DRT; sta_agg_status_read+0xeb isn't even a valid instruction
> offset in regular ARM mode.
> 

It *seems* that it most likely crashes on the first access to tid_tx,
which is consistent with the story of disabling TX aggregation timeouts
reducing the chances.

So I guess we have to look for some TX aggregation teardown RCU pointer
problem?

johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg April 8, 2016, 7:15 a.m. UTC | #11
On Fri, 2016-04-08 at 09:01 +0200, Johannes Berg wrote:
> On Fri, 2016-04-08 at 08:56 +0200, Johannes Berg wrote:
> > 
> > On Thu, 2016-04-07 at 21:32 -0400, Avery Pennarun wrote:
> > 
> > > 
> > > 
> > > Yes.  Here it is:
> > > http://apenwarr.ca/tmp/mac80211-agg-status-crash.ko
> > > 
> > Unfortunately there are no debug symbols in this file, so it
> > doesn't
> > help me much. I can't even seem to get objdump to disassemble it
> > correctly: looks like the file is in thumb, going from things
> > like R_ARM_THM_CALL relocations, but even -Mforce-thumb doesn't
> > seem
> > to DRT; sta_agg_status_read+0xeb isn't even a valid instruction
> > offset in regular ARM mode.
> > 
> It *seems* that it most likely crashes on the first access to tid_tx,
> which is consistent with the story of disabling TX aggregation
> timeouts
> reducing the chances.
> 
> So I guess we have to look for some TX aggregation teardown RCU
> pointer problem?
> 

Can't find anything. The only other thing I saw now is that the TID
appears to be 7 (in r7), might be worth looking for whether that's a
common thing or not?

johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avery Pennarun April 8, 2016, 8:31 a.m. UTC | #12
On Fri, Apr 8, 2016 at 3:15 AM, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Fri, 2016-04-08 at 09:01 +0200, Johannes Berg wrote:
>> On Fri, 2016-04-08 at 08:56 +0200, Johannes Berg wrote:
>> > On Thu, 2016-04-07 at 21:32 -0400, Avery Pennarun wrote:
>> > > Yes.  Here it is:
>> > > http://apenwarr.ca/tmp/mac80211-agg-status-crash.ko
>> > >
>> > Unfortunately there are no debug symbols in this file, so it
>> > doesn't
>> > help me much. I can't even seem to get objdump to disassemble it
>> > correctly: looks like the file is in thumb, going from things
>> > like R_ARM_THM_CALL relocations, but even -Mforce-thumb doesn't
>> > seem
>> > to DRT; sta_agg_status_read+0xeb isn't even a valid instruction
>> > offset in regular ARM mode.
>> >
>> It *seems* that it most likely crashes on the first access to tid_tx,
>> which is consistent with the story of disabling TX aggregation
>> timeouts
>> reducing the chances.
>>
>> So I guess we have to look for some TX aggregation teardown RCU
>> pointer problem?
>
> Can't find anything. The only other thing I saw now is that the TID
> appears to be 7 (in r7), might be worth looking for whether that's a
> common thing or not?

Just to be clear, this crash is only from *reading* the agg_status
files.  I don't know if the crashiness reduces when disabling the
aggregation timeouts, since that's a separate bug (in which the queue
gets stuck and the 'pending' column of this file just keeps
increasing).

I'll try twiddling some options again tomorrow and see if I can get
one with proper debug symbols.  For what it's worth, this platform is
"ARMv7 Processor rev 1 (v7l)" and the gcc build is made for Cortex A9.
You can find an x86 build of our toolchain in the git repo at
https://gfiber.googlesource.com/toolchains/mindspeed.

Thanks for looking into it :)

Avery
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avery Pennarun April 9, 2016, 1:27 a.m. UTC | #13
On Fri, Apr 8, 2016 at 4:31 AM, Avery Pennarun <apenwarr@gmail.com> wrote:
> On Fri, Apr 8, 2016 at 3:15 AM, Johannes Berg <johannes@sipsolutions.net> wrote:
>> On Fri, 2016-04-08 at 09:01 +0200, Johannes Berg wrote:
>>> On Fri, 2016-04-08 at 08:56 +0200, Johannes Berg wrote:
>>> > On Thu, 2016-04-07 at 21:32 -0400, Avery Pennarun wrote:
>>> > > Yes.  Here it is:
>>> > > http://apenwarr.ca/tmp/mac80211-agg-status-crash.ko
>>> > >
>>> > Unfortunately there are no debug symbols in this file, so it
>>> > doesn't
>>> > help me much. I can't even seem to get objdump to disassemble it
>>> > correctly: looks like the file is in thumb, going from things
>>> > like R_ARM_THM_CALL relocations, but even -Mforce-thumb doesn't
>>> > seem
>>> > to DRT; sta_agg_status_read+0xeb isn't even a valid instruction
>>> > offset in regular ARM mode.
>>> >
>>> It *seems* that it most likely crashes on the first access to tid_tx,
>>> which is consistent with the story of disabling TX aggregation
>>> timeouts
>>> reducing the chances.
>>>
>>> So I guess we have to look for some TX aggregation teardown RCU
>>> pointer problem?
>>
>> Can't find anything. The only other thing I saw now is that the TID
>> appears to be 7 (in r7), might be worth looking for whether that's a
>> common thing or not?
>
> Just to be clear, this crash is only from *reading* the agg_status
> files.  I don't know if the crashiness reduces when disabling the
> aggregation timeouts, since that's a separate bug (in which the queue
> gets stuck and the 'pending' column of this file just keeps
> increasing).
>
> I'll try twiddling some options again tomorrow and see if I can get
> one with proper debug symbols.  For what it's worth, this platform is
> "ARMv7 Processor rev 1 (v7l)" and the gcc build is made for Cortex A9.
> You can find an x86 build of our toolchain in the git repo at
> https://gfiber.googlesource.com/toolchains/mindspeed.

Updated .ko file that definitely has debug symbols this time:
http://apenwarr.ca/tmp/mac80211-agg-status-crash-debugsyms.ko
a gdb compiled for x86-64 that can definitely read the above .ko file:
http://apenwarr.ca/tmp/arm-gdb

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg April 9, 2016, 4:56 a.m. UTC | #14
On Fri, 2016-04-08 at 21:27 -0400, Avery Pennarun wrote:

> > Just to be clear, this crash is only from *reading* the agg_status
> > files.  I don't know if the crashiness reduces when disabling the
> > aggregation timeouts, since that's a separate bug (in which the
> > queue gets stuck and the 'pending' column of this file just keeps
> > increasing).

Oh, right, I was confusing the two. The reading one is even stranger
though, in a way. I have no explanation for it (yet). We could suspect
memory corruption, but why would it specifically hit issues here? Not
very plausible.

> Updated .ko file that definitely has debug symbols this time:
> http://apenwarr.ca/tmp/mac80211-agg-status-crash-debugsyms.ko
> 

Ok, that confirms what I did manually in my previous email - that it
crashed on this:

141			p += scnprintf(p, sizeof(buf) + buf - p, "\t%#.2x",
142					tid_tx ? tid_tx->dialog_token : 0);

(and by hand I'd already checked that it crashed dereferencing the
tid_tx->dialog_token, since tid_tx was the value 0x5b35da40.

If any people more familiar with ARM are reading this - does the value
0x5b35da40 ring a bell? Is that a userspace area? Or an area where the
stack would be? All other points around here seem to look like
0xac0c3c58, or maybe 0x838c6958, but not 0x5b35...., how could we end
up with that?

johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Adrian Chadd April 10, 2016, 12:31 a.m. UTC | #15
On 8 April 2016 at 21:56, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Fri, 2016-04-08 at 21:27 -0400, Avery Pennarun wrote:
>
>> > Just to be clear, this crash is only from *reading* the agg_status
>> > files.  I don't know if the crashiness reduces when disabling the
>> > aggregation timeouts, since that's a separate bug (in which the
>> > queue gets stuck and the 'pending' column of this file just keeps
>> > increasing).
>
> Oh, right, I was confusing the two. The reading one is even stranger
> though, in a way. I have no explanation for it (yet). We could suspect
> memory corruption, but why would it specifically hit issues here? Not
> very plausible.
>
>> Updated .ko file that definitely has debug symbols this time:
>> http://apenwarr.ca/tmp/mac80211-agg-status-crash-debugsyms.ko
>>
>
> Ok, that confirms what I did manually in my previous email - that it
> crashed on this:
>
> 141                     p += scnprintf(p, sizeof(buf) + buf - p, "\t%#.2x",
> 142                                     tid_tx ? tid_tx->dialog_token : 0);
>
> (and by hand I'd already checked that it crashed dereferencing the
> tid_tx->dialog_token, since tid_tx was the value 0x5b35da40.
>
> If any people more familiar with ARM are reading this - does the value
> 0x5b35da40 ring a bell? Is that a userspace area? Or an area where the
> stack would be? All other points around here seem to look like
> 0xac0c3c58, or maybe 0x838c6958, but not 0x5b35...., how could we end
> up with that?

.. that looks very userland-y to me. Is it just some pointer garbage perhaps?

Do you have a full crashdump? what's sta->ampdu_mlme look like?



-a
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
bruce m beach April 10, 2016, 1:59 a.m. UTC | #16
> If any people more familiar with ARM are reading this - does the value
> 0x5b35da40 ring a bell?

It could be anything. It depends on the chip implementation. For instance on an
exynos that is an undefined region. What device are we talking about?
Bruce
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avery Pennarun April 19, 2016, 1:29 a.m. UTC | #17
On Sat, Apr 9, 2016 at 9:59 PM, bruce m beach <brucembeach@gmail.com> wrote:
>> If any people more familiar with ARM are reading this - does the value
>> 0x5b35da40 ring a bell?
>
> It could be anything. It depends on the chip implementation. For instance on an
> exynos that is an undefined region. What device are we talking about?

This is a mindspeed c2k CPU, in case that helps at all.  But I'm
guessing it really is just some pointer garbage.  The only way to
trigger the crash is to do (something) with an attached station at the
same moment as reading the agg_status file.  Some station types
trigger it more than others, but I'm not sure which.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/mac80211/debugfs_netdev.c b/net/mac80211/debugfs_netdev.c
index 37ea30e..5ae160b 100644
--- a/net/mac80211/debugfs_netdev.c
+++ b/net/mac80211/debugfs_netdev.c
@@ -715,6 +715,8 @@  static void add_mesh_config(struct ieee80211_sub_if_data *sdata)
 }
 #endif
 
+u32 default_agg_timeout = 5000;
+
 static void add_files(struct ieee80211_sub_if_data *sdata)
 {
 	if (!sdata->vif.debugfs_dir)
@@ -725,6 +727,8 @@  static void add_files(struct ieee80211_sub_if_data *sdata)
 	DEBUGFS_ADD(txpower);
 	DEBUGFS_ADD(user_power_level);
 	DEBUGFS_ADD(ap_power_level);
+	debugfs_create_u32("default_agg_timeout", 0600, sdata->vif.debugfs_dir,
+		&default_agg_timeout);
 
 	if (sdata->vif.type != NL80211_IFTYPE_MONITOR)
 		add_common_files(sdata);
diff --git a/net/mac80211/rc80211_minstrel_ht.c b/net/mac80211/rc80211_minstrel_ht.c
index 3928dbd..028d9d4 100644
--- a/net/mac80211/rc80211_minstrel_ht.c
+++ b/net/mac80211/rc80211_minstrel_ht.c
@@ -671,6 +671,8 @@  minstrel_downgrade_rate(struct minstrel_ht_sta *mi, u16 *idx, bool primary)
 	}
 }
 
+extern u32 default_agg_timeout;
+
 static void
 minstrel_aggr_check(struct ieee80211_sta *pubsta, struct sk_buff *skb)
 {
@@ -691,7 +693,7 @@  minstrel_aggr_check(struct ieee80211_sta *pubsta, struct sk_buff *skb)
 	if (likely(sta->ampdu_mlme.tid_tx[tid]))
 		return;
 
-	ieee80211_start_tx_ba_session(pubsta, tid, 5000);
+	ieee80211_start_tx_ba_session(pubsta, tid, default_agg_timeout);
 }
 
 static void