diff mbox series

rt2800 tx frame dropping issue.

Message ID bc0bc5ea-dd10-a5a5-87ff-00a93bcf814c@pobox.com (mailing list archive)
State RFC
Delegated to: Kalle Valo
Headers show
Series rt2800 tx frame dropping issue. | expand

Commit Message

Daniel Santos Nov. 20, 2018, 9:20 p.m. UTC
Hello,

I'm coming to this issue rather late and I realize there has been much
ado about it.  I want to follow up on a thread from 3 months ago
https://marc.info/?l=linux-wireless&m=153511575812945

> I get testing results from T-Bone user in bugzilla:
> https://bugzilla.kernel.org/show_bug.cgi?id=82751
>
> And get this:
>
> [  781.644185] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due \
> to full tx queue 2 [  781.662620] 2 d1 s1 p1 ms1
>
> Looks like rt2x00 correctly stop queue in mac80211, but sill get skb's. 
>
> So we can do 3 things: increase queue size to 128, increase threshold
> to 16 and make the massage debug one instead of error (I checked
> some other drivers and looks most of them silently drop the frame
> in case queue is full). Especially removing the message can be
> useful since printk can somehow make mt7620 router unresponsive
> for some time resulting in connection drops.
>
> Thoughts ?
>
> Regards
> Stanislaw

I believe I have the answer as to why we're getting frames after we stop
the queue.  I had a little chat about this in #kernelnewbies and some
other devs believe it is intentional.

There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
releasing local->queue_stop_reason_lock and calling the driver's tx
until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between
these two points the thread can be preempted.  So while we stop the
queue in one thread, there can be 20 other threads in between that have
already checked that the queue was running and have a frame buffer
sitting on their stacks.  I think it could be fixed with the below
patch, but I'm being told that it doesn't need it and that the driver
should just *quietly* drop the frame:



Could anybody illuminate for me why this isn't done?  I know that there
has to be a point where we realize there are too many items in the queue
and we can't keep up, but this would seem to be a terrible way to do
that.  Is it one of those things where it isn't worth the performance
degradation?  Any insights would be most appreciated!! :)

Like Stanislaw, I have not been able to reproduce the problem for
testing, but somebody who has the problem (in production) is going to
get a build where the only change is to remove the rt2x00_err message. 
It makes sense that it would be the problem as you mentioned.  I suppose
the proper thing to do is to just quietly drop it and increment the
struct net_device stats.tx_dropped?

Thanks,
Daniel

Comments

Johannes Berg Nov. 23, 2018, 7:45 p.m. UTC | #1
On Tue, 2018-11-20 at 15:20 -0600, Daniel Santos wrote:

> I believe I have the answer as to why we're getting frames after we stop
> the queue.  I had a little chat about this in #kernelnewbies and some
> other devs believe it is intentional.
> 
> There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
> releasing local->queue_stop_reason_lock and calling the driver's tx
> until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between
> these two points the thread can be preempted.  So while we stop the
> queue in one thread, there can be 20 other threads in between that have
> already checked that the queue was running and have a frame buffer
> sitting on their stacks.

Not 20, only 1 per netdev queue. I suspect that means just 1 per
hardware queue, but I don't know how rt2x00 maps netdev queues to
hardware queues. If you have lots of netdevs, that might actually be 20,
but I suspect that's not what's going on here.

>   I think it could be fixed with the below
> patch, but I'm being told that it doesn't need it and that the driver
> should just *quietly* drop the frame:

[snip patch]

> Could anybody illuminate for me why this isn't done?  I know that there
> has to be a point where we realize there are too many items in the queue
> and we can't keep up, but this would seem to be a terrible way to do
> that.  Is it one of those things where it isn't worth the performance
> degradation?  Any insights would be most appreciated!! :)

There's just not much point, and doing the locking here will mean it's
across _all_ queues, whereas if you have multiple hardware queues you
wouldn't really need it.

Note that with internal TXQs with fq/codel support etc. we actually have
the fq->lock and it's global, and it turns out that's still a better
deal than actually doing parallel TX. So this may not matter so much.

In any case, while this might solve the problem for the specific case
you're thinking about, as soon as you have something else starting or
stopping queues from outside the TX path it still wouldn't actually
help.

By the way, one thing that can likely exacerbate the problem is
fragmentation, once you have that you're more likely to trigger lots of
frames in close succession.

What I would suggest doing in the driver is actually stop the queues
once a *threshold* is reached, rather than being full. Say if you have
128 entries in the HW queue, you can stop once you reach 120 (you
probably don't really need the other 8 places). If you get a 121st
frame, then you can still put it on the queue (until you filled 128),
but you can also stop the queue again (stopping queues is idempotent).

johannes
Stanislaw Gruszka Nov. 26, 2018, 9:38 a.m. UTC | #2
On Fri, Nov 23, 2018 at 08:45:54PM +0100, Johannes Berg wrote:
> On Tue, 2018-11-20 at 15:20 -0600, Daniel Santos wrote:
> 
> > I believe I have the answer as to why we're getting frames after we stop
> > the queue.  I had a little chat about this in #kernelnewbies and some
> > other devs believe it is intentional.
> > 
> > There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
> > releasing local->queue_stop_reason_lock and calling the driver's tx
> > until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between
> > these two points the thread can be preempted.  So while we stop the
> > queue in one thread, there can be 20 other threads in between that have
> > already checked that the queue was running and have a frame buffer
> > sitting on their stacks.
> 
> Not 20, only 1 per netdev queue. I suspect that means just 1 per
> hardware queue, but I don't know how rt2x00 maps netdev queues to
> hardware queues. If you have lots of netdevs, that might actually be 20,
> but I suspect that's not what's going on here.
> 
> >   I think it could be fixed with the below
> > patch, but I'm being told that it doesn't need it and that the driver
> > should just *quietly* drop the frame:
> 
> [snip patch]
> 
> > Could anybody illuminate for me why this isn't done?  I know that there
> > has to be a point where we realize there are too many items in the queue
> > and we can't keep up, but this would seem to be a terrible way to do
> > that.  Is it one of those things where it isn't worth the performance
> > degradation?  Any insights would be most appreciated!! :)
> 
> There's just not much point, and doing the locking here will mean it's
> across _all_ queues, whereas if you have multiple hardware queues you
> wouldn't really need it.
> 
> Note that with internal TXQs with fq/codel support etc. we actually have
> the fq->lock and it's global, and it turns out that's still a better
> deal than actually doing parallel TX. So this may not matter so much.
> 
> In any case, while this might solve the problem for the specific case
> you're thinking about, as soon as you have something else starting or
> stopping queues from outside the TX path it still wouldn't actually
> help.
> 
> By the way, one thing that can likely exacerbate the problem is
> fragmentation, once you have that you're more likely to trigger lots of
> frames in close succession.
> 
> What I would suggest doing in the driver is actually stop the queues
> once a *threshold* is reached, rather than being full. Say if you have
> 128 entries in the HW queue, you can stop once you reach 120 (you
> probably don't really need the other 8 places). If you get a 121st
> frame, then you can still put it on the queue (until you filled 128),
> but you can also stop the queue again (stopping queues is idempotent).

That what rt2x00 does, but we have 64 entries on tx queues and
because of that threshold is small. In:

https://bugzilla.kernel.org/show_bug.cgi?id=82751

I proposed increase of queue size to 256 and hence make threshold
bigger. However I was concerned about bufferbloat and requested
testing from users how this affect latency. Never get testing 
results :-(

Thanks
Stanislaw
Daniel Santos Dec. 3, 2018, 9:28 p.m. UTC | #3
Hello!

Sorry for my delay, I got pretty sick last week.

On 11/23/18 1:45 PM, Johannes Berg wrote:
> On Tue, 2018-11-20 at 15:20 -0600, Daniel Santos wrote:
>
>> I believe I have the answer as to why we're getting frames after we stop
>> the queue.  I had a little chat about this in #kernelnewbies and some
>> other devs believe it is intentional.
>>
>> There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
>> releasing local->queue_stop_reason_lock and calling the driver's tx
>> until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between
>> these two points the thread can be preempted.  So while we stop the
>> queue in one thread, there can be 20 other threads in between that have
>> already checked that the queue was running and have a frame buffer
>> sitting on their stacks.
> Not 20, only 1 per netdev queue. I suspect that means just 1 per
> hardware queue, but I don't know how rt2x00 maps netdev queues to
> hardware queues. If you have lots of netdevs, that might actually be 20,
> but I suspect that's not what's going on here.

First of all, thank you for the response!

This code path is reached in at least two different ways that I have
discovered (thus far): from the tasklet and from userspace calling
send(msg,to).  A popular captive portal software called coova-chilli can
have 100 such processes.  While I haven't verified beyond all doubt that
they *can* be preempted here it most certainly seems to be what is
happening. I'm attaching the backtraces for reference.


>>   I think it could be fixed with the below
>> patch, but I'm being told that it doesn't need it and that the driver
>> should just *quietly* drop the frame:
> [snip patch]
>
>> Could anybody illuminate for me why this isn't done?  I know that there
>> has to be a point where we realize there are too many items in the queue
>> and we can't keep up, but this would seem to be a terrible way to do
>> that.  Is it one of those things where it isn't worth the performance
>> degradation?  Any insights would be most appreciated!! :)
> There's just not much point, and doing the locking here will mean it's
> across _all_ queues, whereas if you have multiple hardware queues you
> wouldn't really need it.

So perhaps a better solution would be to either:

a.) define a mechanism for the driver to expose per-queue spinlocks to
the ieee80211 subsystem, or
b.) define a new driver function (or change the existing) to emit a
return value and ask ieee80211 to kindly stick the frame back in its own
queue.

I'm still new to this arena, so please be patient if I've said something
stupid.


> Note that with internal TXQs with fq/codel support etc. we actually have
> the fq->lock and it's global, and it turns out that's still a better
> deal than actually doing parallel TX. So this may not matter so much.
>
> In any case, while this might solve the problem for the specific case
> you're thinking about, as soon as you have something else starting or
> stopping queues from outside the TX path it still wouldn't actually
> help.

I don't think that's quite true.  Even if something else is starting and
stopping the queue, the patch could still *help* if it reduces the
instances of having to drop frames, even if it doesn't eliminate them.


>
> By the way, one thing that can likely exacerbate the problem is
> fragmentation, once you have that you're more likely to trigger lots of
> frames in close succession.


Unfortunately I'm new to this driver so I can't comment on that yet, but
I'm catching up!  I presume that's what RTS and CTS was to the standard
for -- so we can send large frames w/o starving other stations?


> What I would suggest doing in the driver is actually stop the queues
> once a *threshold* is reached, rather than being full. Say if you have
> 128 entries in the HW queue, you can stop once you reach 120 (you
> probably don't really need the other 8 places). If you get a 121st
> frame, then you can still put it on the queue (until you filled 128),
> but you can also stop the queue again (stopping queues is idempotent).
>
> johannes
>
>
Yes, hopefully we can get test data using Stanislaw's patch soon!

Thanks,
Daniel
Thread 72 hit Breakpoint 1, rt2x00queue_write_tx_frame (queue=0x87573a1c, skb=0x85f45e40, sta=<optimized out>, local=false)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c:635
635             skbdesc->tx_rate_idx = rate_idx;
(gdb) bt
#0  rt2x00queue_write_tx_frame (queue=0x87573a1c, skb=0x85f45e40, sta=<optimized out>, local=false)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c:635
#1  0x876224b4 in rt2x00mac_tx (hw=0x868d8d40, control=0x87c39200, skb=0x85f45e40)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/drivers/net/wireless/ralink/rt2x00/rt2x00mac.c:152
#2  0x86937f30 in drv_tx (skb=<optimized out>, control=<optimized out>, local=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/driver-ops.h:34
#3  ieee80211_tx_frags (local=0x868d8d40, vif=0x85d00ed8, sta=0x875f3f48, skbs=0x87c3928c, txpending=false)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/tx.c:1632
#4  0x869385ac in __ieee80211_tx (local=0x868d8d40, skbs=0x87c3928c, led_len=98, sta=<optimized out>, txpending=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/tx.c:1694
#5  0x8693ed30 in ieee80211_tx (sdata=0x85d00520, sta=<optimized out>, skb=0x85f45e40, txpending=false)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/tx.c:1878
#6  0x869404bc in __ieee80211_subif_start_xmit (skb=<optimized out>, dev=0x85d00000, info_flags=0)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/tx.c:3640
#7  0x86940880 in ieee80211_subif_start_xmit (skb=0x85f45e40, dev=0x85d00000)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/tx.c:3782
#8  0x80516d20 in __netdev_start_xmit (more=<optimized out>, dev=<optimized out>, skb=<optimized out>, ops=<optimized out>) at ./include/linux/netdevice.h:4060
#9  netdev_start_xmit (more=<optimized out>, txq=<optimized out>, dev=<optimized out>, skb=<optimized out>) at ./include/linux/netdevice.h:4069
#10 xmit_one (more=<optimized out>, txq=<optimized out>, dev=<optimized out>, skb=<optimized out>) at net/core/dev.c:2994
#11 dev_hard_start_xmit (first=<optimized out>, dev=0x85d00000, txq=0x85e08930, ret=0x87c394b0) at net/core/dev.c:3013
#12 0x80544410 in sch_direct_xmit (skb=0x85f45e40, q=0x8760b000, dev=0x85d00000, txq=0x85e08930, root_lock=0x8760b048, validate=true) at net/sched/sch_generic.c:186
#13 0x805175e8 in __dev_xmit_skb (txq=<optimized out>, dev=<optimized out>, q=<optimized out>, skb=<optimized out>) at net/core/dev.c:3206
#14 __dev_queue_xmit (skb=0x85f45e40, accel_priv=<optimized out>) at net/core/dev.c:3481
#15 0x80517854 in dev_queue_xmit (skb=<optimized out>) at net/core/dev.c:3546
#16 0x80694494 in br_dev_queue_push_xmit (net=<optimized out>, sk=<optimized out>, skb=0x85f45e40) at net/bridge/br_forward.c:55
#17 0x80694628 in NF_HOOK (okfn=<optimized out>, out=<optimized out>, in=<optimized out>, skb=<optimized out>, sk=<optimized out>, net=<optimized out>,
    hook=<optimized out>, pf=<optimized out>) at ./include/linux/netfilter.h:251
#18 br_forward_finish (net=0x80a1a8c8 <init_net>, sk=0x0, skb=0x85f45e40) at net/bridge/br_forward.c:67
#19 0x8069485c in NF_HOOK (okfn=<optimized out>, out=<optimized out>, in=<optimized out>, skb=<optimized out>, sk=<optimized out>, net=<optimized out>,
    hook=<optimized out>, pf=<optimized out>) at ./include/linux/netfilter.h:251
#20 __br_forward (skb=0x0, local_orig=<optimized out>, to=<optimized out>) at net/bridge/br_forward.c:112
#21 0x80691618 in br_dev_xmit (skb=0x85f45e40, dev=0x877c8000) at net/bridge/br_device.c:89
#22 0x80516d20 in __netdev_start_xmit (more=<optimized out>, dev=<optimized out>, skb=<optimized out>, ops=<optimized out>) at ./include/linux/netdevice.h:4060
#23 netdev_start_xmit (more=<optimized out>, txq=<optimized out>, dev=<optimized out>, skb=<optimized out>) at ./include/linux/netdevice.h:4069
#24 xmit_one (more=<optimized out>, txq=<optimized out>, dev=<optimized out>, skb=<optimized out>) at net/core/dev.c:2994
#25 dev_hard_start_xmit (first=<optimized out>, dev=0x877c8000, txq=0x87426c00, ret=0x87c396f8) at net/core/dev.c:3013
#26 0x80517720 in __dev_queue_xmit (skb=0x85f45e60, accel_priv=<optimized out>) at net/core/dev.c:3513
#27 0x80517854 in dev_queue_xmit (skb=<optimized out>) at net/core/dev.c:3546
#28 0x805cb994 in neigh_hh_output (skb=<optimized out>, hh=<optimized out>) at ./include/net/neighbour.h:472
#29 neigh_output (skb=<optimized out>, n=<optimized out>) at ./include/net/neighbour.h:480
#30 ip_finish_output2 (net=<optimized out>, sk=<optimized out>, skb=0x85f45e40) at net/ipv4/ip_output.c:229
#31 0x805cd010 in ip_finish_output (net=<optimized out>, sk=<optimized out>, skb=<optimized out>) at net/ipv4/ip_output.c:317
#32 0x805ce0f4 in NF_HOOK_COND (cond=<optimized out>, okfn=<optimized out>, out=<optimized out>, in=<optimized out>, skb=<optimized out>, sk=<optimized out>,
    net=<optimized out>, hook=<optimized out>, pf=<optimized out>) at ./include/linux/netfilter.h:240
#33 ip_output (net=0x80a1a8c8 <init_net>, sk=0x85c4c140, skb=0x85f45e40) at net/ipv4/ip_output.c:405
#34 0x805cdab0 in ip_queue_xmit (sk=0x85c4c140, skb=0x85f45e40, fl=0x85c4c440) at net/ipv4/ip_output.c:504
#35 0x805e80a0 in __tcp_transmit_skb (sk=0x85c4c140, skb=0x0, clone_it=<optimized out>, gfp_mask=<optimized out>, rcv_nxt=811855119) at net/ipv4/tcp_output.c:1130
#36 0x805ea40c in __tcp_send_ack (sk=0x85c4c140, rcv_nxt=811855119) at net/ipv4/tcp_output.c:3601
#37 0x805e4028 in tcp_ack_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5147
#38 tcp_rcv_established (sk=0x85c4c140, skb=0x85f0ca80, th=0x85c3609c) at net/ipv4/tcp_input.c:5560
#39 0x805f0338 in tcp_v4_do_rcv (sk=0x85c4c140, skb=0x85f0ca80) at net/ipv4/tcp_ipv4.c:1464
#40 0x805f1ac0 in tcp_v4_rcv (skb=0x85f0ca80) at net/ipv4/tcp_ipv4.c:1745
#41 0x805c7954 in ip_local_deliver_finish (net=<optimized out>, sk=<optimized out>, skb=0x85f0ca80) at net/ipv4/ip_input.c:216
#42 0x805c81c8 in NF_HOOK (okfn=<optimized out>, out=<optimized out>, in=<optimized out>, skb=<optimized out>, sk=<optimized out>, net=<optimized out>,
    hook=<optimized out>, pf=<optimized out>) at ./include/linux/netfilter.h:251
#43 ip_local_deliver (skb=0x85f45e78) at net/ipv4/ip_input.c:257
#44 0x805c8584 in NF_HOOK (okfn=<optimized out>, out=<optimized out>, in=<optimized out>, skb=<optimized out>, sk=<optimized out>, net=<optimized out>,
    hook=<optimized out>, pf=<optimized out>) at ./include/linux/netfilter.h:251
#45 ip_rcv (skb=0x85f0ca80, dev=0x877c8000, pt=<optimized out>, orig_dev=<optimized out>) at net/ipv4/ip_input.c:493
#46 0x80513d8c in __netif_receive_skb_core (skb=0x85f0ca80, pfmemalloc=<optimized out>) at net/core/dev.c:4465
#47 0x80518900 in netif_receive_skb_internal (skb=0x85f0ca80) at net/core/dev.c:4576
#48 0x80518b7c in netif_receive_skb (skb=<optimized out>) at net/core/dev.c:4600
#49 0x806961e8 in br_netif_receive_skb (skb=<optimized out>, sk=<optimized out>, net=<optimized out>) at net/bridge/br_input.c:34
---Type <return> to continue, or q <return> to quit---
#50 NF_HOOK (okfn=<optimized out>, out=<optimized out>, in=<optimized out>, skb=<optimized out>, sk=<optimized out>, net=<optimized out>, hook=<optimized out>,
    pf=<optimized out>) at ./include/linux/netfilter.h:251
#51 br_pass_frame_up (skb=0x85f0ca80) at net/bridge/br_input.c:69
#52 0x8069635c in br_handle_frame_finish (net=<optimized out>, sk=<optimized out>, skb=0x85f0ca80) at net/bridge/br_input.c:222
#53 0x80696e88 in NF_HOOK (okfn=<optimized out>, out=<optimized out>, in=<optimized out>, skb=<optimized out>, sk=<optimized out>, net=<optimized out>,
    hook=<optimized out>, pf=<optimized out>) at ./include/linux/netfilter.h:251
#54 br_handle_frame (pskb=<optimized out>) at net/bridge/br_input.c:359
#55 0x80513c78 in __netif_receive_skb_core (skb=0x85f0ca80, pfmemalloc=false) at net/core/dev.c:4419
#56 0x80518900 in netif_receive_skb_internal (skb=0x85f0ca80) at net/core/dev.c:4576
#57 0x80518b7c in netif_receive_skb (skb=<optimized out>) at net/core/dev.c:4600
#58 0x86931280 in ieee80211_deliver_skb (rx=0x87c39e10)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/rx.c:2343
#59 0x86934714 in ieee80211_rx_h_data (rx=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/rx.c:2674
#60 ieee80211_rx_handlers (rx=0x87c39e10, frames=0x87c39d88)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/rx.c:3418
#61 0x86936334 in ieee80211_invoke_rx_handlers (rx=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/rx.c:3459
#62 ieee80211_prepare_and_rx_handle (rx=0x87c39e10, skb=0x85f45e60, consume=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/rx.c:4152
#63 0x86936bc4 in __ieee80211_rx_handle_packet (napi=<optimized out>, skb=<optimized out>, pubsta=<optimized out>, hw=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/rx.c:4212
#64 ieee80211_rx_napi (hw=0x868d8d40, pubsta=0x0, skb=0x85f0ca80, napi=0x0)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/rx.c:4393
#65 0x8762167c in ieee80211_rx (skb=<optimized out>, hw=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/include/net/mac80211.h:4036
#66 ieee80211_rx_ni (skb=<optimized out>, hw=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/include/net/mac80211.h:4071
#67 rt2x00lib_rxdone (entry=0x875d46c8, gfp=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c:873
#68 0x8687e0c4 in rt2x00mmio_rxdone (rt2x00dev=0x868d9c40)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/drivers/net/wireless/ralink/rt2x00/rt2x00mmio.c:92
#69 0x868b072c in rt2800mmio_rxdone_tasklet (data=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/drivers/net/wireless/ralink/rt2x00/rt2800mmio.c:410
#70 0x8002d744 in tasklet_action (a=<optimized out>) at kernel/softirq.c:513
#71 0x806d2194 in __do_softirq () at kernel/softirq.c:288
#72 0x8002d190 in invoke_softirq () at kernel/softirq.c:368
#73 irq_exit () at kernel/softirq.c:409
#74 0x800099b4 in handle_int () at arch/mips/kernel/genex.S:225
Backtrace stopped: frame did not save the PC
(gdb) cont 100
Will ignore next 99 crossings of breakpoint 1.  Continuing.
[New Thread 2854]
[New Thread 2855]
[New Thread 2856]
[New Thread 2868]
[Switching to Thread 2752]

Thread 74 hit Breakpoint 1, rt2x00queue_write_tx_frame (queue=0x875738b4, skb=0x87e3a840, sta=<optimized out>, local=false)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c:635
635             skbdesc->tx_rate_idx = rate_idx;
(gdb) bt
#0  rt2x00queue_write_tx_frame (queue=0x875738b4, skb=0x87e3a840, sta=<optimized out>, local=false)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/drivers/net/wireless/ralink/rt2x00/rt2x00queue.c:635
#1  0x876224b4 in rt2x00mac_tx (hw=0x868d8d40, control=0x85ee7a08, skb=0x87e3a840)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/drivers/net/wireless/ralink/rt2x00/rt2x00mac.c:152
#2  0x86937f30 in drv_tx (skb=<optimized out>, control=<optimized out>, local=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/driver-ops.h:34
#3  ieee80211_tx_frags (local=0x868d8d40, vif=0x85d00ed8, sta=0x0, skbs=0x85ee7a94, txpending=false)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/tx.c:1632
#4  0x869385ac in __ieee80211_tx (local=0x868d8d40, skbs=0x85ee7a94, led_len=189, sta=<optimized out>, txpending=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/tx.c:1694
#5  0x8693ed30 in ieee80211_tx (sdata=0x85d00520, sta=<optimized out>, skb=0x87e3a840, txpending=false)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/tx.c:1878
#6  0x86940de0 in __ieee80211_tx_skb_tid_band (sdata=0x85d00520, skb=0x87e3a840, tid=<optimized out>, band=NL80211_BAND_2GHZ)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/tx.c:4749
#7  0x86915e0c in ieee80211_tx_skb_tid (tid=<optimized out>, skb=<optimized out>, sdata=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/ieee80211_i.h:1930
#8  ieee80211_tx_skb (skb=<optimized out>, sdata=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/ieee80211_i.h:1939
#9  ieee80211_mgmt_tx (wiphy=<optimized out>, wdev=<optimized out>, params=<optimized out>, cookie=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/mac80211/offchannel.c:941
#10 0x877372ac in rdev_mgmt_tx (cookie=<optimized out>, params=<optimized out>, wdev=<optimized out>, rdev=<optimized out>)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/wireless/rdev-ops.h:711
#11 cfg80211_mlme_mgmt_tx (rdev=<optimized out>, wdev=0x85d00528, params=0x85ee7bd0, cookie=0x85ee7c08)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/wireless/mlme.c:691
#12 0x877284b8 in nl80211_tx_mgmt (skb=<optimized out>, info=0x85ee7c48)
    at /home/daniel/proj/embedded/openwrt/gse/build_dir/target-mipsel_24kc_musl/linux-ramips_mt7620/backports-2017-11-01/net/wireless/nl80211.c:9652
#13 0x80552444 in genl_family_rcv_msg (extack=<optimized out>, nlh=<optimized out>, skb=<optimized out>, family=<optimized out>) at net/netlink/genetlink.c:599
#14 genl_rcv_msg (skb=0x87e3a780, nlh=<optimized out>, extack=0x85ee7ca8) at net/netlink/genetlink.c:624
#15 0x80550e4c in netlink_rcv_skb (skb=0x87e3a780, cb=0x80552144 <genl_rcv_msg>) at net/netlink/af_netlink.c:2432
#16 0x80551b94 in genl_rcv (skb=0x87e3a780) at net/netlink/genetlink.c:635
#17 0x80550514 in netlink_unicast_kernel (ssk=<optimized out>, skb=<optimized out>, sk=<optimized out>) at net/netlink/af_netlink.c:1286
#18 netlink_unicast (ssk=0x85c56000, skb=0x87e3a780, portid=0, nonblock=<optimized out>) at net/netlink/af_netlink.c:1312
#19 0x80550a4c in netlink_sendmsg (sock=0x8739b760, msg=0x85ee7ee0, len=236) at net/netlink/af_netlink.c:1877
#20 0x804f50f8 in sock_sendmsg_nosec (msg=<optimized out>, sock=<optimized out>) at net/socket.c:645
#21 sock_sendmsg (sock=<optimized out>, msg=<optimized out>) at net/socket.c:655
#22 0x804f6a84 in ___sys_sendmsg (sock=0x8739b760, msg=<optimized out>, msg_sys=0x85ee7ee0, flags=<optimized out>, used_address=0x0,
    allowed_msghdr_flags=<optimized out>) at net/socket.c:2061
#23 0x804f7864 in __sys_sendmsg (fd=<optimized out>, msg=0x7fd8ab30, flags=0) at net/socket.c:2095
#24 0x800146c4 in handle_sys () at arch/mips/kernel/scall32-o32.S:102
Backtrace stopped: frame did not save the PC
Daniel Santos Dec. 3, 2018, 9:44 p.m. UTC | #4
On 11/26/18 3:38 AM, Stanislaw Gruszka wrote:
> On Fri, Nov 23, 2018 at 08:45:54PM +0100, Johannes Berg wrote:
>> On Tue, 2018-11-20 at 15:20 -0600, Daniel Santos wrote:
>>
>>> I believe I have the answer as to why we're getting frames after we stop
>>> the queue.  I had a little chat about this in #kernelnewbies and some
>>> other devs believe it is intentional.
>>>
>>> There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
>>> releasing local->queue_stop_reason_lock and calling the driver's tx
>>> until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between
>>> these two points the thread can be preempted.  So while we stop the
>>> queue in one thread, there can be 20 other threads in between that have
>>> already checked that the queue was running and have a frame buffer
>>> sitting on their stacks.
>> Not 20, only 1 per netdev queue. I suspect that means just 1 per
>> hardware queue, but I don't know how rt2x00 maps netdev queues to
>> hardware queues. If you have lots of netdevs, that might actually be 20,
>> but I suspect that's not what's going on here.
>>
>>>   I think it could be fixed with the below
>>> patch, but I'm being told that it doesn't need it and that the driver
>>> should just *quietly* drop the frame:
>> [snip patch]
>>
>>> Could anybody illuminate for me why this isn't done?  I know that there
>>> has to be a point where we realize there are too many items in the queue
>>> and we can't keep up, but this would seem to be a terrible way to do
>>> that.  Is it one of those things where it isn't worth the performance
>>> degradation?  Any insights would be most appreciated!! :)
>> There's just not much point, and doing the locking here will mean it's
>> across _all_ queues, whereas if you have multiple hardware queues you
>> wouldn't really need it.
>>
>> Note that with internal TXQs with fq/codel support etc. we actually have
>> the fq->lock and it's global, and it turns out that's still a better
>> deal than actually doing parallel TX. So this may not matter so much.
>>
>> In any case, while this might solve the problem for the specific case
>> you're thinking about, as soon as you have something else starting or
>> stopping queues from outside the TX path it still wouldn't actually
>> help.
>>
>> By the way, one thing that can likely exacerbate the problem is
>> fragmentation, once you have that you're more likely to trigger lots of
>> frames in close succession.
>>
>> What I would suggest doing in the driver is actually stop the queues
>> once a *threshold* is reached, rather than being full. Say if you have
>> 128 entries in the HW queue, you can stop once you reach 120 (you
>> probably don't really need the other 8 places). If you get a 121st
>> frame, then you can still put it on the queue (until you filled 128),
>> but you can also stop the queue again (stopping queues is idempotent).
> That what rt2x00 does, but we have 64 entries on tx queues and
> because of that threshold is small. In:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=82751
>
> I proposed increase of queue size to 256 and hence make threshold
> bigger. However I was concerned about bufferbloat and requested
> testing from users how this affect latency. Never get testing 
> results :-(
>
> Thanks
> Stanislaw
>
Hello Stanislaw,

I almost managed to get that patch in a build to send to somebody who
can reproduce the error in abundance, but they have 15 different people
hammer the router to do it and we ended up sending them a few other
experimental builds instead.

I'm still learning this driver, but I don't see where it creates a
struct net_device -- was that something that came out after the driver
was originally written? (And maybe gets implicitly created somewhere
else?)  iiuc, the best way to do this is by setting tx_queue_len while
the interface is down (via "ip link") and then allocating the queue when
it's brought up.

Unless you know of a problem with this approach, I'm planning on making
a patch just for that.  It will then be easier to tune for an end user's
particular application.  For instance, if there is a small number of
uses who just use a very large amount of bandwidth then buffer bloat
could become a problem if it's too high.  But for a larger number of
users I don't think the buffer bloat probably will matter as much as
lost performance from dropping frames and needing to re-request many
lost packets at the TCP layer.  This would also result in more uplink
bandwidth being consumed.

BTW, I guess we need that struct net_device object in order to increment
tx_dropped properly as well.

Thanks,
Daniel
Stanislaw Gruszka Dec. 4, 2018, 10:16 a.m. UTC | #5
Hi Daniel

On Mon, Dec 03, 2018 at 03:44:46PM -0600, Daniel Santos wrote:
> I almost managed to get that patch in a build to send to somebody who
> can reproduce the error in abundance, but they have 15 different people
> hammer the router to do it and we ended up sending them a few other
> experimental builds instead.
> 
> I'm still learning this driver, but I don't see where it creates a
> struct net_device -- was that something that came out after the driver
> was originally written? (And maybe gets implicitly created somewhere
> else?)  

It is done in ieee80211_if_add(), one netdev per vif.

> iiuc, the best way to do this is by setting tx_queue_len while
> the interface is down (via "ip link") and then allocating the queue when
> it's brought up.

We have diffrent queues at various levels in the network stack.
The queues size I plan to increase are referenced as HW queues

> Unless you know of a problem with this approach, I'm planning on making
> a patch just for that.  It will then be easier to tune for an end user's
> particular application. 

I don't think it's correct approch. Maybe module parameter will be
better just for testing. But since this is for routers/APs just
making build with diffrent tx queues size seems to be better
approach.

> For instance, if there is a small number of
> uses who just use a very large amount of bandwidth then buffer bloat
> could become a problem if it's too high.  But for a larger number of
> users I don't think the buffer bloat probably will matter as much as
> lost performance from dropping frames and needing to re-request many
> lost packets at the TCP layer.  This would also result in more uplink
> bandwidth being consumed.

Well, I guess that correct, but I still wish to see if increase queues
size do not cause big negative effect.

Thanks
Stanislaw
diff mbox series

Patch

diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index d43dab4..29009e0 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -1117,6 +1117,7 @@  struct ieee80211_local {
 	int q_stop_reasons[IEEE80211_MAX_QUEUES][IEEE80211_QUEUE_STOP_REASONS];
 	/* also used to protect ampdu_ac_queue and amdpu_ac_stop_refcnt */
 	spinlock_t queue_stop_reason_lock;
+	spinlock_t tx_path_lock;
 
 	int open_count;
 	int monitors, cooked_mntrs;
diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index 4f8eceb..2312ac9 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -609,6 +609,7 @@  struct ieee80211_hw *ieee80211_alloc_hw_nm(size_t priv_data_len,
 	spin_lock_init(&local->filter_lock);
 	spin_lock_init(&local->rx_path_lock);
 	spin_lock_init(&local->queue_stop_reason_lock);
+	spin_lock_init(&local->tx_path_lock);
 
 	INIT_LIST_HEAD(&local->chanctx_list);
 	mutex_init(&local->chanctx_mtx);
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 87d807f..8d43e2a 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -1311,6 +1311,7 @@  static bool ieee80211_tx_frags(struct ieee80211_local *local,
 		}
 #endif
 
+		spin_lock_bh(&local->tx_path_lock);
 		spin_lock_irqsave(&local->queue_stop_reason_lock, flags);
 		if (local->queue_stop_reasons[q] ||
 		    (!txpending && !skb_queue_empty(&local->pending[q]))) {
@@ -1327,6 +1328,7 @@  static bool ieee80211_tx_frags(struct ieee80211_local *local,
 					spin_unlock_irqrestore(
 						&local->queue_stop_reason_lock,
 						flags);
+					spin_unlock_bh(&local->tx_path_lock);
 					ieee80211_purge_tx_queue(&local->hw,
 								 skbs);
 					return true;
@@ -1347,6 +1349,7 @@  static bool ieee80211_tx_frags(struct ieee80211_local *local,
 
 				spin_unlock_irqrestore(&local->queue_stop_reason_lock,
 						       flags);
+				spin_unlock_bh(&local->tx_path_lock);
 				return false;
 			}
 		}
@@ -1356,6 +1359,7 @@  static bool ieee80211_tx_frags(struct ieee80211_local *local,
 
 		__skb_unlink(skb, skbs);
 		ieee80211_drv_tx(local, vif, sta, skb);
+		spin_unlock_bh(&local->tx_path_lock);
 	}
 
 	return true;