[RFC,v2] iwlwifi: pcie: transmit queue auto-sizing
diff mbox

Message ID 1454616988-21901-1-git-send-email-emmanuel.grumbach@intel.com
State RFC
Delegated to: Johannes Berg
Headers show

Commit Message

Emmanuel Grumbach Feb. 4, 2016, 8:16 p.m. UTC
As many (all?) WiFi devices, Intel WiFi devices have
transmit queues which have 256 transmit descriptors
each and each descriptor corresponds to an MPDU.
This means that when it is full, the queue contains
256 * ~1500 bytes to be transmitted (if we don't have
A-MSDUs). The purpose of those queues is to have enough
packets to be ready for transmission so that when the device
gets an opportunity to transmit (TxOP), it can take as many
packets as the spec allows and aggregate them into one
A-MPDU or even several A-MPDUs if we are using bursts.

The problem is that the packets that are in these queues are
already out of control of the Qdisc and can stay in those
queues for a fairly long time when the link condition is
not good. This leads to the well known bufferbloat problem.

This patch adds a way to tune the size of the transmit queue
so that it won't cause excessive latency. When the link
condition is good, the packets will flow smoothly and the
transmit queue will grow quickly allowing A-MPDUs and
maximal throughput. When the link is not optimal, we will
have retransmissions, lower transmit rates or signal
detection (CCA) which will cause a delay in the packet
transmission. The driver will sense this higher latency
and will reduce the size of the transmit queue.
This means that the packets that continue to arrive
will pile up in the Qdisc rather than in the device
queues. The major advantage of this approach is that
codel can now do its work.

The algorithm is really (too?) simple:
every 5 seconds, starts from a short queue again.
If a packet has been in the queue for less than 10ms,
allow 10 more MPDUs in.
If a packet has been in the queue for more than 20ms,
reduce by 10 the size of the transmit queue.

The implementation is really naive and way too simple:
 * reading jiffies for every Tx / Tx status is not a
   good idead.
 * jiffies are not fine-grained enough on all platforms
 * the constants chosen are really arbitrary and can't be
   tuned.
 * This may be implemented in mac80211 probably and help
   other drivers.
 * etc...

But already this gives nice results. I ran a very simple
experiment: I put the device in a controlled environment
and ran traffic while running default sized ping in the
background. In this configuration, our device quickly
raises its transmission rate to the best rate.
Then, I force the device to use the lowest rate (6Mbps).
Of course, the throughput collapses, but the ping RTT
shoots up.
Using codel helps, but the latency is still high. Codel
with this patch gives much better results:

pfifo_fast:
rtt min/avg/max/mdev = 1932.616/2393.284/2833.407/315.941 ms, pipe 3, ipg/ewma 2215.707/2446.884 ms

fq_codel + Tx queue auto-sizing:
rtt min/avg/max/mdev = 13.541/32.396/54.791/9.610 ms, ipg/ewma 200.685/32.202 ms

fq_codel without Tx queue auto-sizing:
rtt min/avg/max/mdev = 140.821/257.303/331.889/31.074 ms, pipe 2, ipg/ewma 258.147/252.847 ms

Clearly, there is more work to do to be able to merge this,
but it seems that the wireless problems mentioned in
https://lwn.net/Articles/616241/ may have a solution.

Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
---
Fix Dave's email address
---
 drivers/net/wireless/intel/iwlwifi/pcie/internal.h |  6 ++++
 drivers/net/wireless/intel/iwlwifi/pcie/tx.c       | 32 ++++++++++++++++++++--
 2 files changed, 35 insertions(+), 3 deletions(-)

Comments

Ben Greear Feb. 4, 2016, 8:45 p.m. UTC | #1
On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote:
> As many (all?) WiFi devices, Intel WiFi devices have
> transmit queues which have 256 transmit descriptors
> each and each descriptor corresponds to an MPDU.
> This means that when it is full, the queue contains
> 256 * ~1500 bytes to be transmitted (if we don't have
> A-MSDUs). The purpose of those queues is to have enough
> packets to be ready for transmission so that when the device
> gets an opportunity to transmit (TxOP), it can take as many
> packets as the spec allows and aggregate them into one
> A-MPDU or even several A-MPDUs if we are using bursts.

I guess this is only really usable if you have exactly one
peer connected (ie, in station mode)?

Otherwise, you could have one slow peer and one fast one,
and then I suspect this would not work so well?

For reference, ath10k has around 1400 tx descriptors, though
in practice not all are usable, and in stock firmware, I'm guessing
the NIC will never be able to actually fill up it's tx descriptors
and stop traffic.  Instead, it just allows the stack to try to
TX, then drops the frame...

Thanks,
Ben
Emmanuel Grumbach Feb. 4, 2016, 8:56 p.m. UTC | #2
On 02/04/2016 10:46 PM, Ben Greear wrote:
> On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote:
>> As many (all?) WiFi devices, Intel WiFi devices have
>> transmit queues which have 256 transmit descriptors
>> each and each descriptor corresponds to an MPDU.
>> This means that when it is full, the queue contains
>> 256 * ~1500 bytes to be transmitted (if we don't have
>> A-MSDUs). The purpose of those queues is to have enough
>> packets to be ready for transmission so that when the device
>> gets an opportunity to transmit (TxOP), it can take as many
>> packets as the spec allows and aggregate them into one
>> A-MPDU or even several A-MPDUs if we are using bursts.
> I guess this is only really usable if you have exactly one
> peer connected (ie, in station mode)?
>
> Otherwise, you could have one slow peer and one fast one,
> and then I suspect this would not work so well?

Yes. I guess this one (big) limitation. I guess that what would happen
in this case is that the the latency would constantly jitter. But I also
noticed that I could reduce the transmit queue to 130 descriptors
(instead of 256) and still reach maximal throughput because we can
refill the queues quickly enough.
In iwlwifi, we have plans to have one queue for each peer.
This is under development. Not sure when it'll be ready. It also requires
firmware change obviously.

> For reference, ath10k has around 1400 tx descriptors, though
> in practice not all are usable, and in stock firmware, I'm guessing
> the NIC will never be able to actually fill up it's tx descriptors
> and stop traffic.  Instead, it just allows the stack to try to
> TX, then drops the frame...

1400 descriptors, ok... but they are not organised in queues?
(forgive my ignorance of athX drivers)

>
> Thanks,
> Ben
>

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Feb. 4, 2016, 9:14 p.m. UTC | #3
On 02/04/2016 12:56 PM, Grumbach, Emmanuel wrote:
>
>
> On 02/04/2016 10:46 PM, Ben Greear wrote:
>> On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote:
>>> As many (all?) WiFi devices, Intel WiFi devices have
>>> transmit queues which have 256 transmit descriptors
>>> each and each descriptor corresponds to an MPDU.
>>> This means that when it is full, the queue contains
>>> 256 * ~1500 bytes to be transmitted (if we don't have
>>> A-MSDUs). The purpose of those queues is to have enough
>>> packets to be ready for transmission so that when the device
>>> gets an opportunity to transmit (TxOP), it can take as many
>>> packets as the spec allows and aggregate them into one
>>> A-MPDU or even several A-MPDUs if we are using bursts.
>> I guess this is only really usable if you have exactly one
>> peer connected (ie, in station mode)?
>>
>> Otherwise, you could have one slow peer and one fast one,
>> and then I suspect this would not work so well?
>
> Yes. I guess this one (big) limitation. I guess that what would happen
> in this case is that the the latency would constantly jitter. But I also
> noticed that I could reduce the transmit queue to 130 descriptors
> (instead of 256) and still reach maximal throughput because we can
> refill the queues quickly enough.
> In iwlwifi, we have plans to have one queue for each peer.
> This is under development. Not sure when it'll be ready. It also requires
> firmware change obviously.

Per-peer queues will probably be nice, especially if we can keep the
buffer bloat manageable.

>> For reference, ath10k has around 1400 tx descriptors, though
>> in practice not all are usable, and in stock firmware, I'm guessing
>> the NIC will never be able to actually fill up it's tx descriptors
>> and stop traffic.  Instead, it just allows the stack to try to
>> TX, then drops the frame...
>
> 1400 descriptors, ok... but they are not organised in queues?
> (forgive my ignorance of athX drivers)

I think all the details are in the firmware, at least for now.

The firmware details are probably not something I should go into, but suffice it to say
its complex and varies between firmware versions in non-trivial ways.

Thanks,
Ben
Michal Kazior Feb. 5, 2016, 8:44 a.m. UTC | #4
On 4 February 2016 at 22:14, Ben Greear <greearb@candelatech.com> wrote:
> On 02/04/2016 12:56 PM, Grumbach, Emmanuel wrote:
>> On 02/04/2016 10:46 PM, Ben Greear wrote:
>>> On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote:
>>>>
>>>> As many (all?) WiFi devices, Intel WiFi devices have
>>>> transmit queues which have 256 transmit descriptors
>>>> each and each descriptor corresponds to an MPDU.
>>>> This means that when it is full, the queue contains
>>>> 256 * ~1500 bytes to be transmitted (if we don't have
>>>> A-MSDUs). The purpose of those queues is to have enough
>>>> packets to be ready for transmission so that when the device
>>>> gets an opportunity to transmit (TxOP), it can take as many
>>>> packets as the spec allows and aggregate them into one
>>>> A-MPDU or even several A-MPDUs if we are using bursts.
>>>
>>> I guess this is only really usable if you have exactly one
>>> peer connected (ie, in station mode)?
>>>
>>> Otherwise, you could have one slow peer and one fast one,
>>> and then I suspect this would not work so well?
>>
>>
>> Yes. I guess this one (big) limitation. I guess that what would happen
>> in this case is that the the latency would constantly jitter. But I also

Hmm.. You'd probably need to track per-station packet sojourn time as
well and make it possible to stop/wake queues per station.


>> noticed that I could reduce the transmit queue to 130 descriptors
>> (instead of 256) and still reach maximal throughput because we can
>> refill the queues quickly enough.
>> In iwlwifi, we have plans to have one queue for each peer.
>> This is under development. Not sure when it'll be ready. It also requires
>> firmware change obviously.
>
> Per-peer queues will probably be nice, especially if we can keep the
> buffer bloat manageable.

Per-station queues sound tricky if you consider bufferbloat.

To maximize use of airtime (i.e. txop) you need to send big
aggregates. Since aggregates are per station-tid to maximize
multi-station performance (in AP mode) you'll need to queue a lot of
frames, per each station, depending on the chosen tx rate.

A bursted txop can be as big as 5-10ms. If you consider you want to
queue 5-10ms worth of data for *each* station at any given time you
obviously introduce a lot of lag. If you have 10 stations you might
end up with service period at 10*10ms = 100ms. This gets even worse if
you consider MU-MIMO because you need to do an expensive sounding
procedure before transmitting. So while SU aggregation can probably
still work reasonably well with shorter bursts (1-2ms) MU needs at
least 3ms to get *any* gain when compared to SU (which obviously means
you want more to actually make MU pay off). The rule of thumb is the
longer you wait the bigger capacity you can get.

Apparently there's interest in maximizing throughput but it stands in
direct opposition of keeping the latency down so I've been thinking
how to satisfy both.

The current approach ath10k is taking (patches in review [1][2]) is to
use mac80211 software queues for per-station queuing, exposing queue
state to firmware (it decides where frames should be dequeued from)
and making it possible to stop/wake per-station tx subqueue with fake
netdev queues. I'm starting to think this is not the right way though
because it's inherently hard to control latency and there's a huge
memory overhead associated with the fake netdev queues. Also fq_codel
is a less effective with this kind of setup.

My current thinking is that the entire problem should be solved via
(per-AC) qdiscs, e.g. fq_codel. I guess one could use
limit/target/interval/quantum knobs to tune it for higher latency of
aggregation-oriented Wi-Fi links where long service time (think
100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi
works in the first place, i.e. Wi-Fi gets better throughput if you
deliver bursts of packets destined to the same station. Moreover this
gets even more complicated with MU-MIMO where you may want to consider
spatial location (which influences signal quality when grouped) of
each station when you decide which set of stations you're going to
aggregate to in parallel. Since drivers have a finite tx ring this it
is important to deliver bursts that can actually be aggregated
efficiently. This means driver would need to be able to tell qdisc
about per-flow conditions to influence the RR scheme in some way
(assuming a qdiscs even understands flows; do we need a unified way of
talking about flows between qdiscs and drivers?).


[1]: https://www.spinics.net/lists/linux-wireless/msg146187.html
[2]: https://www.spinics.net/lists/linux-wireless/msg146512.html


>>> For reference, ath10k has around 1400 tx descriptors, though
>>> in practice not all are usable, and in stock firmware, I'm guessing
>>> the NIC will never be able to actually fill up it's tx descriptors
>>> and stop traffic.  Instead, it just allows the stack to try to
>>> TX, then drops the frame...
>>
>>
>> 1400 descriptors, ok... but they are not organised in queues?
>> (forgive my ignorance of athX drivers)
>
>
> I think all the details are in the firmware, at least for now.

Yeah. Basically ath10k has a flat set of tx descriptors which are
AC-agnostic. Firmware classifies them internally to per-AC HW queues.


Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Taht Feb. 5, 2016, 4:47 p.m. UTC | #5
> A bursted txop can be as big as 5-10ms. If you consider you want to
> queue 5-10ms worth of data for *each* station at any given time you
> obviously introduce a lot of lag. If you have 10 stations you might
> end up with service period at 10*10ms = 100ms. This gets even worse if
> you consider MU-MIMO because you need to do an expensive sounding
> procedure before transmitting. So while SU aggregation can probably
> still work reasonably well with shorter bursts (1-2ms) MU needs at
> least 3ms to get *any* gain when compared to SU (which obviously means
> you want more to actually make MU pay off).

I am not sure where you get these numbers. Got a spreadsheet?

Gradually reducing the maximum sized txop as a function of the number
of stations makes sense. If you have 10 stations pending delivery and
reduced the max txop to 1ms, you hurt bandwidth at that instant, but
by offering more service to more stations, in less time, they will
converge on a reasonable share of the bandwidth for each, faster[1].
And I'm sure that the person videoconferencing on a link like that
would appreciate getting some service inside of a 10ms interval,
rather than a 100ms.

yes, there's overhead, and that's not the right number, which would
vary as to g,n,ac and successors.

You will also get more opportunities to use mu-mimo with shorter
bursts extant and more stations being regularly serviced.

[1] https://www.youtube.com/watch?v=Rb-UnHDw02o at about 13:50

> The rule of thumb is the
> longer you wait the bigger capacity you can get.

This is not strictly true as the "fountain" of packets is regulated by
acks on the other side of the link, and ramp up or down as a function
of service time and loss.

>
> Apparently there's interest in maximizing throughput but it stands in
> direct opposition of keeping the latency down so I've been thinking
> how to satisfy both.
>
> The current approach ath10k is taking (patches in review [1][2]) is to
> use mac80211 software queues for per-station queuing, exposing queue
> state to firmware (it decides where frames should be dequeued from)
> and making it possible to stop/wake per-station tx subqueue with fake
> netdev queues. I'm starting to think this is not the right way though
> because it's inherently hard to control latency and there's a huge
> memory overhead associated with the fake netdev queues.

What is this overhead?

Applying things  like codel tend to dramatically shorten the amount of
skbs extant... modern 802.11ac capable hardware has tons more
memory...

> Also fq_codel
> is a less effective with this kind of setup.

fq_codel's principal problems with working with wifi are long and
documented in the talk above.

> My current thinking is that the entire problem should be solved via
> (per-AC) qdiscs, e.g. fq_codel. I guess one could use
> limit/target/interval/quantum knobs to tune it for higher latency of
> aggregation-oriented Wi-Fi links where long service time (think
> 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi
> works in the first place, i.e. Wi-Fi gets better throughput if you
> deliver bursts of packets destined to the same station. Moreover this
> gets even more complicated with MU-MIMO where you may want to consider
> spatial location (which influences signal quality when grouped) of
> each station when you decide which set of stations you're going to
> aggregate to in parallel. Since drivers have a finite tx ring this it
> is important to deliver bursts that can actually be aggregated
> efficiently. This means driver would need to be able to tell qdisc
> about per-flow conditions to influence the RR scheme in some way
> (assuming a qdiscs even understands flows; do we need a unified way of
> talking about flows between qdiscs and drivers?).

This is a very good summary of the problems in layering fq_codel as it
exists today on top of wifi as it exists today. :/ Our conclusion
several years ago was that as the information needed to do things more
right was in the mac80211 layer that we could not evolve the qdisc
layer to suit, and needed to move the core ideas into the mac80211
layer.

Things have evolved since, but I still think we can't get enough info
up to the qdisc layer (locks and so on) to use it sanely.

>
> [1]: https://www.spinics.net/lists/linux-wireless/msg146187.html
> [2]: https://www.spinics.net/lists/linux-wireless/msg146512.html

I will review!

>
>>>> For reference, ath10k has around 1400 tx descriptors, though
>>>> in practice not all are usable, and in stock firmware, I'm guessing
>>>> the NIC will never be able to actually fill up it's tx descriptors
>>>> and stop traffic.  Instead, it just allows the stack to try to
>>>> TX, then drops the frame...
>>>
>>>
>>> 1400 descriptors, ok... but they are not organised in queues?
>>> (forgive my ignorance of athX drivers)
>>
>>
>> I think all the details are in the firmware, at least for now.
>
> Yeah. Basically ath10k has a flat set of tx descriptors which are
> AC-agnostic. Firmware classifies them internally to per-AC HW queues.
>
>
> Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Feb. 5, 2016, 4:48 p.m. UTC | #6
On 02/05/2016 12:44 AM, Michal Kazior wrote:

> Per-station queues sound tricky if you consider bufferbloat.
>
> To maximize use of airtime (i.e. txop) you need to send big
> aggregates. Since aggregates are per station-tid to maximize
> multi-station performance (in AP mode) you'll need to queue a lot of
> frames, per each station, depending on the chosen tx rate.
>
> A bursted txop can be as big as 5-10ms. If you consider you want to
> queue 5-10ms worth of data for *each* station at any given time you
> obviously introduce a lot of lag. If you have 10 stations you might
> end up with service period at 10*10ms = 100ms. This gets even worse if
> you consider MU-MIMO because you need to do an expensive sounding
> procedure before transmitting. So while SU aggregation can probably
> still work reasonably well with shorter bursts (1-2ms) MU needs at
> least 3ms to get *any* gain when compared to SU (which obviously means
> you want more to actually make MU pay off). The rule of thumb is the
> longer you wait the bigger capacity you can get.
>
> Apparently there's interest in maximizing throughput but it stands in
> direct opposition of keeping the latency down so I've been thinking
> how to satisfy both.

I really think this should be tunable.  For instance, someone making an AP
that is mostly for letting lots of users stream movies would care a lot more
about throughput than someone making an AP that is mainly for browsing the
web and doing more latency-sensitive activities.


> The current approach ath10k is taking (patches in review [1][2]) is to
> use mac80211 software queues for per-station queuing, exposing queue
> state to firmware (it decides where frames should be dequeued from)
> and making it possible to stop/wake per-station tx subqueue with fake
> netdev queues. I'm starting to think this is not the right way though
> because it's inherently hard to control latency and there's a huge
> memory overhead associated with the fake netdev queues. Also fq_codel
> is a less effective with this kind of setup.
>
> My current thinking is that the entire problem should be solved via
> (per-AC) qdiscs, e.g. fq_codel. I guess one could use
> limit/target/interval/quantum knobs to tune it for higher latency of
> aggregation-oriented Wi-Fi links where long service time (think
> 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi
> works in the first place, i.e. Wi-Fi gets better throughput if you
> deliver bursts of packets destined to the same station. Moreover this
> gets even more complicated with MU-MIMO where you may want to consider
> spatial location (which influences signal quality when grouped) of
> each station when you decide which set of stations you're going to
> aggregate to in parallel. Since drivers have a finite tx ring this it
> is important to deliver bursts that can actually be aggregated
> efficiently. This means driver would need to be able to tell qdisc
> about per-flow conditions to influence the RR scheme in some way
> (assuming a qdiscs even understands flows; do we need a unified way of
> talking about flows between qdiscs and drivers?).

I wonder if it would work better if we removed most of the
tid handling and aggregation logic in the firmware.  Maybe just have the mgt Q and best
effort (and skip VO/VI).  Let the OS tell (or suggest to) the firmware when aggregation
starts and stops.  That might at least cut the number of queues in half,
saving memory and latency up and down the stack.

Thanks,
Ben
Michal Kazior Feb. 8, 2016, 10 a.m. UTC | #7
On 5 February 2016 at 17:47, Dave Taht <dave.taht@gmail.com> wrote:
>> A bursted txop can be as big as 5-10ms. If you consider you want to
>> queue 5-10ms worth of data for *each* station at any given time you
>> obviously introduce a lot of lag. If you have 10 stations you might
>> end up with service period at 10*10ms = 100ms. This gets even worse if
>> you consider MU-MIMO because you need to do an expensive sounding
>> procedure before transmitting. So while SU aggregation can probably
>> still work reasonably well with shorter bursts (1-2ms) MU needs at
>> least 3ms to get *any* gain when compared to SU (which obviously means
>> you want more to actually make MU pay off).
>
> I am not sure where you get these numbers. Got a spreadsheet?

Here's a nice summary on some of it:

  http://chimera.labs.oreilly.com/books/1234000001739/ch03.html#figure-mac-ampdu

Even if your single A-MPDU can be shorter than a txop you can burst a
few of them if my understanding is correct.

The overhead associated with MU sounding is something I've been told
only. Apparently for MU to pay off you need fairly big bursts. This
implies that the more stations you have to service the less it makes
sense to attempt MU if you consider latency.


> Gradually reducing the maximum sized txop as a function of the number
> of stations makes sense. If you have 10 stations pending delivery and
> reduced the max txop to 1ms, you hurt bandwidth at that instant, but
> by offering more service to more stations, in less time, they will
> converge on a reasonable share of the bandwidth for each, faster[1].
> And I'm sure that the person videoconferencing on a link like that
> would appreciate getting some service inside of a 10ms interval,
> rather than a 100ms.
>
> yes, there's overhead, and that's not the right number, which would
> vary as to g,n,ac and successors.
>
> You will also get more opportunities to use mu-mimo with shorter
> bursts extant and more stations being regularly serviced.
>
> [1] https://www.youtube.com/watch?v=Rb-UnHDw02o at about 13:50

This is my thinking as well, at least for most common use cases.

If you try to optimize for throughput by introducing extra induced
latency you might end up not being able to use aggregation in practice
anyway because you won't be able to start up connections and ask for
enough data - or at least that's what my intuition tells me.

But, like I've mentioned, there's interest in making it possible to
maximize for throughput (regardless of latency). This surely makes
sense for synthetic UDP benchmarks. But does it make sense for any
real-world application? No idea.


>> The rule of thumb is the
>> longer you wait the bigger capacity you can get.
>
> This is not strictly true as the "fountain" of packets is regulated by
> acks on the other side of the link, and ramp up or down as a function
> of service time and loss.

Yes, if you consider real world cases, i.e. TCP, web traffic, etc.
then you're correct.


>> Apparently there's interest in maximizing throughput but it stands in
>> direct opposition of keeping the latency down so I've been thinking
>> how to satisfy both.
>>
>> The current approach ath10k is taking (patches in review [1][2]) is to
>> use mac80211 software queues for per-station queuing, exposing queue
>> state to firmware (it decides where frames should be dequeued from)
>> and making it possible to stop/wake per-station tx subqueue with fake
>> netdev queues. I'm starting to think this is not the right way though
>> because it's inherently hard to control latency and there's a huge
>> memory overhead associated with the fake netdev queues.
>
> What is this overhead?

E.g. if you want to be able to maximize throughput for 50 MU clients
you need to be able to queue, in theory, 50*200 (roughly) frames. This
translates to both huge memory usage and latency *and* renders
(fq_)codel qdisc rather.. moot.


> Applying things  like codel tend to dramatically shorten the amount of
> skbs extant...

> modern 802.11ac capable hardware has tons more
> memory...

I don't think it does. QCA988x is able to handle "only" 1424 tx
descriptors (IOW 1500-byte long MSDUs) in the driver-to-firmware tx
queue (it's a flat queue). QCA99x0 is able to handle 2500 if asked
politely.

This is still not enough to satisfy the insane "maximize the
capacity/throughput" expectations though.

You could actually argue it's too much from the bufferbloat problem
point of view anyway and Emmanuel's patch proves it is beneficial to
buffer less in driver depending on the sojourn packet time.


>> Also fq_codel
>> is a less effective with this kind of setup.
>
> fq_codel's principal problems with working with wifi are long and
> documented in the talk above.
>
>> My current thinking is that the entire problem should be solved via
>> (per-AC) qdiscs, e.g. fq_codel. I guess one could use
>> limit/target/interval/quantum knobs to tune it for higher latency of
>> aggregation-oriented Wi-Fi links where long service time (think
>> 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi
>> works in the first place, i.e. Wi-Fi gets better throughput if you
>> deliver bursts of packets destined to the same station. Moreover this
>> gets even more complicated with MU-MIMO where you may want to consider
>> spatial location (which influences signal quality when grouped) of
>> each station when you decide which set of stations you're going to
>> aggregate to in parallel. Since drivers have a finite tx ring this it
>> is important to deliver bursts that can actually be aggregated
>> efficiently. This means driver would need to be able to tell qdisc
>> about per-flow conditions to influence the RR scheme in some way
>> (assuming a qdiscs even understands flows; do we need a unified way of
>> talking about flows between qdiscs and drivers?).
>
> This is a very good summary of the problems in layering fq_codel as it
> exists today on top of wifi as it exists today. :/ Our conclusion
> several years ago was that as the information needed to do things more
> right was in the mac80211 layer that we could not evolve the qdisc
> layer to suit, and needed to move the core ideas into the mac80211
> layer.
>
> Things have evolved since, but I still think we can't get enough info
> up to the qdisc layer (locks and so on) to use it sanely.

The current per-station queue implementation in mac80211 doesn't seem
sufficient. Each of these queues use a simple flat fifo queue
(sk_buff_head) limited by packet count only with somewhat broken
behavior when packet limit is reached as you end up with unfairly
populated queues a lot. They need to have codel applied on them
somehow. You'll eventually end up re-inventing fq_codel in mac80211
making the qdisc redundant.

Moreover these queues aren't per-station only. They are
per-station-per-tid giving 16 queues per station. This is important
because you can't aggregate traffic going out on different tids.

Even without explicit air conditions feedback from mac80211 to
fq_codel I suspect it'd be still beneficial if fq_codel was able to
group (sub-)flows into per-station-tid and burst them out them in
subsequent dequeue() calls so the chance they get aggregated is
higher.


Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Emmanuel Grumbach Feb. 8, 2016, 10:04 a.m. UTC | #8
On Fri, Feb 5, 2016 at 10:44 AM, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 4 February 2016 at 22:14, Ben Greear <greearb@candelatech.com> wrote:
>> On 02/04/2016 12:56 PM, Grumbach, Emmanuel wrote:
>>> On 02/04/2016 10:46 PM, Ben Greear wrote:
>>>> On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote:
>>>>>
>>>>> As many (all?) WiFi devices, Intel WiFi devices have
>>>>> transmit queues which have 256 transmit descriptors
>>>>> each and each descriptor corresponds to an MPDU.
>>>>> This means that when it is full, the queue contains
>>>>> 256 * ~1500 bytes to be transmitted (if we don't have
>>>>> A-MSDUs). The purpose of those queues is to have enough
>>>>> packets to be ready for transmission so that when the device
>>>>> gets an opportunity to transmit (TxOP), it can take as many
>>>>> packets as the spec allows and aggregate them into one
>>>>> A-MPDU or even several A-MPDUs if we are using bursts.
>>>>
>>>> I guess this is only really usable if you have exactly one
>>>> peer connected (ie, in station mode)?
>>>>
>>>> Otherwise, you could have one slow peer and one fast one,
>>>> and then I suspect this would not work so well?
>>>
>>>
>>> Yes. I guess this one (big) limitation. I guess that what would happen
>>> in this case is that the the latency would constantly jitter. But I also
>
> Hmm.. You'd probably need to track per-station packet sojourn time as
> well and make it possible to stop/wake queues per station.

Clearly here comes the difference between the devices you work on, and
the devices I work on. Intel devices are more client oriented. Our AP
mode doesn't handle many clients etc..

>
>
>>> noticed that I could reduce the transmit queue to 130 descriptors
>>> (instead of 256) and still reach maximal throughput because we can
>>> refill the queues quickly enough.
>>> In iwlwifi, we have plans to have one queue for each peer.
>>> This is under development. Not sure when it'll be ready. It also requires
>>> firmware change obviously.
>>
>> Per-peer queues will probably be nice, especially if we can keep the
>> buffer bloat manageable.
>
> Per-station queues sound tricky if you consider bufferbloat.

iwlwifi's A-MPDU model is different from athXk's I guess. In iwlwifi
(the Intel devices really since it is mostly firmware) the firmware
will try to use a TxOP whenever there is data in the queue. Then, once
we have a chance to transmit, we will look at what we have in the
queue and send the biggest aggregates we can (and bursts if allowed).
But we will not defer packets' transmission to get bigger aggregates.
Testing shows that under ideal conditions, we can have enough packets
in the queue to build big aggregates without creating artificial
latency.

>
> To maximize use of airtime (i.e. txop) you need to send big
> aggregates. Since aggregates are per station-tid to maximize
> multi-station performance (in AP mode) you'll need to queue a lot of
> frames, per each station, depending on the chosen tx rate.

Sure.

>
> A bursted txop can be as big as 5-10ms. If you consider you want to
> queue 5-10ms worth of data for *each* station at any given time you
> obviously introduce a lot of lag. If you have 10 stations you might
> end up with service period at 10*10ms = 100ms. This gets even worse if
> you consider MU-MIMO because you need to do an expensive sounding
> procedure before transmitting. So while SU aggregation can probably
> still work reasonably well with shorter bursts (1-2ms) MU needs at
> least 3ms to get *any* gain when compared to SU (which obviously means
> you want more to actually make MU pay off). The rule of thumb is the
> longer you wait the bigger capacity you can get.

I am not an expert about MU-MIMO, but I'll believe you :)
We can chat about this in a few days :)

Queueing frames under good conditions is fine in a way that it is a
"Good queue" (hey Stephen), you need those queues to maximize the
throughput because of the bursty nature of WiFi and the queue "moves"
quickly since you have high throughput so that the sojourn time in
your queue is relatively small, but when the link conditions gets less
good you need to reduce the queue length because it doesn't really
help you anyway.  This is what my patch aims at fixing.
All this is true when you have a small number of stations...
I understand from your comment that even in ideal conditions you still
need to create a lot of latency to gain TPT. Then there isn't much we
can do without impacting either TPT or latency. Then, there is a real
tradeoff.
I guess that again you are facing a classic AP problem that a station
or an AP with a small number of concurrent associated clients will
likely not have.

All this encourages me in my belief that I should do something in
iwlwifi for iwlwifi and at mac80211's level since there seem to be
very different problems / use cases. But this code can still suit
those use cases can all fit and we'd just (...) have to give different
parameters to the "algorithm"?

>
> Apparently there's interest in maximizing throughput but it stands in
> direct opposition of keeping the latency down so I've been thinking
> how to satisfy both.

In your case, then yes. I guess I should limit my patch for queues
that serve vif of type BSS for now. It'll still help us for 99% of
iwlwifi's usages.

>
> The current approach ath10k is taking (patches in review [1][2]) is to
> use mac80211 software queues for per-station queuing, exposing queue
> state to firmware (it decides where frames should be dequeued from)
> and making it possible to stop/wake per-station tx subqueue with fake
> netdev queues. I'm starting to think this is not the right way though
> because it's inherently hard to control latency and there's a huge
> memory overhead associated with the fake netdev queues. Also fq_codel
> is a less effective with this kind of setup.

I'd love to hear more the reasons of fq_codel efficiency's problem
here (just for education).
I saw these patches. This makes the firmware much more actively
involved in the scheduling. As I said, in iwlwifi we plan to have one
hardware queue per-sta-per-tid so that the firmware would be able to
choose what station it wants to send frames to. The main purpose is to
improve uAPSD response time as an AP, but it will also make it easier
to choose the right frames to add to a MU-MIMO transmission.

>
> My current thinking is that the entire problem should be solved via
> (per-AC) qdiscs, e.g. fq_codel. I guess one could use
> limit/target/interval/quantum knobs to tune it for higher latency of
> aggregation-oriented Wi-Fi links where long service time (think
> 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi
> works in the first place, i.e. Wi-Fi gets better throughput if you
> deliver bursts of packets destined to the same station. Moreover this
> gets even more complicated with MU-MIMO where you may want to consider
> spatial location (which influences signal quality when grouped) of
> each station when you decide which set of stations you're going to
> aggregate to in parallel. Since drivers have a finite tx ring this it
> is important to deliver bursts that can actually be aggregated
> efficiently. This means driver would need to be able to tell qdisc
> about per-flow conditions to influence the RR scheme in some way
> (assuming a qdiscs even understands flows; do we need a unified way of
> talking about flows between qdiscs and drivers?).

Interesting. Not sure that 100-200ms is acceptable for all the use
cases though. Bursts can be achieved by TSO maybe? I did that to get
A-MSDU. But... that won't help for bridged traffic which is your
primary use case.

>
>
> [1]: https://www.spinics.net/lists/linux-wireless/msg146187.html
> [2]: https://www.spinics.net/lists/linux-wireless/msg146512.html
>
>
>>>> For reference, ath10k has around 1400 tx descriptors, though
>>>> in practice not all are usable, and in stock firmware, I'm guessing
>>>> the NIC will never be able to actually fill up it's tx descriptors
>>>> and stop traffic.  Instead, it just allows the stack to try to
>>>> TX, then drops the frame...
>>>
>>>
>>> 1400 descriptors, ok... but they are not organised in queues?
>>> (forgive my ignorance of athX drivers)
>>
>>
>> I think all the details are in the firmware, at least for now.
>
> Yeah. Basically ath10k has a flat set of tx descriptors which are
> AC-agnostic. Firmware classifies them internally to per-AC HW queues.
>
>
> Micha?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Emmanuel Grumbach Feb. 8, 2016, 10:17 a.m. UTC | #9
On Mon, Feb 8, 2016 at 12:00 PM, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 5 February 2016 at 17:47, Dave Taht <dave.taht@gmail.com> wrote:
>>> A bursted txop can be as big as 5-10ms. If you consider you want to
>>> queue 5-10ms worth of data for *each* station at any given time you
>>> obviously introduce a lot of lag. If you have 10 stations you might
>>> end up with service period at 10*10ms = 100ms. This gets even worse if
>>> you consider MU-MIMO because you need to do an expensive sounding
>>> procedure before transmitting. So while SU aggregation can probably
>>> still work reasonably well with shorter bursts (1-2ms) MU needs at
>>> least 3ms to get *any* gain when compared to SU (which obviously means
>>> you want more to actually make MU pay off).
>>
>> I am not sure where you get these numbers. Got a spreadsheet?
>
> Here's a nice summary on some of it:
>
>   http://chimera.labs.oreilly.com/books/1234000001739/ch03.html#figure-mac-ampdu
>
> Even if your single A-MPDU can be shorter than a txop you can burst a
> few of them if my understanding is correct.
>
> The overhead associated with MU sounding is something I've been told
> only. Apparently for MU to pay off you need fairly big bursts. This
> implies that the more stations you have to service the less it makes
> sense to attempt MU if you consider latency.
>
>
>> Gradually reducing the maximum sized txop as a function of the number
>> of stations makes sense. If you have 10 stations pending delivery and
>> reduced the max txop to 1ms, you hurt bandwidth at that instant, but
>> by offering more service to more stations, in less time, they will
>> converge on a reasonable share of the bandwidth for each, faster[1].
>> And I'm sure that the person videoconferencing on a link like that
>> would appreciate getting some service inside of a 10ms interval,
>> rather than a 100ms.
>>
>> yes, there's overhead, and that's not the right number, which would
>> vary as to g,n,ac and successors.
>>
>> You will also get more opportunities to use mu-mimo with shorter
>> bursts extant and more stations being regularly serviced.
>>
>> [1] https://www.youtube.com/watch?v=Rb-UnHDw02o at about 13:50
>
> This is my thinking as well, at least for most common use cases.
>
> If you try to optimize for throughput by introducing extra induced
> latency you might end up not being able to use aggregation in practice
> anyway because you won't be able to start up connections and ask for
> enough data - or at least that's what my intuition tells me.
>
> But, like I've mentioned, there's interest in making it possible to
> maximize for throughput (regardless of latency). This surely makes
> sense for synthetic UDP benchmarks. But does it make sense for any
> real-world application? No idea.
>
>
>>> The rule of thumb is the
>>> longer you wait the bigger capacity you can get.
>>
>> This is not strictly true as the "fountain" of packets is regulated by
>> acks on the other side of the link, and ramp up or down as a function
>> of service time and loss.
>
> Yes, if you consider real world cases, i.e. TCP, web traffic, etc.
> then you're correct.
>
>
>>> Apparently there's interest in maximizing throughput but it stands in
>>> direct opposition of keeping the latency down so I've been thinking
>>> how to satisfy both.
>>>
>>> The current approach ath10k is taking (patches in review [1][2]) is to
>>> use mac80211 software queues for per-station queuing, exposing queue
>>> state to firmware (it decides where frames should be dequeued from)
>>> and making it possible to stop/wake per-station tx subqueue with fake
>>> netdev queues. I'm starting to think this is not the right way though
>>> because it's inherently hard to control latency and there's a huge
>>> memory overhead associated with the fake netdev queues.
>>
>> What is this overhead?
>
> E.g. if you want to be able to maximize throughput for 50 MU clients
> you need to be able to queue, in theory, 50*200 (roughly) frames. This
> translates to both huge memory usage and latency *and* renders
> (fq_)codel qdisc rather.. moot.

Ok - now I understand. So yes the conclusion below (make fq_codel
station aware) makes a lot sense.

>
>
>> Applying things  like codel tend to dramatically shorten the amount of
>> skbs extant...
>
>> modern 802.11ac capable hardware has tons more
>> memory...
>
> I don't think it does. QCA988x is able to handle "only" 1424 tx
> descriptors (IOW 1500-byte long MSDUs) in the driver-to-firmware tx
> queue (it's a flat queue). QCA99x0 is able to handle 2500 if asked
> politely.

As I said, our design is not flat which removes for the firmware to
classify the packets by station to be able to build aggregates, but
the downside is the number of clients you can service.

>
> This is still not enough to satisfy the insane "maximize the
> capacity/throughput" expectations though.
>
> You could actually argue it's too much from the bufferbloat problem
> point of view anyway and Emmanuel's patch proves it is beneficial to
> buffer less in driver depending on the sojourn packet time.
>

In real life scenario, yes. But the more I read your comments the more
I get to the conclusion we (Intel) simply haven't been able to reach
the throughput you have which *requires* more latency to reach :)
Note that you always talk about overall throughput (to all the
stations at the same time) which, again, makes a lot of sense when you
work primarily as an AP. We look at the world with different
viewpoint.

>
>>> Also fq_codel
>>> is a less effective with this kind of setup.
>>
>> fq_codel's principal problems with working with wifi are long and
>> documented in the talk above.
>>
>>> My current thinking is that the entire problem should be solved via
>>> (per-AC) qdiscs, e.g. fq_codel. I guess one could use
>>> limit/target/interval/quantum knobs to tune it for higher latency of
>>> aggregation-oriented Wi-Fi links where long service time (think
>>> 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi
>>> works in the first place, i.e. Wi-Fi gets better throughput if you
>>> deliver bursts of packets destined to the same station. Moreover this
>>> gets even more complicated with MU-MIMO where you may want to consider
>>> spatial location (which influences signal quality when grouped) of
>>> each station when you decide which set of stations you're going to
>>> aggregate to in parallel. Since drivers have a finite tx ring this it
>>> is important to deliver bursts that can actually be aggregated
>>> efficiently. This means driver would need to be able to tell qdisc
>>> about per-flow conditions to influence the RR scheme in some way
>>> (assuming a qdiscs even understands flows; do we need a unified way of
>>> talking about flows between qdiscs and drivers?).
>>
>> This is a very good summary of the problems in layering fq_codel as it
>> exists today on top of wifi as it exists today. :/ Our conclusion
>> several years ago was that as the information needed to do things more
>> right was in the mac80211 layer that we could not evolve the qdisc
>> layer to suit, and needed to move the core ideas into the mac80211
>> layer.
>>
>> Things have evolved since, but I still think we can't get enough info
>> up to the qdisc layer (locks and so on) to use it sanely.
>
> The current per-station queue implementation in mac80211 doesn't seem
> sufficient. Each of these queues use a simple flat fifo queue
> (sk_buff_head) limited by packet count only with somewhat broken
> behavior when packet limit is reached as you end up with unfairly
> populated queues a lot. They need to have codel applied on them
> somehow. You'll eventually end up re-inventing fq_codel in mac80211
> making the qdisc redundant.
>
> Moreover these queues aren't per-station only. They are
> per-station-per-tid giving 16 queues per station. This is important
> because you can't aggregate traffic going out on different tids.
>
> Even without explicit air conditions feedback from mac80211 to
> fq_codel I suspect it'd be still beneficial if fq_codel was able to
> group (sub-)flows into per-station-tid and burst them out them in
> subsequent dequeue() calls so the chance they get aggregated is
> higher.

That would allow you to avoid classifying (and buffering) in HW and
then reduce the latency. Agree.


>
>
> Micha?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
index 2f95916..d83eb56 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
@@ -192,6 +192,11 @@  struct iwl_cmd_meta {
 	u32 flags;
 };
 
+struct iwl_txq_auto_size {
+	int min_space;
+	unsigned long reset_ts;
+};
+
 /*
  * Generic queue structure
  *
@@ -293,6 +298,7 @@  struct iwl_txq {
 	bool block;
 	unsigned long wd_timeout;
 	struct sk_buff_head overflow_q;
+	struct iwl_txq_auto_size auto_sz;
 };
 
 static inline dma_addr_t
diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
index 837a7d5..4d1dee6 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
@@ -572,6 +572,8 @@  static int iwl_pcie_txq_init(struct iwl_trans *trans, struct iwl_txq *txq,
 
 	spin_lock_init(&txq->lock);
 	__skb_queue_head_init(&txq->overflow_q);
+	txq->auto_sz.min_space = 240;
+	txq->auto_sz.reset_ts = jiffies;
 
 	/*
 	 * Tell nic where to find circular buffer of Tx Frame Descriptors for
@@ -1043,10 +1045,14 @@  void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
 	     q->read_ptr != tfd_num;
 	     q->read_ptr = iwl_queue_inc_wrap(q->read_ptr)) {
 		struct sk_buff *skb = txq->entries[txq->q.read_ptr].skb;
+		struct ieee80211_tx_info *info;
+		unsigned long tx_time;
 
 		if (WARN_ON_ONCE(!skb))
 			continue;
 
+		info = IEEE80211_SKB_CB(skb);
+
 		iwl_pcie_free_tso_page(skb);
 
 		__skb_queue_tail(skbs, skb);
@@ -1056,6 +1062,18 @@  void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
 		iwl_pcie_txq_inval_byte_cnt_tbl(trans, txq);
 
 		iwl_pcie_txq_free_tfd(trans, txq);
+
+		tx_time = (uintptr_t)info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2];
+		if (time_before(jiffies, tx_time + msecs_to_jiffies(10))) {
+			txq->auto_sz.min_space -= 10;
+			txq->auto_sz.min_space =
+				max(txq->auto_sz.min_space, txq->q.high_mark);
+		} else if (time_after(jiffies,
+				      tx_time + msecs_to_jiffies(20))) {
+			txq->auto_sz.min_space += 10;
+			txq->auto_sz.min_space =
+				min(txq->auto_sz.min_space, 252);
+		}
 	}
 
 	iwl_pcie_txq_progress(txq);
@@ -2185,6 +2203,7 @@  int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,
 		      struct iwl_device_cmd *dev_cmd, int txq_id)
 {
 	struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
+	struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
 	struct ieee80211_hdr *hdr;
 	struct iwl_tx_cmd *tx_cmd = (struct iwl_tx_cmd *)dev_cmd->payload;
 	struct iwl_cmd_meta *out_meta;
@@ -2234,13 +2253,20 @@  int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,
 
 	spin_lock(&txq->lock);
 
-	if (iwl_queue_space(q) < q->high_mark) {
+	info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2] =
+		(void *)(uintptr_t)jiffies;
+
+	if (time_after(jiffies,
+		       txq->auto_sz.reset_ts + msecs_to_jiffies(5000))) {
+		txq->auto_sz.min_space = 240;
+		txq->auto_sz.reset_ts = jiffies;
+	}
+
+	if (iwl_queue_space(q) < txq->auto_sz.min_space) {
 		iwl_stop_queue(trans, txq);
 
 		/* don't put the packet on the ring, if there is no room */
 		if (unlikely(iwl_queue_space(q) < 3)) {
-			struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
-
 			info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 1] =
 				dev_cmd;
 			__skb_queue_tail(&txq->overflow_q, skb);