diff mbox

[RFC] ath10k: implement dql for htt tx

Message ID 1458898743-21118-1-git-send-email-michal.kazior@tieto.com (mailing list archive)
State RFC
Headers show

Commit Message

Michal Kazior March 25, 2016, 9:39 a.m. UTC
This implements a very naive dynamic queue limits
on the flat HTT Tx. In some of my tests (using
flent) it seems to reduce induced latency by
orders of magnitude (e.g. when enforcing 6mbps
tx rates 2500ms -> 150ms). But at the same time it
introduces TCP throughput buildup over time
(instead of immediate bump to max). More
importantly I didn't observe it to make things
much worse (yet).

Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
---

I'm not sure yet if it's worth to consider this
patch for merging per se. My motivation was to
have something to prove mac80211 fq works and to
see if DQL can learn the proper queue limit in
face of wireless rate control at all.

I'll do a follow up post with flent test results
and some notes.


 drivers/net/wireless/ath/ath10k/core.h   |  2 ++
 drivers/net/wireless/ath/ath10k/htt_rx.c | 12 ++++++++----
 drivers/net/wireless/ath/ath10k/htt_tx.c |  8 +++++++-
 drivers/net/wireless/ath/ath10k/mac.c    | 26 ++++++++++++++++++++------
 drivers/net/wireless/ath/ath10k/txrx.c   |  6 +++++-
 drivers/net/wireless/ath/ath10k/txrx.h   |  3 ++-
 6 files changed, 44 insertions(+), 13 deletions(-)

Comments

Dave Taht March 26, 2016, 4:44 p.m. UTC | #1
Dear Michal:

I commented on and put up your results for the baseline driver here:

http://blog.cerowrt.org/post/rtt_fair_on_wifi/

And the wonderful result you got for the first ever fq_codel-ish
implementation here:

http://blog.cerowrt.org/post/fq_codel_on_ath10k/

I am running behind on this patch set, but a couple quick comments.


On Fri, Mar 25, 2016 at 2:55 AM, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 25 March 2016 at 10:39, Michal Kazior <michal.kazior@tieto.com> wrote:
>> This implements a very naive dynamic queue limits
>> on the flat HTT Tx. In some of my tests (using
>> flent) it seems to reduce induced latency by
>> orders of magnitude (e.g. when enforcing 6mbps
>> tx rates 2500ms -> 150ms). But at the same time it
>> introduces TCP throughput buildup over time
>> (instead of immediate bump to max). More
>> importantly I didn't observe it to make things
>> much worse (yet).
>>
>> Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
>> ---
>>
>> I'm not sure yet if it's worth to consider this
>> patch for merging per se. My motivation was to
>> have something to prove mac80211 fq works and to
>> see if DQL can learn the proper queue limit in
>> face of wireless rate control at all.
>>
>> I'll do a follow up post with flent test results
>> and some notes.
>
> Here's a short description what-is-what test naming:
>  - sw/fq contains only txq/flow stuff (no scheduling, no txop queue limits)
>  - sw/ath10k_dql contains only ath10k patch which applies DQL to
> driver-firmware tx queue naively
>  - sw/fq+ath10k_dql is obvious
>  - sw/base today's ath.git/master checkout used as base
>  - "veryfast" tests TCP tput to reference receiver (4 antennas)
>  - "fast" tests TCP tput to ref receiver (1 antenna)
>  - "slow" tests TCP tput to ref receiver (1 *unplugged* antenna)
>  - "fast+slow" tests sharing between "fast" and "slow"
>  - "autorate" uses default rate control
>  - "rate6m" uses fixed-tx-rate at 6mbps
>  - the test uses QCA9880 w/ 10.1.467
>  - no rrul tests, sorry Dave! :)

rrul would be a good baseline to have, but no need to waste your time
on running it every time as yet. It stresses out both sides of the
link so whenever you get two devices with these driver changes on them
it would be "interesting". It's the meanest, nastiest test we have...
if you can get past the rrul, you've truly won.

Consistently using tcp_fair_up with 1,2,4 flows and 1-4 stations as
you are now is good enough.

doing a more voip-like test with slamming d-itg into your test would be good...

>
> Observations / conclusions:
>  - DQL builds up throughput slowly on "veryfast"; in some tests it
> doesn't get to reach peak (roughly 210mbps average) because the test
> is too short

It looks like having access to the rate control info here for the
initial and ongoing estimates will react faster and better than dql
can. I loved the potential here in getting full rate for web traffic
in the usual 2second burst you get it in (see above blog entries)

>  - DQL shows better latency results in almost all cases compared to
> the txop based scheduling from my mac80211 RFC (but i haven't
> thoroughly looked at *all* the data; I might've missed a case where it
> performs worse)

Well, if you are not saturating the link, latency will be better.
Showing how much less latency is possible, is good too, but....

>  - latency improvement seen on sw/ath10k_dql @ rate6m,fast compared to
> sw/base (1800ms -> 160ms) can be explained by the fact that txq AC
> limit is 256 and since all TCP streams run on BE (and fq_codel as the
> qdisc) the induced txq latency is 256 * (1500 / (6*1024*1024/8.)) / 4
> = ~122ms which is pretty close to the test data (the formula ignores
> MAC overhead, so the latency in practice is larger). Once you consider
> the overhead and in-flight packets on driver-firmware tx queue 160ms
> doesn't seem strange. Moreover when you compare the same case with
> sw/fq+ath10k_dql you can clearly see the advantage of having fq_codel
> in mac80211 software queuing - the latency drops by (another) order of
> magnitude because now incomming ICMPs are treated as new, bursty flows
> and get fed to the device quickly.


It is always good to test codel and fq_codel separately, particularly
on a new codel implementation. There are so many ways to get codel
wrong or add an optimization that doesn't work (speaking as someone
that has got it wrong often)

If you are getting a fq result of 12 ms, that means you are getting
data into the device with a ~12ms standing queue there. On a good day
you'd see perhaps 17-22ms for "codel target 5ms" in that case, on the
rtt_fair_up series of tests.

if you are getting a pure codel result of 160ms, that means the
implementation is broken. But I think (after having read your
description twice), the baseline result today of 160ms of queuing was
with a fq_codel *qdisc* doing the work on top of huge buffers, the
results a few days ago were with a fq_codel 802.11 layer, and the
results today you are comparing, are pure fq (no codel) in the 802.11e
stack, with fixed (and dql) buffering?

if so. Yea! Science!

...

One of the flaws of the flent tests is that conceptually they were
developed before the fq stuff won so big, and looking hard at the
per-queue latency for the fat flows requires either looking hard at
the packet captures or sampling the actual queue length. There is that
sampling capability in various flent tests, but at the moment it only
samples what tc provides (Drops, marks, and length) and it does not
look like there is a snapshot queue length exported from that ath10k
driver?

...

As for a standing queue of 12ms at all in wifi... and making the fq
portion work better, it would be quite nice to get that down a bit
more. One thought (for testing purposes) would be to fix a txop at
1024,2048,3xxxus for some test runs. I really don't have a a feel for
framing overhead on the latest standards. (I loathe the idea of
holding the media for more than 2-3ms when you have other stuff coming
in behind it...)

 Another is to hold off preparing and submitting a new batch of
packets; when you know the existing TID will take 4ms to transmit,
defer grabbing the next batch for 3ms. Etc.

It would be glorious to see wifi capable of decent twitch gaming again...

>  - slow+fast case still sucks but that's expected because DQL hasn't
> been applied per-station
>
>  - sw/fq has lower peak throughput ("veryfast") compared to sw/base
> (this actually proves current - and very young least to say - ath10k
> wake-tx-queue implementation is deficient; ath10k_dql improves it and
> sw/fq+ath10k_dql climbs up to the max throughput over time)
>
>
> To sum things up:
>  - DQL might be able to replace the explicit txop queue limiting
> (which requires rate control info)

I am pessimistic. Perhaps as a fallback?

>  - mac80211 fair queuing works

:)

>
> A few plots for quick and easy reference:
>
>   http://imgur.com/a/TnvbQ
>
>
> Micha?
>
> PS. I'm not feeling comfortable attaching 1MB attachment to a mailing
> list. Is this okay or should I use something else next time?

I/you can slam results into the github blogcerowrt repo and then pull
out stuff selectively....
Michal Kazior March 29, 2016, 7:49 a.m. UTC | #2
On 26 March 2016 at 17:44, Dave Taht <dave.taht@gmail.com> wrote:
> Dear Michal:
[...]
> I am running behind on this patch set, but a couple quick comments.
[...]
>>  - no rrul tests, sorry Dave! :)
>
> rrul would be a good baseline to have, but no need to waste your time
> on running it every time as yet. It stresses out both sides of the
> link so whenever you get two devices with these driver changes on them
> it would be "interesting". It's the meanest, nastiest test we have...
> if you can get past the rrul, you've truly won.
>
> Consistently using tcp_fair_up with 1,2,4 flows and 1-4 stations as
> you are now is good enough.
>
> doing a more voip-like test with slamming d-itg into your test would be good...
>
>>
>> Observations / conclusions:
>>  - DQL builds up throughput slowly on "veryfast"; in some tests it
>> doesn't get to reach peak (roughly 210mbps average) because the test
>> is too short
>
> It looks like having access to the rate control info here for the
> initial and ongoing estimates will react faster and better than dql
> can. I loved the potential here in getting full rate for web traffic
> in the usual 2second burst you get it in (see above blog entries)

On one hand - yes, rate control should in theory be "faster".

On the other hand DQL will react also to host system interrupt service
time. On slow CPUs (typically found on routers and such) you might end
up grinding the CPU so much you need deeper tx queues to keep the hw
busy (and therefore keep performance maxed). DQL should automatically
adjust to that while "txop limit" might not.


>>  - DQL shows better latency results in almost all cases compared to
>> the txop based scheduling from my mac80211 RFC (but i haven't
>> thoroughly looked at *all* the data; I might've missed a case where it
>> performs worse)
>
> Well, if you are not saturating the link, latency will be better.
> Showing how much less latency is possible, is good too, but....

If you wait long enough DQL does scale the tx queue size for max
performance (you can see TCP throughput scale towards test end). It
does take a few seconds to reach the peak. Perhaps this could be
tweaked.


>>  - latency improvement seen on sw/ath10k_dql @ rate6m,fast compared to
>> sw/base (1800ms -> 160ms) can be explained by the fact that txq AC
>> limit is 256 and since all TCP streams run on BE (and fq_codel as the
>> qdisc) the induced txq latency is 256 * (1500 / (6*1024*1024/8.)) / 4
>> = ~122ms which is pretty close to the test data (the formula ignores
>> MAC overhead, so the latency in practice is larger). Once you consider
>> the overhead and in-flight packets on driver-firmware tx queue 160ms
>> doesn't seem strange. Moreover when you compare the same case with
>> sw/fq+ath10k_dql you can clearly see the advantage of having fq_codel
>> in mac80211 software queuing - the latency drops by (another) order of
>> magnitude because now incomming ICMPs are treated as new, bursty flows
>> and get fed to the device quickly.
>
>
> It is always good to test codel and fq_codel separately, particularly
> on a new codel implementation. There are so many ways to get codel
> wrong or add an optimization that doesn't work (speaking as someone
> that has got it wrong often)
>
> If you are getting a fq result of 12 ms, that means you are getting
> data into the device with a ~12ms standing queue there. On a good day
> you'd see perhaps 17-22ms for "codel target 5ms" in that case, on the
> rtt_fair_up series of tests.

This will obviously depend on the number of stations you have data
queued to. Estimating codel target time requires smarter tx
scheduling. My earlier (RFC) patch tried doing that.


> if you are getting a pure codel result of 160ms, that means the
> implementation is broken. But I think (after having read your
> description twice), the baseline result today of 160ms of queuing was
> with a fq_codel *qdisc* doing the work on top of huge buffers,

Yes. The 160ms is with fq_codel qdisc with ath10k doing DQL at 6mbps.
Without DQL ath10k would clog up all tx slots (1424 of them) with
frames. At 6mbps you typically want/need a handful (5-10) of frames to
be queued.

> the
> results a few days ago were with a fq_codel 802.11 layer, and the
> results today you are comparing, are pure fq (no codel) in the 802.11e
> stack, with fixed (and dql) buffering?

Yes. codel target in fq_codel-in-mac80211 is hardcoded at 20ms now
because there's no scheduling and hence no data to derive the target
dynamically.


> if so. Yea! Science!
>
> ...
>
> One of the flaws of the flent tests is that conceptually they were
> developed before the fq stuff won so big, and looking hard at the
> per-queue latency for the fat flows requires either looking hard at
> the packet captures or sampling the actual queue length. There is that
> sampling capability in various flent tests, but at the moment it only
> samples what tc provides (Drops, marks, and length) and it does not
> look like there is a snapshot queue length exported from that ath10k
> driver?

Exporting tx queue length snapshot should be fairly easy. 2 debugfs
entries for ar->htt.max_num_pending_tx and ar->htt.num_pending_tx.


>
> ...
>
> As for a standing queue of 12ms at all in wifi... and making the fq
> portion work better, it would be quite nice to get that down a bit
> more. One thought (for testing purposes) would be to fix a txop at
> 1024,2048,3xxxus for some test runs. I really don't have a a feel for
> framing overhead on the latest standards. (I loathe the idea of
> holding the media for more than 2-3ms when you have other stuff coming
> in behind it...)
>
>  Another is to hold off preparing and submitting a new batch of
> packets; when you know the existing TID will take 4ms to transmit,
> defer grabbing the next batch for 3ms. Etc.

I don't think hardcoding timings for tx scheduling is a good idea. I
believe we just need a deficit-based round robin with time slices. The
problem I see is time slices may change with host CPU load. That's why
I'm leaning towards more experiments with DQL approach.


> It would be glorious to see wifi capable of decent twitch gaming again...
>
>>  - slow+fast case still sucks but that's expected because DQL hasn't
>> been applied per-station
>>
>>  - sw/fq has lower peak throughput ("veryfast") compared to sw/base
>> (this actually proves current - and very young least to say - ath10k
>> wake-tx-queue implementation is deficient; ath10k_dql improves it and
>> sw/fq+ath10k_dql climbs up to the max throughput over time)
>>
>>
>> To sum things up:
>>  - DQL might be able to replace the explicit txop queue limiting
>> (which requires rate control info)
>
> I am pessimistic. Perhaps as a fallback?

At first I was (too) considering DQL as a nice fallback but the more I
think about the more it makes sense to use it as the main source of
deriving time slices for tx scheduling.


>>  - mac80211 fair queuing works
>
> :)
>
>>
>> A few plots for quick and easy reference:
>>
>>   http://imgur.com/a/TnvbQ
>>
>>
>> Micha?
>>
>> PS. I'm not feeling comfortable attaching 1MB attachment to a mailing
>> list. Is this okay or should I use something else next time?
>
> I/you can slam results into the github blogcerowrt repo and then pull
> out stuff selectively....

Good idea, thanks!


Micha?
Ben Greear March 29, 2016, 3:54 p.m. UTC | #3
On 03/29/2016 12:49 AM, Michal Kazior wrote:

>> if you are getting a pure codel result of 160ms, that means the
>> implementation is broken. But I think (after having read your
>> description twice), the baseline result today of 160ms of queuing was
>> with a fq_codel *qdisc* doing the work on top of huge buffers,
>
> Yes. The 160ms is with fq_codel qdisc with ath10k doing DQL at 6mbps.
> Without DQL ath10k would clog up all tx slots (1424 of them) with
> frames. At 6mbps you typically want/need a handful (5-10) of frames to
> be queued.

Have you actually verified you can use all tx slots?  The way the
firmware uses it's tx buffers I think you may not be able to actually
do that...and in practice, you will get a lot fewer usable tx-buffers
than configured....

THanks,
Ben
Dave Taht March 30, 2016, 12:57 a.m. UTC | #4
As a side note of wifi ideas complementary to codel, please see:

http://blog.cerowrt.org/post/selective_unprotect/

On Tue, Mar 29, 2016 at 12:49 AM, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 26 March 2016 at 17:44, Dave Taht <dave.taht@gmail.com> wrote:
>> Dear Michal:
> [...]
>> I am running behind on this patch set, but a couple quick comments.
> [...]
>>>  - no rrul tests, sorry Dave! :)
>>
>> rrul would be a good baseline to have, but no need to waste your time
>> on running it every time as yet. It stresses out both sides of the
>> link so whenever you get two devices with these driver changes on them
>> it would be "interesting". It's the meanest, nastiest test we have...
>> if you can get past the rrul, you've truly won.
>>
>> Consistently using tcp_fair_up with 1,2,4 flows and 1-4 stations as
>> you are now is good enough.
>>
>> doing a more voip-like test with slamming d-itg into your test would be good...
>>
>>>
>>> Observations / conclusions:
>>>  - DQL builds up throughput slowly on "veryfast"; in some tests it
>>> doesn't get to reach peak (roughly 210mbps average) because the test
>>> is too short
>>
>> It looks like having access to the rate control info here for the
>> initial and ongoing estimates will react faster and better than dql
>> can. I loved the potential here in getting full rate for web traffic
>> in the usual 2second burst you get it in (see above blog entries)
>
> On one hand - yes, rate control should in theory be "faster".
>
> On the other hand DQL will react also to host system interrupt service
> time. On slow CPUs (typically found on routers and such) you might end
> up grinding the CPU so much you need deeper tx queues to keep the hw
> busy (and therefore keep performance maxed). DQL should automatically
> adjust to that while "txop limit" might not.

Mmmm.... current multi-core generation arm routers should be fast enough.

Otherwise, point taken (possibly). Even intel i3 boxes need offloads to get to
line rate.


>>
>> It is always good to test codel and fq_codel separately, particularly
>> on a new codel implementation. There are so many ways to get codel
>> wrong or add an optimization that doesn't work (speaking as someone
>> that has got it wrong often)
>>
>> If you are getting a fq result of 12 ms, that means you are getting
>> data into the device with a ~12ms standing queue there. On a good day
>> you'd see perhaps 17-22ms for "codel target 5ms" in that case, on the
>> rtt_fair_up series of tests.
>
> This will obviously depend on the number of stations you have data
> queued to. Estimating codel target time requires smarter tx
> scheduling. My earlier (RFC) patch tried doing that.

and I loved it. ;)

>
>> if you are getting a pure codel result of 160ms, that means the
>> implementation is broken. But I think (after having read your
>> description twice), the baseline result today of 160ms of queuing was
>> with a fq_codel *qdisc* doing the work on top of huge buffers,
>
> Yes. The 160ms is with fq_codel qdisc with ath10k doing DQL at 6mbps.
> Without DQL ath10k would clog up all tx slots (1424 of them) with
> frames. At 6mbps you typically want/need a handful (5-10) of frames to
> be queued.
>
>> the
>> results a few days ago were with a fq_codel 802.11 layer, and the
>> results today you are comparing, are pure fq (no codel) in the 802.11e
>> stack, with fixed (and dql) buffering?
>
> Yes. codel target in fq_codel-in-mac80211 is hardcoded at 20ms now
> because there's no scheduling and hence no data to derive the target
> dynamically.

Well, for these simple 2 station tests, you could halve it, easily.

With ecn on on both sides, I tend to look at the groupings of the ecn
marks in wireshark.

>
>
>> if so. Yea! Science!
>>
>> ...
>>
>> One of the flaws of the flent tests is that conceptually they were
>> developed before the fq stuff won so big, and looking hard at the
>> per-queue latency for the fat flows requires either looking hard at
>> the packet captures or sampling the actual queue length. There is that
>> sampling capability in various flent tests, but at the moment it only
>> samples what tc provides (Drops, marks, and length) and it does not
>> look like there is a snapshot queue length exported from that ath10k
>> driver?
>
> Exporting tx queue length snapshot should be fairly easy. 2 debugfs
> entries for ar->htt.max_num_pending_tx and ar->htt.num_pending_tx.

K. Still running *way* behind you on getting stuff up and running. The
ath10ks I ordered were backordered, should arrive shortly.

>
>
>>
>> ...
>>
>> As for a standing queue of 12ms at all in wifi... and making the fq
>> portion work better, it would be quite nice to get that down a bit
>> more. One thought (for testing purposes) would be to fix a txop at
>> 1024,2048,3xxxus for some test runs. I really don't have a a feel for
>> framing overhead on the latest standards. (I loathe the idea of
>> holding the media for more than 2-3ms when you have other stuff coming
>> in behind it...)
>>
>>  Another is to hold off preparing and submitting a new batch of
>> packets; when you know the existing TID will take 4ms to transmit,
>> defer grabbing the next batch for 3ms. Etc.
>
> I don't think hardcoding timings for tx scheduling is a good idea. I

wasn't suggesting that, was suggesting predicting a minimum time to
transmit based on the history.

> believe we just need a deficit-based round robin with time slices. The
> problem I see is time slices may change with host CPU load. That's why
> I'm leaning towards more experiments with DQL approach.

OK.

>
>> It would be glorious to see wifi capable of decent twitch gaming again...
>>
>>>  - slow+fast case still sucks but that's expected because DQL hasn't
>>> been applied per-station
>>>
>>>  - sw/fq has lower peak throughput ("veryfast") compared to sw/base
>>> (this actually proves current - and very young least to say - ath10k
>>> wake-tx-queue implementation is deficient; ath10k_dql improves it and
>>> sw/fq+ath10k_dql climbs up to the max throughput over time)
>>>
>>>
>>> To sum things up:
>>>  - DQL might be able to replace the explicit txop queue limiting
>>> (which requires rate control info)
>>
>> I am pessimistic. Perhaps as a fallback?
>
> At first I was (too) considering DQL as a nice fallback but the more I
> think about the more it makes sense to use it as the main source of
> deriving time slices for tx scheduling.

I don't really get how dql can be applied per station in it's current forrm.

>
>
>>>  - mac80211 fair queuing works
>>
>> :)
>>
>>>
>>> A few plots for quick and easy reference:
>>>
>>>   http://imgur.com/a/TnvbQ
>>>
>>>
>>> Micha?
>>>
>>> PS. I'm not feeling comfortable attaching 1MB attachment to a mailing
>>> list. Is this okay or should I use something else next time?
>>
>> I/you can slam results into the github blogcerowrt repo and then pull
>> out stuff selectively....
>
> Good idea, thanks!

You got commit privs.

>
>
> Micha?
Michal Kazior March 30, 2016, 9:22 a.m. UTC | #5
On 29 March 2016 at 17:54, Ben Greear <greearb@candelatech.com> wrote:
> On 03/29/2016 12:49 AM, Michal Kazior wrote:
>
>>> if you are getting a pure codel result of 160ms, that means the
>>> implementation is broken. But I think (after having read your
>>> description twice), the baseline result today of 160ms of queuing was
>>> with a fq_codel *qdisc* doing the work on top of huge buffers,
>>
>>
>> Yes. The 160ms is with fq_codel qdisc with ath10k doing DQL at 6mbps.
>> Without DQL ath10k would clog up all tx slots (1424 of them) with
>> frames. At 6mbps you typically want/need a handful (5-10) of frames to
>> be queued.
>
>
> Have you actually verified you can use all tx slots?

It works in most cases. I guess you're suspecting some of your
tx(flushing?) problems might be induced by overcommiting?

> The way the
> firmware uses it's tx buffers I think you may not be able to actually
> do that...and in practice, you will get a lot fewer usable tx-buffers
> than configured....

Could be, e.g. I'm aware management frames are kind of a special case
in recent firmware revisions.

What would/do you expect firmware would/will do when we overcommit?
The driver does advertise number of HTT tx slots so I would expect it
to work fine if it didn't crash during boot.


Micha?
Michal Kazior March 30, 2016, 10:04 a.m. UTC | #6
On 30 March 2016 at 02:57, Dave Taht <dave.taht@gmail.com> wrote:
> As a side note of wifi ideas complementary to codel, please see:
>
> http://blog.cerowrt.org/post/selective_unprotect/
>
> On Tue, Mar 29, 2016 at 12:49 AM, Michal Kazior <michal.kazior@tieto.com> wrote:
[...]
>> On the other hand DQL will react also to host system interrupt service
>> time. On slow CPUs (typically found on routers and such) you might end
>> up grinding the CPU so much you need deeper tx queues to keep the hw
>> busy (and therefore keep performance maxed). DQL should automatically
>> adjust to that while "txop limit" might not.
>
> Mmmm.... current multi-core generation arm routers should be fast enough.

While this helps it doesn't mean you can magically un-serialize
interrupt/txrx completion work.


[...]
>>>> To sum things up:
>>>>  - DQL might be able to replace the explicit txop queue limiting
>>>> (which requires rate control info)
>>>
>>> I am pessimistic. Perhaps as a fallback?
>>
>> At first I was (too) considering DQL as a nice fallback but the more I
>> think about the more it makes sense to use it as the main source of
>> deriving time slices for tx scheduling.
>
> I don't really get how dql can be applied per station in it's current forrm.

I was thinking of the following scheme:
 - disable excessive retries in fw (apparently doable via WMI pdev parameter)
 - deficit-based round-robin over stations
 - maintain DQL structure per station
 - when station gets its turn to xmit push frames to fw
 - keep doing that (with regard to per station DQL limits) until either:
   - associated software queue becomes empty
   - 1 timeslice for given station has elapsed
     - i was thinking of extracting timeslices from DQL or maintaining
it separately
 - try next station
 - do not submit packets to multiple stations at once (MU will need
more work..), i.e. always drain tx completely before switching to next
station
 - each station may need a different timeslice (to squeeze out max
burst performance)


[...]
>>>> PS. I'm not feeling comfortable attaching 1MB attachment to a mailing
>>>> list. Is this okay or should I use something else next time?
>>>
>>> I/you can slam results into the github blogcerowrt repo and then pull
>>> out stuff selectively....
>>
>> Good idea, thanks!
>
> You got commit privs.

Yep, thanks!


Micha?
Ben Greear March 30, 2016, 3:28 p.m. UTC | #7
On 03/30/2016 02:22 AM, Michal Kazior wrote:
> On 29 March 2016 at 17:54, Ben Greear <greearb@candelatech.com> wrote:
>> On 03/29/2016 12:49 AM, Michal Kazior wrote:
>>
>>>> if you are getting a pure codel result of 160ms, that means the
>>>> implementation is broken. But I think (after having read your
>>>> description twice), the baseline result today of 160ms of queuing was
>>>> with a fq_codel *qdisc* doing the work on top of huge buffers,
>>>
>>>
>>> Yes. The 160ms is with fq_codel qdisc with ath10k doing DQL at 6mbps.
>>> Without DQL ath10k would clog up all tx slots (1424 of them) with
>>> frames. At 6mbps you typically want/need a handful (5-10) of frames to
>>> be queued.
>>
>>
>> Have you actually verified you can use all tx slots?
>
> It works in most cases. I guess you're suspecting some of your
> tx(flushing?) problems might be induced by overcommiting?
>
>> The way the
>> firmware uses it's tx buffers I think you may not be able to actually
>> do that...and in practice, you will get a lot fewer usable tx-buffers
>> than configured....
>
> Could be, e.g. I'm aware management frames are kind of a special case
> in recent firmware revisions.
>
> What would/do you expect firmware would/will do when we overcommit?
> The driver does advertise number of HTT tx slots so I would expect it
> to work fine if it didn't crash during boot.

The firmware will return something like tx-dropped immediately.  The reason
is that the firmware keeps more than one internal priority queue, and in general, reserves
some of the tx-descriptors for high priority.

So, you never see tx-queues completely full in the driver, so tx queues are
not stopped farther up the stack.

Possibly I am confused about some of this, so I'm quite curious if you ever see
tx-queues determined to be full in the ath10k driver.

Thanks,
Ben
Michal Kazior March 31, 2016, 6:39 a.m. UTC | #8
On 30 March 2016 at 17:28, Ben Greear <greearb@candelatech.com> wrote:
> On 03/30/2016 02:22 AM, Michal Kazior wrote:
>>
>> On 29 March 2016 at 17:54, Ben Greear <greearb@candelatech.com> wrote:
>>>
>>> On 03/29/2016 12:49 AM, Michal Kazior wrote:
>>>
>>>>> if you are getting a pure codel result of 160ms, that means the
>>>>> implementation is broken. But I think (after having read your
>>>>> description twice), the baseline result today of 160ms of queuing was
>>>>> with a fq_codel *qdisc* doing the work on top of huge buffers,
>>>>
>>>>
>>>>
>>>> Yes. The 160ms is with fq_codel qdisc with ath10k doing DQL at 6mbps.
>>>> Without DQL ath10k would clog up all tx slots (1424 of them) with
>>>> frames. At 6mbps you typically want/need a handful (5-10) of frames to
>>>> be queued.
>>>
>>>
>>>
>>> Have you actually verified you can use all tx slots?
>>
>>
>> It works in most cases. I guess you're suspecting some of your
>> tx(flushing?) problems might be induced by overcommiting?
>>
>>> The way the
>>> firmware uses it's tx buffers I think you may not be able to actually
>>> do that...and in practice, you will get a lot fewer usable tx-buffers
>>> than configured....
>>
>>
>> Could be, e.g. I'm aware management frames are kind of a special case
>> in recent firmware revisions.
>>
>> What would/do you expect firmware would/will do when we overcommit?
>> The driver does advertise number of HTT tx slots so I would expect it
>> to work fine if it didn't crash during boot.
>
>
> The firmware will return something like tx-dropped immediately.  The reason
> is that the firmware keeps more than one internal priority queue, and in
> general, reserves
> some of the tx-descriptors for high priority.
>
> So, you never see tx-queues completely full in the driver, so tx queues are
> not stopped farther up the stack.
>
> Possibly I am confused about some of this, so I'm quite curious if you ever
> see
> tx-queues determined to be full in the ath10k driver.

I haven't analyzed it this much. Nevertheless it's good to know we
might be overcommiting on the HTT Tx. One more reason to use DQL
and/or schedule tx in a smarter way.


Micha?
Michal Kazior April 1, 2016, 8:01 a.m. UTC | #9
Re-posting text only as it was blocked by most mailing list servers:

The original attachment can be fetched at:
  http://kazikcz.github.io/dl/2016-04-01-flent-ath10k-dql.tar.gz

On 25 March 2016 at 10:55, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 25 March 2016 at 10:39, Michal Kazior <michal.kazior@tieto.com> wrote:
>> This implements a very naive dynamic queue limits
>> on the flat HTT Tx. In some of my tests (using
>> flent) it seems to reduce induced latency by
>> orders of magnitude (e.g. when enforcing 6mbps
>> tx rates 2500ms -> 150ms). But at the same time it
>> introduces TCP throughput buildup over time
>> (instead of immediate bump to max). More
>> importantly I didn't observe it to make things
>> much worse (yet).
>>
>> Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
>> ---
>>
>> I'm not sure yet if it's worth to consider this
>> patch for merging per se. My motivation was to
>> have something to prove mac80211 fq works and to
>> see if DQL can learn the proper queue limit in
>> face of wireless rate control at all.
>>
>> I'll do a follow up post with flent test results
>> and some notes.
>
> Here's a short description what-is-what test naming:
>  - sw/fq contains only txq/flow stuff (no scheduling, no txop queue limits)
>  - sw/ath10k_dql contains only ath10k patch which applies DQL to
> driver-firmware tx queue naively
>  - sw/fq+ath10k_dql is obvious
>  - sw/base today's ath.git/master checkout used as base
>  - "veryfast" tests TCP tput to reference receiver (4 antennas)
>  - "fast" tests TCP tput to ref receiver (1 antenna)
>  - "slow" tests TCP tput to ref receiver (1 *unplugged* antenna)
>  - "fast+slow" tests sharing between "fast" and "slow"
>  - "autorate" uses default rate control
>  - "rate6m" uses fixed-tx-rate at 6mbps
>  - the test uses QCA9880 w/ 10.1.467
>  - no rrul tests, sorry Dave! :)
>
> \
> Observations / conclusions:
>  - DQL builds up throughput slowly on "veryfast"; in some tests it
> doesn't get to reach peak (roughly 210mbps average) because the test
> is too short
>
>  - DQL shows better latency results in almost all cases compared to
> the txop based scheduling from my mac80211 RFC (but i haven't
> thoroughly looked at *all* the data; I might've missed a case where it
> performs worse)
>
>  - latency improvement seen on sw/ath10k_dql @ rate6m,fast compared to
> sw/base (1800ms -> 160ms) can be explained by the fact that txq AC
> limit is 256 and since all TCP streams run on BE (and fq_codel as the
> qdisc) the induced txq latency is 256 * (1500 / (6*1024*1024/8.)) / 4
> = ~122ms which is pretty close to the test data (the formula ignores
> MAC overhead, so the latency in practice is larger). Once you consider
> the overhead and in-flight packets on driver-firmware tx queue 160ms
> doesn't seem strange. Moreover when you compare the same case with
> sw/fq+ath10k_dql you can clearly see the advantage of having fq_codel
> in mac80211 software queuing - the latency drops by (another) order of
> magnitude because now incomming ICMPs are treated as new, bursty flows
> and get fed to the device quickly.
>
>  - slow+fast case still sucks but that's expected because DQL hasn't
> been applied per-station
>
>  - sw/fq has lower peak throughput ("veryfast") compared to sw/base
> (this actually proves current - and very young least to say - ath10k
> wake-tx-queue implementation is deficient; ath10k_dql improves it and
> sw/fq+ath10k_dql climbs up to the max throughput over time)
>
>
> To sum things up:
>  - DQL might be able to replace the explicit txop queue limiting
> (which requires rate control info)
>  - mac80211 fair queuing works
>
>
> A few plots for quick and easy reference:
>
>   http://imgur.com/a/TnvbQ



Micha?
diff mbox

Patch

diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h
index b6c157ef705a..d8eebcd2b0b0 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -630,6 +630,8 @@  struct ath10k {
 	struct device *dev;
 	u8 mac_addr[ETH_ALEN];
 
+	struct dql dql;
+
 	enum ath10k_hw_rev hw_rev;
 	u16 dev_id;
 	u32 chip_id;
diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
index 2da8ccf3da05..38bc8bf46b67 100644
--- a/drivers/net/wireless/ath/ath10k/htt_rx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
@@ -1679,7 +1679,8 @@  static void ath10k_htt_rx_frag_handler(struct ath10k_htt *htt,
 }
 
 static void ath10k_htt_rx_frm_tx_compl(struct ath10k *ar,
-				       struct sk_buff *skb)
+				       struct sk_buff *skb,
+				       unsigned int *completed)
 {
 	struct ath10k_htt *htt = &ar->htt;
 	struct htt_resp *resp = (struct htt_resp *)skb->data;
@@ -1712,7 +1713,7 @@  static void ath10k_htt_rx_frm_tx_compl(struct ath10k *ar,
 	for (i = 0; i < resp->data_tx_completion.num_msdus; i++) {
 		msdu_id = resp->data_tx_completion.msdus[i];
 		tx_done.msdu_id = __le16_to_cpu(msdu_id);
-		ath10k_txrx_tx_unref(htt, &tx_done);
+		ath10k_txrx_tx_unref(htt, &tx_done, completed);
 	}
 }
 
@@ -2354,7 +2355,7 @@  void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct sk_buff *skb)
 			break;
 		}
 
-		status = ath10k_txrx_tx_unref(htt, &tx_done);
+		status = ath10k_txrx_tx_unref(htt, &tx_done, NULL);
 		if (!status) {
 			spin_lock_bh(&htt->tx_lock);
 			ath10k_htt_tx_mgmt_dec_pending(htt);
@@ -2482,6 +2483,7 @@  static void ath10k_htt_txrx_compl_task(unsigned long ptr)
 	struct htt_resp *resp;
 	struct sk_buff *skb;
 	unsigned long flags;
+	unsigned int completed = 0;
 
 	__skb_queue_head_init(&tx_q);
 	__skb_queue_head_init(&rx_q);
@@ -2505,10 +2507,12 @@  static void ath10k_htt_txrx_compl_task(unsigned long ptr)
 	spin_unlock_irqrestore(&htt->tx_fetch_ind_q.lock, flags);
 
 	while ((skb = __skb_dequeue(&tx_q))) {
-		ath10k_htt_rx_frm_tx_compl(htt->ar, skb);
+		ath10k_htt_rx_frm_tx_compl(htt->ar, skb, &completed);
 		dev_kfree_skb_any(skb);
 	}
 
+	dql_completed(&htt->ar->dql, completed);
+
 	while ((skb = __skb_dequeue(&tx_ind_q))) {
 		ath10k_htt_rx_tx_fetch_ind(ar, skb);
 		dev_kfree_skb_any(skb);
diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
index b2ae122381ca..2b7f7802f9f1 100644
--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -367,7 +367,7 @@  static int ath10k_htt_tx_clean_up_pending(int msdu_id, void *skb, void *ctx)
 	tx_done.discard = 1;
 	tx_done.msdu_id = msdu_id;
 
-	ath10k_txrx_tx_unref(htt, &tx_done);
+	ath10k_txrx_tx_unref(htt, &tx_done, NULL);
 
 	return 0;
 }
@@ -378,6 +378,7 @@  void ath10k_htt_tx_free(struct ath10k_htt *htt)
 
 	idr_for_each(&htt->pending_tx, ath10k_htt_tx_clean_up_pending, htt->ar);
 	idr_destroy(&htt->pending_tx);
+	dql_reset(&htt->ar->dql);
 
 	if (htt->txbuf.vaddr) {
 		size = htt->max_num_pending_tx *
@@ -839,6 +840,7 @@  int ath10k_htt_tx(struct ath10k_htt *htt, enum ath10k_hw_txrx_mode txmode,
 	u16 freq = 0;
 	u32 frags_paddr = 0;
 	u32 txbuf_paddr;
+	size_t skb_len;
 	struct htt_msdu_ext_desc *ext_desc = NULL;
 
 	spin_lock_bh(&htt->tx_lock);
@@ -1000,12 +1002,16 @@  int ath10k_htt_tx(struct ath10k_htt *htt, enum ath10k_hw_txrx_mode txmode,
 	sg_items[1].paddr = skb_cb->paddr;
 	sg_items[1].len = prefetch_len;
 
+	skb_len = msdu->len;
+
 	res = ath10k_hif_tx_sg(htt->ar,
 			       htt->ar->htc.endpoint[htt->eid].ul_pipe_id,
 			       sg_items, ARRAY_SIZE(sg_items));
 	if (res)
 		goto err_unmap_msdu;
 
+	dql_queued(&ar->dql, skb_len);
+
 	return 0;
 
 err_unmap_msdu:
diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index ed00853ea9cc..c848049ffdf3 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -3755,23 +3755,35 @@  void ath10k_mac_tx_push_pending(struct ath10k *ar)
 	struct ieee80211_hw *hw = ar->hw;
 	struct ieee80211_txq *txq;
 	struct ath10k_txq *artxq;
-	struct ath10k_txq *last;
 	int ret;
 	int max;
 
 	spin_lock_bh(&ar->txqs_lock);
 	rcu_read_lock();
 
-	last = list_last_entry(&ar->txqs, struct ath10k_txq, list);
-	while (!list_empty(&ar->txqs)) {
+	for (;;) {
+		if (list_empty(&ar->txqs))
+			break;
+
+		if (dql_avail(&ar->dql) < 0)
+			break;
+
 		artxq = list_first_entry(&ar->txqs, struct ath10k_txq, list);
 		txq = container_of((void *)artxq, struct ieee80211_txq,
 				   drv_priv);
 
-		/* Prevent aggressive sta/tid taking over tx queue */
 		max = 16;
 		ret = 0;
-		while (ath10k_mac_tx_can_push(hw, txq) && max--) {
+		for (;;) {
+			if (!max--)
+				break;
+
+			if (!ath10k_mac_tx_can_push(hw, txq))
+				break;
+
+			if (dql_avail(&ar->dql) < 0)
+				break;
+
 			ret = ath10k_mac_tx_push_txq(hw, txq);
 			if (ret < 0)
 				break;
@@ -3783,7 +3795,7 @@  void ath10k_mac_tx_push_pending(struct ath10k *ar)
 
 		ath10k_htt_tx_txq_update(hw, txq);
 
-		if (artxq == last || (ret < 0 && ret != -ENOENT))
+		if (ret < 0 && ret != -ENOENT)
 			break;
 	}
 
@@ -4331,6 +4343,8 @@  static int ath10k_start(struct ieee80211_hw *hw)
 
 	mutex_lock(&ar->conf_mutex);
 
+	dql_init(&ar->dql, HZ);
+
 	switch (ar->state) {
 	case ATH10K_STATE_OFF:
 		ar->state = ATH10K_STATE_ON;
diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
index 48e26cdfe9a5..122c8edf10a1 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -50,7 +50,8 @@  out:
 }
 
 int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
-			 const struct htt_tx_done *tx_done)
+			 const struct htt_tx_done *tx_done,
+			 unsigned int *completed)
 {
 	struct ath10k *ar = htt->ar;
 	struct device *dev = ar->dev;
@@ -87,6 +88,9 @@  int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
 	if (txq)
 		artxq->num_fw_queued--;
 
+	if (completed)
+		*completed += msdu->len;
+
 	ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id);
 	ath10k_htt_tx_dec_pending(htt);
 	if (htt->num_pending_tx == 0)
diff --git a/drivers/net/wireless/ath/ath10k/txrx.h b/drivers/net/wireless/ath/ath10k/txrx.h
index e7ea1ae1c438..3a655270bcc5 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.h
+++ b/drivers/net/wireless/ath/ath10k/txrx.h
@@ -20,7 +20,8 @@ 
 #include "htt.h"
 
 int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
-			 const struct htt_tx_done *tx_done);
+			 const struct htt_tx_done *tx_done,
+			 unsigned int *completed);
 
 struct ath10k_peer *ath10k_peer_find(struct ath10k *ar, int vdev_id,
 				     const u8 *addr);