diff mbox

[RFC] mac80211: Add airtime fairness accounting

Message ID 20171006115232.28688-1-toke@toke.dk (mailing list archive)
State RFC
Delegated to: Johannes Berg
Headers show

Commit Message

Toke Høiland-Jørgensen Oct. 6, 2017, 11:52 a.m. UTC
This adds accounting of each station's airtime usage to mac80211. On the RX
side, airtime is calculated from the packet length and duration. On the TX side,
the driver has to fill in status.tx_time with the packet send duration.

When a station's airtime deficit runs into the negative, no more packets will be
returned from ieee80211_tx_dequeue(), but rather instead ERR_PTR(-EAGAIN). The
driver is expected to handle this and to keep scheduling such a station.
It is assumed that the driver schedules active TXQs in a round-robin manner.

This is still incomplete. Missing:
- The actual calculation of RX airtime
- Updates of the drivers to check for the ERR_PTR and set tx_time

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 net/mac80211/ieee80211_i.h |  3 +++
 net/mac80211/rx.c          | 30 ++++++++++++++++++++++++++++++
 net/mac80211/sta_info.c    |  1 +
 net/mac80211/sta_info.h    |  7 +++++++
 net/mac80211/status.c      |  8 ++++++++
 net/mac80211/tx.c          | 17 ++++++++++++++---
 6 files changed, 63 insertions(+), 3 deletions(-)

Comments

Johannes Berg Oct. 6, 2017, 2:07 p.m. UTC | #1
On Fri, 2017-10-06 at 13:52 +0200, Toke Høiland-Jørgensen wrote:
> This adds accounting of each station's airtime usage to mac80211. On
> the RX side, airtime is calculated from the packet length and
> duration.

I think you should probably try to get something here from the driver?
Doing the calculations is really awkward, because you have to take into
account aggregation, etc.

However, I'm not even sure that you _want_ to use RX airtime at all? I
guess this is a fairly fundamental philosophical question, but there's
no way to feed back "you're sending too much" to a station, so if it
starts consuming a lot of airtime with (what we see as) RX, and we
throttle sending to it, then I'm not sure that will have much effect?
Maybe with TCP, but not really with anything else.

You also can't really know what AC this was on, and it doesn't make all
that much sense?

Then again, I'm not even sure yet why you care about the AC at all?
Should fairness really be something that's within an AC?

johannes
Toke Høiland-Jørgensen Oct. 6, 2017, 2:29 p.m. UTC | #2
Johannes Berg <johannes@sipsolutions.net> writes:

> On Fri, 2017-10-06 at 13:52 +0200, Toke Høiland-Jørgensen wrote:
>> This adds accounting of each station's airtime usage to mac80211. On
>> the RX side, airtime is calculated from the packet length and
>> duration.
>
> I think you should probably try to get something here from the driver?
> Doing the calculations is really awkward, because you have to take
> into account aggregation, etc.

Well, calculating the duration from the rate and size is what ath9k is
currently doing, and that has all the information available to do so (I
did verify that the airtime was almost identical on the TX and RX side
for the same traffic). But yeah, I guess that is not necessarily the
case for all drivers? Maybe having a field that the driver can fill in
is better (that also has the nice property that a driver that doesn't
supply any airtime information won't get the scheduling; which is
probably desirable).

> However, I'm not even sure that you _want_ to use RX airtime at all? I
> guess this is a fairly fundamental philosophical question, but there's
> no way to feed back "you're sending too much" to a station, so if it
> starts consuming a lot of airtime with (what we see as) RX, and we
> throttle sending to it, then I'm not sure that will have much effect?
> Maybe with TCP, but not really with anything else.

You're right that it does nothing for one-way UDP, of course. But not a
lot of traffic is actually one-way (most is either TCP or something like
Quic that also has a feedback loop). So in most of our tests we saw a
pretty nice effect from TCP back-pressure.

> You also can't really know what AC this was on, and it doesn't make all
> that much sense?
>
> Then again, I'm not even sure yet why you care about the AC at all?
> Should fairness really be something that's within an AC?

No, ideally it shouldn't. This is mostly an artifact of how ath9k
schedules TXQs independently for each AC. This means that if (say) three
stations are back-logged on the BE queue, and one of those also has the
occasional packet on the VO queue, if the deficit is shared between all
levels, every time a VO packet arrives that station effectively gets its
deficit reset to zero, meaning it will get too much airtime.

Having a separate deficit for each AC level is a crude workaround for
this, which is what we went with for the initial version in ath9k; and
this patch is just a straight-forward port of that. But I guess this is
a good opportunity to figure out how to solve this properly; I'll think
about how to do that (ideas welcome!)...

-Toke
Johannes Berg Oct. 6, 2017, 5:12 p.m. UTC | #3
[please don't use html format, the linux lists drop it]

> As I understand it, this code is mainly intended for stations that
> serve as an AP, rather than as a client.

Sure, it's pointless in all other cases since there's only one (L2)
station to talk to.

> Since the MAC layer is the one with easiest access to the real
> airtime statistics (much easier than, say, a qdisc), there is no
> better place to implement it.

I'm not arguing anything else, what are you trying to say?

> If RTS/CTS probes are in use, then the AP does in fact have control
> over receive bandwidth on a per station basis, at least in theory.

That's super theoretical, there's no chipset I know of that'd easily
let you use this to control it. You'd also use up airtime for a bunch
of RTS retransmits - they really aren't intended for that type of
usage.

> But I think this is simply an attempt to balance total airtime
> between stations by accounting for airtime used in both directions,
> even if one of those directions isn't controlled so strictly.

My concern is that you don't have _any_ control over the TX from the
station, so the client can effectively starve itself by sending "too
much". I'm concerned that this can lead to pathological corner cases
where the client tries to retransmit things because there was no proper
ACK etc.?

johannes
Johannes Berg Oct. 6, 2017, 5:18 p.m. UTC | #4
On Fri, 2017-10-06 at 16:29 +0200, Toke Høiland-Jørgensen wrote:
> Johannes Berg <johannes@sipsolutions.net> writes:
> 
> > On Fri, 2017-10-06 at 13:52 +0200, Toke Høiland-Jørgensen wrote:
> > > This adds accounting of each station's airtime usage to mac80211.
> > > On
> > > the RX side, airtime is calculated from the packet length and
> > > duration.
> > 
> > I think you should probably try to get something here from the
> > driver?
> > Doing the calculations is really awkward, because you have to take
> > into account aggregation, etc.
> 
> Well, calculating the duration from the rate and size is what ath9k
> is currently doing, and that has all the information available to do
> so (I did verify that the airtime was almost identical on the TX and
> RX side for the same traffic).

It's still iffy - with aggregation you might have a better idea of the
total airtime, but not really see - in the higher layers - all the
padding that comes between A-MPDUs etc. I think many drivers could do
better by exposing the total airtime from the device/firmware, whereas
exposing _all_ the little details that you need to calculate it post-
facto will be really difficult, and make the calculation really hard.

> But yeah, I guess that is not necessarily the case for all drivers?
> Maybe having a field that the driver can fill in is better (that also
> has the nice property that a driver that doesn't supply any airtime
> information won't get the scheduling; which is probably desirable).

I think that'd be better, yes. Perhaps for some simple cases a helper
function can be provided, but I'd want to actually know about those
drivers before adding that.

> You're right that it does nothing for one-way UDP, of course. But not
> a lot of traffic is actually one-way (most is either TCP or something
> like Quic that also has a feedback loop). So in most of our tests we
> saw a pretty nice effect from TCP back-pressure.

Yeah that makes some sense. I'm a bit concerned about pathological
corner cases, you don't really know about the protocol used on top. Can
a client starve itself by sending things? And perhaps, if it has a dumb
higher layer protocol, it'll retransmit rather than back off, if it
doesn't get the higher layer ACK?

> > Then again, I'm not even sure yet why you care about the AC at all?
> > Should fairness really be something that's within an AC?
> 
> No, ideally it shouldn't. This is mostly an artifact of how ath9k
> schedules TXQs independently for each AC. This means that if (say)
> three stations are back-logged on the BE queue, and one of those also
> has the occasional packet on the VO queue, if the deficit is shared
> between all levels, every time a VO packet arrives that station
> effectively gets its deficit reset to zero, meaning it will get too
> much airtime.
> 
> Having a separate deficit for each AC level is a crude workaround for
> this, which is what we went with for the initial version in ath9k;
> and this patch is just a straight-forward port of that. But I guess
> this is a good opportunity to figure out how to solve this properly;
> I'll think about how to do that (ideas welcome!)...

So ... no, I don't understand. You have a TXQ per _TID_, so splitting
up this per _AC_ still doesn't make sense since you have two TXQs for
each AC?

But then again I'm not sure I've sufficiently wrapped my head around
the whole algorithm, so ...

johannes
David Lang Oct. 6, 2017, 10:40 p.m. UTC | #5
On Fri, 6 Oct 2017, Johannes Berg wrote:

>> Well, calculating the duration from the rate and size is what ath9k
>> is currently doing, and that has all the information available to do
>> so (I did verify that the airtime was almost identical on the TX and
>> RX side for the same traffic).
>
> It's still iffy - with aggregation you might have a better idea of the
> total airtime, but not really see - in the higher layers - all the
> padding that comes between A-MPDUs etc. I think many drivers could do
> better by exposing the total airtime from the device/firmware, whereas
> exposing _all_ the little details that you need to calculate it post-
> facto will be really difficult, and make the calculation really hard.

perfect is the enemy of good enough :-)

I don't think the intent is to try and be a perfect accounting, if the total 
calculated time ends up being > 100%, it doesn't really matter, what matters is 
the relative behavior of the stations, and while the naive calculation fails to 
properly reward a station that's being more efficient, it is still good enough 
to punish stations using lower bandwidth modes (which is a far larger cause of 
problems)

while it's ideal to have the driver provide the airtime, falling back to a naive 
(but relativly cheap) calculation if a time isn't provided.

David Lang
Toke Høiland-Jørgensen Oct. 7, 2017, 11:22 a.m. UTC | #6
Johannes Berg <johannes@sipsolutions.net> writes:

> On Fri, 2017-10-06 at 16:29 +0200, Toke Høiland-Jørgensen wrote:
>> Johannes Berg <johannes@sipsolutions.net> writes:
>> 
>> Well, calculating the duration from the rate and size is what ath9k
>> is currently doing, and that has all the information available to do
>> so (I did verify that the airtime was almost identical on the TX and
>> RX side for the same traffic).
>
> It's still iffy - with aggregation you might have a better idea of the
> total airtime, but not really see - in the higher layers - all the
> padding that comes between A-MPDUs etc. I think many drivers could do
> better by exposing the total airtime from the device/firmware, whereas
> exposing _all_ the little details that you need to calculate it post-
> facto will be really difficult, and make the calculation really hard.

Guess you are right that it will be difficult to get a completely
accurate number. But as David Lang notes, as long as we are off by the
same amount for all stations, that is fine - we're just interested in
relative numbers.

>> But yeah, I guess that is not necessarily the case for all drivers?
>> Maybe having a field that the driver can fill in is better (that also
>> has the nice property that a driver that doesn't supply any airtime
>> information won't get the scheduling; which is probably desirable).
>
> I think that'd be better, yes. Perhaps for some simple cases a helper
> function can be provided, but I'd want to actually know about those
> drivers before adding that.

Right, I'll add a field to rx_status instead that drivers can use.

>> You're right that it does nothing for one-way UDP, of course. But not
>> a lot of traffic is actually one-way (most is either TCP or something
>> like Quic that also has a feedback loop). So in most of our tests we
>> saw a pretty nice effect from TCP back-pressure.
>
> Yeah that makes some sense. I'm a bit concerned about pathological
> corner cases, you don't really know about the protocol used on top.
> Can a client starve itself by sending things? And perhaps, if it has a
> dumb higher layer protocol, it'll retransmit rather than back off, if
> it doesn't get the higher layer ACK?

I suppose a station that keeps sending flooding the airwaves (say, an
upstream CBR UDP flow with no feedback loop, at a rate higher than it
has bandwidth for) can lock itself out to some extent. I'll see if I can
get this to happen in my testbed with the ath9k code and report back on
what happens.

But for anything TCP-like with a feedback loop, the starvation will be
temporary as the lack of ACKs will cause the sender to eventually slow
down enough to get below its fair share.

>> Having a separate deficit for each AC level is a crude workaround for
>> this, which is what we went with for the initial version in ath9k;
>> and this patch is just a straight-forward port of that. But I guess
>> this is a good opportunity to figure out how to solve this properly;
>> I'll think about how to do that (ideas welcome!)...
>
> So ... no, I don't understand. You have a TXQ per _TID_, so splitting
> up this per _AC_ still doesn't make sense since you have two TXQs for
> each AC?

Yeah, but ath9k schedules all TIDs on the same AC together. So if
station A has two TIDs active and station B one, the scheduling order
will be A1, A2, B1, basically. This is fine as long as they are all on
the same AC and scheduled together (in which case A1 and A2 will just
share the same deficit).

But if there is only one TXQ active, every time that queue is scheduled,
the scheduler will loop until its deficit becomes positive (which is
generally good; that is how we make sure the scheduler is
work-conserving). However, if the deficit is shared with another
scheduling group (which in ath9k is another AC), the station in the
group by itself will not be limited to its fair share, because every
time it is scheduled by itself, its deficit is "reset".

Ideally, I would prefer the scheduling to be "two-pass": First, decide
which physical station to send to, then decide which TID on that station
to service. But because everything is done at the TID/TXQ level, that is
not quite trivial to achieve I think...

-Toke
Johannes Berg Oct. 9, 2017, 7:15 a.m. UTC | #7
On Sat, 2017-10-07 at 13:22 +0200, Toke Høiland-Jørgensen wrote:

> Guess you are right that it will be difficult to get a completely
> accurate number. But as David Lang notes, as long as we are off by
> the same amount for all stations, that is fine - we're just
> interested in relative numbers.

That's not quite true though, you'd overestimate most on stations that
are using aggregation, assuming you take into account the whole frame
exchange sequence time. But maybe giving less than their fair share to
fast stations isn't really that much of a problem.

> > So ... no, I don't understand. You have a TXQ per _TID_, so
> > splitting up this per _AC_ still doesn't make sense since you have
> > two TXQs for each AC?
> 
> Yeah, but ath9k schedules all TIDs on the same AC together. So if
> station A has two TIDs active and station B one, the scheduling order
> will be A1, A2, B1, basically. This is fine as long as they are all
> on the same AC and scheduled together (in which case A1 and A2 will
> just share the same deficit).

Huh, I'm confused. Can you elaborate on this? How does it schedule two
TIDs for _different_ ACs then?

It seems to me that to actually get the right QoS behaviour you have to
schedule *stations*, and then while you're looking at a station, you
need to select the highest-priority TXQ that has data? Otherwise, don't
you end up doing fairness on a STA/AC rather than just on a STA, so
that a station that uses two ACs gets twice as much airtime as one
using just a single AC?

> But if there is only one TXQ active, every time that queue is
> scheduled, the scheduler will loop until its deficit becomes positive
> (which is generally good; that is how we make sure the scheduler is
> work-conserving). However, if the deficit is shared with another
> scheduling group (which in ath9k is another AC), the station in the
> group by itself will not be limited to its fair share, because every
> time it is scheduled by itself, its deficit is "reset".
> 
> Ideally, I would prefer the scheduling to be "two-pass": First,
> decide which physical station to send to, then decide which TID on
> that station to service. 

Yeah, that would make more sense.

> But because everything is done at the TID/TXQ level, that is not
> quite trivial to achieve I think...

Well you can group the TXQs, I guess. They all have a STA pointer, so
you could put a one- or two-bit "schedule color" field into each
station and if you find a TXQ with the same station color you just skip
it or something like that?

johannes
David Lang Oct. 9, 2017, 7:50 a.m. UTC | #8
On Mon, 9 Oct 2017, Johannes Berg wrote:

> On Sat, 2017-10-07 at 13:22 +0200, Toke Høiland-Jørgensen wrote:
>
>> Guess you are right that it will be difficult to get a completely
>> accurate number. But as David Lang notes, as long as we are off by
>> the same amount for all stations, that is fine - we're just
>> interested in relative numbers.
>
> That's not quite true though, you'd overestimate most on stations that
> are using aggregation, assuming you take into account the whole frame
> exchange sequence time. But maybe giving less than their fair share to
> fast stations isn't really that much of a problem.

how much error does this introduce?
Compared to the stations using 802.11b, this is a trivial difference.

That's why I said perfect is the enemy of good enough. Yes, it would be ideal to 
get the airtime from the driver, but if the driver doesn't provide it, I think 
ignoring aggregation in the name of simplicity is 'good enough'

David Lang
Toke Høiland-Jørgensen Oct. 9, 2017, 9:42 a.m. UTC | #9
Johannes Berg <johannes@sipsolutions.net> writes:

> On Sat, 2017-10-07 at 13:22 +0200, Toke Høiland-Jørgensen wrote:
>
>> Guess you are right that it will be difficult to get a completely
>> accurate number. But as David Lang notes, as long as we are off by
>> the same amount for all stations, that is fine - we're just
>> interested in relative numbers.
>
> That's not quite true though, you'd overestimate most on stations that
> are using aggregation, assuming you take into account the whole frame
> exchange sequence time. But maybe giving less than their fair share to
> fast stations isn't really that much of a problem.

Well, the padding and spacing between frames is at most 11 bytes (4-byte
delimiter, 4-byte FCS and 3-byte padding), which is ~0.7% of a
full-sized frame. I'm not too worried about errors on that scale, TBH.
Sure, it would be better to have it be accurate, but there are other
imperfections, especially on the RX side (we can't count
retransmissions, for instance, since the receiver obviously doesn't see
those).

>> > So ... no, I don't understand. You have a TXQ per _TID_, so
>> > splitting up this per _AC_ still doesn't make sense since you have
>> > two TXQs for each AC?
>> 
>> Yeah, but ath9k schedules all TIDs on the same AC together. So if
>> station A has two TIDs active and station B one, the scheduling order
>> will be A1, A2, B1, basically. This is fine as long as they are all
>> on the same AC and scheduled together (in which case A1 and A2 will
>> just share the same deficit).
>
> Huh, I'm confused. Can you elaborate on this? How does it schedule two
> TIDs for _different_ ACs then?

There's a separate scheduling loop for each hardware queue (one per AC),
which only schedules all TXQs with that AC. The hardware will prioritise
higher ACs by dequeueing from the high-priority hardware queue first.

> It seems to me that to actually get the right QoS behaviour you have
> to schedule *stations*, and then while you're looking at a station,
> you need to select the highest-priority TXQ that has data? Otherwise,
> don't you end up doing fairness on a STA/AC rather than just on a STA,
> so that a station that uses two ACs gets twice as much airtime as one
> using just a single AC?

Yeah, that's what we have currently in ath9k. However, it's rare in
practice that a station transmits the same amount of data on all ACs
(for one, since the max aggregation size is smaller at the higher ACs
that becomes difficult). But you are quite right that this is something
that should be fixed :)

>> Ideally, I would prefer the scheduling to be "two-pass": First,
>> decide which physical station to send to, then decide which TID on
>> that station to service. 
>
> Yeah, that would make more sense.
>
>> But because everything is done at the TID/TXQ level, that is not
>> quite trivial to achieve I think...
>
> Well you can group the TXQs, I guess. They all have a STA pointer, so
> you could put a one- or two-bit "schedule color" field into each
> station and if you find a TXQ with the same station color you just
> skip it or something like that?

Couldn't we add something like a get_next_txq(phy) function to mac80211
that the drivers can call to get the queue to pull packets from? That
way, responsibility for scheduling both stations and QoS levels falls to
mac80211, which makes it possible to do clever scheduling stuff without
having to re-implement it in every driver. Also, this function could
handle all the special TXQs for PS and non-data frames that you were
talking about in your other email?

Unless there's some reason I'm missing that the driver really needs to
schedule the TXQs, I think this would make a lot of sense?

-Toke
Johannes Berg Oct. 9, 2017, 11:40 a.m. UTC | #10
On Mon, 2017-10-09 at 11:42 +0200, Toke Høiland-Jørgensen wrote:

> Well, the padding and spacing between frames is at most 11 bytes (4-
> byte delimiter, 4-byte FCS and 3-byte padding), which is ~0.7% of a
> full-sized frame. I'm not too worried about errors on that scale,
> TBH.

I'm not sure - this really should take the whole frame exchange
sequence into consideration, since the "dead" IFS time and the ACK etc.
is also airtime consumed for that station, even if there's no actual
transmission going on.

If you factor that in, the overhead reduction with aggregation is
considerable! With an 80 MHz 2x2 MCS 9 (866Mbps PHY rate) A-MPDU
containing 64 packets, you can reach >650Mbps (with protection),
without A-MPDU you can reach only about 45Mbps I think.

You'd think that a 1500 byte frame takes 1.5ms for the 1Mbps client,
and ~14µs for the above mentioned VHT rate.

In reality, however, the overhead for both is comparable in absolute
numbers, it's >200µs.

If you don't take any of this overhead into account at all, then you'll
vastly over-allocate time for clients sending small (non-aggregated)
frames, because for those - even with slow rates - the overhead will
dominate.

If you do take this overhead into account but don't account for
aggregation, you'll vastly under-allocate time for HT/VHT clients that
use aggregation.

I don't know if there's an easy answer. Perhaps not accounting for the
overhead but assuming that clients won't be stupid and will actually do
aggregation when they ramp up their rates is reasonable in most
scenarios, but I'm afraid that we'll find interop issues - we found for
example that if you enable U-APSD lots of devices won't do aggregation
any more ...

> Sure, it would be better to have it be accurate, but there are other
> imperfections, especially on the RX side (we can't count
> retransmissions, for instance, since the receiver obviously doesn't
> see those).

Sure.

> There's a separate scheduling loop for each hardware queue (one per
> AC), which only schedules all TXQs with that AC. The hardware will
> prioritise higher ACs by dequeueing from the high-priority hardware
> queue first.

Ok, I guess that addresses that issue.

> Yeah, that's what we have currently in ath9k. However, it's rare in
> practice that a station transmits the same amount of data on all ACs
> (for one, since the max aggregation size is smaller at the higher ACs
> that becomes difficult). But you are quite right that this is
> something that should be fixed :)

Not sure the amount of data matters that much, but ok.

> > > Ideally, I would prefer the scheduling to be "two-pass": First,
> > > decide which physical station to send to, then decide which TID
> > > on that station to service. 
> > 
> > Yeah, that would make more sense.
> > 
> > > But because everything is done at the TID/TXQ level, that is not
> > > quite trivial to achieve I think...
> > 
> > Well you can group the TXQs, I guess. They all have a STA pointer,
> > so
> > you could put a one- or two-bit "schedule color" field into each
> > station and if you find a TXQ with the same station color you just
> > skip it or something like that?
> 
> Couldn't we add something like a get_next_txq(phy) function to
> mac80211 that the drivers can call to get the queue to pull packets
> from? That way, responsibility for scheduling both stations and QoS
> levels falls to mac80211, which makes it possible to do clever
> scheduling stuff without having to re-implement it in every driver.
> Also, this function could handle all the special TXQs for PS and non-
> data frames that you were talking about in your other email?
> 
> Unless there's some reason I'm missing that the driver really needs
> to schedule the TXQs, I think this would make a lot of sense?

I have no idea, that's something you'll have to ask Felix I guess. I'd
think it should work, but the scheduling might have other constraints
like wanting to fill certain A-MPDU buffers, or getting a specific
channel (though that's already decided once you pick the station).

It might also be hard to combine that - if you have space on your VI
queue, how do you then pick the queue? We can't really go *all* the way
and do scheduling *entirely* in software, getting rid of per-AC queues,
since the per-AC queues also work to assign the EDCA parameters etc.

Also, in iwlwifi we actually have a HW queue per TID to facilitate
aggregation, though we could just let mac80211 pick the next TXQ to
serve and skip in the unlikely case that the HW queue for that is
already full (which really shouldn't happen).

johannes
Toke Høiland-Jørgensen Oct. 9, 2017, 12:38 p.m. UTC | #11
Johannes Berg <johannes@sipsolutions.net> writes:

> On Mon, 2017-10-09 at 11:42 +0200, Toke Høiland-Jørgensen wrote:
>
>> Well, the padding and spacing between frames is at most 11 bytes (4-
>> byte delimiter, 4-byte FCS and 3-byte padding), which is ~0.7% of a
>> full-sized frame. I'm not too worried about errors on that scale,
>> TBH.
>
> I'm not sure - this really should take the whole frame exchange
> sequence into consideration, since the "dead" IFS time and the ACK etc.
> is also airtime consumed for that station, even if there's no actual
> transmission going on.
>
> If you factor that in, the overhead reduction with aggregation is
> considerable! With an 80 MHz 2x2 MCS 9 (866Mbps PHY rate) A-MPDU
> containing 64 packets, you can reach >650Mbps (with protection),
> without A-MPDU you can reach only about 45Mbps I think.
>
> You'd think that a 1500 byte frame takes 1.5ms for the 1Mbps client,
> and ~14µs for the above mentioned VHT rate.
>
> In reality, however, the overhead for both is comparable in absolute
> numbers, it's >200µs.
>
> If you don't take any of this overhead into account at all, then you'll
> vastly over-allocate time for clients sending small (non-aggregated)
> frames, because for those - even with slow rates - the overhead will
> dominate.

Right, but most of these are constant values that are straight forward
to add as long as you know how the frame was received, no? Maybe not as
a general function in mac80211, but the driver should be able to
perform a reasonable computation in the absence of information from the
hardware.

What does iwl put into the status.tx_time field of ieee80211_tx_info,
BTW? That was the only driver I could find that used the field, and it
looks like it just writes something it gets from the hardware into it.
So does that value include overhead? And what about retransmissions?

> I don't know if there's an easy answer. Perhaps not accounting for the
> overhead but assuming that clients won't be stupid and will actually
> do aggregation when they ramp up their rates is reasonable in most
> scenarios, but I'm afraid that we'll find interop issues - we found
> for example that if you enable U-APSD lots of devices won't do
> aggregation any more ...

What do you mean by "interop" here, exactly? Just that stations doing
weird things will see reduced performance?

>> > > Ideally, I would prefer the scheduling to be "two-pass": First,
>> > > decide which physical station to send to, then decide which TID
>> > > on that station to service. 
>> > 
>> > Yeah, that would make more sense.
>> > 
>> > > But because everything is done at the TID/TXQ level, that is not
>> > > quite trivial to achieve I think...
>> > 
>> > Well you can group the TXQs, I guess. They all have a STA pointer,
>> > so
>> > you could put a one- or two-bit "schedule color" field into each
>> > station and if you find a TXQ with the same station color you just
>> > skip it or something like that?
>> 
>> Couldn't we add something like a get_next_txq(phy) function to
>> mac80211 that the drivers can call to get the queue to pull packets
>> from? That way, responsibility for scheduling both stations and QoS
>> levels falls to mac80211, which makes it possible to do clever
>> scheduling stuff without having to re-implement it in every driver.
>> Also, this function could handle all the special TXQs for PS and non-
>> data frames that you were talking about in your other email?
>> 
>> Unless there's some reason I'm missing that the driver really needs
>> to schedule the TXQs, I think this would make a lot of sense?
>
> I have no idea, that's something you'll have to ask Felix I guess. I'd
> think it should work, but the scheduling might have other constraints
> like wanting to fill certain A-MPDU buffers, or getting a specific
> channel (though that's already decided once you pick the station).

I'm pretty sure it will work for ath9k. That just picks a TXQ and keeps
pulling packets until it has filled an aggregate. That would still be
possible if mac80211 picks the TXQ instead of the driver itself. So I
was asking more generally, but if you don't see anything obvious that
would prevent me from doing this, I guess I'll go and try it out :)

> It might also be hard to combine that - if you have space on your VI
> queue, how do you then pick the queue? We can't really go *all* the
> way and do scheduling *entirely* in software, getting rid of per-AC
> queues, since the per-AC queues also work to assign the EDCA
> parameters etc.

We'll need to keep putting packets into different hardware queues, sure.
But deciding which one can be done at the last instant (i.e., for ath9k,
ask mac80211 for a TXQ, look at which AC that TXQ belongs to, and start
putting packets into that hardware queue).

One of the things I would also like to try, is to sometimes promote or
demote packets between AC levels. E.g., if a station has one VO packet
and a bunch of BE packets queued, it may sometimes be more efficient to
just put the VO packet at the beginning of a BE aggregate. I still need
to figure out for which values of 'sometimes' this is a good idea, but
I'd like to at least be able to support this sort of shenanigans, which
I believe what I proposed above will.

> Also, in iwlwifi we actually have a HW queue per TID to facilitate
> aggregation, though we could just let mac80211 pick the next TXQ to
> serve and skip in the unlikely case that the HW queue for that is
> already full (which really shouldn't happen).

Yeah, there may be a need for the driver to be able to express some
constraints on the queues it can accept currently; may a bitmap of
eligible TID numbers, or just a way of saying "can't use this TXQ,
please give me another". But it may also be that it's enough for the
driver to just give up and try again later if it can't use the TXQ it is
assigned...

-Toke
Johannes Berg Oct. 9, 2017, 6:50 p.m. UTC | #12
Hi,

> Right, but most of these are constant values that are straight
> forward to add as long as you know how the frame was received, no?
> Maybe not as a general function in mac80211, but the driver should be
> able to perform a reasonable computation in the absence of
> information from the hardware.

Yes, I think so.

> What does iwl put into the status.tx_time field of ieee80211_tx_info,
> BTW? That was the only driver I could find that used the field, and
> it looks like it just writes something it gets from the hardware into
> it.
> So does that value include overhead? And what about retransmissions?

It comes from the firmware - as far as I can tell, yes, it'll include
retransmissions and the whole frame exchange sequence (RTS-CTS-data-ack 
or RTS-CTS-ampdu-blockack).

> > I don't know if there's an easy answer. Perhaps not accounting for
> > the
> > overhead but assuming that clients won't be stupid and will
> > actually
> > do aggregation when they ramp up their rates is reasonable in most
> > scenarios, but I'm afraid that we'll find interop issues - we found
> > for example that if you enable U-APSD lots of devices won't do
> > aggregation any more ...
> 
> What do you mean by "interop" here, exactly? Just that stations doing
> weird things will see reduced performance?

Well, the case that we did find is that sometimes U-APSD kills
aggregation, so then you'd have a lot of single frames and
significantly under-estimate air-time usage in this case. Thus, this
station would get far more than its fair share of air time, because of
a bug that makes it not use aggregation... That doesn't sound very good
to me.

Perhaps at least taking aggregation into account would be doable - we
_should_ know this, at least partially.

> I'm pretty sure it will work for ath9k. That just picks a TXQ and
> keeps pulling packets until it has filled an aggregate. 

OK

> That would still be possible if mac80211 picks the TXQ instead of the
> driver itself. 

Right.

> So I was asking more generally, but if you don't see anything obvious
> that would prevent me from doing this, I guess I'll go and try it out
> :)

:-)

> > It might also be hard to combine that - if you have space on your
> > VI
> > queue, how do you then pick the queue? We can't really go *all* the
> > way and do scheduling *entirely* in software, getting rid of per-AC
> > queues, since the per-AC queues also work to assign the EDCA
> > parameters etc.
> 
> We'll need to keep putting packets into different hardware queues,
> sure.

> But deciding which one can be done at the last instant (i.e., for
> ath9k,
> ask mac80211 for a TXQ, look at which AC that TXQ belongs to, and
> start
> putting packets into that hardware queue).

Right.

> One of the things I would also like to try, is to sometimes promote
> or demote packets between AC levels. E.g., if a station has one VO
> packet and a bunch of BE packets queued, it may sometimes be more
> efficient to just put the VO packet at the beginning of a BE
> aggregate. I still need to figure out for which values of 'sometimes'
> this is a good idea, but I'd like to at least be able to support this
> sort of shenanigans, which I believe what I proposed above will.

I don't think that's a good idea. It's possible (and this can be done
at least in synthetic scenarios) that VO traffic _completely_ drowns
out all lower-priority traffic, so demoting frames that way would get
them the wrong EDCA parameters.

Yes, it might be better for throughput, but it would almost certainly
be worse for latency for those frames, which is kinda the whole point
of the ACs.

> > Also, in iwlwifi we actually have a HW queue per TID to facilitate
> > aggregation, though we could just let mac80211 pick the next TXQ to
> > serve and skip in the unlikely case that the HW queue for that is
> > already full (which really shouldn't happen).
> 
> Yeah, there may be a need for the driver to be able to express some
> constraints on the queues it can accept currently; may a bitmap of
> eligible TID numbers, or just a way of saying "can't use this TXQ,
> please give me another". But it may also be that it's enough for the
> driver to just give up and try again later if it can't use the TXQ it
> is assigned...

I guess it could just move on to the next TID. There doesn't seem to be
much point in saying "I can't service this high-priority TID, give me
something lower priority instead" because it really should be servicing
the higher priority first anyway, so just skipping if it can't be
serviced seems fine.

johannes
Toke Høiland-Jørgensen Oct. 9, 2017, 8:25 p.m. UTC | #13
Johannes Berg <johannes@sipsolutions.net> writes:

> Hi,
>
>> Right, but most of these are constant values that are straight
>> forward to add as long as you know how the frame was received, no?
>> Maybe not as a general function in mac80211, but the driver should be
>> able to perform a reasonable computation in the absence of
>> information from the hardware.
>
> Yes, I think so.

Cool. I'll go with that, then.

>> What does iwl put into the status.tx_time field of ieee80211_tx_info,
>> BTW? That was the only driver I could find that used the field, and
>> it looks like it just writes something it gets from the hardware into
>> it.
>> So does that value include overhead? And what about retransmissions?
>
> It comes from the firmware - as far as I can tell, yes, it'll include
> retransmissions and the whole frame exchange sequence
> (RTS-CTS-data-ack or RTS-CTS-ampdu-blockack).

Excellent! Just what we need :)

>> > I don't know if there's an easy answer. Perhaps not accounting for
>> > the
>> > overhead but assuming that clients won't be stupid and will
>> > actually
>> > do aggregation when they ramp up their rates is reasonable in most
>> > scenarios, but I'm afraid that we'll find interop issues - we found
>> > for example that if you enable U-APSD lots of devices won't do
>> > aggregation any more ...
>> 
>> What do you mean by "interop" here, exactly? Just that stations doing
>> weird things will see reduced performance?
>
> Well, the case that we did find is that sometimes U-APSD kills
> aggregation, so then you'd have a lot of single frames and
> significantly under-estimate air-time usage in this case. Thus, this
> station would get far more than its fair share of air time, because of
> a bug that makes it not use aggregation... That doesn't sound very
> good to me.
>
> Perhaps at least taking aggregation into account would be doable - we
> _should_ know this, at least partially.

Yeah, ath9k certainly gets that as part of the RX information from the
chip.

>> One of the things I would also like to try, is to sometimes promote
>> or demote packets between AC levels. E.g., if a station has one VO
>> packet and a bunch of BE packets queued, it may sometimes be more
>> efficient to just put the VO packet at the beginning of a BE
>> aggregate. I still need to figure out for which values of 'sometimes'
>> this is a good idea, but I'd like to at least be able to support this
>> sort of shenanigans, which I believe what I proposed above will.
>
> I don't think that's a good idea. It's possible (and this can be done
> at least in synthetic scenarios) that VO traffic _completely_ drowns
> out all lower-priority traffic, so demoting frames that way would get
> them the wrong EDCA parameters.
>
> Yes, it might be better for throughput, but it would almost certainly
> be worse for latency for those frames, which is kinda the whole point
> of the ACs.

Well, some of the tests I did while porting ath9k to the iTXQs indicated
that for voice-like traffic we can get very close to the same actual
end-to-end latency for BE-marked traffic that we do with VO-marked
traffic. This is in part because the FQ sparse flow prioritisation makes
sure that such flows get queueing priority.

Now obviously, there are going to be tradeoffs, and scenarios where
latency will suffer. But I would at least like to be able to explore
this, and I think with the API we are currently discussing that will be
possible. Another thing I want to explore is doing soft admission
control; i.e., if someone sends bulk traffic marked as VO, it will be
automatically demoted to BE traffic rather than locking everyone else
out. We've tested that with some success in the Cake scheduler, and it
may be applicable to WiFi as well.

>> > Also, in iwlwifi we actually have a HW queue per TID to facilitate
>> > aggregation, though we could just let mac80211 pick the next TXQ to
>> > serve and skip in the unlikely case that the HW queue for that is
>> > already full (which really shouldn't happen).
>> 
>> Yeah, there may be a need for the driver to be able to express some
>> constraints on the queues it can accept currently; may a bitmap of
>> eligible TID numbers, or just a way of saying "can't use this TXQ,
>> please give me another". But it may also be that it's enough for the
>> driver to just give up and try again later if it can't use the TXQ it
>> is assigned...
>
> I guess it could just move on to the next TID. There doesn't seem to
> be much point in saying "I can't service this high-priority TID, give
> me something lower priority instead" because it really should be
> servicing the higher priority first anyway, so just skipping if it
> can't be serviced seems fine.

Suppose so. I'll try it out :)

-Toke
Johannes Berg Oct. 11, 2017, 8:55 a.m. UTC | #14
On Mon, 2017-10-09 at 22:25 +0200, Toke Høiland-Jørgensen wrote:

> > Well, the case that we did find is that sometimes U-APSD kills
> > aggregation, so then you'd have a lot of single frames and
> > significantly under-estimate air-time usage in this case. Thus,
> > this
> > station would get far more than its fair share of air time, because
> > of
> > a bug that makes it not use aggregation... That doesn't sound very
> > good to me.
> > 
> > Perhaps at least taking aggregation into account would be doable -
> > we _should_ know this, at least partially.
> 
> Yeah, ath9k certainly gets that as part of the RX information from
> the chip.

Yeah, though there are more complex cases, e.g. with hardware A-MSDU
deaggregation like in ath10k.

> Well, some of the tests I did while porting ath9k to the iTXQs
> indicated that for voice-like traffic we can get very close to the
> same actual end-to-end latency for BE-marked traffic that we do with
> VO-marked traffic. This is in part because the FQ sparse flow
> prioritisation makes sure that such flows get queueing priority.

That really depends on medium saturation - if that's low, then yeah,
the difference is in the EDCA parameters and the minimum backoff, which
isn't a huge difference if you measure end-to-end.

Given saturation/congestion though, and I think we have to take this
into account since not everyone will be using this code, there are
measurable advantages to using VO - like in the test I mentioned where
you can block out BE "forever".

> Now obviously, there are going to be tradeoffs, and scenarios where
> latency will suffer. But I would at least like to be able to explore
> this, and I think with the API we are currently discussing that will
> be possible. 

I'm not sure how you envision this working with the API, since you
still have to pull from multiple TXQs, but I have no objection to the
API as is - I'm just not convinced that actually demoting traffic is a
good idea :)

> Another thing I want to explore is doing soft admission
> control; i.e., if someone sends bulk traffic marked as VO, it will be
> automatically demoted to BE traffic rather than locking everyone else
> out. We've tested that with some success in the Cake scheduler, and
> it may be applicable to WiFi as well.

It seems pretty strange to do that at this level - we control what TID
(and AC) is selected for the traffic? So why not just change at that
level for this type of thing?

You can probably even already do that with TC by re-marking the traffic
depending on type or throughput or whatever else TC can classify on.

johannes
Toke Høiland-Jørgensen Oct. 11, 2017, 1:50 p.m. UTC | #15
Johannes Berg <johannes@sipsolutions.net> writes:

>> Yeah, ath9k certainly gets that as part of the RX information from
>> the chip.
>
> Yeah, though there are more complex cases, e.g. with hardware A-MSDU
> deaggregation like in ath10k.

Here be dragons; noted :)

>> Well, some of the tests I did while porting ath9k to the iTXQs
>> indicated that for voice-like traffic we can get very close to the
>> same actual end-to-end latency for BE-marked traffic that we do with
>> VO-marked traffic. This is in part because the FQ sparse flow
>> prioritisation makes sure that such flows get queueing priority.
>
> That really depends on medium saturation - if that's low, then yeah,
> the difference is in the EDCA parameters and the minimum backoff,
> which isn't a huge difference if you measure end-to-end.
>
> Given saturation/congestion though, and I think we have to take this
> into account since not everyone will be using this code, there are
> measurable advantages to using VO - like in the test I mentioned where
> you can block out BE "forever".

Yup, when the medium is saturated the tradeoff might be different. My
hope is that it is possible to dynamically detect when it might be
beneficial and change behaviour based on this. I know this is maybe a
bit hand-wavy for now, but otherwise it wouldn't be research... ;)
>
>> Now obviously, there are going to be tradeoffs, and scenarios where
>> latency will suffer. But I would at least like to be able to explore
>> this, and I think with the API we are currently discussing that will
>> be possible. 
>
> I'm not sure how you envision this working with the API, since you
> still have to pull from multiple TXQs, but I have no objection to the
> API as is - I'm just not convinced that actually demoting traffic is a
> good idea :)

Sure, I realise I have some convincing to do on this point...

>> Another thing I want to explore is doing soft admission
>> control; i.e., if someone sends bulk traffic marked as VO, it will be
>> automatically demoted to BE traffic rather than locking everyone else
>> out. We've tested that with some success in the Cake scheduler, and
>> it may be applicable to WiFi as well.
>
> It seems pretty strange to do that at this level - we control what TID
> (and AC) is selected for the traffic? So why not just change at that
> level for this type of thing?
>
> You can probably even already do that with TC by re-marking the traffic
> depending on type or throughput or whatever else TC can classify on.

Mostly because I am expecting that the current queue state is an
important input variable. It's quite straight forward to (e.g.) remark
VO packets if they exceed a fixed rate. But what I want to do is demote
packets if they exceed what the network can currently handle. It may be
that visibility into the queueing is enough, though. I have yet to
actually experiment with this...

-Toke
diff mbox

Patch

diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index 9675814f64db..2bc259cf2c30 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -90,6 +90,9 @@  extern const u8 ieee80211_ac_to_qos_mask[IEEE80211_NUM_ACS];
 
 #define IEEE80211_MAX_NAN_INSTANCE_ID 255
 
+/* How much to increase airtime deficit on each scheduling round */
+#define IEEE80211_AIRTIME_QUANTUM        1000 /* usec */
+
 struct ieee80211_fragment_entry {
 	struct sk_buff_head skb_list;
 	unsigned long first_frag_time;
diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
index 70e9d2ca8bbe..031dae70c7ad 100644
--- a/net/mac80211/rx.c
+++ b/net/mac80211/rx.c
@@ -1673,6 +1673,35 @@  ieee80211_rx_h_sta_process(struct ieee80211_rx_data *rx)
 	return RX_CONTINUE;
 } /* ieee80211_rx_h_sta_process */
 
+static ieee80211_rx_result debug_noinline
+ieee80211_rx_h_airtime(struct ieee80211_rx_data *rx)
+{
+	struct sta_info *sta = rx->sta;
+	struct sk_buff *skb = rx->skb;
+	struct ieee80211_rx_status *status = IEEE80211_SKB_RXCB(skb);
+	struct ieee80211_hdr *hdr = (struct ieee80211_hdr *)skb->data;
+	struct ieee80211_sub_if_data *sdata = rx->sdata;
+	const struct ieee80211_rate *rate;
+	u32 airtime = 0;
+	int i;
+	u8 ac;
+
+	if (!sta || !ieee80211_is_data(hdr->frame_control))
+		return RX_CONTINUE;
+
+	ac = ieee80211_select_queue_80211(sdata, skb, hdr);
+	/* FIXME: implement this
+	airtime = ieee80211_pkt_duration(status, skb->len); */
+
+	spin_lock_bh(&sta->lock);
+	sta->airtime_deficit[ac] -= airtime;
+	sta->airtime_stats.rx_airtime += airtime;
+	spin_unlock_bh(&sta->lock);
+
+	return RX_CONTINUE;
+} /* ieee80211_rx_h_airtime */
+
+
 static ieee80211_rx_result debug_noinline
 ieee80211_rx_h_decrypt(struct ieee80211_rx_data *rx)
 {
@@ -3392,6 +3421,7 @@  static void ieee80211_rx_handlers(struct ieee80211_rx_data *rx,
 		CALL_RXH(ieee80211_rx_h_check_more_data);
 		CALL_RXH(ieee80211_rx_h_uapsd_and_pspoll);
 		CALL_RXH(ieee80211_rx_h_sta_process);
+		CALL_RXH(ieee80211_rx_h_airtime);
 		CALL_RXH(ieee80211_rx_h_decrypt);
 		CALL_RXH(ieee80211_rx_h_defragment);
 		CALL_RXH(ieee80211_rx_h_michael_mic_verify);
diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
index ffcd25c4908c..cf33c2998bdd 100644
--- a/net/mac80211/sta_info.c
+++ b/net/mac80211/sta_info.c
@@ -388,6 +388,7 @@  struct sta_info *sta_info_alloc(struct ieee80211_sub_if_data *sdata,
 	for (i = 0; i < IEEE80211_NUM_ACS; i++) {
 		skb_queue_head_init(&sta->ps_tx_buf[i]);
 		skb_queue_head_init(&sta->tx_filtered[i]);
+		sta->airtime_deficit[i] = IEEE80211_AIRTIME_QUANTUM;
 	}
 
 	for (i = 0; i < IEEE80211_NUM_TIDS; i++)
diff --git a/net/mac80211/sta_info.h b/net/mac80211/sta_info.h
index a35c964f6217..43eb4b3da12a 100644
--- a/net/mac80211/sta_info.h
+++ b/net/mac80211/sta_info.h
@@ -548,6 +548,13 @@  struct sta_info {
 	} tx_stats;
 	u16 tid_seq[IEEE80211_QOS_CTL_TID_MASK + 1];
 
+	/* Airtime stats and deficit, protected by lock */
+	struct {
+		u64 rx_airtime;
+		u64 tx_airtime;
+	} airtime_stats;
+	s64 airtime_deficit[IEEE80211_NUM_ACS];
+
 	/*
 	 * Aggregation information, locked with lock.
 	 */
diff --git a/net/mac80211/status.c b/net/mac80211/status.c
index da7427a41529..5998004401f6 100644
--- a/net/mac80211/status.c
+++ b/net/mac80211/status.c
@@ -823,6 +823,14 @@  static void __ieee80211_tx_status(struct ieee80211_hw *hw,
 				ieee80211_lost_packet(sta, info);
 			}
 		}
+
+		if (info->status.tx_time) {
+			int ac = skb_get_queue_mapping(skb);
+			spin_lock_bh(&sta->lock);
+			sta->airtime_stats.tx_airtime += info->status.tx_time;
+			sta->airtime_deficit[ac] -= info->status.tx_time;
+			spin_unlock_bh(&sta->lock);
+		}
 	}
 
 	/* SNMP counters
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 94826680cf2b..4b11d199bf8e 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -3410,6 +3410,7 @@  struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
 	struct ieee80211_tx_data tx;
 	ieee80211_tx_result r;
 	struct ieee80211_vif *vif;
+	struct sta_info *sta = NULL;
 
 	spin_lock_bh(&fq->lock);
 
@@ -3421,6 +3422,18 @@  struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
 	if (skb)
 		goto out;
 
+	if (txq->sta) {
+		sta = container_of(txq->sta, struct sta_info, sta);
+		spin_lock_bh(&sta->lock);
+		if (sta.airtime_deficit[txq->ac] <= 0) {
+			sta.airtime_deficit[txq->ac] += IEEE80211_AIRTIME_QUANTUM;
+			skb = ERR_PTR(-EAGAIN);
+			spin_unlock_bh(&sta->lock);
+			goto out;
+		}
+		spin_unlock_bh(&sta->lock);
+	}
+
 begin:
 	skb = fq_tin_dequeue(fq, tin, fq_tin_dequeue_func);
 	if (!skb)
@@ -3434,9 +3447,7 @@  struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
 	tx.local = local;
 	tx.skb = skb;
 	tx.sdata = vif_to_sdata(info->control.vif);
-
-	if (txq->sta)
-		tx.sta = container_of(txq->sta, struct sta_info, sta);
+	tx.sta = sta;
 
 	/*
 	 * The key can be removed while the packet was queued, so need to call