Message ID | 20171006115232.28688-1-toke@toke.dk (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Johannes Berg |
Headers | show |
On Fri, 2017-10-06 at 13:52 +0200, Toke Høiland-Jørgensen wrote: > This adds accounting of each station's airtime usage to mac80211. On > the RX side, airtime is calculated from the packet length and > duration. I think you should probably try to get something here from the driver? Doing the calculations is really awkward, because you have to take into account aggregation, etc. However, I'm not even sure that you _want_ to use RX airtime at all? I guess this is a fairly fundamental philosophical question, but there's no way to feed back "you're sending too much" to a station, so if it starts consuming a lot of airtime with (what we see as) RX, and we throttle sending to it, then I'm not sure that will have much effect? Maybe with TCP, but not really with anything else. You also can't really know what AC this was on, and it doesn't make all that much sense? Then again, I'm not even sure yet why you care about the AC at all? Should fairness really be something that's within an AC? johannes
Johannes Berg <johannes@sipsolutions.net> writes: > On Fri, 2017-10-06 at 13:52 +0200, Toke Høiland-Jørgensen wrote: >> This adds accounting of each station's airtime usage to mac80211. On >> the RX side, airtime is calculated from the packet length and >> duration. > > I think you should probably try to get something here from the driver? > Doing the calculations is really awkward, because you have to take > into account aggregation, etc. Well, calculating the duration from the rate and size is what ath9k is currently doing, and that has all the information available to do so (I did verify that the airtime was almost identical on the TX and RX side for the same traffic). But yeah, I guess that is not necessarily the case for all drivers? Maybe having a field that the driver can fill in is better (that also has the nice property that a driver that doesn't supply any airtime information won't get the scheduling; which is probably desirable). > However, I'm not even sure that you _want_ to use RX airtime at all? I > guess this is a fairly fundamental philosophical question, but there's > no way to feed back "you're sending too much" to a station, so if it > starts consuming a lot of airtime with (what we see as) RX, and we > throttle sending to it, then I'm not sure that will have much effect? > Maybe with TCP, but not really with anything else. You're right that it does nothing for one-way UDP, of course. But not a lot of traffic is actually one-way (most is either TCP or something like Quic that also has a feedback loop). So in most of our tests we saw a pretty nice effect from TCP back-pressure. > You also can't really know what AC this was on, and it doesn't make all > that much sense? > > Then again, I'm not even sure yet why you care about the AC at all? > Should fairness really be something that's within an AC? No, ideally it shouldn't. This is mostly an artifact of how ath9k schedules TXQs independently for each AC. This means that if (say) three stations are back-logged on the BE queue, and one of those also has the occasional packet on the VO queue, if the deficit is shared between all levels, every time a VO packet arrives that station effectively gets its deficit reset to zero, meaning it will get too much airtime. Having a separate deficit for each AC level is a crude workaround for this, which is what we went with for the initial version in ath9k; and this patch is just a straight-forward port of that. But I guess this is a good opportunity to figure out how to solve this properly; I'll think about how to do that (ideas welcome!)... -Toke
[please don't use html format, the linux lists drop it] > As I understand it, this code is mainly intended for stations that > serve as an AP, rather than as a client. Sure, it's pointless in all other cases since there's only one (L2) station to talk to. > Since the MAC layer is the one with easiest access to the real > airtime statistics (much easier than, say, a qdisc), there is no > better place to implement it. I'm not arguing anything else, what are you trying to say? > If RTS/CTS probes are in use, then the AP does in fact have control > over receive bandwidth on a per station basis, at least in theory. That's super theoretical, there's no chipset I know of that'd easily let you use this to control it. You'd also use up airtime for a bunch of RTS retransmits - they really aren't intended for that type of usage. > But I think this is simply an attempt to balance total airtime > between stations by accounting for airtime used in both directions, > even if one of those directions isn't controlled so strictly. My concern is that you don't have _any_ control over the TX from the station, so the client can effectively starve itself by sending "too much". I'm concerned that this can lead to pathological corner cases where the client tries to retransmit things because there was no proper ACK etc.? johannes
On Fri, 2017-10-06 at 16:29 +0200, Toke Høiland-Jørgensen wrote: > Johannes Berg <johannes@sipsolutions.net> writes: > > > On Fri, 2017-10-06 at 13:52 +0200, Toke Høiland-Jørgensen wrote: > > > This adds accounting of each station's airtime usage to mac80211. > > > On > > > the RX side, airtime is calculated from the packet length and > > > duration. > > > > I think you should probably try to get something here from the > > driver? > > Doing the calculations is really awkward, because you have to take > > into account aggregation, etc. > > Well, calculating the duration from the rate and size is what ath9k > is currently doing, and that has all the information available to do > so (I did verify that the airtime was almost identical on the TX and > RX side for the same traffic). It's still iffy - with aggregation you might have a better idea of the total airtime, but not really see - in the higher layers - all the padding that comes between A-MPDUs etc. I think many drivers could do better by exposing the total airtime from the device/firmware, whereas exposing _all_ the little details that you need to calculate it post- facto will be really difficult, and make the calculation really hard. > But yeah, I guess that is not necessarily the case for all drivers? > Maybe having a field that the driver can fill in is better (that also > has the nice property that a driver that doesn't supply any airtime > information won't get the scheduling; which is probably desirable). I think that'd be better, yes. Perhaps for some simple cases a helper function can be provided, but I'd want to actually know about those drivers before adding that. > You're right that it does nothing for one-way UDP, of course. But not > a lot of traffic is actually one-way (most is either TCP or something > like Quic that also has a feedback loop). So in most of our tests we > saw a pretty nice effect from TCP back-pressure. Yeah that makes some sense. I'm a bit concerned about pathological corner cases, you don't really know about the protocol used on top. Can a client starve itself by sending things? And perhaps, if it has a dumb higher layer protocol, it'll retransmit rather than back off, if it doesn't get the higher layer ACK? > > Then again, I'm not even sure yet why you care about the AC at all? > > Should fairness really be something that's within an AC? > > No, ideally it shouldn't. This is mostly an artifact of how ath9k > schedules TXQs independently for each AC. This means that if (say) > three stations are back-logged on the BE queue, and one of those also > has the occasional packet on the VO queue, if the deficit is shared > between all levels, every time a VO packet arrives that station > effectively gets its deficit reset to zero, meaning it will get too > much airtime. > > Having a separate deficit for each AC level is a crude workaround for > this, which is what we went with for the initial version in ath9k; > and this patch is just a straight-forward port of that. But I guess > this is a good opportunity to figure out how to solve this properly; > I'll think about how to do that (ideas welcome!)... So ... no, I don't understand. You have a TXQ per _TID_, so splitting up this per _AC_ still doesn't make sense since you have two TXQs for each AC? But then again I'm not sure I've sufficiently wrapped my head around the whole algorithm, so ... johannes
On Fri, 6 Oct 2017, Johannes Berg wrote: >> Well, calculating the duration from the rate and size is what ath9k >> is currently doing, and that has all the information available to do >> so (I did verify that the airtime was almost identical on the TX and >> RX side for the same traffic). > > It's still iffy - with aggregation you might have a better idea of the > total airtime, but not really see - in the higher layers - all the > padding that comes between A-MPDUs etc. I think many drivers could do > better by exposing the total airtime from the device/firmware, whereas > exposing _all_ the little details that you need to calculate it post- > facto will be really difficult, and make the calculation really hard. perfect is the enemy of good enough :-) I don't think the intent is to try and be a perfect accounting, if the total calculated time ends up being > 100%, it doesn't really matter, what matters is the relative behavior of the stations, and while the naive calculation fails to properly reward a station that's being more efficient, it is still good enough to punish stations using lower bandwidth modes (which is a far larger cause of problems) while it's ideal to have the driver provide the airtime, falling back to a naive (but relativly cheap) calculation if a time isn't provided. David Lang
Johannes Berg <johannes@sipsolutions.net> writes: > On Fri, 2017-10-06 at 16:29 +0200, Toke Høiland-Jørgensen wrote: >> Johannes Berg <johannes@sipsolutions.net> writes: >> >> Well, calculating the duration from the rate and size is what ath9k >> is currently doing, and that has all the information available to do >> so (I did verify that the airtime was almost identical on the TX and >> RX side for the same traffic). > > It's still iffy - with aggregation you might have a better idea of the > total airtime, but not really see - in the higher layers - all the > padding that comes between A-MPDUs etc. I think many drivers could do > better by exposing the total airtime from the device/firmware, whereas > exposing _all_ the little details that you need to calculate it post- > facto will be really difficult, and make the calculation really hard. Guess you are right that it will be difficult to get a completely accurate number. But as David Lang notes, as long as we are off by the same amount for all stations, that is fine - we're just interested in relative numbers. >> But yeah, I guess that is not necessarily the case for all drivers? >> Maybe having a field that the driver can fill in is better (that also >> has the nice property that a driver that doesn't supply any airtime >> information won't get the scheduling; which is probably desirable). > > I think that'd be better, yes. Perhaps for some simple cases a helper > function can be provided, but I'd want to actually know about those > drivers before adding that. Right, I'll add a field to rx_status instead that drivers can use. >> You're right that it does nothing for one-way UDP, of course. But not >> a lot of traffic is actually one-way (most is either TCP or something >> like Quic that also has a feedback loop). So in most of our tests we >> saw a pretty nice effect from TCP back-pressure. > > Yeah that makes some sense. I'm a bit concerned about pathological > corner cases, you don't really know about the protocol used on top. > Can a client starve itself by sending things? And perhaps, if it has a > dumb higher layer protocol, it'll retransmit rather than back off, if > it doesn't get the higher layer ACK? I suppose a station that keeps sending flooding the airwaves (say, an upstream CBR UDP flow with no feedback loop, at a rate higher than it has bandwidth for) can lock itself out to some extent. I'll see if I can get this to happen in my testbed with the ath9k code and report back on what happens. But for anything TCP-like with a feedback loop, the starvation will be temporary as the lack of ACKs will cause the sender to eventually slow down enough to get below its fair share. >> Having a separate deficit for each AC level is a crude workaround for >> this, which is what we went with for the initial version in ath9k; >> and this patch is just a straight-forward port of that. But I guess >> this is a good opportunity to figure out how to solve this properly; >> I'll think about how to do that (ideas welcome!)... > > So ... no, I don't understand. You have a TXQ per _TID_, so splitting > up this per _AC_ still doesn't make sense since you have two TXQs for > each AC? Yeah, but ath9k schedules all TIDs on the same AC together. So if station A has two TIDs active and station B one, the scheduling order will be A1, A2, B1, basically. This is fine as long as they are all on the same AC and scheduled together (in which case A1 and A2 will just share the same deficit). But if there is only one TXQ active, every time that queue is scheduled, the scheduler will loop until its deficit becomes positive (which is generally good; that is how we make sure the scheduler is work-conserving). However, if the deficit is shared with another scheduling group (which in ath9k is another AC), the station in the group by itself will not be limited to its fair share, because every time it is scheduled by itself, its deficit is "reset". Ideally, I would prefer the scheduling to be "two-pass": First, decide which physical station to send to, then decide which TID on that station to service. But because everything is done at the TID/TXQ level, that is not quite trivial to achieve I think... -Toke
On Sat, 2017-10-07 at 13:22 +0200, Toke Høiland-Jørgensen wrote: > Guess you are right that it will be difficult to get a completely > accurate number. But as David Lang notes, as long as we are off by > the same amount for all stations, that is fine - we're just > interested in relative numbers. That's not quite true though, you'd overestimate most on stations that are using aggregation, assuming you take into account the whole frame exchange sequence time. But maybe giving less than their fair share to fast stations isn't really that much of a problem. > > So ... no, I don't understand. You have a TXQ per _TID_, so > > splitting up this per _AC_ still doesn't make sense since you have > > two TXQs for each AC? > > Yeah, but ath9k schedules all TIDs on the same AC together. So if > station A has two TIDs active and station B one, the scheduling order > will be A1, A2, B1, basically. This is fine as long as they are all > on the same AC and scheduled together (in which case A1 and A2 will > just share the same deficit). Huh, I'm confused. Can you elaborate on this? How does it schedule two TIDs for _different_ ACs then? It seems to me that to actually get the right QoS behaviour you have to schedule *stations*, and then while you're looking at a station, you need to select the highest-priority TXQ that has data? Otherwise, don't you end up doing fairness on a STA/AC rather than just on a STA, so that a station that uses two ACs gets twice as much airtime as one using just a single AC? > But if there is only one TXQ active, every time that queue is > scheduled, the scheduler will loop until its deficit becomes positive > (which is generally good; that is how we make sure the scheduler is > work-conserving). However, if the deficit is shared with another > scheduling group (which in ath9k is another AC), the station in the > group by itself will not be limited to its fair share, because every > time it is scheduled by itself, its deficit is "reset". > > Ideally, I would prefer the scheduling to be "two-pass": First, > decide which physical station to send to, then decide which TID on > that station to service. Yeah, that would make more sense. > But because everything is done at the TID/TXQ level, that is not > quite trivial to achieve I think... Well you can group the TXQs, I guess. They all have a STA pointer, so you could put a one- or two-bit "schedule color" field into each station and if you find a TXQ with the same station color you just skip it or something like that? johannes
On Mon, 9 Oct 2017, Johannes Berg wrote: > On Sat, 2017-10-07 at 13:22 +0200, Toke Høiland-Jørgensen wrote: > >> Guess you are right that it will be difficult to get a completely >> accurate number. But as David Lang notes, as long as we are off by >> the same amount for all stations, that is fine - we're just >> interested in relative numbers. > > That's not quite true though, you'd overestimate most on stations that > are using aggregation, assuming you take into account the whole frame > exchange sequence time. But maybe giving less than their fair share to > fast stations isn't really that much of a problem. how much error does this introduce? Compared to the stations using 802.11b, this is a trivial difference. That's why I said perfect is the enemy of good enough. Yes, it would be ideal to get the airtime from the driver, but if the driver doesn't provide it, I think ignoring aggregation in the name of simplicity is 'good enough' David Lang
Johannes Berg <johannes@sipsolutions.net> writes: > On Sat, 2017-10-07 at 13:22 +0200, Toke Høiland-Jørgensen wrote: > >> Guess you are right that it will be difficult to get a completely >> accurate number. But as David Lang notes, as long as we are off by >> the same amount for all stations, that is fine - we're just >> interested in relative numbers. > > That's not quite true though, you'd overestimate most on stations that > are using aggregation, assuming you take into account the whole frame > exchange sequence time. But maybe giving less than their fair share to > fast stations isn't really that much of a problem. Well, the padding and spacing between frames is at most 11 bytes (4-byte delimiter, 4-byte FCS and 3-byte padding), which is ~0.7% of a full-sized frame. I'm not too worried about errors on that scale, TBH. Sure, it would be better to have it be accurate, but there are other imperfections, especially on the RX side (we can't count retransmissions, for instance, since the receiver obviously doesn't see those). >> > So ... no, I don't understand. You have a TXQ per _TID_, so >> > splitting up this per _AC_ still doesn't make sense since you have >> > two TXQs for each AC? >> >> Yeah, but ath9k schedules all TIDs on the same AC together. So if >> station A has two TIDs active and station B one, the scheduling order >> will be A1, A2, B1, basically. This is fine as long as they are all >> on the same AC and scheduled together (in which case A1 and A2 will >> just share the same deficit). > > Huh, I'm confused. Can you elaborate on this? How does it schedule two > TIDs for _different_ ACs then? There's a separate scheduling loop for each hardware queue (one per AC), which only schedules all TXQs with that AC. The hardware will prioritise higher ACs by dequeueing from the high-priority hardware queue first. > It seems to me that to actually get the right QoS behaviour you have > to schedule *stations*, and then while you're looking at a station, > you need to select the highest-priority TXQ that has data? Otherwise, > don't you end up doing fairness on a STA/AC rather than just on a STA, > so that a station that uses two ACs gets twice as much airtime as one > using just a single AC? Yeah, that's what we have currently in ath9k. However, it's rare in practice that a station transmits the same amount of data on all ACs (for one, since the max aggregation size is smaller at the higher ACs that becomes difficult). But you are quite right that this is something that should be fixed :) >> Ideally, I would prefer the scheduling to be "two-pass": First, >> decide which physical station to send to, then decide which TID on >> that station to service. > > Yeah, that would make more sense. > >> But because everything is done at the TID/TXQ level, that is not >> quite trivial to achieve I think... > > Well you can group the TXQs, I guess. They all have a STA pointer, so > you could put a one- or two-bit "schedule color" field into each > station and if you find a TXQ with the same station color you just > skip it or something like that? Couldn't we add something like a get_next_txq(phy) function to mac80211 that the drivers can call to get the queue to pull packets from? That way, responsibility for scheduling both stations and QoS levels falls to mac80211, which makes it possible to do clever scheduling stuff without having to re-implement it in every driver. Also, this function could handle all the special TXQs for PS and non-data frames that you were talking about in your other email? Unless there's some reason I'm missing that the driver really needs to schedule the TXQs, I think this would make a lot of sense? -Toke
On Mon, 2017-10-09 at 11:42 +0200, Toke Høiland-Jørgensen wrote: > Well, the padding and spacing between frames is at most 11 bytes (4- > byte delimiter, 4-byte FCS and 3-byte padding), which is ~0.7% of a > full-sized frame. I'm not too worried about errors on that scale, > TBH. I'm not sure - this really should take the whole frame exchange sequence into consideration, since the "dead" IFS time and the ACK etc. is also airtime consumed for that station, even if there's no actual transmission going on. If you factor that in, the overhead reduction with aggregation is considerable! With an 80 MHz 2x2 MCS 9 (866Mbps PHY rate) A-MPDU containing 64 packets, you can reach >650Mbps (with protection), without A-MPDU you can reach only about 45Mbps I think. You'd think that a 1500 byte frame takes 1.5ms for the 1Mbps client, and ~14µs for the above mentioned VHT rate. In reality, however, the overhead for both is comparable in absolute numbers, it's >200µs. If you don't take any of this overhead into account at all, then you'll vastly over-allocate time for clients sending small (non-aggregated) frames, because for those - even with slow rates - the overhead will dominate. If you do take this overhead into account but don't account for aggregation, you'll vastly under-allocate time for HT/VHT clients that use aggregation. I don't know if there's an easy answer. Perhaps not accounting for the overhead but assuming that clients won't be stupid and will actually do aggregation when they ramp up their rates is reasonable in most scenarios, but I'm afraid that we'll find interop issues - we found for example that if you enable U-APSD lots of devices won't do aggregation any more ... > Sure, it would be better to have it be accurate, but there are other > imperfections, especially on the RX side (we can't count > retransmissions, for instance, since the receiver obviously doesn't > see those). Sure. > There's a separate scheduling loop for each hardware queue (one per > AC), which only schedules all TXQs with that AC. The hardware will > prioritise higher ACs by dequeueing from the high-priority hardware > queue first. Ok, I guess that addresses that issue. > Yeah, that's what we have currently in ath9k. However, it's rare in > practice that a station transmits the same amount of data on all ACs > (for one, since the max aggregation size is smaller at the higher ACs > that becomes difficult). But you are quite right that this is > something that should be fixed :) Not sure the amount of data matters that much, but ok. > > > Ideally, I would prefer the scheduling to be "two-pass": First, > > > decide which physical station to send to, then decide which TID > > > on that station to service. > > > > Yeah, that would make more sense. > > > > > But because everything is done at the TID/TXQ level, that is not > > > quite trivial to achieve I think... > > > > Well you can group the TXQs, I guess. They all have a STA pointer, > > so > > you could put a one- or two-bit "schedule color" field into each > > station and if you find a TXQ with the same station color you just > > skip it or something like that? > > Couldn't we add something like a get_next_txq(phy) function to > mac80211 that the drivers can call to get the queue to pull packets > from? That way, responsibility for scheduling both stations and QoS > levels falls to mac80211, which makes it possible to do clever > scheduling stuff without having to re-implement it in every driver. > Also, this function could handle all the special TXQs for PS and non- > data frames that you were talking about in your other email? > > Unless there's some reason I'm missing that the driver really needs > to schedule the TXQs, I think this would make a lot of sense? I have no idea, that's something you'll have to ask Felix I guess. I'd think it should work, but the scheduling might have other constraints like wanting to fill certain A-MPDU buffers, or getting a specific channel (though that's already decided once you pick the station). It might also be hard to combine that - if you have space on your VI queue, how do you then pick the queue? We can't really go *all* the way and do scheduling *entirely* in software, getting rid of per-AC queues, since the per-AC queues also work to assign the EDCA parameters etc. Also, in iwlwifi we actually have a HW queue per TID to facilitate aggregation, though we could just let mac80211 pick the next TXQ to serve and skip in the unlikely case that the HW queue for that is already full (which really shouldn't happen). johannes
Johannes Berg <johannes@sipsolutions.net> writes: > On Mon, 2017-10-09 at 11:42 +0200, Toke Høiland-Jørgensen wrote: > >> Well, the padding and spacing between frames is at most 11 bytes (4- >> byte delimiter, 4-byte FCS and 3-byte padding), which is ~0.7% of a >> full-sized frame. I'm not too worried about errors on that scale, >> TBH. > > I'm not sure - this really should take the whole frame exchange > sequence into consideration, since the "dead" IFS time and the ACK etc. > is also airtime consumed for that station, even if there's no actual > transmission going on. > > If you factor that in, the overhead reduction with aggregation is > considerable! With an 80 MHz 2x2 MCS 9 (866Mbps PHY rate) A-MPDU > containing 64 packets, you can reach >650Mbps (with protection), > without A-MPDU you can reach only about 45Mbps I think. > > You'd think that a 1500 byte frame takes 1.5ms for the 1Mbps client, > and ~14µs for the above mentioned VHT rate. > > In reality, however, the overhead for both is comparable in absolute > numbers, it's >200µs. > > If you don't take any of this overhead into account at all, then you'll > vastly over-allocate time for clients sending small (non-aggregated) > frames, because for those - even with slow rates - the overhead will > dominate. Right, but most of these are constant values that are straight forward to add as long as you know how the frame was received, no? Maybe not as a general function in mac80211, but the driver should be able to perform a reasonable computation in the absence of information from the hardware. What does iwl put into the status.tx_time field of ieee80211_tx_info, BTW? That was the only driver I could find that used the field, and it looks like it just writes something it gets from the hardware into it. So does that value include overhead? And what about retransmissions? > I don't know if there's an easy answer. Perhaps not accounting for the > overhead but assuming that clients won't be stupid and will actually > do aggregation when they ramp up their rates is reasonable in most > scenarios, but I'm afraid that we'll find interop issues - we found > for example that if you enable U-APSD lots of devices won't do > aggregation any more ... What do you mean by "interop" here, exactly? Just that stations doing weird things will see reduced performance? >> > > Ideally, I would prefer the scheduling to be "two-pass": First, >> > > decide which physical station to send to, then decide which TID >> > > on that station to service. >> > >> > Yeah, that would make more sense. >> > >> > > But because everything is done at the TID/TXQ level, that is not >> > > quite trivial to achieve I think... >> > >> > Well you can group the TXQs, I guess. They all have a STA pointer, >> > so >> > you could put a one- or two-bit "schedule color" field into each >> > station and if you find a TXQ with the same station color you just >> > skip it or something like that? >> >> Couldn't we add something like a get_next_txq(phy) function to >> mac80211 that the drivers can call to get the queue to pull packets >> from? That way, responsibility for scheduling both stations and QoS >> levels falls to mac80211, which makes it possible to do clever >> scheduling stuff without having to re-implement it in every driver. >> Also, this function could handle all the special TXQs for PS and non- >> data frames that you were talking about in your other email? >> >> Unless there's some reason I'm missing that the driver really needs >> to schedule the TXQs, I think this would make a lot of sense? > > I have no idea, that's something you'll have to ask Felix I guess. I'd > think it should work, but the scheduling might have other constraints > like wanting to fill certain A-MPDU buffers, or getting a specific > channel (though that's already decided once you pick the station). I'm pretty sure it will work for ath9k. That just picks a TXQ and keeps pulling packets until it has filled an aggregate. That would still be possible if mac80211 picks the TXQ instead of the driver itself. So I was asking more generally, but if you don't see anything obvious that would prevent me from doing this, I guess I'll go and try it out :) > It might also be hard to combine that - if you have space on your VI > queue, how do you then pick the queue? We can't really go *all* the > way and do scheduling *entirely* in software, getting rid of per-AC > queues, since the per-AC queues also work to assign the EDCA > parameters etc. We'll need to keep putting packets into different hardware queues, sure. But deciding which one can be done at the last instant (i.e., for ath9k, ask mac80211 for a TXQ, look at which AC that TXQ belongs to, and start putting packets into that hardware queue). One of the things I would also like to try, is to sometimes promote or demote packets between AC levels. E.g., if a station has one VO packet and a bunch of BE packets queued, it may sometimes be more efficient to just put the VO packet at the beginning of a BE aggregate. I still need to figure out for which values of 'sometimes' this is a good idea, but I'd like to at least be able to support this sort of shenanigans, which I believe what I proposed above will. > Also, in iwlwifi we actually have a HW queue per TID to facilitate > aggregation, though we could just let mac80211 pick the next TXQ to > serve and skip in the unlikely case that the HW queue for that is > already full (which really shouldn't happen). Yeah, there may be a need for the driver to be able to express some constraints on the queues it can accept currently; may a bitmap of eligible TID numbers, or just a way of saying "can't use this TXQ, please give me another". But it may also be that it's enough for the driver to just give up and try again later if it can't use the TXQ it is assigned... -Toke
Hi, > Right, but most of these are constant values that are straight > forward to add as long as you know how the frame was received, no? > Maybe not as a general function in mac80211, but the driver should be > able to perform a reasonable computation in the absence of > information from the hardware. Yes, I think so. > What does iwl put into the status.tx_time field of ieee80211_tx_info, > BTW? That was the only driver I could find that used the field, and > it looks like it just writes something it gets from the hardware into > it. > So does that value include overhead? And what about retransmissions? It comes from the firmware - as far as I can tell, yes, it'll include retransmissions and the whole frame exchange sequence (RTS-CTS-data-ack or RTS-CTS-ampdu-blockack). > > I don't know if there's an easy answer. Perhaps not accounting for > > the > > overhead but assuming that clients won't be stupid and will > > actually > > do aggregation when they ramp up their rates is reasonable in most > > scenarios, but I'm afraid that we'll find interop issues - we found > > for example that if you enable U-APSD lots of devices won't do > > aggregation any more ... > > What do you mean by "interop" here, exactly? Just that stations doing > weird things will see reduced performance? Well, the case that we did find is that sometimes U-APSD kills aggregation, so then you'd have a lot of single frames and significantly under-estimate air-time usage in this case. Thus, this station would get far more than its fair share of air time, because of a bug that makes it not use aggregation... That doesn't sound very good to me. Perhaps at least taking aggregation into account would be doable - we _should_ know this, at least partially. > I'm pretty sure it will work for ath9k. That just picks a TXQ and > keeps pulling packets until it has filled an aggregate. OK > That would still be possible if mac80211 picks the TXQ instead of the > driver itself. Right. > So I was asking more generally, but if you don't see anything obvious > that would prevent me from doing this, I guess I'll go and try it out > :) :-) > > It might also be hard to combine that - if you have space on your > > VI > > queue, how do you then pick the queue? We can't really go *all* the > > way and do scheduling *entirely* in software, getting rid of per-AC > > queues, since the per-AC queues also work to assign the EDCA > > parameters etc. > > We'll need to keep putting packets into different hardware queues, > sure. > But deciding which one can be done at the last instant (i.e., for > ath9k, > ask mac80211 for a TXQ, look at which AC that TXQ belongs to, and > start > putting packets into that hardware queue). Right. > One of the things I would also like to try, is to sometimes promote > or demote packets between AC levels. E.g., if a station has one VO > packet and a bunch of BE packets queued, it may sometimes be more > efficient to just put the VO packet at the beginning of a BE > aggregate. I still need to figure out for which values of 'sometimes' > this is a good idea, but I'd like to at least be able to support this > sort of shenanigans, which I believe what I proposed above will. I don't think that's a good idea. It's possible (and this can be done at least in synthetic scenarios) that VO traffic _completely_ drowns out all lower-priority traffic, so demoting frames that way would get them the wrong EDCA parameters. Yes, it might be better for throughput, but it would almost certainly be worse for latency for those frames, which is kinda the whole point of the ACs. > > Also, in iwlwifi we actually have a HW queue per TID to facilitate > > aggregation, though we could just let mac80211 pick the next TXQ to > > serve and skip in the unlikely case that the HW queue for that is > > already full (which really shouldn't happen). > > Yeah, there may be a need for the driver to be able to express some > constraints on the queues it can accept currently; may a bitmap of > eligible TID numbers, or just a way of saying "can't use this TXQ, > please give me another". But it may also be that it's enough for the > driver to just give up and try again later if it can't use the TXQ it > is assigned... I guess it could just move on to the next TID. There doesn't seem to be much point in saying "I can't service this high-priority TID, give me something lower priority instead" because it really should be servicing the higher priority first anyway, so just skipping if it can't be serviced seems fine. johannes
Johannes Berg <johannes@sipsolutions.net> writes: > Hi, > >> Right, but most of these are constant values that are straight >> forward to add as long as you know how the frame was received, no? >> Maybe not as a general function in mac80211, but the driver should be >> able to perform a reasonable computation in the absence of >> information from the hardware. > > Yes, I think so. Cool. I'll go with that, then. >> What does iwl put into the status.tx_time field of ieee80211_tx_info, >> BTW? That was the only driver I could find that used the field, and >> it looks like it just writes something it gets from the hardware into >> it. >> So does that value include overhead? And what about retransmissions? > > It comes from the firmware - as far as I can tell, yes, it'll include > retransmissions and the whole frame exchange sequence > (RTS-CTS-data-ack or RTS-CTS-ampdu-blockack). Excellent! Just what we need :) >> > I don't know if there's an easy answer. Perhaps not accounting for >> > the >> > overhead but assuming that clients won't be stupid and will >> > actually >> > do aggregation when they ramp up their rates is reasonable in most >> > scenarios, but I'm afraid that we'll find interop issues - we found >> > for example that if you enable U-APSD lots of devices won't do >> > aggregation any more ... >> >> What do you mean by "interop" here, exactly? Just that stations doing >> weird things will see reduced performance? > > Well, the case that we did find is that sometimes U-APSD kills > aggregation, so then you'd have a lot of single frames and > significantly under-estimate air-time usage in this case. Thus, this > station would get far more than its fair share of air time, because of > a bug that makes it not use aggregation... That doesn't sound very > good to me. > > Perhaps at least taking aggregation into account would be doable - we > _should_ know this, at least partially. Yeah, ath9k certainly gets that as part of the RX information from the chip. >> One of the things I would also like to try, is to sometimes promote >> or demote packets between AC levels. E.g., if a station has one VO >> packet and a bunch of BE packets queued, it may sometimes be more >> efficient to just put the VO packet at the beginning of a BE >> aggregate. I still need to figure out for which values of 'sometimes' >> this is a good idea, but I'd like to at least be able to support this >> sort of shenanigans, which I believe what I proposed above will. > > I don't think that's a good idea. It's possible (and this can be done > at least in synthetic scenarios) that VO traffic _completely_ drowns > out all lower-priority traffic, so demoting frames that way would get > them the wrong EDCA parameters. > > Yes, it might be better for throughput, but it would almost certainly > be worse for latency for those frames, which is kinda the whole point > of the ACs. Well, some of the tests I did while porting ath9k to the iTXQs indicated that for voice-like traffic we can get very close to the same actual end-to-end latency for BE-marked traffic that we do with VO-marked traffic. This is in part because the FQ sparse flow prioritisation makes sure that such flows get queueing priority. Now obviously, there are going to be tradeoffs, and scenarios where latency will suffer. But I would at least like to be able to explore this, and I think with the API we are currently discussing that will be possible. Another thing I want to explore is doing soft admission control; i.e., if someone sends bulk traffic marked as VO, it will be automatically demoted to BE traffic rather than locking everyone else out. We've tested that with some success in the Cake scheduler, and it may be applicable to WiFi as well. >> > Also, in iwlwifi we actually have a HW queue per TID to facilitate >> > aggregation, though we could just let mac80211 pick the next TXQ to >> > serve and skip in the unlikely case that the HW queue for that is >> > already full (which really shouldn't happen). >> >> Yeah, there may be a need for the driver to be able to express some >> constraints on the queues it can accept currently; may a bitmap of >> eligible TID numbers, or just a way of saying "can't use this TXQ, >> please give me another". But it may also be that it's enough for the >> driver to just give up and try again later if it can't use the TXQ it >> is assigned... > > I guess it could just move on to the next TID. There doesn't seem to > be much point in saying "I can't service this high-priority TID, give > me something lower priority instead" because it really should be > servicing the higher priority first anyway, so just skipping if it > can't be serviced seems fine. Suppose so. I'll try it out :) -Toke
On Mon, 2017-10-09 at 22:25 +0200, Toke Høiland-Jørgensen wrote: > > Well, the case that we did find is that sometimes U-APSD kills > > aggregation, so then you'd have a lot of single frames and > > significantly under-estimate air-time usage in this case. Thus, > > this > > station would get far more than its fair share of air time, because > > of > > a bug that makes it not use aggregation... That doesn't sound very > > good to me. > > > > Perhaps at least taking aggregation into account would be doable - > > we _should_ know this, at least partially. > > Yeah, ath9k certainly gets that as part of the RX information from > the chip. Yeah, though there are more complex cases, e.g. with hardware A-MSDU deaggregation like in ath10k. > Well, some of the tests I did while porting ath9k to the iTXQs > indicated that for voice-like traffic we can get very close to the > same actual end-to-end latency for BE-marked traffic that we do with > VO-marked traffic. This is in part because the FQ sparse flow > prioritisation makes sure that such flows get queueing priority. That really depends on medium saturation - if that's low, then yeah, the difference is in the EDCA parameters and the minimum backoff, which isn't a huge difference if you measure end-to-end. Given saturation/congestion though, and I think we have to take this into account since not everyone will be using this code, there are measurable advantages to using VO - like in the test I mentioned where you can block out BE "forever". > Now obviously, there are going to be tradeoffs, and scenarios where > latency will suffer. But I would at least like to be able to explore > this, and I think with the API we are currently discussing that will > be possible. I'm not sure how you envision this working with the API, since you still have to pull from multiple TXQs, but I have no objection to the API as is - I'm just not convinced that actually demoting traffic is a good idea :) > Another thing I want to explore is doing soft admission > control; i.e., if someone sends bulk traffic marked as VO, it will be > automatically demoted to BE traffic rather than locking everyone else > out. We've tested that with some success in the Cake scheduler, and > it may be applicable to WiFi as well. It seems pretty strange to do that at this level - we control what TID (and AC) is selected for the traffic? So why not just change at that level for this type of thing? You can probably even already do that with TC by re-marking the traffic depending on type or throughput or whatever else TC can classify on. johannes
Johannes Berg <johannes@sipsolutions.net> writes: >> Yeah, ath9k certainly gets that as part of the RX information from >> the chip. > > Yeah, though there are more complex cases, e.g. with hardware A-MSDU > deaggregation like in ath10k. Here be dragons; noted :) >> Well, some of the tests I did while porting ath9k to the iTXQs >> indicated that for voice-like traffic we can get very close to the >> same actual end-to-end latency for BE-marked traffic that we do with >> VO-marked traffic. This is in part because the FQ sparse flow >> prioritisation makes sure that such flows get queueing priority. > > That really depends on medium saturation - if that's low, then yeah, > the difference is in the EDCA parameters and the minimum backoff, > which isn't a huge difference if you measure end-to-end. > > Given saturation/congestion though, and I think we have to take this > into account since not everyone will be using this code, there are > measurable advantages to using VO - like in the test I mentioned where > you can block out BE "forever". Yup, when the medium is saturated the tradeoff might be different. My hope is that it is possible to dynamically detect when it might be beneficial and change behaviour based on this. I know this is maybe a bit hand-wavy for now, but otherwise it wouldn't be research... ;) > >> Now obviously, there are going to be tradeoffs, and scenarios where >> latency will suffer. But I would at least like to be able to explore >> this, and I think with the API we are currently discussing that will >> be possible. > > I'm not sure how you envision this working with the API, since you > still have to pull from multiple TXQs, but I have no objection to the > API as is - I'm just not convinced that actually demoting traffic is a > good idea :) Sure, I realise I have some convincing to do on this point... >> Another thing I want to explore is doing soft admission >> control; i.e., if someone sends bulk traffic marked as VO, it will be >> automatically demoted to BE traffic rather than locking everyone else >> out. We've tested that with some success in the Cake scheduler, and >> it may be applicable to WiFi as well. > > It seems pretty strange to do that at this level - we control what TID > (and AC) is selected for the traffic? So why not just change at that > level for this type of thing? > > You can probably even already do that with TC by re-marking the traffic > depending on type or throughput or whatever else TC can classify on. Mostly because I am expecting that the current queue state is an important input variable. It's quite straight forward to (e.g.) remark VO packets if they exceed a fixed rate. But what I want to do is demote packets if they exceed what the network can currently handle. It may be that visibility into the queueing is enough, though. I have yet to actually experiment with this... -Toke
diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h index 9675814f64db..2bc259cf2c30 100644 --- a/net/mac80211/ieee80211_i.h +++ b/net/mac80211/ieee80211_i.h @@ -90,6 +90,9 @@ extern const u8 ieee80211_ac_to_qos_mask[IEEE80211_NUM_ACS]; #define IEEE80211_MAX_NAN_INSTANCE_ID 255 +/* How much to increase airtime deficit on each scheduling round */ +#define IEEE80211_AIRTIME_QUANTUM 1000 /* usec */ + struct ieee80211_fragment_entry { struct sk_buff_head skb_list; unsigned long first_frag_time; diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c index 70e9d2ca8bbe..031dae70c7ad 100644 --- a/net/mac80211/rx.c +++ b/net/mac80211/rx.c @@ -1673,6 +1673,35 @@ ieee80211_rx_h_sta_process(struct ieee80211_rx_data *rx) return RX_CONTINUE; } /* ieee80211_rx_h_sta_process */ +static ieee80211_rx_result debug_noinline +ieee80211_rx_h_airtime(struct ieee80211_rx_data *rx) +{ + struct sta_info *sta = rx->sta; + struct sk_buff *skb = rx->skb; + struct ieee80211_rx_status *status = IEEE80211_SKB_RXCB(skb); + struct ieee80211_hdr *hdr = (struct ieee80211_hdr *)skb->data; + struct ieee80211_sub_if_data *sdata = rx->sdata; + const struct ieee80211_rate *rate; + u32 airtime = 0; + int i; + u8 ac; + + if (!sta || !ieee80211_is_data(hdr->frame_control)) + return RX_CONTINUE; + + ac = ieee80211_select_queue_80211(sdata, skb, hdr); + /* FIXME: implement this + airtime = ieee80211_pkt_duration(status, skb->len); */ + + spin_lock_bh(&sta->lock); + sta->airtime_deficit[ac] -= airtime; + sta->airtime_stats.rx_airtime += airtime; + spin_unlock_bh(&sta->lock); + + return RX_CONTINUE; +} /* ieee80211_rx_h_airtime */ + + static ieee80211_rx_result debug_noinline ieee80211_rx_h_decrypt(struct ieee80211_rx_data *rx) { @@ -3392,6 +3421,7 @@ static void ieee80211_rx_handlers(struct ieee80211_rx_data *rx, CALL_RXH(ieee80211_rx_h_check_more_data); CALL_RXH(ieee80211_rx_h_uapsd_and_pspoll); CALL_RXH(ieee80211_rx_h_sta_process); + CALL_RXH(ieee80211_rx_h_airtime); CALL_RXH(ieee80211_rx_h_decrypt); CALL_RXH(ieee80211_rx_h_defragment); CALL_RXH(ieee80211_rx_h_michael_mic_verify); diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c index ffcd25c4908c..cf33c2998bdd 100644 --- a/net/mac80211/sta_info.c +++ b/net/mac80211/sta_info.c @@ -388,6 +388,7 @@ struct sta_info *sta_info_alloc(struct ieee80211_sub_if_data *sdata, for (i = 0; i < IEEE80211_NUM_ACS; i++) { skb_queue_head_init(&sta->ps_tx_buf[i]); skb_queue_head_init(&sta->tx_filtered[i]); + sta->airtime_deficit[i] = IEEE80211_AIRTIME_QUANTUM; } for (i = 0; i < IEEE80211_NUM_TIDS; i++) diff --git a/net/mac80211/sta_info.h b/net/mac80211/sta_info.h index a35c964f6217..43eb4b3da12a 100644 --- a/net/mac80211/sta_info.h +++ b/net/mac80211/sta_info.h @@ -548,6 +548,13 @@ struct sta_info { } tx_stats; u16 tid_seq[IEEE80211_QOS_CTL_TID_MASK + 1]; + /* Airtime stats and deficit, protected by lock */ + struct { + u64 rx_airtime; + u64 tx_airtime; + } airtime_stats; + s64 airtime_deficit[IEEE80211_NUM_ACS]; + /* * Aggregation information, locked with lock. */ diff --git a/net/mac80211/status.c b/net/mac80211/status.c index da7427a41529..5998004401f6 100644 --- a/net/mac80211/status.c +++ b/net/mac80211/status.c @@ -823,6 +823,14 @@ static void __ieee80211_tx_status(struct ieee80211_hw *hw, ieee80211_lost_packet(sta, info); } } + + if (info->status.tx_time) { + int ac = skb_get_queue_mapping(skb); + spin_lock_bh(&sta->lock); + sta->airtime_stats.tx_airtime += info->status.tx_time; + sta->airtime_deficit[ac] -= info->status.tx_time; + spin_unlock_bh(&sta->lock); + } } /* SNMP counters diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c index 94826680cf2b..4b11d199bf8e 100644 --- a/net/mac80211/tx.c +++ b/net/mac80211/tx.c @@ -3410,6 +3410,7 @@ struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw, struct ieee80211_tx_data tx; ieee80211_tx_result r; struct ieee80211_vif *vif; + struct sta_info *sta = NULL; spin_lock_bh(&fq->lock); @@ -3421,6 +3422,18 @@ struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw, if (skb) goto out; + if (txq->sta) { + sta = container_of(txq->sta, struct sta_info, sta); + spin_lock_bh(&sta->lock); + if (sta.airtime_deficit[txq->ac] <= 0) { + sta.airtime_deficit[txq->ac] += IEEE80211_AIRTIME_QUANTUM; + skb = ERR_PTR(-EAGAIN); + spin_unlock_bh(&sta->lock); + goto out; + } + spin_unlock_bh(&sta->lock); + } + begin: skb = fq_tin_dequeue(fq, tin, fq_tin_dequeue_func); if (!skb) @@ -3434,9 +3447,7 @@ struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw, tx.local = local; tx.skb = skb; tx.sdata = vif_to_sdata(info->control.vif); - - if (txq->sta) - tx.sta = container_of(txq->sta, struct sta_info, sta); + tx.sta = sta; /* * The key can be removed while the packet was queued, so need to call
This adds accounting of each station's airtime usage to mac80211. On the RX side, airtime is calculated from the packet length and duration. On the TX side, the driver has to fill in status.tx_time with the packet send duration. When a station's airtime deficit runs into the negative, no more packets will be returned from ieee80211_tx_dequeue(), but rather instead ERR_PTR(-EAGAIN). The driver is expected to handle this and to keep scheduling such a station. It is assumed that the driver schedules active TXQs in a round-robin manner. This is still incomplete. Missing: - The actual calculation of RX airtime - Updates of the drivers to check for the ERR_PTR and set tx_time Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> --- net/mac80211/ieee80211_i.h | 3 +++ net/mac80211/rx.c | 30 ++++++++++++++++++++++++++++++ net/mac80211/sta_info.c | 1 + net/mac80211/sta_info.h | 7 +++++++ net/mac80211/status.c | 8 ++++++++ net/mac80211/tx.c | 17 ++++++++++++++--- 6 files changed, 63 insertions(+), 3 deletions(-)