diff mbox series

[RFC/RFT,4/4] mac80211: Apply Airtime-based Queue Limit (AQL) on packet dequeue

Message ID 156889576869.191202.510507546538322707.stgit@alrua-x1 (mailing list archive)
State RFC
Delegated to: Johannes Berg
Headers show
Series Add Airtime Queue Limits (AQL) to mac80211 | expand

Commit Message

Toke Høiland-Jørgensen Sept. 19, 2019, 12:22 p.m. UTC
From: Toke Høiland-Jørgensen <toke@redhat.com>

Some devices have deep buffers in firmware and/or hardware which prevents
the FQ structure in mac80211 from effectively limiting bufferbloat on the
link. For Ethernet devices we have BQL to limit the lower-level queues, but
this cannot be applied to mac80211 because transmit rates can vary wildly
between packets depending on which station we are transmitting it to.

To overcome this, we can use airtime-based queue limiting (AQL), where we
estimate the transmission time for each packet before dequeueing it, and
use that to limit the amount of data in-flight to the hardware. This idea
was originally implemented as part of the out-of-tree airtime fairness
patch to ath10k[0] in chromiumos.

This patch ports that idea over to mac80211. The basic idea is simple
enough: Whenever we dequeue a packet from the TXQs and send it to the
driver, we estimate its airtime usage, based on the last recorded TX rate
of the station that packet is destined for. We keep a running per-AC total
of airtime queued for the whole device, and when that total climbs above 8
ms' worth of data (corresponding to two maximum-sized aggregates), we
simply throttle the queues until it drops down again.

The estimated airtime for each skb is stored in the tx_info, so we can
subtract the same amount from the running total when the skb is freed or
recycled. The throttling mechanism relies on this accounting to be
accurate (i.e., that we are not freeing skbs without subtracting any
airtime they were accounted for), so we put the subtraction into
ieee80211_report_used_skb().

This patch does *not* include any mechanism to wake a throttled TXQ again,
on the assumption that this will happen anyway as a side effect of whatever
freed the skb (most commonly a TX completion).

The throttling mechanism only kicks in if the queued airtime total goes
above the limit. Since mac80211 calculates the time based on the reported
last_tx_time from the driver, the whole throttling mechanism only kicks in
for drivers that actually report this value. With the exception of
multicast, where we always calculate an estimated tx time on the assumption
that multicast is transmitted at the lowest (6 Mbps) rate.

The throttling added in this patch is in addition to any throttling already
performed by the airtime fairness mechanism, and in principle the two
mechanisms are orthogonal (and currently also uses two different sources of
airtime). In the future, we could amend this, using the airtime estimates
calculated by this mechanism as a fallback input to the airtime fairness
scheduler, to enable airtime fairness even on drivers that don't have a
hardware source of airtime usage for each station.

[0] https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/588190/13/drivers/net/wireless-4.2/ath/ath10k/mac.c#3845

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 net/mac80211/debugfs.c     |   24 ++++++++++++++++++++++++
 net/mac80211/ieee80211_i.h |    7 +++++++
 net/mac80211/status.c      |   22 ++++++++++++++++++++++
 net/mac80211/tx.c          |   38 +++++++++++++++++++++++++++++++++++++-
 4 files changed, 90 insertions(+), 1 deletion(-)

Comments

Peter Oh Sept. 19, 2019, 5:44 p.m. UTC | #1
On 9/19/19 5:22 AM, Toke Høiland-Jørgensen wrote:
> From: Toke Høiland-Jørgensen <toke@redhat.com>
>
> This patch ports that idea over to mac80211. The basic idea is simple
> enough: Whenever we dequeue a packet from the TXQs and send it to the
> driver, we estimate its airtime usage, based on the last recorded TX rate
> of the station that packet is destined for.

The way to decide the last recorded TX rate could be vary among drivers. 
In terms of ath10k driver and FW, they use 4 PPDUs to update the Tx 
rate. Isn't it too small sampling number to be used for AQL?

Peter
Ben Greear Sept. 19, 2019, 5:46 p.m. UTC | #2
On 9/19/19 10:44 AM, Peter Oh wrote:
> On 9/19/19 5:22 AM, Toke Høiland-Jørgensen wrote:
>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>
>> This patch ports that idea over to mac80211. The basic idea is simple
>> enough: Whenever we dequeue a packet from the TXQs and send it to the
>> driver, we estimate its airtime usage, based on the last recorded TX rate
>> of the station that packet is destined for.
> 
> The way to decide the last recorded TX rate could be vary among drivers. In terms of ath10k driver and FW, they use 4 PPDUs to update the Tx rate. Isn't it too 
> small sampling number to be used for AQL?

Probably it is not exactly the last 4 either, since the report comes back indirectly and not
synchronized with the tx path?

Thanks,
Ben
Peter Oh Sept. 19, 2019, 5:54 p.m. UTC | #3
On 9/19/19 10:46 AM, Ben Greear wrote:
> On 9/19/19 10:44 AM, Peter Oh wrote:
>> On 9/19/19 5:22 AM, Toke Høiland-Jørgensen wrote:
>>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>>
>>> This patch ports that idea over to mac80211. The basic idea is simple
>>> enough: Whenever we dequeue a packet from the TXQs and send it to the
>>> driver, we estimate its airtime usage, based on the last recorded TX 
>>> rate
>>> of the station that packet is destined for.
>>
>> The way to decide the last recorded TX rate could be vary among 
>> drivers. In terms of ath10k driver and FW, they use 4 PPDUs to update 
>> the Tx rate. Isn't it too small sampling number to be used for AQL?
>
> Probably it is not exactly the last 4 either, since the report comes 
> back indirectly and not
> synchronized with the tx path?
>
The point of my question is "the last recorded Tx raith small nte is 
derived wumber of PPDUs and if it's ok to use it for AQL calculation or 
not".

Thanks,

Peter
Peter Oh Sept. 19, 2019, 6:03 p.m. UTC | #4
On 9/19/19 10:54 AM, Peter Oh wrote:
>
> On 9/19/19 10:46 AM, Ben Greear wrote:
>> On 9/19/19 10:44 AM, Peter Oh wrote:
>>> On 9/19/19 5:22 AM, Toke Høiland-Jørgensen wrote:
>>>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>
>>>> This patch ports that idea over to mac80211. The basic idea is simple
>>>> enough: Whenever we dequeue a packet from the TXQs and send it to the
>>>> driver, we estimate its airtime usage, based on the last recorded 
>>>> TX rate
>>>> of the station that packet is destined for.
>>>
>>> The way to decide the last recorded TX rate could be vary among 
>>> drivers. In terms of ath10k driver and FW, they use 4 PPDUs to 
>>> update the Tx rate. Isn't it too small sampling number to be used 
>>> for AQL?
>>
>> Probably it is not exactly the last 4 either, since the report comes 
>> back indirectly and not
>> synchronized with the tx path?
>>
The point of my question is "the last recorded Tx rate is derived from 
small number of PPDUs and if it's OK to use it for AQL calculation or not".

Thanks,

Peter
Toke Høiland-Jørgensen Sept. 19, 2019, 6:03 p.m. UTC | #5
Peter Oh <peter.oh@eero.com> writes:

> On 9/19/19 10:46 AM, Ben Greear wrote:
>> On 9/19/19 10:44 AM, Peter Oh wrote:
>>> On 9/19/19 5:22 AM, Toke Høiland-Jørgensen wrote:
>>>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>
>>>> This patch ports that idea over to mac80211. The basic idea is simple
>>>> enough: Whenever we dequeue a packet from the TXQs and send it to the
>>>> driver, we estimate its airtime usage, based on the last recorded TX 
>>>> rate
>>>> of the station that packet is destined for.
>>>
>>> The way to decide the last recorded TX rate could be vary among 
>>> drivers. In terms of ath10k driver and FW, they use 4 PPDUs to update 
>>> the Tx rate. Isn't it too small sampling number to be used for AQL?
>>
>> Probably it is not exactly the last 4 either, since the report comes 
>> back indirectly and not
>> synchronized with the tx path?
>>
> The point of my question is "the last recorded Tx raith small nte is
> derived wumber of PPDUs and if it's ok to use it for AQL calculation
> or not".

We're leaving a bit of slack in the system by limiting the buffering to
two aggregates' worth of buffering instead of just one. This is to
prevent starvation in case our estimate is off. In the other direction,
(i.e., if the rate drops suddenly), that will translate to more bloat
until the queue drains. Not much we can do about that; we can only work
with the data we have...

Still, the Google guys reported pretty good results using this method
for ath10k with their out of tree patch. So I think that in many cases,
doing this will be an improvement; obviously, it won't be perfect. But
it beats the 1000 pkt+ queue limit currently in (some versions of)
ath10k firmware.

In an ideal world, the firmware would enforce this minimum queueing
and throttle itself, but, well, sadly we don't live an ideal world...

-Toke
Dave Taht Sept. 19, 2019, 6:18 p.m. UTC | #6
For the record, this was the google report on their implementation in 3.18.

http://flent-newark.bufferbloat.net/~d/Airtime%20based%20queue%20limit%20for%20FQ_CoDel%20in%20wireless%20interface.pdf

The latency vs RvR graphs at the end were to die for.
John Yates Sept. 19, 2019, 8:09 p.m. UTC | #7
On Thu, Sep 19, 2019 at 2:18 PM Dave Taht <dave.taht@gmail.com> wrote:
>
> For the record, this was the google report on their implementation in 3.18.
>
> http://flent-newark.bufferbloat.net/~d/Airtime%20based%20queue%20limit%20for%20FQ_CoDel%20in%20wireless%20interface.pdf

From skimming that paper it sounds like this is shipping
in the current Google WiFi product.  Is that correct?

/john
Toke Høiland-Jørgensen Sept. 19, 2019, 8:15 p.m. UTC | #8
John Yates <john@yates-sheets.org> writes:

> On Thu, Sep 19, 2019 at 2:18 PM Dave Taht <dave.taht@gmail.com> wrote:
>>
>> For the record, this was the google report on their implementation in 3.18.
>>
>> http://flent-newark.bufferbloat.net/~d/Airtime%20based%20queue%20limit%20for%20FQ_CoDel%20in%20wireless%20interface.pdf
>
> From skimming that paper it sounds like this is shipping in the
> current Google WiFi product. Is that correct?

I believe so, yeah. The chromiumos patch tracker is here:
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/588190/13

-Toke
Lorenzo Bianconi Sept. 20, 2019, 12:06 p.m. UTC | #9
> From: Toke Høiland-Jørgensen <toke@redhat.com>
> 
> Some devices have deep buffers in firmware and/or hardware which prevents
> the FQ structure in mac80211 from effectively limiting bufferbloat on the
> link. For Ethernet devices we have BQL to limit the lower-level queues, but
> this cannot be applied to mac80211 because transmit rates can vary wildly
> between packets depending on which station we are transmitting it to.
> 
> To overcome this, we can use airtime-based queue limiting (AQL), where we
> estimate the transmission time for each packet before dequeueing it, and
> use that to limit the amount of data in-flight to the hardware. This idea
> was originally implemented as part of the out-of-tree airtime fairness
> patch to ath10k[0] in chromiumos.
> 
> This patch ports that idea over to mac80211. The basic idea is simple
> enough: Whenever we dequeue a packet from the TXQs and send it to the
> driver, we estimate its airtime usage, based on the last recorded TX rate
> of the station that packet is destined for. We keep a running per-AC total
> of airtime queued for the whole device, and when that total climbs above 8
> ms' worth of data (corresponding to two maximum-sized aggregates), we
> simply throttle the queues until it drops down again.
> 
> The estimated airtime for each skb is stored in the tx_info, so we can
> subtract the same amount from the running total when the skb is freed or
> recycled. The throttling mechanism relies on this accounting to be
> accurate (i.e., that we are not freeing skbs without subtracting any
> airtime they were accounted for), so we put the subtraction into
> ieee80211_report_used_skb().
> 
> This patch does *not* include any mechanism to wake a throttled TXQ again,
> on the assumption that this will happen anyway as a side effect of whatever
> freed the skb (most commonly a TX completion).
> 
> The throttling mechanism only kicks in if the queued airtime total goes
> above the limit. Since mac80211 calculates the time based on the reported
> last_tx_time from the driver, the whole throttling mechanism only kicks in
> for drivers that actually report this value. With the exception of
> multicast, where we always calculate an estimated tx time on the assumption
> that multicast is transmitted at the lowest (6 Mbps) rate.
> 
> The throttling added in this patch is in addition to any throttling already
> performed by the airtime fairness mechanism, and in principle the two
> mechanisms are orthogonal (and currently also uses two different sources of
> airtime). In the future, we could amend this, using the airtime estimates
> calculated by this mechanism as a fallback input to the airtime fairness
> scheduler, to enable airtime fairness even on drivers that don't have a
> hardware source of airtime usage for each station.
> 
> [0] https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/588190/13/drivers/net/wireless-4.2/ath/ath10k/mac.c#3845
> 
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> ---
>  net/mac80211/debugfs.c     |   24 ++++++++++++++++++++++++
>  net/mac80211/ieee80211_i.h |    7 +++++++
>  net/mac80211/status.c      |   22 ++++++++++++++++++++++
>  net/mac80211/tx.c          |   38 +++++++++++++++++++++++++++++++++++++-
>  4 files changed, 90 insertions(+), 1 deletion(-)

Hi Toke,

Thx a lot for working on this. Few comments inline.

Regards,
Lorenzo

> 
> diff --git a/net/mac80211/debugfs.c b/net/mac80211/debugfs.c
> index 568b3b276931..c846c6e7f3e3 100644
> --- a/net/mac80211/debugfs.c
> +++ b/net/mac80211/debugfs.c
> @@ -148,6 +148,29 @@ static const struct file_operations aqm_ops = {
>  	.llseek = default_llseek,
>  };
>  

[...]

> @@ -3581,8 +3591,19 @@ struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
>  	tx.skb = skb;
>  	tx.sdata = vif_to_sdata(info->control.vif);
>  
> -	if (txq->sta)
> +	pktlen = skb->len + 38;
> +	if (txq->sta) {
>  		tx.sta = container_of(txq->sta, struct sta_info, sta);
> +		if (tx.sta->last_tx_bitrate) {
> +			airtime = (pktlen * 8 * 1000 *
> +				   tx.sta->last_tx_bitrate_reciprocal) >> IEEE80211_RECIPROCAL_SHIFT;
> +			airtime += IEEE80211_AIRTIME_OVERHEAD;

Here we are not taking into account aggregation burst size (it is done in a
rough way in chromeos implementation) and tx retries. I have not carried out
any tests so far but I think IEEE80211_AIRTIME_OVERHEAD will led to a
significant airtime overestimation. Do you think this can be improved? (..I
agree this is not a perfect world, but .. :))

Moreover, can this approach be affected by some interrupt coalescing implemented
by the chipset?

> +		}
> +	} else {
> +		airtime = (pktlen * 8 * 1000 *
> +			   IEEE80211_AIRTIME_MINRATE_RECIPROCAL) >> IEEE80211_RECIPROCAL_SHIFT;
> +		airtime += IEEE80211_AIRTIME_OVERHEAD;
> +	}
>  
>  	/*
>  	 * The key can be removed while the packet was queued, so need to call
> @@ -3659,6 +3680,15 @@ struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
>  	}
>  
>  	IEEE80211_SKB_CB(skb)->control.vif = vif;
> +
> +	if (airtime) {
> +		info->control.tx_time_est = airtime;
> +
> +		spin_lock_bh(&local->active_txq_lock[ac]);
> +		local->airtime_queued[ac] += airtime;
> +		spin_unlock_bh(&local->active_txq_lock[ac]);
> +	}
> +
>  	return skb;
>  
>  out:
> @@ -3676,6 +3706,9 @@ struct ieee80211_txq *ieee80211_next_txq(struct ieee80211_hw *hw, u8 ac)
>  
>  	spin_lock_bh(&local->active_txq_lock[ac]);
>  
> +	if (local->airtime_queued[ac] > IEEE80211_AIRTIME_QUEUE_LIMIT)
> +		goto out;
> +
>   begin:
>  	txqi = list_first_entry_or_null(&local->active_txqs[ac],
>  					struct txq_info,
> @@ -3753,6 +3786,9 @@ bool ieee80211_txq_may_transmit(struct ieee80211_hw *hw,
>  
>  	spin_lock_bh(&local->active_txq_lock[ac]);
>  
> +	if (local->airtime_queued[ac] > IEEE80211_AIRTIME_QUEUE_LIMIT)
> +		goto out;
> +
>  	if (!txqi->txq.sta)
>  		goto out;
>  
>
Toke Høiland-Jørgensen Sept. 20, 2019, 12:54 p.m. UTC | #10
Lorenzo Bianconi <lorenzo@kernel.org> writes:

>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>> 
>> Some devices have deep buffers in firmware and/or hardware which prevents
>> the FQ structure in mac80211 from effectively limiting bufferbloat on the
>> link. For Ethernet devices we have BQL to limit the lower-level queues, but
>> this cannot be applied to mac80211 because transmit rates can vary wildly
>> between packets depending on which station we are transmitting it to.
>> 
>> To overcome this, we can use airtime-based queue limiting (AQL), where we
>> estimate the transmission time for each packet before dequeueing it, and
>> use that to limit the amount of data in-flight to the hardware. This idea
>> was originally implemented as part of the out-of-tree airtime fairness
>> patch to ath10k[0] in chromiumos.
>> 
>> This patch ports that idea over to mac80211. The basic idea is simple
>> enough: Whenever we dequeue a packet from the TXQs and send it to the
>> driver, we estimate its airtime usage, based on the last recorded TX rate
>> of the station that packet is destined for. We keep a running per-AC total
>> of airtime queued for the whole device, and when that total climbs above 8
>> ms' worth of data (corresponding to two maximum-sized aggregates), we
>> simply throttle the queues until it drops down again.
>> 
>> The estimated airtime for each skb is stored in the tx_info, so we can
>> subtract the same amount from the running total when the skb is freed or
>> recycled. The throttling mechanism relies on this accounting to be
>> accurate (i.e., that we are not freeing skbs without subtracting any
>> airtime they were accounted for), so we put the subtraction into
>> ieee80211_report_used_skb().
>> 
>> This patch does *not* include any mechanism to wake a throttled TXQ again,
>> on the assumption that this will happen anyway as a side effect of whatever
>> freed the skb (most commonly a TX completion).
>> 
>> The throttling mechanism only kicks in if the queued airtime total goes
>> above the limit. Since mac80211 calculates the time based on the reported
>> last_tx_time from the driver, the whole throttling mechanism only kicks in
>> for drivers that actually report this value. With the exception of
>> multicast, where we always calculate an estimated tx time on the assumption
>> that multicast is transmitted at the lowest (6 Mbps) rate.
>> 
>> The throttling added in this patch is in addition to any throttling already
>> performed by the airtime fairness mechanism, and in principle the two
>> mechanisms are orthogonal (and currently also uses two different sources of
>> airtime). In the future, we could amend this, using the airtime estimates
>> calculated by this mechanism as a fallback input to the airtime fairness
>> scheduler, to enable airtime fairness even on drivers that don't have a
>> hardware source of airtime usage for each station.
>> 
>> [0] https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/588190/13/drivers/net/wireless-4.2/ath/ath10k/mac.c#3845
>> 
>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> ---
>>  net/mac80211/debugfs.c     |   24 ++++++++++++++++++++++++
>>  net/mac80211/ieee80211_i.h |    7 +++++++
>>  net/mac80211/status.c      |   22 ++++++++++++++++++++++
>>  net/mac80211/tx.c          |   38 +++++++++++++++++++++++++++++++++++++-
>>  4 files changed, 90 insertions(+), 1 deletion(-)
>
> Hi Toke,
>
> Thx a lot for working on this. Few comments inline.
>
> Regards,
> Lorenzo
>
>> 
>> diff --git a/net/mac80211/debugfs.c b/net/mac80211/debugfs.c
>> index 568b3b276931..c846c6e7f3e3 100644
>> --- a/net/mac80211/debugfs.c
>> +++ b/net/mac80211/debugfs.c
>> @@ -148,6 +148,29 @@ static const struct file_operations aqm_ops = {
>>  	.llseek = default_llseek,
>>  };
>>  
>
> [...]
>
>> @@ -3581,8 +3591,19 @@ struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
>>  	tx.skb = skb;
>>  	tx.sdata = vif_to_sdata(info->control.vif);
>>  
>> -	if (txq->sta)
>> +	pktlen = skb->len + 38;
>> +	if (txq->sta) {
>>  		tx.sta = container_of(txq->sta, struct sta_info, sta);
>> +		if (tx.sta->last_tx_bitrate) {
>> +			airtime = (pktlen * 8 * 1000 *
>> +				   tx.sta->last_tx_bitrate_reciprocal) >> IEEE80211_RECIPROCAL_SHIFT;
>> +			airtime += IEEE80211_AIRTIME_OVERHEAD;
>
> Here we are not taking into account aggregation burst size (it is done
> in a rough way in chromeos implementation) and tx retries. I have not
> carried out any tests so far but I think IEEE80211_AIRTIME_OVERHEAD
> will led to a significant airtime overestimation. Do you think this
> can be improved? (..I agree this is not a perfect world, but .. :))

Hmm, yeah, looking at this again, the way I'm going this now, I should
probably have used the low 16-us IFS overhead for every packet.

I guess we could do something similar to what the chromeos thing is
doing. I.e., adding a single "large" overhead value when we think the
packet is the first of a burst, and using the smaller value for the
rest.

One approach could be to couple the switch to the "scheduling rounds" we
already have. I.e., first packet after a call to
ieee8021_txq_schedule_start() will get the 100-us overhead, and every
subsequent one will get the low one. Not sure how this will fit with
what the driver actually does, though, so I guess some experimentation
is in order.

Ultimately,  I'm not sure it matters that much whether occasionally add
80 us extra to the estimate. But as you say, adding 100 us to every
packet is probably a bit much ;)

> Moreover, can this approach be affected by some interrupt coalescing
> implemented by the chipset?

Probably? Ultimately we don't really know what exactly the chipset is
doing, so we're guessing here, no?

-Toke
Lorenzo Bianconi Sept. 20, 2019, 1:06 p.m. UTC | #11
> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> 
> >> From: Toke Høiland-Jørgensen <toke@redhat.com>
> >> 
> >> Some devices have deep buffers in firmware and/or hardware which prevents
> >> the FQ structure in mac80211 from effectively limiting bufferbloat on the
> >> link. For Ethernet devices we have BQL to limit the lower-level queues, but
> >> this cannot be applied to mac80211 because transmit rates can vary wildly
> >> between packets depending on which station we are transmitting it to.
> >> 
> >> To overcome this, we can use airtime-based queue limiting (AQL), where we
> >> estimate the transmission time for each packet before dequeueing it, and
> >> use that to limit the amount of data in-flight to the hardware. This idea
> >> was originally implemented as part of the out-of-tree airtime fairness
> >> patch to ath10k[0] in chromiumos.
> >> 
> >> This patch ports that idea over to mac80211. The basic idea is simple
> >> enough: Whenever we dequeue a packet from the TXQs and send it to the
> >> driver, we estimate its airtime usage, based on the last recorded TX rate
> >> of the station that packet is destined for. We keep a running per-AC total
> >> of airtime queued for the whole device, and when that total climbs above 8
> >> ms' worth of data (corresponding to two maximum-sized aggregates), we
> >> simply throttle the queues until it drops down again.
> >> 
> >> The estimated airtime for each skb is stored in the tx_info, so we can
> >> subtract the same amount from the running total when the skb is freed or
> >> recycled. The throttling mechanism relies on this accounting to be
> >> accurate (i.e., that we are not freeing skbs without subtracting any
> >> airtime they were accounted for), so we put the subtraction into
> >> ieee80211_report_used_skb().
> >> 
> >> This patch does *not* include any mechanism to wake a throttled TXQ again,
> >> on the assumption that this will happen anyway as a side effect of whatever
> >> freed the skb (most commonly a TX completion).
> >> 
> >> The throttling mechanism only kicks in if the queued airtime total goes
> >> above the limit. Since mac80211 calculates the time based on the reported
> >> last_tx_time from the driver, the whole throttling mechanism only kicks in
> >> for drivers that actually report this value. With the exception of
> >> multicast, where we always calculate an estimated tx time on the assumption
> >> that multicast is transmitted at the lowest (6 Mbps) rate.
> >> 
> >> The throttling added in this patch is in addition to any throttling already
> >> performed by the airtime fairness mechanism, and in principle the two
> >> mechanisms are orthogonal (and currently also uses two different sources of
> >> airtime). In the future, we could amend this, using the airtime estimates
> >> calculated by this mechanism as a fallback input to the airtime fairness
> >> scheduler, to enable airtime fairness even on drivers that don't have a
> >> hardware source of airtime usage for each station.
> >> 
> >> [0] https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/588190/13/drivers/net/wireless-4.2/ath/ath10k/mac.c#3845
> >> 
> >> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> >> ---
> >>  net/mac80211/debugfs.c     |   24 ++++++++++++++++++++++++
> >>  net/mac80211/ieee80211_i.h |    7 +++++++
> >>  net/mac80211/status.c      |   22 ++++++++++++++++++++++
> >>  net/mac80211/tx.c          |   38 +++++++++++++++++++++++++++++++++++++-
> >>  4 files changed, 90 insertions(+), 1 deletion(-)
> >
> > Hi Toke,
> >
> > Thx a lot for working on this. Few comments inline.
> >
> > Regards,
> > Lorenzo
> >
> >> 
> >> diff --git a/net/mac80211/debugfs.c b/net/mac80211/debugfs.c
> >> index 568b3b276931..c846c6e7f3e3 100644
> >> --- a/net/mac80211/debugfs.c
> >> +++ b/net/mac80211/debugfs.c
> >> @@ -148,6 +148,29 @@ static const struct file_operations aqm_ops = {
> >>  	.llseek = default_llseek,
> >>  };
> >>  
> >
> > [...]
> >
> >> @@ -3581,8 +3591,19 @@ struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
> >>  	tx.skb = skb;
> >>  	tx.sdata = vif_to_sdata(info->control.vif);
> >>  
> >> -	if (txq->sta)
> >> +	pktlen = skb->len + 38;
> >> +	if (txq->sta) {
> >>  		tx.sta = container_of(txq->sta, struct sta_info, sta);
> >> +		if (tx.sta->last_tx_bitrate) {
> >> +			airtime = (pktlen * 8 * 1000 *
> >> +				   tx.sta->last_tx_bitrate_reciprocal) >> IEEE80211_RECIPROCAL_SHIFT;
> >> +			airtime += IEEE80211_AIRTIME_OVERHEAD;
> >
> > Here we are not taking into account aggregation burst size (it is done
> > in a rough way in chromeos implementation) and tx retries. I have not
> > carried out any tests so far but I think IEEE80211_AIRTIME_OVERHEAD
> > will led to a significant airtime overestimation. Do you think this
> > can be improved? (..I agree this is not a perfect world, but .. :))
> 
> Hmm, yeah, looking at this again, the way I'm going this now, I should
> probably have used the low 16-us IFS overhead for every packet.
> 
> I guess we could do something similar to what the chromeos thing is
> doing. I.e., adding a single "large" overhead value when we think the
> packet is the first of a burst, and using the smaller value for the
> rest.
> 
> One approach could be to couple the switch to the "scheduling rounds" we
> already have. I.e., first packet after a call to
> ieee8021_txq_schedule_start() will get the 100-us overhead, and every
> subsequent one will get the low one. Not sure how this will fit with
> what the driver actually does, though, so I guess some experimentation
> is in order.
> 
> Ultimately,  I'm not sure it matters that much whether occasionally add
> 80 us extra to the estimate. But as you say, adding 100 us to every
> packet is probably a bit much ;)

Would it be possible to use the previous tx airtime reported by the driver?
(not sure if it is feasible). Some drivers can report airtime compute in hw,
the issue is it can be no not linked to the given skb or aggregation burst, so
we should take into account burst size

> 
> > Moreover, can this approach be affected by some interrupt coalescing
> > implemented by the chipset?
> 
> Probably? Ultimately we don't really know what exactly the chipset is
> doing, so we're guessing here, no?

Here I mean if the hw relies on a 1:n tx interrupt/packet ratio (I guess most
driver do), it would probably affect throughput, right? (e.g TCP)

Regards,
Lorenzo

> 
> -Toke
>
Toke Høiland-Jørgensen Sept. 20, 2019, 1:32 p.m. UTC | #12
Lorenzo Bianconi <lorenzo@kernel.org> writes:

>> Lorenzo Bianconi <lorenzo@kernel.org> writes:
>> 
>> >> From: Toke Høiland-Jørgensen <toke@redhat.com>
>> >> 
>> >> Some devices have deep buffers in firmware and/or hardware which prevents
>> >> the FQ structure in mac80211 from effectively limiting bufferbloat on the
>> >> link. For Ethernet devices we have BQL to limit the lower-level queues, but
>> >> this cannot be applied to mac80211 because transmit rates can vary wildly
>> >> between packets depending on which station we are transmitting it to.
>> >> 
>> >> To overcome this, we can use airtime-based queue limiting (AQL), where we
>> >> estimate the transmission time for each packet before dequeueing it, and
>> >> use that to limit the amount of data in-flight to the hardware. This idea
>> >> was originally implemented as part of the out-of-tree airtime fairness
>> >> patch to ath10k[0] in chromiumos.
>> >> 
>> >> This patch ports that idea over to mac80211. The basic idea is simple
>> >> enough: Whenever we dequeue a packet from the TXQs and send it to the
>> >> driver, we estimate its airtime usage, based on the last recorded TX rate
>> >> of the station that packet is destined for. We keep a running per-AC total
>> >> of airtime queued for the whole device, and when that total climbs above 8
>> >> ms' worth of data (corresponding to two maximum-sized aggregates), we
>> >> simply throttle the queues until it drops down again.
>> >> 
>> >> The estimated airtime for each skb is stored in the tx_info, so we can
>> >> subtract the same amount from the running total when the skb is freed or
>> >> recycled. The throttling mechanism relies on this accounting to be
>> >> accurate (i.e., that we are not freeing skbs without subtracting any
>> >> airtime they were accounted for), so we put the subtraction into
>> >> ieee80211_report_used_skb().
>> >> 
>> >> This patch does *not* include any mechanism to wake a throttled TXQ again,
>> >> on the assumption that this will happen anyway as a side effect of whatever
>> >> freed the skb (most commonly a TX completion).
>> >> 
>> >> The throttling mechanism only kicks in if the queued airtime total goes
>> >> above the limit. Since mac80211 calculates the time based on the reported
>> >> last_tx_time from the driver, the whole throttling mechanism only kicks in
>> >> for drivers that actually report this value. With the exception of
>> >> multicast, where we always calculate an estimated tx time on the assumption
>> >> that multicast is transmitted at the lowest (6 Mbps) rate.
>> >> 
>> >> The throttling added in this patch is in addition to any throttling already
>> >> performed by the airtime fairness mechanism, and in principle the two
>> >> mechanisms are orthogonal (and currently also uses two different sources of
>> >> airtime). In the future, we could amend this, using the airtime estimates
>> >> calculated by this mechanism as a fallback input to the airtime fairness
>> >> scheduler, to enable airtime fairness even on drivers that don't have a
>> >> hardware source of airtime usage for each station.
>> >> 
>> >> [0] https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/588190/13/drivers/net/wireless-4.2/ath/ath10k/mac.c#3845
>> >> 
>> >> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> >> ---
>> >>  net/mac80211/debugfs.c     |   24 ++++++++++++++++++++++++
>> >>  net/mac80211/ieee80211_i.h |    7 +++++++
>> >>  net/mac80211/status.c      |   22 ++++++++++++++++++++++
>> >>  net/mac80211/tx.c          |   38 +++++++++++++++++++++++++++++++++++++-
>> >>  4 files changed, 90 insertions(+), 1 deletion(-)
>> >
>> > Hi Toke,
>> >
>> > Thx a lot for working on this. Few comments inline.
>> >
>> > Regards,
>> > Lorenzo
>> >
>> >> 
>> >> diff --git a/net/mac80211/debugfs.c b/net/mac80211/debugfs.c
>> >> index 568b3b276931..c846c6e7f3e3 100644
>> >> --- a/net/mac80211/debugfs.c
>> >> +++ b/net/mac80211/debugfs.c
>> >> @@ -148,6 +148,29 @@ static const struct file_operations aqm_ops = {
>> >>  	.llseek = default_llseek,
>> >>  };
>> >>  
>> >
>> > [...]
>> >
>> >> @@ -3581,8 +3591,19 @@ struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
>> >>  	tx.skb = skb;
>> >>  	tx.sdata = vif_to_sdata(info->control.vif);
>> >>  
>> >> -	if (txq->sta)
>> >> +	pktlen = skb->len + 38;
>> >> +	if (txq->sta) {
>> >>  		tx.sta = container_of(txq->sta, struct sta_info, sta);
>> >> +		if (tx.sta->last_tx_bitrate) {
>> >> +			airtime = (pktlen * 8 * 1000 *
>> >> +				   tx.sta->last_tx_bitrate_reciprocal) >> IEEE80211_RECIPROCAL_SHIFT;
>> >> +			airtime += IEEE80211_AIRTIME_OVERHEAD;
>> >
>> > Here we are not taking into account aggregation burst size (it is done
>> > in a rough way in chromeos implementation) and tx retries. I have not
>> > carried out any tests so far but I think IEEE80211_AIRTIME_OVERHEAD
>> > will led to a significant airtime overestimation. Do you think this
>> > can be improved? (..I agree this is not a perfect world, but .. :))
>> 
>> Hmm, yeah, looking at this again, the way I'm going this now, I should
>> probably have used the low 16-us IFS overhead for every packet.
>> 
>> I guess we could do something similar to what the chromeos thing is
>> doing. I.e., adding a single "large" overhead value when we think the
>> packet is the first of a burst, and using the smaller value for the
>> rest.
>> 
>> One approach could be to couple the switch to the "scheduling rounds" we
>> already have. I.e., first packet after a call to
>> ieee8021_txq_schedule_start() will get the 100-us overhead, and every
>> subsequent one will get the low one. Not sure how this will fit with
>> what the driver actually does, though, so I guess some experimentation
>> is in order.
>> 
>> Ultimately,  I'm not sure it matters that much whether occasionally add
>> 80 us extra to the estimate. But as you say, adding 100 us to every
>> packet is probably a bit much ;)
>
> Would it be possible to use the previous tx airtime reported by the
> driver? (not sure if it is feasible). Some drivers can report airtime
> compute in hw, the issue is it can be no not linked to the given skb
> or aggregation burst, so we should take into account burst size

That's what we do for the fairness scheduler. And yeah, if the HW can
report after-the-fact airtime usage that is bound to be more accurate,
so I think we should keep using that for fairness.

But for this AQL thing, we really need it ahead of time. However, I
don't think it's as important that it is super accurate. As long as we
have a reasonable estimate I think we'll be fine. We can solve any
inaccuracies by fiddling with the limit, I think. Similar to what BQL
does; dynamically adjusting it up and down.

So for a first pass, we can just err on the side of having the limit
higher, and then iterate from there.

>> > Moreover, can this approach be affected by some interrupt coalescing
>> > implemented by the chipset?
>> 
>> Probably? Ultimately we don't really know what exactly the chipset is
>> doing, so we're guessing here, no?
>
> Here I mean if the hw relies on a 1:n tx interrupt/packet ratio (I
> guess most driver do), it would probably affect throughput, right?
> (e.g TCP)

Yeah, this is what I alluded to above: If we set the limit too low, were
are going to kill TCP throughput. Ideally, we want the limit to be as
low as we can get it without hurting TCP (too much), but no lower. Just
doing the conversion to airtime is a way to achieve this: This will
scale the actual queue length with the achievable throughput as long as
the tx rate estimate is reasonably accurate. If needed, we can add
another layer of dynamic tuning on top using the existing BQL logic; but
I'd like to get the basic case working first...

-Toke
Kan Yan Sept. 20, 2019, 10:55 p.m. UTC | #13
Hi Toke,

There is an updated version of AQL in the chromiumos tree implemented
in the mac80211 driver, instead of in the ath10k driver as the
original version:
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1703105/7
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1703106/6

It is based on a more recent kernel (4.14) and integrated with the
airtime fairness tx scheduler in mac80211. This version has been
tested rather extensively.  I intended to use it as the basis for my
effort to bring AQL upstream, but get sidetracked by other things. I
can clean it up and send a patchset next week if you think that is the
right path. Sorry for the long delay and slack off on the upstream
effort.

There is some concern in this thread regarding the accuracy of the
estimated airtime using the last reported TX rate. It is indeed a
rather crude method and did not include retries in the calculation.
Besides, there are lags between firmware changing rate and host driver
get the rate update. The 16us IFS overhead is only correct for 5G and
it is actually 10us for 2.4 G. However, that hardly matters. The goal
of AQL is to prevent the firmware/hardware queue from getting bloated
or starved. There is a lot of headroom in the queue length limit (8-10
ms) to tolerate inaccuracy in the estimate airtime. AQL doesn't
control the fine grained TX packet scheduling. It is handled by the
airtime fairness scheduler and ultimately firmware.

There are two TX airtimes in the newer version (chromiumos 4.14
kernel): The estimated airtime for frames pending in the queue and the
airtime reported by the firmware for the frame transmitted, which
should be accurate as the firmware supposed to take retries and
aggregation into account. The airtime fairness scheduler that does the
TX packet scheduling should use the TX airtime reported by the
firmware. That's the reason why the original implementation in the
ChromiumOS tree tries to factor in aggregation size when estimate the
airtime overhead and the later version doesn't even bother with that.

Regards,
Kan


On Fri, Sep 20, 2019 at 6:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Lorenzo Bianconi <lorenzo@kernel.org> writes:
>
> >> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> >>
> >> >> From: Toke Høiland-Jørgensen <toke@redhat.com>
> >> >>
> >> >> Some devices have deep buffers in firmware and/or hardware which prevents
> >> >> the FQ structure in mac80211 from effectively limiting bufferbloat on the
> >> >> link. For Ethernet devices we have BQL to limit the lower-level queues, but
> >> >> this cannot be applied to mac80211 because transmit rates can vary wildly
> >> >> between packets depending on which station we are transmitting it to.
> >> >>
> >> >> To overcome this, we can use airtime-based queue limiting (AQL), where we
> >> >> estimate the transmission time for each packet before dequeueing it, and
> >> >> use that to limit the amount of data in-flight to the hardware. This idea
> >> >> was originally implemented as part of the out-of-tree airtime fairness
> >> >> patch to ath10k[0] in chromiumos.
> >> >>
> >> >> This patch ports that idea over to mac80211. The basic idea is simple
> >> >> enough: Whenever we dequeue a packet from the TXQs and send it to the
> >> >> driver, we estimate its airtime usage, based on the last recorded TX rate
> >> >> of the station that packet is destined for. We keep a running per-AC total
> >> >> of airtime queued for the whole device, and when that total climbs above 8
> >> >> ms' worth of data (corresponding to two maximum-sized aggregates), we
> >> >> simply throttle the queues until it drops down again.
> >> >>
> >> >> The estimated airtime for each skb is stored in the tx_info, so we can
> >> >> subtract the same amount from the running total when the skb is freed or
> >> >> recycled. The throttling mechanism relies on this accounting to be
> >> >> accurate (i.e., that we are not freeing skbs without subtracting any
> >> >> airtime they were accounted for), so we put the subtraction into
> >> >> ieee80211_report_used_skb().
> >> >>
> >> >> This patch does *not* include any mechanism to wake a throttled TXQ again,
> >> >> on the assumption that this will happen anyway as a side effect of whatever
> >> >> freed the skb (most commonly a TX completion).
> >> >>
> >> >> The throttling mechanism only kicks in if the queued airtime total goes
> >> >> above the limit. Since mac80211 calculates the time based on the reported
> >> >> last_tx_time from the driver, the whole throttling mechanism only kicks in
> >> >> for drivers that actually report this value. With the exception of
> >> >> multicast, where we always calculate an estimated tx time on the assumption
> >> >> that multicast is transmitted at the lowest (6 Mbps) rate.
> >> >>
> >> >> The throttling added in this patch is in addition to any throttling already
> >> >> performed by the airtime fairness mechanism, and in principle the two
> >> >> mechanisms are orthogonal (and currently also uses two different sources of
> >> >> airtime). In the future, we could amend this, using the airtime estimates
> >> >> calculated by this mechanism as a fallback input to the airtime fairness
> >> >> scheduler, to enable airtime fairness even on drivers that don't have a
> >> >> hardware source of airtime usage for each station.
> >> >>
> >> >> [0] https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/588190/13/drivers/net/wireless-4.2/ath/ath10k/mac.c#3845
> >> >>
> >> >> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> >> >> ---
> >> >>  net/mac80211/debugfs.c     |   24 ++++++++++++++++++++++++
> >> >>  net/mac80211/ieee80211_i.h |    7 +++++++
> >> >>  net/mac80211/status.c      |   22 ++++++++++++++++++++++
> >> >>  net/mac80211/tx.c          |   38 +++++++++++++++++++++++++++++++++++++-
> >> >>  4 files changed, 90 insertions(+), 1 deletion(-)
> >> >
> >> > Hi Toke,
> >> >
> >> > Thx a lot for working on this. Few comments inline.
> >> >
> >> > Regards,
> >> > Lorenzo
> >> >
> >> >>
> >> >> diff --git a/net/mac80211/debugfs.c b/net/mac80211/debugfs.c
> >> >> index 568b3b276931..c846c6e7f3e3 100644
> >> >> --- a/net/mac80211/debugfs.c
> >> >> +++ b/net/mac80211/debugfs.c
> >> >> @@ -148,6 +148,29 @@ static const struct file_operations aqm_ops = {
> >> >>   .llseek = default_llseek,
> >> >>  };
> >> >>
> >> >
> >> > [...]
> >> >
> >> >> @@ -3581,8 +3591,19 @@ struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
> >> >>   tx.skb = skb;
> >> >>   tx.sdata = vif_to_sdata(info->control.vif);
> >> >>
> >> >> - if (txq->sta)
> >> >> + pktlen = skb->len + 38;
> >> >> + if (txq->sta) {
> >> >>           tx.sta = container_of(txq->sta, struct sta_info, sta);
> >> >> +         if (tx.sta->last_tx_bitrate) {
> >> >> +                 airtime = (pktlen * 8 * 1000 *
> >> >> +                            tx.sta->last_tx_bitrate_reciprocal) >> IEEE80211_RECIPROCAL_SHIFT;
> >> >> +                 airtime += IEEE80211_AIRTIME_OVERHEAD;
> >> >
> >> > Here we are not taking into account aggregation burst size (it is done
> >> > in a rough way in chromeos implementation) and tx retries. I have not
> >> > carried out any tests so far but I think IEEE80211_AIRTIME_OVERHEAD
> >> > will led to a significant airtime overestimation. Do you think this
> >> > can be improved? (..I agree this is not a perfect world, but .. :))
> >>
> >> Hmm, yeah, looking at this again, the way I'm going this now, I should
> >> probably have used the low 16-us IFS overhead for every packet.
> >>
> >> I guess we could do something similar to what the chromeos thing is
> >> doing. I.e., adding a single "large" overhead value when we think the
> >> packet is the first of a burst, and using the smaller value for the
> >> rest.
> >>
> >> One approach could be to couple the switch to the "scheduling rounds" we
> >> already have. I.e., first packet after a call to
> >> ieee8021_txq_schedule_start() will get the 100-us overhead, and every
> >> subsequent one will get the low one. Not sure how this will fit with
> >> what the driver actually does, though, so I guess some experimentation
> >> is in order.
> >>
> >> Ultimately,  I'm not sure it matters that much whether occasionally add
> >> 80 us extra to the estimate. But as you say, adding 100 us to every
> >> packet is probably a bit much ;)
> >
> > Would it be possible to use the previous tx airtime reported by the
> > driver? (not sure if it is feasible). Some drivers can report airtime
> > compute in hw, the issue is it can be no not linked to the given skb
> > or aggregation burst, so we should take into account burst size
>
> That's what we do for the fairness scheduler. And yeah, if the HW can
> report after-the-fact airtime usage that is bound to be more accurate,
> so I think we should keep using that for fairness.
>
> But for this AQL thing, we really need it ahead of time. However, I
> don't think it's as important that it is super accurate. As long as we
> have a reasonable estimate I think we'll be fine. We can solve any
> inaccuracies by fiddling with the limit, I think. Similar to what BQL
> does; dynamically adjusting it up and down.
>
> So for a first pass, we can just err on the side of having the limit
> higher, and then iterate from there.
>
> >> > Moreover, can this approach be affected by some interrupt coalescing
> >> > implemented by the chipset?
> >>
> >> Probably? Ultimately we don't really know what exactly the chipset is
> >> doing, so we're guessing here, no?
> >
> > Here I mean if the hw relies on a 1:n tx interrupt/packet ratio (I
> > guess most driver do), it would probably affect throughput, right?
> > (e.g TCP)
>
> Yeah, this is what I alluded to above: If we set the limit too low, were
> are going to kill TCP throughput. Ideally, we want the limit to be as
> low as we can get it without hurting TCP (too much), but no lower. Just
> doing the conversion to airtime is a way to achieve this: This will
> scale the actual queue length with the achievable throughput as long as
> the tx rate estimate is reasonably accurate. If needed, we can add
> another layer of dynamic tuning on top using the existing BQL logic; but
> I'd like to get the basic case working first...
>
> -Toke
>
Toke Høiland-Jørgensen Sept. 21, 2019, 12:11 p.m. UTC | #14
Kan Yan <kyan@google.com> writes:

> Hi Toke,
>
> There is an updated version of AQL in the chromiumos tree implemented
> in the mac80211 driver, instead of in the ath10k driver as the
> original version:
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1703105/7
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1703106/6

Ah, that's awesome! Thank you for brining this up :)

> It is based on a more recent kernel (4.14) and integrated with the
> airtime fairness tx scheduler in mac80211. This version has been
> tested rather extensively. I intended to use it as the basis for my
> effort to bring AQL upstream, but get sidetracked by other things. I
> can clean it up and send a patchset next week if you think that is the
> right path.

Yes, please do! AFAICT, the main difference is that your version keeps
the airtime calculation itself in the driver, while mine passes up the
rate and lets mac80211 do the calculation of airtime. Other than that,
the differences are minor, no?

I'm not actually sure which approach is best; I suspect doing all the
accounting in mac80211 will help with integrating this into drivers that
use minstrel; we can just add a hook in that and be done with it.
Whereas if the driver has to do the accounting, we would need to add
that to each driver (mt76, iwl(?)).

But of course, doing things in mac80211 depends on stuffing even more
stuff into the already overloaded cb field; and I'm not actually
entirely sure what I've done with that will actually work. WDYT?

In any case, if you post your series we'll have something to contrast
against, which I think will be useful to help us converge on something
we can all be happy with.

Of course we'll also have to eventually integrate this with the other
series that Yibo recently re-posted (the virtual time scheduler). I
think that will be relatively straight forward, except I'm not sure your
atomic patches will work when we also have to update the rbtree. Any
thoughts on that series in general?

> Sorry for the long delay and slack off on the upstream effort.

Hehe, no worries. I only posted this because Dave finally bugged me into
doing something about this at LPC. And hey, we're making progress now,
so that's good! :)

> There is some concern in this thread regarding the accuracy of the
> estimated airtime using the last reported TX rate. It is indeed a
> rather crude method and did not include retries in the calculation.
> Besides, there are lags between firmware changing rate and host driver
> get the rate update. The 16us IFS overhead is only correct for 5G and
> it is actually 10us for 2.4 G. However, that hardly matters. The goal
> of AQL is to prevent the firmware/hardware queue from getting bloated
> or starved. There is a lot of headroom in the queue length limit (8-10
> ms) to tolerate inaccuracy in the estimate airtime. AQL doesn't
> control the fine grained TX packet scheduling. It is handled by the
> airtime fairness scheduler and ultimately firmware.

Yeah, this was basically the point I was trying to make; this limit
doesn't need to be that accurate, we just need a rough estimate. If we
want to get the latency even lower later, we're better off fiddling with
the queue limit value than trying to improve the airtime estimate.

> There are two TX airtimes in the newer version (chromiumos 4.14
> kernel): The estimated airtime for frames pending in the queue and the
> airtime reported by the firmware for the frame transmitted, which
> should be accurate as the firmware supposed to take retries and
> aggregation into account. The airtime fairness scheduler that does the
> TX packet scheduling should use the TX airtime reported by the
> firmware. That's the reason why the original implementation in the
> ChromiumOS tree tries to factor in aggregation size when estimate the
> airtime overhead and the later version doesn't even bother with that.

Yup, makes sense. Looking at the version you linked to, though, it seems
you're calling ieee80211_sta_register_airtime() with the estimated value
as well? So are you double-accounting airtime, or are you adjusting for
the accurate values somewhere else I don't see in that series?

-Toke
Yibo Zhao Sept. 25, 2019, 7:37 a.m. UTC | #15
On 2019-09-19 20:22, Toke Høiland-Jørgensen wrote:
> From: Toke Høiland-Jørgensen <toke@redhat.com>
> 
> Some devices have deep buffers in firmware and/or hardware which 
> prevents
> the FQ structure in mac80211 from effectively limiting bufferbloat on 
> the
> link. For Ethernet devices we have BQL to limit the lower-level queues, 
> but
> this cannot be applied to mac80211 because transmit rates can vary 
> wildly
> between packets depending on which station we are transmitting it to.
> 
> To overcome this, we can use airtime-based queue limiting (AQL), where 
> we
> estimate the transmission time for each packet before dequeueing it, 
> and
> use that to limit the amount of data in-flight to the hardware. This 
> idea
> was originally implemented as part of the out-of-tree airtime fairness
> patch to ath10k[0] in chromiumos.
> 
> This patch ports that idea over to mac80211. The basic idea is simple
> enough: Whenever we dequeue a packet from the TXQs and send it to the
> driver, we estimate its airtime usage, based on the last recorded TX 
> rate
> of the station that packet is destined for. We keep a running per-AC 
> total
> of airtime queued for the whole device, and when that total climbs 
> above 8
> ms' worth of data (corresponding to two maximum-sized aggregates), we
> simply throttle the queues until it drops down again.
> 
> The estimated airtime for each skb is stored in the tx_info, so we can
> subtract the same amount from the running total when the skb is freed 
> or
> recycled. The throttling mechanism relies on this accounting to be
> accurate (i.e., that we are not freeing skbs without subtracting any
> airtime they were accounted for), so we put the subtraction into
> ieee80211_report_used_skb().
> 
> This patch does *not* include any mechanism to wake a throttled TXQ 
> again,
> on the assumption that this will happen anyway as a side effect of 
> whatever
> freed the skb (most commonly a TX completion).
> 
> The throttling mechanism only kicks in if the queued airtime total goes
> above the limit. Since mac80211 calculates the time based on the 
> reported
> last_tx_time from the driver, the whole throttling mechanism only kicks 
> in
> for drivers that actually report this value. With the exception of
> multicast, where we always calculate an estimated tx time on the 
> assumption
> that multicast is transmitted at the lowest (6 Mbps) rate.
> 
> The throttling added in this patch is in addition to any throttling 
> already
> performed by the airtime fairness mechanism, and in principle the two
> mechanisms are orthogonal (and currently also uses two different 
> sources of
> airtime). In the future, we could amend this, using the airtime 
> estimates
> calculated by this mechanism as a fallback input to the airtime 
> fairness
> scheduler, to enable airtime fairness even on drivers that don't have a
> hardware source of airtime usage for each station.
> 
So if it is going to work together with virtual time based mechanism in 
the future, the Tx criteria will be met both of below conditions,
        1. Lower than g_vt
        2. Lower than IEEE80211_AIRTIME_QUEUE_LIMIT

Are we going to maintain two kinds of airtime that one is from 
estimation and the other is basically from FW reporting?

Meanwhile, airtime_queued will also limit the situation that we only 
have a station for transmission. Not sure if the peak throughput will be 
impacted. I believe it may work fine with 11ac in chromiumos, how about 
11n and 11a?

Anyway, I think this approach will help to improve performance of the 
virtual time based mechanism since it makes packets buffered in host 
instead of FW's deep queue. We have observed 2-clients case with 
different ratio in TCP fails to maintain the ratio because the packets 
arriving at host get pushed to FW immediately and thus the airtime 
weight sum is 0 in most of time meaning no TXQ in the rbtree. For UDP, 
if we pump load more than the PHY rate, the ratio can be maintained 
beacuse the FW queue is full and packtes begin to be buffered in host 
making TXQs staying on the rbtree for most of time. However, TCP has its 
own flow control that we can not push enough load like UDP.

>  out:
> @@ -3676,6 +3706,9 @@ struct ieee80211_txq *ieee80211_next_txq(struct
> ieee80211_hw *hw, u8 ac)
> 
>  	spin_lock_bh(&local->active_txq_lock[ac]);
> 
> +	if (local->airtime_queued[ac] > IEEE80211_AIRTIME_QUEUE_LIMIT)
> +		goto out;
> +
>   begin:
>  	txqi = list_first_entry_or_null(&local->active_txqs[ac],
>  					struct txq_info,
> @@ -3753,6 +3786,9 @@ bool ieee80211_txq_may_transmit(struct 
> ieee80211_hw *hw,
> 
>  	spin_lock_bh(&local->active_txq_lock[ac]);
> 
> +	if (local->airtime_queued[ac] > IEEE80211_AIRTIME_QUEUE_LIMIT)
> +		goto out;
> +
>  	if (!txqi->txq.sta)
>  		goto out;
Toke Høiland-Jørgensen Sept. 25, 2019, 8:11 a.m. UTC | #16
Yibo Zhao <yiboz@codeaurora.org> writes:

> So if it is going to work together with virtual time based mechanism in 
> the future, the Tx criteria will be met both of below conditions,
>         1. Lower than g_vt
>         2. Lower than IEEE80211_AIRTIME_QUEUE_LIMIT

> Are we going to maintain two kinds of airtime that one is from 
> estimation and the other is basically from FW reporting?

Yes, that was my plan. For devices that don't have FW reporting of
airtime, we can fall back to the estimation; but if we do have FW
reporting that is most likely going to be more accurate, so better to
use that for fairness...

> Meanwhile, airtime_queued will also limit the situation that we only
> have a station for transmission. Not sure if the peak throughput will
> be impacted. I believe it may work fine with 11ac in chromiumos, how
> about 11n and 11a?

Well, we will need to test that, of course. But ath9k shows that it's
quite possible to run with quite shallow buffers, so with a bit of
tuning I think we should be fine. If anything, slower networks need
*fewer* packets queued in the firmware, and it's *easier* for the host
to keep up with transmission.

> Anyway, I think this approach will help to improve performance of the 
> virtual time based mechanism since it makes packets buffered in host 
> instead of FW's deep queue. We have observed 2-clients case with 
> different ratio in TCP fails to maintain the ratio because the packets 
> arriving at host get pushed to FW immediately and thus the airtime 
> weight sum is 0 in most of time meaning no TXQ in the rbtree. For UDP, 
> if we pump load more than the PHY rate, the ratio can be maintained 
> beacuse the FW queue is full and packtes begin to be buffered in host 
> making TXQs staying on the rbtree for most of time. However, TCP has its 
> own flow control that we can not push enough load like UDP.

Yes, fixing that is exactly the point of this series :)

-Toke
Yibo Zhao Sept. 25, 2019, 11:54 a.m. UTC | #17
On 2019-09-25 16:11, Toke Høiland-Jørgensen wrote:
> Yibo Zhao <yiboz@codeaurora.org> writes:
> 
>> So if it is going to work together with virtual time based mechanism 
>> in
>> the future, the Tx criteria will be met both of below conditions,
>>         1. Lower than g_vt
>>         2. Lower than IEEE80211_AIRTIME_QUEUE_LIMIT
> 
>> Are we going to maintain two kinds of airtime that one is from
>> estimation and the other is basically from FW reporting?
> 
> Yes, that was my plan. For devices that don't have FW reporting of
> airtime, we can fall back to the estimation; but if we do have FW
> reporting that is most likely going to be more accurate, so better to
> use that for fairness...

Do you mean we will use airtime reported by FW to calculate 
local->airtime_queued in case we have FW reporting airtime?

> 
>> Meanwhile, airtime_queued will also limit the situation that we only
>> have a station for transmission. Not sure if the peak throughput will
>> be impacted. I believe it may work fine with 11ac in chromiumos, how
>> about 11n and 11a?
> 
> Well, we will need to test that, of course. But ath9k shows that it's
> quite possible to run with quite shallow buffers, so with a bit of
> tuning I think we should be fine. If anything, slower networks need
> *fewer* packets queued in the firmware, and it's *easier* for the host
> to keep up with transmission.
> 
>> Anyway, I think this approach will help to improve performance of the
>> virtual time based mechanism since it makes packets buffered in host
>> instead of FW's deep queue. We have observed 2-clients case with
>> different ratio in TCP fails to maintain the ratio because the packets
>> arriving at host get pushed to FW immediately and thus the airtime
>> weight sum is 0 in most of time meaning no TXQ in the rbtree. For UDP,
>> if we pump load more than the PHY rate, the ratio can be maintained
>> beacuse the FW queue is full and packtes begin to be buffered in host
>> making TXQs staying on the rbtree for most of time. However, TCP has 
>> its
>> own flow control that we can not push enough load like UDP.
> 
> Yes, fixing that is exactly the point of this series :)
> 
> -Toke
Toke Høiland-Jørgensen Sept. 25, 2019, 12:52 p.m. UTC | #18
Yibo Zhao <yiboz@codeaurora.org> writes:

> On 2019-09-25 16:11, Toke Høiland-Jørgensen wrote:
>> Yibo Zhao <yiboz@codeaurora.org> writes:
>> 
>>> So if it is going to work together with virtual time based mechanism 
>>> in
>>> the future, the Tx criteria will be met both of below conditions,
>>>         1. Lower than g_vt
>>>         2. Lower than IEEE80211_AIRTIME_QUEUE_LIMIT
>> 
>>> Are we going to maintain two kinds of airtime that one is from
>>> estimation and the other is basically from FW reporting?
>> 
>> Yes, that was my plan. For devices that don't have FW reporting of
>> airtime, we can fall back to the estimation; but if we do have FW
>> reporting that is most likely going to be more accurate, so better to
>> use that for fairness...
>
> Do you mean we will use airtime reported by FW to calculate 
> local->airtime_queued in case we have FW reporting airtime?

No, the opposite; if the firmware can't report airtime, we can use the
estimated values to feed into report_airtime() for the fairness
calculation...

-Toke
Kan Yan Sept. 26, 2019, 12:27 a.m. UTC | #19
> Yes, please do! AFAICT, the main difference is that your version keeps
> the airtime calculation itself in the driver, while mine passes up the
> rate and lets mac80211 do the calculation of airtime. Other than that,
> the differences are minor, no?
> I'm not actually sure which approach is best; I suspect doing all the
> accounting in mac80211 will help with integrating this into drivers that
> use minstrel; we can just add a hook in that and be done with it.
> Whereas if the driver has to do the accounting, we would need to add
> that to each driver (mt76, iwl(?)).

Yes, they are essentially doing the same thing. I kept the airtime
estimation code in the ath10k just because it is already there. It is
better to do that in mac80211, so it doesn't have to be duplicated for
each driver and avoids the overhead of updating the estimated airtime
from host driver to mac80211.

> But of course, doing things in mac80211 depends on stuffing even more
> stuff into the already overloaded cb field; and I'm not actually
> entirely sure what I've done with that will actually work. WDYT?
Either way a field in skb cb is needed to record the estimated
airtime. The  'tx_time_est' shares the space with the codel
'enque_time' looks fine to me, as their lifetime doesn't overlap.

There is another minor difference in the ChromiumOs version, which
actually address the issue Yibo just asked:
> Meanwhile, airtime_queued will also limit the situation that we only
> have a station for transmission. Not sure if the peak throughput will be
> impacted. I believe it may work fine with 11ac in chromiumos, how about
> 11n and 11a?

My version has two AQL limits, a smaller per station limit (4ms) and a
larger per interface limit (24 ms). When the per interface limit has
not been reached, stations are allowed to transmit up to 1/3 of the
interface limits (8ms). This way it balance the needs to control
latency when there are a lot of stations and to get good throughput
benchmark numbers with a single client. In my test, I found increasing
the AQL limit to beyond 8 ms doesn't helps peak throughput on 4x4
ath10k chipset.
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1734867/3/net/mac80211/tx.c#b3734

> Of course we'll also have to eventually integrate this with the other
> series that Yibo recently re-posted (the virtual time scheduler). I
> think that will be relatively straight forward, except I'm not sure your
> atomic patches will work when we also have to update the rbtree. Any
> thoughts on that series in general?
I do like the virtual time scheduler patchset. It makes it easier to
schedule an arbitrary tx queue and handles ath10k's firmware pulling
mode better. I will give it a try.

> Yup, makes sense. Looking at the version you linked to, though, it seems
> you're calling ieee80211_sta_register_airtime() with the estimated value
> as well? So are you double-accounting airtime, or are you adjusting for
> the accurate values somewhere else I don't see in that series?
It does not double count airtime, just both the airtime fairness
scheduler and AQL use the estimate airtime. It is on an older tree and
still doesn't have the patch that provides the fw airtime:
https://patchwork.kernel.org/patch/10684689


On Wed, Sep 25, 2019 at 5:52 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Yibo Zhao <yiboz@codeaurora.org> writes:
>
> > On 2019-09-25 16:11, Toke Høiland-Jørgensen wrote:
> >> Yibo Zhao <yiboz@codeaurora.org> writes:
> >>
> >>> So if it is going to work together with virtual time based mechanism
> >>> in
> >>> the future, the Tx criteria will be met both of below conditions,
> >>>         1. Lower than g_vt
> >>>         2. Lower than IEEE80211_AIRTIME_QUEUE_LIMIT
> >>
> >>> Are we going to maintain two kinds of airtime that one is from
> >>> estimation and the other is basically from FW reporting?
> >>
> >> Yes, that was my plan. For devices that don't have FW reporting of
> >> airtime, we can fall back to the estimation; but if we do have FW
> >> reporting that is most likely going to be more accurate, so better to
> >> use that for fairness...
> >
> > Do you mean we will use airtime reported by FW to calculate
> > local->airtime_queued in case we have FW reporting airtime?
>
> No, the opposite; if the firmware can't report airtime, we can use the
> estimated values to feed into report_airtime() for the fairness
> calculation...
>
> -Toke
>
Kan Yan Sept. 26, 2019, 12:34 a.m. UTC | #20
> > Do you mean we will use airtime reported by FW to calculate
>
> > local->airtime_queued in case we have FW reporting airtime?
> No, the opposite; if the firmware can't report airtime, we can use the
> estimated values to feed into report_airtime() for the fairness
> calculation...
The local->airtime_queued is the 'future' airtime for the packet
pending the queue. It can't be replaced by the after the fact airtime
reported from firmware for the frames transmitted.


On Wed, Sep 25, 2019 at 5:27 PM Kan Yan <kyan@google.com> wrote:
>
> > Yes, please do! AFAICT, the main difference is that your version keeps
> > the airtime calculation itself in the driver, while mine passes up the
> > rate and lets mac80211 do the calculation of airtime. Other than that,
> > the differences are minor, no?
> > I'm not actually sure which approach is best; I suspect doing all the
> > accounting in mac80211 will help with integrating this into drivers that
> > use minstrel; we can just add a hook in that and be done with it.
> > Whereas if the driver has to do the accounting, we would need to add
> > that to each driver (mt76, iwl(?)).
>
> Yes, they are essentially doing the same thing. I kept the airtime
> estimation code in the ath10k just because it is already there. It is
> better to do that in mac80211, so it doesn't have to be duplicated for
> each driver and avoids the overhead of updating the estimated airtime
> from host driver to mac80211.
>
> > But of course, doing things in mac80211 depends on stuffing even more
> > stuff into the already overloaded cb field; and I'm not actually
> > entirely sure what I've done with that will actually work. WDYT?
> Either way a field in skb cb is needed to record the estimated
> airtime. The  'tx_time_est' shares the space with the codel
> 'enque_time' looks fine to me, as their lifetime doesn't overlap.
>
> There is another minor difference in the ChromiumOs version, which
> actually address the issue Yibo just asked:
> > Meanwhile, airtime_queued will also limit the situation that we only
> > have a station for transmission. Not sure if the peak throughput will be
> > impacted. I believe it may work fine with 11ac in chromiumos, how about
> > 11n and 11a?
>
> My version has two AQL limits, a smaller per station limit (4ms) and a
> larger per interface limit (24 ms). When the per interface limit has
> not been reached, stations are allowed to transmit up to 1/3 of the
> interface limits (8ms). This way it balance the needs to control
> latency when there are a lot of stations and to get good throughput
> benchmark numbers with a single client. In my test, I found increasing
> the AQL limit to beyond 8 ms doesn't helps peak throughput on 4x4
> ath10k chipset.
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1734867/3/net/mac80211/tx.c#b3734
>
> > Of course we'll also have to eventually integrate this with the other
> > series that Yibo recently re-posted (the virtual time scheduler). I
> > think that will be relatively straight forward, except I'm not sure your
> > atomic patches will work when we also have to update the rbtree. Any
> > thoughts on that series in general?
> I do like the virtual time scheduler patchset. It makes it easier to
> schedule an arbitrary tx queue and handles ath10k's firmware pulling
> mode better. I will give it a try.
>
> > Yup, makes sense. Looking at the version you linked to, though, it seems
> > you're calling ieee80211_sta_register_airtime() with the estimated value
> > as well? So are you double-accounting airtime, or are you adjusting for
> > the accurate values somewhere else I don't see in that series?
> It does not double count airtime, just both the airtime fairness
> scheduler and AQL use the estimate airtime. It is on an older tree and
> still doesn't have the patch that provides the fw airtime:
> https://patchwork.kernel.org/patch/10684689
>
>
> On Wed, Sep 25, 2019 at 5:52 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Yibo Zhao <yiboz@codeaurora.org> writes:
> >
> > > On 2019-09-25 16:11, Toke Høiland-Jørgensen wrote:
> > >> Yibo Zhao <yiboz@codeaurora.org> writes:
> > >>
> > >>> So if it is going to work together with virtual time based mechanism
> > >>> in
> > >>> the future, the Tx criteria will be met both of below conditions,
> > >>>         1. Lower than g_vt
> > >>>         2. Lower than IEEE80211_AIRTIME_QUEUE_LIMIT
> > >>
> > >>> Are we going to maintain two kinds of airtime that one is from
> > >>> estimation and the other is basically from FW reporting?
> > >>
> > >> Yes, that was my plan. For devices that don't have FW reporting of
> > >> airtime, we can fall back to the estimation; but if we do have FW
> > >> reporting that is most likely going to be more accurate, so better to
> > >> use that for fairness...
> > >
> > > Do you mean we will use airtime reported by FW to calculate
> > > local->airtime_queued in case we have FW reporting airtime?
> >
> > No, the opposite; if the firmware can't report airtime, we can use the
> > estimated values to feed into report_airtime() for the fairness
> > calculation...
> >
> > -Toke
> >
Toke Høiland-Jørgensen Sept. 26, 2019, 5:57 a.m. UTC | #21
Kan Yan <kyan@google.com> writes:

>> > Do you mean we will use airtime reported by FW to calculate
>>
>> > local->airtime_queued in case we have FW reporting airtime?
>> No, the opposite; if the firmware can't report airtime, we can use the
>> estimated values to feed into report_airtime() for the fairness
>> calculation...
> The local->airtime_queued is the 'future' airtime for the packet
> pending the queue. It can't be replaced by the after the fact airtime
> reported from firmware for the frames transmitted.

No, but on tx_completion we could do something like this:

airtime = CB(skb)->tx_time ?: CB(skb)->tx_time_est;
ieee80211_report_airtime(sta, airtime);

That way, if the driver sets the tx_time field to something the firmware
reports, we'll use that, and otherwise we'd fall back to the estimate.

Of course, there would need to be a way for the driver to opt out of
this, for drivers that report out of band airtime like ath10k does :)

-Toke
Toke Høiland-Jørgensen Sept. 26, 2019, 6:04 a.m. UTC | #22
Kan Yan <kyan@google.com> writes:

>> Yes, please do! AFAICT, the main difference is that your version keeps
>> the airtime calculation itself in the driver, while mine passes up the
>> rate and lets mac80211 do the calculation of airtime. Other than that,
>> the differences are minor, no?
>> I'm not actually sure which approach is best; I suspect doing all the
>> accounting in mac80211 will help with integrating this into drivers that
>> use minstrel; we can just add a hook in that and be done with it.
>> Whereas if the driver has to do the accounting, we would need to add
>> that to each driver (mt76, iwl(?)).
>
> Yes, they are essentially doing the same thing. I kept the airtime
> estimation code in the ath10k just because it is already there. It is
> better to do that in mac80211, so it doesn't have to be duplicated for
> each driver and avoids the overhead of updating the estimated airtime
> from host driver to mac80211.

Right, makes sense.

>> But of course, doing things in mac80211 depends on stuffing even more
>> stuff into the already overloaded cb field; and I'm not actually
>> entirely sure what I've done with that will actually work. WDYT?
> Either way a field in skb cb is needed to record the estimated
> airtime. The  'tx_time_est' shares the space with the codel
> 'enque_time' looks fine to me, as their lifetime doesn't overlap.

The kbuild bot pointed out that the current implementation doesn't work
as it's supposed to on m68k (which is big-endian, I think?). I guess
it's because the compiler puts the u16 in the "wrong half" of the space
being used by the u32 it shares with, so it doesn't line up? If so, that
may mean we'll need another layer struct/union wrapping; unless someone
else has an idea for how to force the compiler to put the u16 in a union
at the "start" of the u32 regardless of endianness?

> There is another minor difference in the ChromiumOs version, which
> actually address the issue Yibo just asked:
>> Meanwhile, airtime_queued will also limit the situation that we only
>> have a station for transmission. Not sure if the peak throughput will be
>> impacted. I believe it may work fine with 11ac in chromiumos, how about
>> 11n and 11a?
>
> My version has two AQL limits, a smaller per station limit (4ms) and a
> larger per interface limit (24 ms). When the per interface limit has
> not been reached, stations are allowed to transmit up to 1/3 of the
> interface limits (8ms). This way it balance the needs to control
> latency when there are a lot of stations and to get good throughput
> benchmark numbers with a single client. In my test, I found increasing
> the AQL limit to beyond 8 ms doesn't helps peak throughput on 4x4
> ath10k chipset.
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1734867/3/net/mac80211/tx.c#b3734

Yeah, I was wondering about that. Makes sense. Why 24ms, exactly?

>> Of course we'll also have to eventually integrate this with the other
>> series that Yibo recently re-posted (the virtual time scheduler). I
>> think that will be relatively straight forward, except I'm not sure your
>> atomic patches will work when we also have to update the rbtree. Any
>> thoughts on that series in general?
> I do like the virtual time scheduler patchset. It makes it easier to
> schedule an arbitrary tx queue and handles ath10k's firmware pulling
> mode better. I will give it a try.

Yup, that was the idea. Note that the current version doesn't have the
more granular locking that Felix put in for the RR-based scheduler.
Guess I need to re-spin; will see if I can't get to that soon.

>> Yup, makes sense. Looking at the version you linked to, though, it seems
>> you're calling ieee80211_sta_register_airtime() with the estimated value
>> as well? So are you double-accounting airtime, or are you adjusting for
>> the accurate values somewhere else I don't see in that series?
> It does not double count airtime, just both the airtime fairness
> scheduler and AQL use the estimate airtime. It is on an older tree and
> still doesn't have the patch that provides the fw airtime:
> https://patchwork.kernel.org/patch/10684689

Ah, I see. I assumed that the other call to sta_register_airtime() was
still there...

-Toke
Felix Fietkau Sept. 26, 2019, 12:53 p.m. UTC | #23
On 2019-09-19 14:22, Toke Høiland-Jørgensen wrote:
> From: Toke Høiland-Jørgensen <toke@redhat.com>
> 
> Some devices have deep buffers in firmware and/or hardware which prevents
> the FQ structure in mac80211 from effectively limiting bufferbloat on the
> link. For Ethernet devices we have BQL to limit the lower-level queues, but
> this cannot be applied to mac80211 because transmit rates can vary wildly
> between packets depending on which station we are transmitting it to.
> 
> To overcome this, we can use airtime-based queue limiting (AQL), where we
> estimate the transmission time for each packet before dequeueing it, and
> use that to limit the amount of data in-flight to the hardware. This idea
> was originally implemented as part of the out-of-tree airtime fairness
> patch to ath10k[0] in chromiumos.
> 
> This patch ports that idea over to mac80211. The basic idea is simple
> enough: Whenever we dequeue a packet from the TXQs and send it to the
> driver, we estimate its airtime usage, based on the last recorded TX rate
> of the station that packet is destined for. We keep a running per-AC total
> of airtime queued for the whole device, and when that total climbs above 8
> ms' worth of data (corresponding to two maximum-sized aggregates), we
> simply throttle the queues until it drops down again.
> 
> The estimated airtime for each skb is stored in the tx_info, so we can
> subtract the same amount from the running total when the skb is freed or
> recycled. The throttling mechanism relies on this accounting to be
> accurate (i.e., that we are not freeing skbs without subtracting any
> airtime they were accounted for), so we put the subtraction into
> ieee80211_report_used_skb().
> 
> This patch does *not* include any mechanism to wake a throttled TXQ again,
> on the assumption that this will happen anyway as a side effect of whatever
> freed the skb (most commonly a TX completion).
> 
> The throttling mechanism only kicks in if the queued airtime total goes
> above the limit. Since mac80211 calculates the time based on the reported
> last_tx_time from the driver, the whole throttling mechanism only kicks in
> for drivers that actually report this value. With the exception of
> multicast, where we always calculate an estimated tx time on the assumption
> that multicast is transmitted at the lowest (6 Mbps) rate.
> 
> The throttling added in this patch is in addition to any throttling already
> performed by the airtime fairness mechanism, and in principle the two
> mechanisms are orthogonal (and currently also uses two different sources of
> airtime). In the future, we could amend this, using the airtime estimates
> calculated by this mechanism as a fallback input to the airtime fairness
> scheduler, to enable airtime fairness even on drivers that don't have a
> hardware source of airtime usage for each station.
> 
> [0] https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/588190/13/drivers/net/wireless-4.2/ath/ath10k/mac.c#3845
One thing that might be missing here is dealing with airtime accounting
of frames that remain queued in the driver/hardware because the station
is in powersave mode.

- Felix
Toke Høiland-Jørgensen Sept. 26, 2019, 1:17 p.m. UTC | #24
Felix Fietkau <nbd@nbd.name> writes:

> On 2019-09-19 14:22, Toke Høiland-Jørgensen wrote:
>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>> 
>> Some devices have deep buffers in firmware and/or hardware which prevents
>> the FQ structure in mac80211 from effectively limiting bufferbloat on the
>> link. For Ethernet devices we have BQL to limit the lower-level queues, but
>> this cannot be applied to mac80211 because transmit rates can vary wildly
>> between packets depending on which station we are transmitting it to.
>> 
>> To overcome this, we can use airtime-based queue limiting (AQL), where we
>> estimate the transmission time for each packet before dequeueing it, and
>> use that to limit the amount of data in-flight to the hardware. This idea
>> was originally implemented as part of the out-of-tree airtime fairness
>> patch to ath10k[0] in chromiumos.
>> 
>> This patch ports that idea over to mac80211. The basic idea is simple
>> enough: Whenever we dequeue a packet from the TXQs and send it to the
>> driver, we estimate its airtime usage, based on the last recorded TX rate
>> of the station that packet is destined for. We keep a running per-AC total
>> of airtime queued for the whole device, and when that total climbs above 8
>> ms' worth of data (corresponding to two maximum-sized aggregates), we
>> simply throttle the queues until it drops down again.
>> 
>> The estimated airtime for each skb is stored in the tx_info, so we can
>> subtract the same amount from the running total when the skb is freed or
>> recycled. The throttling mechanism relies on this accounting to be
>> accurate (i.e., that we are not freeing skbs without subtracting any
>> airtime they were accounted for), so we put the subtraction into
>> ieee80211_report_used_skb().
>> 
>> This patch does *not* include any mechanism to wake a throttled TXQ again,
>> on the assumption that this will happen anyway as a side effect of whatever
>> freed the skb (most commonly a TX completion).
>> 
>> The throttling mechanism only kicks in if the queued airtime total goes
>> above the limit. Since mac80211 calculates the time based on the reported
>> last_tx_time from the driver, the whole throttling mechanism only kicks in
>> for drivers that actually report this value. With the exception of
>> multicast, where we always calculate an estimated tx time on the assumption
>> that multicast is transmitted at the lowest (6 Mbps) rate.
>> 
>> The throttling added in this patch is in addition to any throttling already
>> performed by the airtime fairness mechanism, and in principle the two
>> mechanisms are orthogonal (and currently also uses two different sources of
>> airtime). In the future, we could amend this, using the airtime estimates
>> calculated by this mechanism as a fallback input to the airtime fairness
>> scheduler, to enable airtime fairness even on drivers that don't have a
>> hardware source of airtime usage for each station.
>> 
>> [0] https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/588190/13/drivers/net/wireless-4.2/ath/ath10k/mac.c#3845
> One thing that might be missing here is dealing with airtime accounting
> of frames that remain queued in the driver/hardware because the station
> is in powersave mode.

Oh, right. Didn't know that could happen (I assumed those would be
flushed out or something). But if we're going to go with Kan's
suggestion of having per-station accounting as well as a global
accounting for the device, we could just subtract the station's
outstanding balance from the device total when it goes into powersave
mode, and add it back once it wakes up again. At least I think that
would work, no?

-Toke
Felix Fietkau Sept. 26, 2019, 1:47 p.m. UTC | #25
On 2019-09-26 15:17, Toke Høiland-Jørgensen wrote:
> Oh, right. Didn't know that could happen (I assumed those would be
> flushed out or something). But if we're going to go with Kan's
> suggestion of having per-station accounting as well as a global
> accounting for the device, we could just subtract the station's
> outstanding balance from the device total when it goes into powersave
> mode, and add it back once it wakes up again. At least I think that
> would work, no?Yes, I think that would work.

- Felix
Toke Høiland-Jørgensen Sept. 26, 2019, 3:19 p.m. UTC | #26
Felix Fietkau <nbd@nbd.name> writes:

> On 2019-09-26 15:17, Toke Høiland-Jørgensen wrote:
>> Oh, right. Didn't know that could happen (I assumed those would be
>> flushed out or something). But if we're going to go with Kan's
>> suggestion of having per-station accounting as well as a global
>> accounting for the device, we could just subtract the station's
>> outstanding balance from the device total when it goes into powersave
>> mode, and add it back once it wakes up again. At least I think that
>> would work, no?
>Yes, I think that would work.

Great! Will incorporate that, then.

-Toke
Kan Yan Sept. 27, 2019, 2:42 a.m. UTC | #27
> No, but on tx_completion we could do something like this:
> airtime = CB(skb)->tx_time ?: CB(skb)->tx_time_est;
> ieee80211_report_airtime(sta, airtime);
> That way, if the driver sets the tx_time field to something the firmware
> reports, we'll use that, and otherwise we'd fall back to the estimate.
> Of course, there would need to be a way for the driver to opt out of
> this, for drivers that report out of band airtime like ath10k does :)

I doubt that will work, unless firmware can do per frame airtime
update in the skb. It is more likely that firmware provides out of
band airtime update for a period of time, like an aggregation. That's
the case for ath10k: https://patchwork.kernel.org/patch/10684689

Regarding handling frame for station enters power saving mode:
>
> >> Oh, right. Didn't know that could happen (I assumed those would be
> >> flushed out or something). But if we're going to go with Kan's
> >> suggestion of having per-station accounting as well as a global
> >> accounting for the device, we could just subtract the station's
> >> outstanding balance from the device total when it goes into powersave
> >> mode, and add it back once it wakes up again. At least I think that
> >> would work, no?
> >Yes, I think that would work.
> Great! Will incorporate that, then.

I think that could work but things could be a bit more complicated.
Not sure I fully understand the usage of airtime_weight_sum in  [PATCH
V3 1/4] mac80211: Switch to a virtual time-based airtime scheduler:

in ieee80211_schedule_txq():
   local->airtime_weight_sum[ac] += sta->airtime_weight;

in ieee80211_sta_register_airtime():
   weight_sum = local->airtime_weight_sum[ac] ?: sta->airtime_weight;
   local->airtime_v_t[ac] += airtime / weight_sum;
   sta->airtime[ac].v_t += airtime / sta->airtime_weight;

in  __ieee80211_unschedule_txq
 local->airtime_weight_sum[ac] -= sta->airtime_weight;

I assume the purpose of airtime_weight_sum is to count station's
virtual airtime proportional to the number of active stations for
fairness.  My concern is the per interface
local->airtime_weight_sum[ac] get updated when packets are released
from a txq to lower layer, but it doesn't mean the airtime will be
consumed (packets get transmitted) shortly, due to events like station
enter power save mode, so the weight_sum used in
ieee80211_sta_register_airtime() maybe inaccurate. For architects
using firmware/hardware offloading, firmware ultimately controls
packet scheduling and has quite a lot of autonomy. The host driver's
airtime_weight_sum may get out of sync with the number of active
stations actually scheduled by firmware even without power saving
events.

Is this a correct understanding? I kind of think the original method
of airtime accounting using deficit maybe more robust in this regard.
Toke Høiland-Jørgensen Sept. 27, 2019, 6:12 a.m. UTC | #28
Kan Yan <kyan@google.com> writes:

>> No, but on tx_completion we could do something like this:
>> airtime = CB(skb)->tx_time ?: CB(skb)->tx_time_est;
>> ieee80211_report_airtime(sta, airtime);
>> That way, if the driver sets the tx_time field to something the firmware
>> reports, we'll use that, and otherwise we'd fall back to the estimate.
>> Of course, there would need to be a way for the driver to opt out of
>> this, for drivers that report out of band airtime like ath10k does :)
>
> I doubt that will work, unless firmware can do per frame airtime
> update in the skb. It is more likely that firmware provides out of
> band airtime update for a period of time, like an aggregation. That's
> the case for ath10k: https://patchwork.kernel.org/patch/10684689

No, ath10k would continue to do what it was always doing. Drivers that
can report after-the-fact airtime usage per-frame (like ath9k) will
continue to do that. In both of those cases, the airtime estimate is
only used to throttle the queue, not to schedule for fairness.

My point is just that for drivers that have *no* mechanism to report
airtime usage after-the-fact, we can add a flag that will allow the
values estimated by mac80211 to also be used for the fairness
scheduler...

> Regarding handling frame for station enters power saving mode:
>>
>> >> Oh, right. Didn't know that could happen (I assumed those would be
>> >> flushed out or something). But if we're going to go with Kan's
>> >> suggestion of having per-station accounting as well as a global
>> >> accounting for the device, we could just subtract the station's
>> >> outstanding balance from the device total when it goes into powersave
>> >> mode, and add it back once it wakes up again. At least I think that
>> >> would work, no?
>> >Yes, I think that would work.
>> Great! Will incorporate that, then.
>
> I think that could work but things could be a bit more complicated.
> Not sure I fully understand the usage of airtime_weight_sum in  [PATCH
> V3 1/4] mac80211: Switch to a virtual time-based airtime scheduler:
>
> in ieee80211_schedule_txq():
>    local->airtime_weight_sum[ac] += sta->airtime_weight;
>
> in ieee80211_sta_register_airtime():
>    weight_sum = local->airtime_weight_sum[ac] ?: sta->airtime_weight;
>    local->airtime_v_t[ac] += airtime / weight_sum;
>    sta->airtime[ac].v_t += airtime / sta->airtime_weight;
>
> in  __ieee80211_unschedule_txq
>  local->airtime_weight_sum[ac] -= sta->airtime_weight;
>
> I assume the purpose of airtime_weight_sum is to count station's
> virtual airtime proportional to the number of active stations for
> fairness.

Yup, the proportion between the station's airtime weight and the total
scheduled airtime weight indicates the station's fair share.

> My concern is the per interface local->airtime_weight_sum[ac] get
> updated when packets are released from a txq to lower layer, but it
> doesn't mean the airtime will be consumed (packets get transmitted)
> shortly, due to events like station enter power save mode, so the
> weight_sum used in ieee80211_sta_register_airtime() maybe inaccurate.
> For architects using firmware/hardware offloading, firmware ultimately
> controls packet scheduling and has quite a lot of autonomy. The host
> driver's airtime_weight_sum may get out of sync with the number of
> active stations actually scheduled by firmware even without power
> saving events.
>
> Is this a correct understanding? I kind of think the original method
> of airtime accounting using deficit maybe more robust in this regard.

You are right that this could happen, yes. However, the station is only
unscheduled when its mac80211 queue runs completely empty. So the
assumption is that stations that transmit continuously (which are really
the ones we care about for fairness purposes), would keep being
scheduled most of the time.

Now, you're quite right that this assumption might be wrong, which would
lead to bad results. I think the other (queue throttling) patch set
would help, though; that should push the queues up into mac80211 and
give the stations a higher probability of being scheduled when they are
in fact backlogged. I've only tested the virtual time scheduler on
ath9k, which inherently has shallow buffers in the hardware.

So yeah, it may be that the virtual time-thing turns out to not work
well. But the results looked encouraging on ath9k, and since it will
make it easier to schedule multiple stations, I think it has some merit
that makes it worth trying out. We should probably get the AQL stuff
done first, though, and try the virtual time scheduler on top of that.

BTW, I think Felix' concern about powersave was in relation to AQL: If
we don't handle power save in that, we can end up in a situation where
the budget for packets allowed to be queued in the firmware is taken up
entirely by stations that are currently in powersave mode; which would
throttle the device completely. So we should take that into account for
AQL; for the fairness scheduler, stations in powersave are already
unscheduled, so that should be fine.

-Toke
Yibo Zhao Sept. 27, 2019, 9:20 a.m. UTC | #29
On 2019-09-27 14:12, Toke Høiland-Jørgensen wrote:
> Kan Yan <kyan@google.com> writes:
> 
>>> No, but on tx_completion we could do something like this:
>>> airtime = CB(skb)->tx_time ?: CB(skb)->tx_time_est;
>>> ieee80211_report_airtime(sta, airtime);
>>> That way, if the driver sets the tx_time field to something the 
>>> firmware
>>> reports, we'll use that, and otherwise we'd fall back to the 
>>> estimate.
>>> Of course, there would need to be a way for the driver to opt out of
>>> this, for drivers that report out of band airtime like ath10k does :)
>> 
>> I doubt that will work, unless firmware can do per frame airtime
>> update in the skb. It is more likely that firmware provides out of
>> band airtime update for a period of time, like an aggregation. That's
>> the case for ath10k: https://patchwork.kernel.org/patch/10684689
> 
> No, ath10k would continue to do what it was always doing. Drivers that
> can report after-the-fact airtime usage per-frame (like ath9k) will
> continue to do that. In both of those cases, the airtime estimate is
> only used to throttle the queue, not to schedule for fairness.
> 
> My point is just that for drivers that have *no* mechanism to report
> airtime usage after-the-fact, we can add a flag that will allow the
> values estimated by mac80211 to also be used for the fairness
> scheduler...
> 
>> Regarding handling frame for station enters power saving mode:
>>> 
>>> >> Oh, right. Didn't know that could happen (I assumed those would be
>>> >> flushed out or something). But if we're going to go with Kan's
>>> >> suggestion of having per-station accounting as well as a global
>>> >> accounting for the device, we could just subtract the station's
>>> >> outstanding balance from the device total when it goes into powersave
>>> >> mode, and add it back once it wakes up again. At least I think that
>>> >> would work, no?
>>> >Yes, I think that would work.
>>> Great! Will incorporate that, then.
>> 
>> I think that could work but things could be a bit more complicated.
>> Not sure I fully understand the usage of airtime_weight_sum in  [PATCH
>> V3 1/4] mac80211: Switch to a virtual time-based airtime scheduler:
>> 
>> in ieee80211_schedule_txq():
>>    local->airtime_weight_sum[ac] += sta->airtime_weight;
>> 
>> in ieee80211_sta_register_airtime():
>>    weight_sum = local->airtime_weight_sum[ac] ?: sta->airtime_weight;
>>    local->airtime_v_t[ac] += airtime / weight_sum;
>>    sta->airtime[ac].v_t += airtime / sta->airtime_weight;
>> 
>> in  __ieee80211_unschedule_txq
>>  local->airtime_weight_sum[ac] -= sta->airtime_weight;
>> 
>> I assume the purpose of airtime_weight_sum is to count station's
>> virtual airtime proportional to the number of active stations for
>> fairness.
> 
> Yup, the proportion between the station's airtime weight and the total
> scheduled airtime weight indicates the station's fair share.
> 
>> My concern is the per interface local->airtime_weight_sum[ac] get
>> updated when packets are released from a txq to lower layer, but it
>> doesn't mean the airtime will be consumed (packets get transmitted)
>> shortly, due to events like station enter power save mode, so the
>> weight_sum used in ieee80211_sta_register_airtime() maybe inaccurate.
>> For architects using firmware/hardware offloading, firmware ultimately
>> controls packet scheduling and has quite a lot of autonomy. The host
>> driver's airtime_weight_sum may get out of sync with the number of
>> active stations actually scheduled by firmware even without power
>> saving events.
>> 
>> Is this a correct understanding? I kind of think the original method
>> of airtime accounting using deficit maybe more robust in this regard.
> 
> You are right that this could happen, yes. However, the station is only
> unscheduled when its mac80211 queue runs completely empty. So the
> assumption is that stations that transmit continuously (which are 
> really
> the ones we care about for fairness purposes), would keep being
> scheduled most of the time.
> 
> Now, you're quite right that this assumption might be wrong, which 
> would
> lead to bad results. I think the other (queue throttling) patch set
> would help, though; that should push the queues up into mac80211 and
> give the stations a higher probability of being scheduled when they are
> in fact backlogged. I've only tested the virtual time scheduler on
> ath9k, which inherently has shallow buffers in the hardware.
> 
> So yeah, it may be that the virtual time-thing turns out to not work
> well. But the results looked encouraging on ath9k, and since it will

I am not familiar with ath9k but SFAIK, ath9k is fine with virtual 
time-thing because it does not have firmware/hardware offloading. Thus 
host can see the packets backlogged in host queue and TXQs stay on 
rbtree for the most of time if there is continuous transmission. As a 
result, the algo fully works. While for ath10k like Kan said, it has 
firmware/hardware offloading and host cannot see the packets backlogged 
in host queue because they are already sent to FW queue as long as the 
ingress load less than PHY rate. Then TXQs are removed from the rbtree 
which leads to the algo not working so well. For driver that has 
firmware/hardware offloading, I think the key is to keep some of the 
packets buffered in the host even if FW queue is not full at that time. 
Also I believe DRR may have the same issue since only TXQs in the list 
contend for Tx chance.

> make it easier to schedule multiple stations, I think it has some merit
> that makes it worth trying out. We should probably get the AQL stuff
> done first, though, and try the virtual time scheduler on top of that.

Agree that we should get the AQL stuff done first since I believe it 
will help to fix the issue mentioned above.

> 
> BTW, I think Felix' concern about powersave was in relation to AQL: If
> we don't handle power save in that, we can end up in a situation where
> the budget for packets allowed to be queued in the firmware is taken up
> entirely by stations that are currently in powersave mode; which would
> throttle the device completely. So we should take that into account for
> AQL; for the fairness scheduler, stations in powersave are already
> unscheduled, so that should be fine.
> 
> -Toke
Kan Yan Sept. 28, 2019, 8:24 p.m. UTC | #30
> No, ath10k would continue to do what it was always doing. Drivers that
> can report after-the-fact airtime usage per-frame (like ath9k) will
> continue to do that. In both of those cases, the airtime estimate is
> only used to throttle the queue, not to schedule for fairness.
You are right, I didn't realize ath9k reports per frame airtime usage.

> Yeah, I was wondering about that. Makes sense. Why 24ms, exactly?
The per interface 24 ms queue limit is an empirical number that works
well for both achieve low latency when there is a lot of stations and
get high throughput when there is only 1-2 stations.  We could make it
configurable.

> BTW, I think Felix' concern about powersave was in relation to AQL: If
> we don't handle power save in that, we can end up in a situation where
>the budget for packets allowed to be queued in the firmware is taken up
> entirely by stations that are currently in powersave mode; which would
> throttle the device completely. So we should take that into account for
> AQL; for the fairness scheduler, stations in powersave are already
> unscheduled, so that should be fine.
I think the accounting for the airtime of frames in the power saving
queue could affect both the fairness scheduler and AQL.
For chipset with firmware offload, PS handling is mostly done by
firmware, so host driver's knowledge of PS state could be slightly
out-of-dated. The power save behavior also make it harder to the
airtime_weight correct for the fairness scheduler.
Powersave mode's impact to AQL is much smaller. The lower per station
queue limit is not impacted by other stations PS behavior, since the
estimated future airtime is not weighted for other stations and a
station won't get blocked due to others stations in PS mode.
Station in PS mode do affects AQL's higher per interface limit, but in
an inconsequential way. The per interface AQL queue limit is quite
large (24 ms), hence airtime from packets in PS queue is unlikely to
have a significant impact on it. Still, it will be better if the
packet in power saving queue can be taken into account.

> > make it easier to schedule multiple stations, I think it has some merit
> > that makes it worth trying out. We should probably get the AQL stuff
> > done first, though, and try the virtual time scheduler on top of that.
> Agree that we should get the AQL stuff done first since I believe it
> will help to fix the issue mentioned above.
That sounds like a good plan. The virtual time scheduler is more
involved and will take more work to get it right. It make sense to get
AQL done first.


On Fri, Sep 27, 2019 at 2:20 AM Yibo Zhao <yiboz@codeaurora.org> wrote:
>
> On 2019-09-27 14:12, Toke Høiland-Jørgensen wrote:
> > Kan Yan <kyan@google.com> writes:
> >
> >>> No, but on tx_completion we could do something like this:
> >>> airtime = CB(skb)->tx_time ?: CB(skb)->tx_time_est;
> >>> ieee80211_report_airtime(sta, airtime);
> >>> That way, if the driver sets the tx_time field to something the
> >>> firmware
> >>> reports, we'll use that, and otherwise we'd fall back to the
> >>> estimate.
> >>> Of course, there would need to be a way for the driver to opt out of
> >>> this, for drivers that report out of band airtime like ath10k does :)
> >>
> >> I doubt that will work, unless firmware can do per frame airtime
> >> update in the skb. It is more likely that firmware provides out of
> >> band airtime update for a period of time, like an aggregation. That's
> >> the case for ath10k: https://patchwork.kernel.org/patch/10684689
> >
> > No, ath10k would continue to do what it was always doing. Drivers that
> > can report after-the-fact airtime usage per-frame (like ath9k) will
> > continue to do that. In both of those cases, the airtime estimate is
> > only used to throttle the queue, not to schedule for fairness.
> >
> > My point is just that for drivers that have *no* mechanism to report
> > airtime usage after-the-fact, we can add a flag that will allow the
> > values estimated by mac80211 to also be used for the fairness
> > scheduler...
> >
> >> Regarding handling frame for station enters power saving mode:
> >>>
> >>> >> Oh, right. Didn't know that could happen (I assumed those would be
> >>> >> flushed out or something). But if we're going to go with Kan's
> >>> >> suggestion of having per-station accounting as well as a global
> >>> >> accounting for the device, we could just subtract the station's
> >>> >> outstanding balance from the device total when it goes into powersave
> >>> >> mode, and add it back once it wakes up again. At least I think that
> >>> >> would work, no?
> >>> >Yes, I think that would work.
> >>> Great! Will incorporate that, then.
> >>
> >> I think that could work but things could be a bit more complicated.
> >> Not sure I fully understand the usage of airtime_weight_sum in  [PATCH
> >> V3 1/4] mac80211: Switch to a virtual time-based airtime scheduler:
> >>
> >> in ieee80211_schedule_txq():
> >>    local->airtime_weight_sum[ac] += sta->airtime_weight;
> >>
> >> in ieee80211_sta_register_airtime():
> >>    weight_sum = local->airtime_weight_sum[ac] ?: sta->airtime_weight;
> >>    local->airtime_v_t[ac] += airtime / weight_sum;
> >>    sta->airtime[ac].v_t += airtime / sta->airtime_weight;
> >>
> >> in  __ieee80211_unschedule_txq
> >>  local->airtime_weight_sum[ac] -= sta->airtime_weight;
> >>
> >> I assume the purpose of airtime_weight_sum is to count station's
> >> virtual airtime proportional to the number of active stations for
> >> fairness.
> >
> > Yup, the proportion between the station's airtime weight and the total
> > scheduled airtime weight indicates the station's fair share.
> >
> >> My concern is the per interface local->airtime_weight_sum[ac] get
> >> updated when packets are released from a txq to lower layer, but it
> >> doesn't mean the airtime will be consumed (packets get transmitted)
> >> shortly, due to events like station enter power save mode, so the
> >> weight_sum used in ieee80211_sta_register_airtime() maybe inaccurate.
> >> For architects using firmware/hardware offloading, firmware ultimately
> >> controls packet scheduling and has quite a lot of autonomy. The host
> >> driver's airtime_weight_sum may get out of sync with the number of
> >> active stations actually scheduled by firmware even without power
> >> saving events.
> >>
> >> Is this a correct understanding? I kind of think the original method
> >> of airtime accounting using deficit maybe more robust in this regard.
> >
> > You are right that this could happen, yes. However, the station is only
> > unscheduled when its mac80211 queue runs completely empty. So the
> > assumption is that stations that transmit continuously (which are
> > really
> > the ones we care about for fairness purposes), would keep being
> > scheduled most of the time.
> >
> > Now, you're quite right that this assumption might be wrong, which
> > would
> > lead to bad results. I think the other (queue throttling) patch set
> > would help, though; that should push the queues up into mac80211 and
> > give the stations a higher probability of being scheduled when they are
> > in fact backlogged. I've only tested the virtual time scheduler on
> > ath9k, which inherently has shallow buffers in the hardware.
> >
> > So yeah, it may be that the virtual time-thing turns out to not work
> > well. But the results looked encouraging on ath9k, and since it will
>
> I am not familiar with ath9k but SFAIK, ath9k is fine with virtual
> time-thing because it does not have firmware/hardware offloading. Thus
> host can see the packets backlogged in host queue and TXQs stay on
> rbtree for the most of time if there is continuous transmission. As a
> result, the algo fully works. While for ath10k like Kan said, it has
> firmware/hardware offloading and host cannot see the packets backlogged
> in host queue because they are already sent to FW queue as long as the
> ingress load less than PHY rate. Then TXQs are removed from the rbtree
> which leads to the algo not working so well. For driver that has
> firmware/hardware offloading, I think the key is to keep some of the
> packets buffered in the host even if FW queue is not full at that time.
> Also I believe DRR may have the same issue since only TXQs in the list
> contend for Tx chance.
>
> > make it easier to schedule multiple stations, I think it has some merit
> > that makes it worth trying out. We should probably get the AQL stuff
> > done first, though, and try the virtual time scheduler on top of that.
>
> Agree that we should get the AQL stuff done first since I believe it
> will help to fix the issue mentioned above.
>
> >
> > BTW, I think Felix' concern about powersave was in relation to AQL: If
> > we don't handle power save in that, we can end up in a situation where
> > the budget for packets allowed to be queued in the firmware is taken up
> > entirely by stations that are currently in powersave mode; which would
> > throttle the device completely. So we should take that into account for
> > AQL; for the fairness scheduler, stations in powersave are already
> > unscheduled, so that should be fine.
> >
> > -Toke
>
> --
> Yibo
Toke Høiland-Jørgensen Sept. 29, 2019, 7:18 p.m. UTC | #31
Kan Yan <kyan@google.com> writes:

>> No, ath10k would continue to do what it was always doing. Drivers that
>> can report after-the-fact airtime usage per-frame (like ath9k) will
>> continue to do that. In both of those cases, the airtime estimate is
>> only used to throttle the queue, not to schedule for fairness.
> You are right, I didn't realize ath9k reports per frame airtime usage.
>
>> Yeah, I was wondering about that. Makes sense. Why 24ms, exactly?
> The per interface 24 ms queue limit is an empirical number that works
> well for both achieve low latency when there is a lot of stations and
> get high throughput when there is only 1-2 stations.  We could make it
> configurable.

Right. "Found by trial and error" is a fine answer as far as I'm
concerned :)

But yeah, this should probably be configurable, like BQL is.

>> BTW, I think Felix' concern about powersave was in relation to AQL: If
>> we don't handle power save in that, we can end up in a situation where
>>the budget for packets allowed to be queued in the firmware is taken up
>> entirely by stations that are currently in powersave mode; which would
>> throttle the device completely. So we should take that into account for
>> AQL; for the fairness scheduler, stations in powersave are already
>> unscheduled, so that should be fine.
> I think the accounting for the airtime of frames in the power saving
> queue could affect both the fairness scheduler and AQL.
> For chipset with firmware offload, PS handling is mostly done by
> firmware, so host driver's knowledge of PS state could be slightly
> out-of-dated. The power save behavior also make it harder to the
> airtime_weight correct for the fairness scheduler.

Hmm, maybe. I'm not sure how significant this effect would be, but I
guess we'll need to find out :)


> Powersave mode's impact to AQL is much smaller. The lower per station
> queue limit is not impacted by other stations PS behavior, since the
> estimated future airtime is not weighted for other stations and a
> station won't get blocked due to others stations in PS mode.
> Station in PS mode do affects AQL's higher per interface limit, but in
> an inconsequential way. The per interface AQL queue limit is quite
> large (24 ms), hence airtime from packets in PS queue is unlikely to
> have a significant impact on it. Still, it will be better if the
> packet in power saving queue can be taken into account.

I guess the risk is lower when with a 24ms per-iface limit; but with
enough stations I guess it could still happen, no? So we should probably
handle this case...

>> > make it easier to schedule multiple stations, I think it has some merit
>> > that makes it worth trying out. We should probably get the AQL stuff
>> > done first, though, and try the virtual time scheduler on top of that.
>> Agree that we should get the AQL stuff done first since I believe it
>> will help to fix the issue mentioned above.
> That sounds like a good plan. The virtual time scheduler is more
> involved and will take more work to get it right. It make sense to get
> AQL done first.

Cool. Are you going to submit a ported version of your implementation?
Then we can work from the two submissions and see if we can't converge
on something...

-Toke
Kan Yan Oct. 1, 2019, 4:47 a.m. UTC | #32
> I guess the risk is lower when with a 24ms per-iface limit; but with
> enough stations I guess it could still happen, no? So we should probably
> handle this case...
Each txq (per sta, per tid) is allowed to release at least the lower
AQL limit amount of packet (default 4ms), which is not affected by
other station's PS behavior and 4ms should be sufficient for most use
cases.
The 24ms per-interface limit is an optimization to get good benchmark
score in peak performance test, which usually only involve 1-2
stations. The higher limit probably won't matter anymore when there
are many stations. I haven't noticed side effects due to PS behavior
in the ChromiumOS version. Still, it is better to be able to take
frames in PS queue in to account,

> Cool. Are you going to submit a ported version of your implementation?
> Then we can work from the two submissions and see if we can't converge
> on something...
Working on porting, should have something ready before the end of this week.


On Sun, Sep 29, 2019 at 12:18 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Kan Yan <kyan@google.com> writes:
>
> >> No, ath10k would continue to do what it was always doing. Drivers that
> >> can report after-the-fact airtime usage per-frame (like ath9k) will
> >> continue to do that. In both of those cases, the airtime estimate is
> >> only used to throttle the queue, not to schedule for fairness.
> > You are right, I didn't realize ath9k reports per frame airtime usage.
> >
> >> Yeah, I was wondering about that. Makes sense. Why 24ms, exactly?
> > The per interface 24 ms queue limit is an empirical number that works
> > well for both achieve low latency when there is a lot of stations and
> > get high throughput when there is only 1-2 stations.  We could make it
> > configurable.
>
> Right. "Found by trial and error" is a fine answer as far as I'm
> concerned :)
>
> But yeah, this should probably be configurable, like BQL is.
>
> >> BTW, I think Felix' concern about powersave was in relation to AQL: If
> >> we don't handle power save in that, we can end up in a situation where
> >>the budget for packets allowed to be queued in the firmware is taken up
> >> entirely by stations that are currently in powersave mode; which would
> >> throttle the device completely. So we should take that into account for
> >> AQL; for the fairness scheduler, stations in powersave are already
> >> unscheduled, so that should be fine.
> > I think the accounting for the airtime of frames in the power saving
> > queue could affect both the fairness scheduler and AQL.
> > For chipset with firmware offload, PS handling is mostly done by
> > firmware, so host driver's knowledge of PS state could be slightly
> > out-of-dated. The power save behavior also make it harder to the
> > airtime_weight correct for the fairness scheduler.
>
> Hmm, maybe. I'm not sure how significant this effect would be, but I
> guess we'll need to find out :)
>
>
> > Powersave mode's impact to AQL is much smaller. The lower per station
> > queue limit is not impacted by other stations PS behavior, since the
> > estimated future airtime is not weighted for other stations and a
> > station won't get blocked due to others stations in PS mode.
> > Station in PS mode do affects AQL's higher per interface limit, but in
> > an inconsequential way. The per interface AQL queue limit is quite
> > large (24 ms), hence airtime from packets in PS queue is unlikely to
> > have a significant impact on it. Still, it will be better if the
> > packet in power saving queue can be taken into account.
>
> I guess the risk is lower when with a 24ms per-iface limit; but with
> enough stations I guess it could still happen, no? So we should probably
> handle this case...
>
> >> > make it easier to schedule multiple stations, I think it has some merit
> >> > that makes it worth trying out. We should probably get the AQL stuff
> >> > done first, though, and try the virtual time scheduler on top of that.
> >> Agree that we should get the AQL stuff done first since I believe it
> >> will help to fix the issue mentioned above.
> > That sounds like a good plan. The virtual time scheduler is more
> > involved and will take more work to get it right. It make sense to get
> > AQL done first.
>
> Cool. Are you going to submit a ported version of your implementation?
> Then we can work from the two submissions and see if we can't converge
> on something...
>
> -Toke
>
Toke Høiland-Jørgensen Oct. 1, 2019, 6:57 a.m. UTC | #33
Kan Yan <kyan@google.com> writes:

>> I guess the risk is lower when with a 24ms per-iface limit; but with
>> enough stations I guess it could still happen, no? So we should probably
>> handle this case...
> Each txq (per sta, per tid) is allowed to release at least the lower
> AQL limit amount of packet (default 4ms), which is not affected by
> other station's PS behavior and 4ms should be sufficient for most use
> cases.

Ah, I thought you'd meant each station can queue MIN(4ms, 24ms-<other
stations>). I see that is not the case; it's up to 10ms as long as the
total is less than 20ms, and up to 4ms otherwise. 

> The 24ms per-interface limit is an optimization to get good benchmark
> score in peak performance test, which usually only involve 1-2
> stations.

Gotta get those benchmark numbers in ;)

> The higher limit probably won't matter anymore when there are many
> stations. I haven't noticed side effects due to PS behavior in the
> ChromiumOS version. Still, it is better to be able to take frames in
> PS queue in to account,

As long as one station always gets its 4ms, I'm not too worried about
PS; but that was not the case in my patch :)

>> Cool. Are you going to submit a ported version of your implementation?
>> Then we can work from the two submissions and see if we can't converge
>> on something...
> Working on porting, should have something ready before the end of this
> week.

Great!

-Toke
diff mbox series

Patch

diff --git a/net/mac80211/debugfs.c b/net/mac80211/debugfs.c
index 568b3b276931..c846c6e7f3e3 100644
--- a/net/mac80211/debugfs.c
+++ b/net/mac80211/debugfs.c
@@ -148,6 +148,29 @@  static const struct file_operations aqm_ops = {
 	.llseek = default_llseek,
 };
 
+static ssize_t airtime_queued_read(struct file *file,
+				   char __user *user_buf,
+				   size_t count,
+				   loff_t *ppos)
+{
+	struct ieee80211_local *local = file->private_data;
+	char buf[32 * IEEE80211_NUM_ACS], *p = buf;
+	u8 ac;
+
+	for (ac = 0; ac < IEEE80211_NUM_ACS; ac++)
+		p += scnprintf(p, sizeof(buf)+buf-p, "AC%u: %u\n", ac,
+			       local->airtime_queued[ac]);
+
+	return simple_read_from_buffer(userbuf, count, ppos, buf, p - buf);
+
+}
+
+static const struct file_operations airtime_queued_ops = {
+	.read = airtime_queued_read,
+	.open = simple_open,
+	.llseek = default_llseek,
+};
+
 static ssize_t force_tx_status_read(struct file *file,
 				    char __user *user_buf,
 				    size_t count,
@@ -440,6 +463,7 @@  void debugfs_hw_add(struct ieee80211_local *local)
 
 	debugfs_create_u16("airtime_flags", 0600,
 			   phyd, &local->airtime_flags);
+	DEBUGFS_ADD(airtime_queued);
 
 	statsd = debugfs_create_dir("statistics", phyd);
 
diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index 9de5390411ba..6bebfe80ed29 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -63,6 +63,12 @@  struct ieee80211_local;
 #define IEEE80211_RECIPROCAL_DIVISOR 0x100000000ULL
 #define IEEE80211_RECIPROCAL_SHIFT 32
 
+/* constants used for airtime queue limit */
+#define IEEE80211_AIRTIME_QUEUE_LIMIT 8000 /* 8 ms */
+#define IEEE80211_AIRTIME_OVERHEAD 100
+#define IEEE80211_AIRTIME_OVERHEAD_IFS 16
+#define IEEE80211_AIRTIME_MINRATE_RECIPROCAL (IEEE80211_RECIPROCAL_DIVISOR / 6000)
+
 /*
  * Some APs experience problems when working with U-APSD. Decreasing the
  * probability of that happening by using legacy mode for all ACs but VO isn't
@@ -1144,6 +1150,7 @@  struct ieee80211_local {
 	spinlock_t active_txq_lock[IEEE80211_NUM_ACS];
 	struct list_head active_txqs[IEEE80211_NUM_ACS];
 	u16 schedule_round[IEEE80211_NUM_ACS];
+	u32 airtime_queued[IEEE80211_NUM_ACS];
 
 	u16 airtime_flags;
 
diff --git a/net/mac80211/status.c b/net/mac80211/status.c
index ab8ba5835ca0..e63a70657050 100644
--- a/net/mac80211/status.c
+++ b/net/mac80211/status.c
@@ -711,6 +711,28 @@  static void ieee80211_report_used_skb(struct ieee80211_local *local,
 		skb->wifi_acked = acked;
 	}
 
+	if (info->control.tx_time_est) {
+		u8 *qc, ac;
+		int tid;
+
+		if (ieee80211_is_data_qos(hdr->frame_control)) {
+			qc = ieee80211_get_qos_ctl(hdr);
+			tid = qc[0] & 0xf;
+			ac = ieee80211_ac_from_tid(tid);
+		} else {
+			ac = IEEE80211_AC_BE;
+		}
+
+		spin_lock_bh(&local->active_txq_lock[ac]);
+		/* sanity check to make sure we don't underflow */
+		if (WARN_ON_ONCE(info->control.tx_time_est > local->airtime_queued[ac]))
+			local->airtime_queued[ac] = 0;
+		else
+			local->airtime_queued[ac] -= info->control.tx_time_est;
+		spin_unlock_bh(&local->active_txq_lock[ac]);
+
+	}
+
 	ieee80211_led_tx(local);
 
 	if (skb_has_frag_list(skb)) {
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 1fa422782905..2b8564621ecf 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -3546,9 +3546,19 @@  struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
 	struct ieee80211_tx_data tx;
 	ieee80211_tx_result r;
 	struct ieee80211_vif *vif = txq->vif;
+	u32 airtime = 0, airtime_queued;
+	u8 ac = txq->ac;
+	u32 pktlen;
 
 	WARN_ON_ONCE(softirq_count() == 0);
 
+	spin_lock_bh(&local->active_txq_lock[ac]);
+	airtime_queued = local->airtime_queued[ac];
+	spin_unlock_bh(&local->active_txq_lock[ac]);
+
+	if (airtime_queued > IEEE80211_AIRTIME_QUEUE_LIMIT)
+		return NULL;
+
 begin:
 	spin_lock_bh(&fq->lock);
 
@@ -3581,8 +3591,19 @@  struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
 	tx.skb = skb;
 	tx.sdata = vif_to_sdata(info->control.vif);
 
-	if (txq->sta)
+	pktlen = skb->len + 38;
+	if (txq->sta) {
 		tx.sta = container_of(txq->sta, struct sta_info, sta);
+		if (tx.sta->last_tx_bitrate) {
+			airtime = (pktlen * 8 * 1000 *
+				   tx.sta->last_tx_bitrate_reciprocal) >> IEEE80211_RECIPROCAL_SHIFT;
+			airtime += IEEE80211_AIRTIME_OVERHEAD;
+		}
+	} else {
+		airtime = (pktlen * 8 * 1000 *
+			   IEEE80211_AIRTIME_MINRATE_RECIPROCAL) >> IEEE80211_RECIPROCAL_SHIFT;
+		airtime += IEEE80211_AIRTIME_OVERHEAD;
+	}
 
 	/*
 	 * The key can be removed while the packet was queued, so need to call
@@ -3659,6 +3680,15 @@  struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
 	}
 
 	IEEE80211_SKB_CB(skb)->control.vif = vif;
+
+	if (airtime) {
+		info->control.tx_time_est = airtime;
+
+		spin_lock_bh(&local->active_txq_lock[ac]);
+		local->airtime_queued[ac] += airtime;
+		spin_unlock_bh(&local->active_txq_lock[ac]);
+	}
+
 	return skb;
 
 out:
@@ -3676,6 +3706,9 @@  struct ieee80211_txq *ieee80211_next_txq(struct ieee80211_hw *hw, u8 ac)
 
 	spin_lock_bh(&local->active_txq_lock[ac]);
 
+	if (local->airtime_queued[ac] > IEEE80211_AIRTIME_QUEUE_LIMIT)
+		goto out;
+
  begin:
 	txqi = list_first_entry_or_null(&local->active_txqs[ac],
 					struct txq_info,
@@ -3753,6 +3786,9 @@  bool ieee80211_txq_may_transmit(struct ieee80211_hw *hw,
 
 	spin_lock_bh(&local->active_txq_lock[ac]);
 
+	if (local->airtime_queued[ac] > IEEE80211_AIRTIME_QUEUE_LIMIT)
+		goto out;
+
 	if (!txqi->txq.sta)
 		goto out;