[RFC/RFT] mac80211: implement fq_codel for software queuing

Message ID	1456492163-11437-1-git-send-email-michal.kazior@tieto.com (mailing list archive)
State	RFC
Delegated to:	Johannes Berg
Headers	show Return-Path: <linux-wireless-owner@kernel.org> From: Michal Kazior <michal.kazior@tieto.com> To: linux-wireless@vger.kernel.org Cc: johannes@sipsolutions.net, netdev@vger.kernel.org, eric.dumazet@gmail.com, dave.taht@gmail.com, emmanuel.grumbach@intel.com, nbd@openwrt.org, Tim Shepard <shep@alum.mit.edu>, Michal Kazior <michal.kazior@tieto.com> Subject: [RFC/RFT] mac80211: implement fq_codel for software queuing Date: Fri, 26 Feb 2016 14:09:23 +0100 Message-Id: <1456492163-11437-1-git-send-email-michal.kazior@tieto.com> Sender: linux-wireless-owner@vger.kernel.org Precedence: bulk

Michal Kazior Feb. 26, 2016, 1:09 p.m. UTC

Since 11n aggregation become important to get the
best out of txops. However aggregation inherently
requires buffering and queuing. Once variable
medium conditions to different associated stations
is considered it became apparent that bufferbloat
can't be simply fought with qdiscs for wireless
drivers. 11ac with MU-MIMO makes the problem
worse because the bandwidth-delay product becomes
even greater.

This bases on codel5 and sch_fq_codel.c. It may
not be the Right Thing yet but it should at least
provide a framework for more improvements.

I guess dropping rate could factor in per-station
rate control info but I don't know how this should
exactly be done. HW rate control drivers would
need extra work to take advantage of this.

This obviously works only with drivers that use
wake_tx_queue op.

Note: This uses IFF_NO_QUEUE to get rid of qdiscs
for wireless drivers that use mac80211 and
implement wake_tx_queue op.

Moreover the current txq_limit and latency setting
might need tweaking. Either from userspace or be
dynamically scaled with regard to, e.g. number of
associated stations.

FWIW This already works nicely with ath10k's (not
yey merged) pull-push congestion control for
MU-MIMO as far as throughput is concerned.

Evaluating latency improvements is a little tricky
at this point if a driver is using more queue
layering and/or its firmware controls tx
scheduling - hence I don't have any solid data on
this. I'm open for suggestions though.

It might also be a good idea to do the following
in the future:

 - make generic tx scheduling which does some RR
   over per-sta-tid queues and dequeues bursts of
   packets to form a PPDU to fit into designated
   txop timeframe and bytelimit

   This could in theory be shared and used by
   ath9k and (future) mt76.

   Moreover tx scheduling could factor in rate
   control info and keep per-station number of
   queued packets at a sufficient low threshold to
   avoid queue buildup for slow stations. Emmanuel
   already did similar experiment for iwlwifi's
   station mode and got promising results.

 - make software queueing default internally in
   mac80211. This could help other drivers to get
   at least some benefit from mac80211 smarter
   queueing.

Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
---
 include/net/mac80211.h     |  36 ++++-
 net/mac80211/agg-tx.c      |   8 +-
 net/mac80211/codel.h       | 260 +++++++++++++++++++++++++++++++
 net/mac80211/codel_i.h     |  89 +++++++++++
 net/mac80211/ieee80211_i.h |  27 +++-
 net/mac80211/iface.c       |  25 ++-
 net/mac80211/main.c        |   9 +-
 net/mac80211/rx.c          |   2 +-
 net/mac80211/sta_info.c    |  10 +-
 net/mac80211/sta_info.h    |  27 ++++
 net/mac80211/tx.c          | 370 ++++++++++++++++++++++++++++++++++++++++-----
 net/mac80211/util.c        |  20 ++-
 12 files changed, 805 insertions(+), 78 deletions(-)
 create mode 100644 net/mac80211/codel.h
 create mode 100644 net/mac80211/codel_i.h

Felix Fietkau Feb. 26, 2016, 4:48 p.m. UTC | #1

On 2016-02-26 14:09, Michal Kazior wrote:
> Since 11n aggregation become important to get the
> best out of txops. However aggregation inherently
> requires buffering and queuing. Once variable
> medium conditions to different associated stations
> is considered it became apparent that bufferbloat
> can't be simply fought with qdiscs for wireless
> drivers. 11ac with MU-MIMO makes the problem
> worse because the bandwidth-delay product becomes
> even greater.
> 
> This bases on codel5 and sch_fq_codel.c. It may
> not be the Right Thing yet but it should at least
> provide a framework for more improvements.
Nice work!

> I guess dropping rate could factor in per-station
> rate control info but I don't know how this should
> exactly be done. HW rate control drivers would
> need extra work to take advantage of this.
> 
> This obviously works only with drivers that use
> wake_tx_queue op.
> 
> Note: This uses IFF_NO_QUEUE to get rid of qdiscs
> for wireless drivers that use mac80211 and
> implement wake_tx_queue op.
> 
> Moreover the current txq_limit and latency setting
> might need tweaking. Either from userspace or be
> dynamically scaled with regard to, e.g. number of
> associated stations.
> 
> FWIW This already works nicely with ath10k's (not
> yey merged) pull-push congestion control for
> MU-MIMO as far as throughput is concerned.
> 
> Evaluating latency improvements is a little tricky
> at this point if a driver is using more queue
> layering and/or its firmware controls tx
> scheduling - hence I don't have any solid data on
> this. I'm open for suggestions though.
> 
> It might also be a good idea to do the following
> in the future:
> 
>  - make generic tx scheduling which does some RR
>    over per-sta-tid queues and dequeues bursts of
>    packets to form a PPDU to fit into designated
>    txop timeframe and bytelimit
> 
>    This could in theory be shared and used by
>    ath9k and (future) mt76.
> 
>    Moreover tx scheduling could factor in rate
>    control info and keep per-station number of
>    queued packets at a sufficient low threshold to
>    avoid queue buildup for slow stations. Emmanuel
>    already did similar experiment for iwlwifi's
>    station mode and got promising results.
> 
>  - make software queueing default internally in
>    mac80211. This could help other drivers to get
>    at least some benefit from mac80211 smarter
>    queueing.
> 
> Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
> ---
>  include/net/mac80211.h     |  36 ++++-
>  net/mac80211/agg-tx.c      |   8 +-
>  net/mac80211/codel.h       | 260 +++++++++++++++++++++++++++++++
>  net/mac80211/codel_i.h     |  89 +++++++++++
>  net/mac80211/ieee80211_i.h |  27 +++-
>  net/mac80211/iface.c       |  25 ++-
>  net/mac80211/main.c        |   9 +-
>  net/mac80211/rx.c          |   2 +-
>  net/mac80211/sta_info.c    |  10 +-
>  net/mac80211/sta_info.h    |  27 ++++
>  net/mac80211/tx.c          | 370 ++++++++++++++++++++++++++++++++++++++++-----
>  net/mac80211/util.c        |  20 ++-
>  12 files changed, 805 insertions(+), 78 deletions(-)
>  create mode 100644 net/mac80211/codel.h
>  create mode 100644 net/mac80211/codel_i.h
> 

> diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
> index af584f7cdd63..f42f898cb8b5 100644
> --- a/net/mac80211/tx.c
> +++ b/net/mac80211/tx.c
> + [...]
> +static void ieee80211_txq_enqueue(struct ieee80211_local *local,
> +				  struct txq_info *txqi,
> +				  struct sk_buff *skb)
> +{
> +	struct ieee80211_fq *fq = &local->fq;
> +	struct ieee80211_hw *hw = &local->hw;
> +	struct txq_flow *flow;
> +	struct txq_flow *i;
> +	size_t idx = fq_hash(fq, skb);
> +
> +	flow = &fq->flows[idx];
> +
> +	if (flow->txqi)
> +		flow = &txqi->flow;
I'm not sure I understand this part correctly, but shouldn't that be:
	if (flow->txqi && flow->txqi != txqi)

> +
> +	/* The following overwrites `vif` pointer effectively. It is later
> +	 * restored using txq structure.
> +	 */
> +	IEEE80211_SKB_CB(skb)->control.enqueue_time = codel_get_time();
> +
> +	flow->txqi = txqi;
> +	flow->backlog += skb->len;
> +	txqi->backlog_bytes += skb->len;
> +	txqi->backlog_packets++;
> +	fq->backlog++;
> +
> +	if (list_empty(&flow->backlogchain))
> +		i = list_last_entry(&fq->backlogs, struct txq_flow, backlogchain);
> +	else
> +		i = flow;
> +
> +	list_for_each_entry_continue_reverse(i, &fq->backlogs, backlogchain)
> +		if (i->backlog > flow->backlog)
> +			break;
> +
> +	list_move(&flow->backlogchain, &i->backlogchain);
> +
> +	if (list_empty(&flow->flowchain)) {
> +		flow->deficit = fq->quantum;
> +		list_add_tail(&flow->flowchain, &txqi->new_flows);
> +	}
> +
> +	__skb_queue_tail(&flow->queue, skb);
> +
> +	if (fq->backlog > hw->txq_limit)
> +		fq_drop(local);
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Kazior Feb. 26, 2016, 6:54 p.m. UTC | #2

On 26 February 2016 at 17:48, Felix Fietkau <nbd@openwrt.org> wrote:
[...]
>> diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
>> index af584f7cdd63..f42f898cb8b5 100644
>> --- a/net/mac80211/tx.c
>> +++ b/net/mac80211/tx.c
>> + [...]
>> +static void ieee80211_txq_enqueue(struct ieee80211_local *local,
>> +                               struct txq_info *txqi,
>> +                               struct sk_buff *skb)
>> +{
>> +     struct ieee80211_fq *fq = &local->fq;
>> +     struct ieee80211_hw *hw = &local->hw;
>> +     struct txq_flow *flow;
>> +     struct txq_flow *i;
>> +     size_t idx = fq_hash(fq, skb);
>> +
>> +     flow = &fq->flows[idx];
>> +
>> +     if (flow->txqi)
>> +             flow = &txqi->flow;
> I'm not sure I understand this part correctly, but shouldn't that be:
>         if (flow->txqi && flow->txqi != txqi)

You're correct. Good catch, thanks!


Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Johannes Berg March 1, 2016, 2:02 p.m. UTC | #3

On Fri, 2016-02-26 at 14:09 +0100, Michal Kazior wrote:
> 
> +typedef u64 codel_time_t;

Do we really need this? And if yes, does it have to be in the public
header file? Why a typedef anyway?

> - * @txq_ac_max_pending: maximum number of frames per AC pending in
> all txq
> - *	entries for a vif.
> + * @txq_cparams: codel parameters to control tx queueing dropping
> behavior
> + * @txq_limit: maximum number of frames queuesd

typo - queued

> @@ -2133,7 +2155,8 @@ struct ieee80211_hw {
>  	u8 uapsd_max_sp_len;
>  	u8 n_cipher_schemes;
>  	const struct ieee80211_cipher_scheme *cipher_schemes;
> -	int txq_ac_max_pending;
> +	struct codel_params txq_cparams;

Should this really be a parameter the driver determines?

> +static void ieee80211_if_setup_no_queue(struct net_device *dev)
> +{
> +	ieee80211_if_setup(dev);
> +	dev->priv_flags |= IFF_NO_QUEUE;
> +	/* Note for backporters: use dev->tx_queue_len = 0 instead
> of IFF_ */

Heh. Remove that comment; we have an spatch in backports already :)

> --- a/net/mac80211/sta_info.h
> +++ b/net/mac80211/sta_info.h
> @@ -19,6 +19,7 @@
>  #include <linux/etherdevice.h>
>  #include <linux/rhashtable.h>
>  #include "key.h"
> +#include "codel_i.h"
>  
>  /**
>   * enum ieee80211_sta_info_flags - Stations flags
> @@ -327,6 +328,32 @@ struct mesh_sta {
>  
>  DECLARE_EWMA(signal, 1024, 8)
>  
> +struct txq_info;
> +
> +/**
> + * struct txq_flow - per traffic flow queue
> + *
> + * This structure is used to distinguish and queue different traffic
> flows
> + * separately for fair queueing/AQM purposes.
> + *
> + * @txqi: txq_info structure it is associated at given time
> + * @flowchain: can be linked to other flows for RR purposes
> + * @backlogchain: can be linked to other flows for backlog sorting
> purposes
> + * @queue: sk_buff queue
> + * @cvars: codel state vars
> + * @backlog: number of bytes pending in the queue
> + * @deficit: used for fair queueing balancing
> + */
> +struct txq_flow {
> +	struct txq_info *txqi;
> +	struct list_head flowchain;
> +	struct list_head backlogchain;
> +	struct sk_buff_head queue;
> +	struct codel_vars cvars;
> +	u32 backlog;
> +	u32 deficit;
> +};
> +
>  /**
>   * struct sta_info - STA information
>   *
> diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
> index af584f7cdd63..f42f898cb8b5 100644
> --- a/net/mac80211/tx.c
> +++ b/net/mac80211/tx.c
> @@ -34,6 +34,7 @@
>  #include "wpa.h"
>  #include "wme.h"
>  #include "rate.h"
> +#include "codel.h"

> +static inline codel_time_t
> +custom_codel_get_enqueue_time(struct sk_buff *skb)

There are a lot of inlines here - first of all, do they all need to be
inline?

And secondly, perhaps it would make some sense to move this into
another file?

johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Kazior March 2, 2016, 7:38 a.m. UTC | #4

On 1 March 2016 at 15:02, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Fri, 2016-02-26 at 14:09 +0100, Michal Kazior wrote:
>>
>> +typedef u64 codel_time_t;
>
> Do we really need this? And if yes, does it have to be in the public
> header file? Why a typedef anyway?

Hmm.. I don't think it's strictly necessary. I just wanted to keep as
much from the original codel implementation as possible. I'm fine with
using just u64.


>> - * @txq_ac_max_pending: maximum number of frames per AC pending in
>> all txq
>> - *   entries for a vif.
>> + * @txq_cparams: codel parameters to control tx queueing dropping
>> behavior
>> + * @txq_limit: maximum number of frames queuesd
>
> typo - queued
>
>> @@ -2133,7 +2155,8 @@ struct ieee80211_hw {
>>       u8 uapsd_max_sp_len;
>>       u8 n_cipher_schemes;
>>       const struct ieee80211_cipher_scheme *cipher_schemes;
>> -     int txq_ac_max_pending;
>> +     struct codel_params txq_cparams;
>
> Should this really be a parameter the driver determines?

I would think so, or at least it should be able to influence it in
*some* way. You can have varying degree of induced latency depending
on fw/hw tx queue depth, air conditions and possible tx rates implying
different/varying RTT. Cake[1] even has a few RTT presets like: lan,
internet, satellite.

I don't really have a plan how exactly a driver could make use of it
for benefit though. It might end up not being necessary after all if
generic tx scheduling materializes in mac80211.


[1]: http://www.bufferbloat.net/projects/codel/wiki/Cake


>> +static void ieee80211_if_setup_no_queue(struct net_device *dev)
>> +{
>> +     ieee80211_if_setup(dev);
>> +     dev->priv_flags |= IFF_NO_QUEUE;
>> +     /* Note for backporters: use dev->tx_queue_len = 0 instead
>> of IFF_ */
>
> Heh. Remove that comment; we have an spatch in backports already :)

I've put it in the RFC specifically in case anyone wants to port this
bypassing backports, e.g. to openwrt's quilt (so when compilation
fails, you can quickly fix it up). I'll remove it before proper
submission obviously :)


>> --- a/net/mac80211/sta_info.h
>> +++ b/net/mac80211/sta_info.h
>> @@ -19,6 +19,7 @@
>>  #include <linux/etherdevice.h>
>>  #include <linux/rhashtable.h>
>>  #include "key.h"
>> +#include "codel_i.h"
>>
>>  /**
>>   * enum ieee80211_sta_info_flags - Stations flags
>> @@ -327,6 +328,32 @@ struct mesh_sta {
>>
>>  DECLARE_EWMA(signal, 1024, 8)
>>
>> +struct txq_info;
>> +
>> +/**
>> + * struct txq_flow - per traffic flow queue
>> + *
>> + * This structure is used to distinguish and queue different traffic
>> flows
>> + * separately for fair queueing/AQM purposes.
>> + *
>> + * @txqi: txq_info structure it is associated at given time
>> + * @flowchain: can be linked to other flows for RR purposes
>> + * @backlogchain: can be linked to other flows for backlog sorting
>> purposes
>> + * @queue: sk_buff queue
>> + * @cvars: codel state vars
>> + * @backlog: number of bytes pending in the queue
>> + * @deficit: used for fair queueing balancing
>> + */
>> +struct txq_flow {
>> +     struct txq_info *txqi;
>> +     struct list_head flowchain;
>> +     struct list_head backlogchain;
>> +     struct sk_buff_head queue;
>> +     struct codel_vars cvars;
>> +     u32 backlog;
>> +     u32 deficit;
>> +};
>> +
>>  /**
>>   * struct sta_info - STA information
>>   *
>> diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
>> index af584f7cdd63..f42f898cb8b5 100644
>> --- a/net/mac80211/tx.c
>> +++ b/net/mac80211/tx.c
>> @@ -34,6 +34,7 @@
>>  #include "wpa.h"
>>  #include "wme.h"
>>  #include "rate.h"
>> +#include "codel.h"
>
>> +static inline codel_time_t
>> +custom_codel_get_enqueue_time(struct sk_buff *skb)
>
> There are a lot of inlines here - first of all, do they all need to be
> inline?

I just followed the style of code found in codel implementation. I
figured there's a reason why is uses `inline` although I didn't do any
measurements myself.


> And secondly, perhaps it would make some sense to move this into
> another file?

I did consider it briefly but decided it'll be beneficial for the
compiler to have tx hotpath in a single file.


Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Taht March 3, 2016, 5 p.m. UTC | #5

On Tue, Mar 1, 2016 at 11:38 PM, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 1 March 2016 at 15:02, Johannes Berg <johannes@sipsolutions.net> wrote:
>> On Fri, 2016-02-26 at 14:09 +0100, Michal Kazior wrote:
>>>
>>> +typedef u64 codel_time_t;
>>
>> Do we really need this? And if yes, does it have to be in the public
>> header file? Why a typedef anyway?
>
> Hmm.. I don't think it's strictly necessary. I just wanted to keep as
> much from the original codel implementation as possible. I'm fine with
> using just u64.

This is an artifact of the original codel keeping time in (nsec >> 10)
to fit into a 32 bit int.

In codel5 we switched to native 64 bit timekeeping as simpler, to
improve logging and reason about.

u64 is fine.

>
>>> - * @txq_ac_max_pending: maximum number of frames per AC pending in
>>> all txq
>>> - *   entries for a vif.
>>> + * @txq_cparams: codel parameters to control tx queueing dropping
>>> behavior
>>> + * @txq_limit: maximum number of frames queuesd
>>
>> typo - queued
>>
>>> @@ -2133,7 +2155,8 @@ struct ieee80211_hw {
>>>       u8 uapsd_max_sp_len;
>>>       u8 n_cipher_schemes;
>>>       const struct ieee80211_cipher_scheme *cipher_schemes;
>>> -     int txq_ac_max_pending;
>>> +     struct codel_params txq_cparams;
>>
>> Should this really be a parameter the driver determines?
>
> I would think so, or at least it should be able to influence it in
> *some* way. You can have varying degree of induced latency depending
> on fw/hw tx queue depth, air conditions and possible tx rates implying
> different/varying RTT. Cake[1] even has a few RTT presets like: lan,
> internet, satellite.

While those presets have been useful in testing codel and (more
generically in cake we can rapidly change the bandwidth from userspace
for testing), in the real world you don't move from orbit to desktop
and back as rapidly as wifi does.

> I don't really have a plan how exactly a driver could make use of it
> for benefit though. It might end up not being necessary after all if
> generic tx scheduling materializes in mac80211.

What we envisioned here is ewma smoothing the target based on the
total service time needed for all active stations, per round. (There
are other possible approaches)

Say you serve 10 stations at 1ms each in one round, a codel target of
5ms will try to push things down too far.  If in the next round, you
only serve 2 stations at 1ms each (but get back 10 responses at .5ms
each), you're still too high. If it's just one station, well, you can
get below 2ms if the driver is only sending 1ms, but maybe it's
sending 5ms...

If you have a large multicast burst, that burp will cause lost packets.

Merely getting typical wifi latencies under load down below the 20ms
range would be a good start, after that some testing, hard thought,
and evaluation are going to be needed..... for early testing I think a
20ms fixed target would be safer than the existing 5ms.

Pushing the fq part of fq_codel on a per station basis as close to the
hardware as possible, and having better airtime fairness between
stations is a huge win in itself.





>
> [1]: http://www.bufferbloat.net/projects/codel/wiki/Cake
>
>
>>> +static void ieee80211_if_setup_no_queue(struct net_device *dev)
>>> +{
>>> +     ieee80211_if_setup(dev);
>>> +     dev->priv_flags |= IFF_NO_QUEUE;
>>> +     /* Note for backporters: use dev->tx_queue_len = 0 instead
>>> of IFF_ */
>>
>> Heh. Remove that comment; we have an spatch in backports already :)
>
> I've put it in the RFC specifically in case anyone wants to port this
> bypassing backports, e.g. to openwrt's quilt (so when compilation
> fails, you can quickly fix it up). I'll remove it before proper
> submission obviously :)
>
>
>>> --- a/net/mac80211/sta_info.h
>>> +++ b/net/mac80211/sta_info.h
>>> @@ -19,6 +19,7 @@
>>>  #include <linux/etherdevice.h>
>>>  #include <linux/rhashtable.h>
>>>  #include "key.h"
>>> +#include "codel_i.h"
>>>
>>>  /**
>>>   * enum ieee80211_sta_info_flags - Stations flags
>>> @@ -327,6 +328,32 @@ struct mesh_sta {
>>>
>>>  DECLARE_EWMA(signal, 1024, 8)
>>>
>>> +struct txq_info;
>>> +
>>> +/**
>>> + * struct txq_flow - per traffic flow queue
>>> + *
>>> + * This structure is used to distinguish and queue different traffic
>>> flows
>>> + * separately for fair queueing/AQM purposes.
>>> + *
>>> + * @txqi: txq_info structure it is associated at given time
>>> + * @flowchain: can be linked to other flows for RR purposes
>>> + * @backlogchain: can be linked to other flows for backlog sorting
>>> purposes
>>> + * @queue: sk_buff queue
>>> + * @cvars: codel state vars
>>> + * @backlog: number of bytes pending in the queue
>>> + * @deficit: used for fair queueing balancing
>>> + */
>>> +struct txq_flow {
>>> +     struct txq_info *txqi;
>>> +     struct list_head flowchain;
>>> +     struct list_head backlogchain;
>>> +     struct sk_buff_head queue;
>>> +     struct codel_vars cvars;
>>> +     u32 backlog;
>>> +     u32 deficit;
>>> +};
>>> +
>>>  /**
>>>   * struct sta_info - STA information
>>>   *
>>> diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
>>> index af584f7cdd63..f42f898cb8b5 100644
>>> --- a/net/mac80211/tx.c
>>> +++ b/net/mac80211/tx.c
>>> @@ -34,6 +34,7 @@
>>>  #include "wpa.h"
>>>  #include "wme.h"
>>>  #include "rate.h"
>>> +#include "codel.h"
>>
>>> +static inline codel_time_t
>>> +custom_codel_get_enqueue_time(struct sk_buff *skb)
>>
>> There are a lot of inlines here - first of all, do they all need to be
>> inline?
>
> I just followed the style of code found in codel implementation. I
> figured there's a reason why is uses `inline` although I didn't do any
> measurements myself.
>
>
>> And secondly, perhaps it would make some sense to move this into
>> another file?
>
> I did consider it briefly but decided it'll be beneficial for the
> compiler to have tx hotpath in a single file.
>
>
> Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Kazior March 4, 2016, 6:32 a.m. UTC | #6

On 4 March 2016 at 03:48, Tim Shepard <shep@alum.mit.edu> wrote:
[...]
> (I am interested in knowing what other mac80211 drivers have been
>  modified to use the mac80211 intermediate software queues.   I know
>  Michal mentioned he has patches for ath10k that are not yet released,
>  and I know Felix is finishing up the mt76 driver which uses them.)

Patches for ath10k are under review since quite some time now (but are
not merged yet). The latest re-spin is:

  http://lists.infradead.org/pipermail/ath10k/2016-March/006923.html


Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avery Pennarun March 7, 2016, 2:05 p.m. UTC | #7

On Fri, Mar 4, 2016 at 1:32 AM, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 4 March 2016 at 03:48, Tim Shepard <shep@alum.mit.edu> wrote:
> [...]
>> (I am interested in knowing what other mac80211 drivers have been
>>  modified to use the mac80211 intermediate software queues.   I know
>>  Michal mentioned he has patches for ath10k that are not yet released,
>>  and I know Felix is finishing up the mt76 driver which uses them.)
>
> Patches for ath10k are under review since quite some time now (but are
> not merged yet). The latest re-spin is:
>
>   http://lists.infradead.org/pipermail/ath10k/2016-March/006923.html

Hi all, on Friday I had a chance to experiment with some of these
patches, specifically Tim's ath9k patch (to use intermediate queues),
plus MIchal's patch to use fq_codel with the intermediate queues.  I
didn't attempt any fine tuning; I just slapped them together to see
what happens.  (I tried applying Michal's ath10k patches too, but got
stuck since they seem to be applied against the upstream v4.4 kernel
and didn't merge cleanly with the latest mac80211 branch.  Maybe I was
doing something wrong.)

Test setup:
   AP (ath9k) -> 2x2 strong signal -> STA1 (mwifiex)
        -> attenuator (-40 dB) -> 1x1 weak signal  -> STA2 (mwifiex)

STA2 generally gets modulation levels around MCS0-2 and STA1 usually
gets something like MCS12-15.

With or without this patch, results with TCP iperf were fishy - I
think packet loss patterns were particularly bad and caused 2-second
TCP retry timeouts occasionally - so I removed TCP from the test and
switched the UDP iperf instead.

I ran isoping (https://gfiber.googlesource.com/vendor/google/platform/+/master/cmds/isoping.c)
from the AP to both stations to measure two-way latency during all
tests.  (I used -r2 for two packets/sec in each direction in order not
to affect the test results too much.)

Overall results:

- Running one iperf at a time, I saw ~45 Mbps to STA1 and ~7 Mbps to STA2.

- Running both iperfs at once, without the patches, latencies got
extremely high (~600ms sometimes) and results were closer to
byte-fairness than airtime-fairness (ie. ~7 Mbps each).

- Running both iperfs at once, with the patches, latencies were still
high (usually high 2-digit, sometimes low 3-digit latencies) but we
got closer to airtime-fairness than byte-fairness (~17 Mbps and ~2
Mbps).

- With only one iperf running, without the patches, latencies were
high to both stations.  With the patches, latency was
mid-double-digits to the non-iperf station (pretty good!) while being
low-mid triple-digits to the busy iperf station.  This suggests that
we are getting per-station queuing (yay!) but does make me question
whether the fq_ in fq_codel was working.

- More generally, the latencies were all still higher than expected.
I didn't investigate why this might be, but the obvious guess (which
Tim has agreed with) is that we need something BQL-like in addition to
the fq_codel layer.  The BQL-like thing is what Emmanuel's earlier
latency patch did with iwlwifi, with apparently good results.  If
someone wants to try making a similar patch for ath9k, I'd be happy to
help test it out.

Although things aren't yet nearly as good as I'd like to see them,
I'll note that Tim's and Michal's patches don't seem to make things
*worse*, at least in my setup, and do improve results in my test.  So
if they pass code review, it may make sense to apply them as one small
step forward to reducing wifi latency under load.

Have fun,

Avery
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Felix Fietkau March 7, 2016, 3:09 p.m. UTC | #8

On 2016-03-07 15:05, Avery Pennarun wrote:
> On Fri, Mar 4, 2016 at 1:32 AM, Michal Kazior <michal.kazior@tieto.com> wrote:
>> On 4 March 2016 at 03:48, Tim Shepard <shep@alum.mit.edu> wrote:
>> [...]
>>> (I am interested in knowing what other mac80211 drivers have been
>>>  modified to use the mac80211 intermediate software queues.   I know
>>>  Michal mentioned he has patches for ath10k that are not yet released,
>>>  and I know Felix is finishing up the mt76 driver which uses them.)
>>
>> Patches for ath10k are under review since quite some time now (but are
>> not merged yet). The latest re-spin is:
>>
>>   http://lists.infradead.org/pipermail/ath10k/2016-March/006923.html
> 
> Hi all, on Friday I had a chance to experiment with some of these
> patches, specifically Tim's ath9k patch (to use intermediate queues),
> plus MIchal's patch to use fq_codel with the intermediate queues.  I
> didn't attempt any fine tuning; I just slapped them together to see
> what happens.  (I tried applying Michal's ath10k patches too, but got
> stuck since they seem to be applied against the upstream v4.4 kernel
> and didn't merge cleanly with the latest mac80211 branch.  Maybe I was
> doing something wrong.)
> 
> Test setup:
>    AP (ath9k) -> 2x2 strong signal -> STA1 (mwifiex)
>         -> attenuator (-40 dB) -> 1x1 weak signal  -> STA2 (mwifiex)
> 
> STA2 generally gets modulation levels around MCS0-2 and STA1 usually
> gets something like MCS12-15.
> 
> With or without this patch, results with TCP iperf were fishy - I
> think packet loss patterns were particularly bad and caused 2-second
> TCP retry timeouts occasionally - so I removed TCP from the test and
> switched the UDP iperf instead.
> 
> I ran isoping (https://gfiber.googlesource.com/vendor/google/platform/+/master/cmds/isoping.c)
> from the AP to both stations to measure two-way latency during all
> tests.  (I used -r2 for two packets/sec in each direction in order not
> to affect the test results too much.)
> 
> Overall results:
> 
> - Running one iperf at a time, I saw ~45 Mbps to STA1 and ~7 Mbps to STA2.
> 
> - Running both iperfs at once, without the patches, latencies got
> extremely high (~600ms sometimes) and results were closer to
> byte-fairness than airtime-fairness (ie. ~7 Mbps each).
> 
> - Running both iperfs at once, with the patches, latencies were still
> high (usually high 2-digit, sometimes low 3-digit latencies) but we
> got closer to airtime-fairness than byte-fairness (~17 Mbps and ~2
> Mbps).
> 
> - With only one iperf running, without the patches, latencies were
> high to both stations.  With the patches, latency was
> mid-double-digits to the non-iperf station (pretty good!) while being
> low-mid triple-digits to the busy iperf station.  This suggests that
> we are getting per-station queuing (yay!) but does make me question
> whether the fq_ in fq_codel was working.
Please change the 'if (flow->txqi)' check in ieee80211_txq_enqueue to:
if (flow->txqi && flow->txqi != txqi)
This should hopefully fix the fq_ part ;)

- Felix
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avery Pennarun March 7, 2016, 4:25 p.m. UTC | #9

On Mon, Mar 7, 2016 at 10:09 AM, Felix Fietkau <nbd@openwrt.org> wrote:
> On 2016-03-07 15:05, Avery Pennarun wrote:
>> On Fri, Mar 4, 2016 at 1:32 AM, Michal Kazior <michal.kazior@tieto.com> wrote:
>>> On 4 March 2016 at 03:48, Tim Shepard <shep@alum.mit.edu> wrote:
>>> [...]
>>>> (I am interested in knowing what other mac80211 drivers have been
>>>>  modified to use the mac80211 intermediate software queues.   I know
>>>>  Michal mentioned he has patches for ath10k that are not yet released,
>>>>  and I know Felix is finishing up the mt76 driver which uses them.)
>>>
>>> Patches for ath10k are under review since quite some time now (but are
>>> not merged yet). The latest re-spin is:
>>>
>>>   http://lists.infradead.org/pipermail/ath10k/2016-March/006923.html
>>
>> Hi all, on Friday I had a chance to experiment with some of these
>> patches, specifically Tim's ath9k patch (to use intermediate queues),
>> plus MIchal's patch to use fq_codel with the intermediate queues.  I
>> didn't attempt any fine tuning; I just slapped them together to see
>> what happens.  (I tried applying Michal's ath10k patches too, but got
>> stuck since they seem to be applied against the upstream v4.4 kernel
>> and didn't merge cleanly with the latest mac80211 branch.  Maybe I was
>> doing something wrong.)
>>
>> Test setup:
>>    AP (ath9k) -> 2x2 strong signal -> STA1 (mwifiex)
>>         -> attenuator (-40 dB) -> 1x1 weak signal  -> STA2 (mwifiex)
>>
>> STA2 generally gets modulation levels around MCS0-2 and STA1 usually
>> gets something like MCS12-15.
>>
>> With or without this patch, results with TCP iperf were fishy - I
>> think packet loss patterns were particularly bad and caused 2-second
>> TCP retry timeouts occasionally - so I removed TCP from the test and
>> switched the UDP iperf instead.
>>
>> I ran isoping (https://gfiber.googlesource.com/vendor/google/platform/+/master/cmds/isoping.c)
>> from the AP to both stations to measure two-way latency during all
>> tests.  (I used -r2 for two packets/sec in each direction in order not
>> to affect the test results too much.)
>>
>> Overall results:
>>
>> - Running one iperf at a time, I saw ~45 Mbps to STA1 and ~7 Mbps to STA2.
>>
>> - Running both iperfs at once, without the patches, latencies got
>> extremely high (~600ms sometimes) and results were closer to
>> byte-fairness than airtime-fairness (ie. ~7 Mbps each).
>>
>> - Running both iperfs at once, with the patches, latencies were still
>> high (usually high 2-digit, sometimes low 3-digit latencies) but we
>> got closer to airtime-fairness than byte-fairness (~17 Mbps and ~2
>> Mbps).
>>
>> - With only one iperf running, without the patches, latencies were
>> high to both stations.  With the patches, latency was
>> mid-double-digits to the non-iperf station (pretty good!) while being
>> low-mid triple-digits to the busy iperf station.  This suggests that
>> we are getting per-station queuing (yay!) but does make me question
>> whether the fq_ in fq_codel was working.
>
> Please change the 'if (flow->txqi)' check in ieee80211_txq_enqueue to:
> if (flow->txqi && flow->txqi != txqi)
> This should hopefully fix the fq_ part ;)

Oops, I saw your message about that earlier and totally forgot to
apply the change.  But maybe that was for the best, because it doesn't
seem to uniformly make things better.

*Without* your change, I observe that my iperf3 session to STA1 (high
speed) seems to complain about a lot of out-of-order packets.  *With*
your change, the out-of-order complaints seem to go away, which is
nice.  The throughput measurements look about the same both ways.

However, *without* your change, isoping latency to STA1 (low speed)
seems to be pretty stable in the ~100ms range (although it fluctuates
a bit).  *With* your change, STA2 latency fluctuates wildly as low as
1.x ms (yay!) but as high as 800ms (boo).  STA1 latency is fairly low
in both cases.

I have to admit, I haven't read any of this code in enough detail to
have a guess as to why this might be. But I did switch back and forth
between the two versions a few times to confirm that it seems to be
repeatable.

Just to compare, I went back to a version that contains only Tim's
patch (intermediate queues) but not fq_codel.  That one seems to have
much less variability in the isoping times (~50-100ms under load).
The best case isn't as good, but the worst case is much less bad.
This suggests to me that maybe codel's per-station drop rate is
oscillating (perhaps it needs to ramp less quickly?).  I wonder if the
competing codels between stations also confuse each other: as one
ramps down, maybe the other one would be encouraged to ramp up?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Taht March 7, 2016, 4:54 p.m. UTC | #10

If I can just get a coherent patch set that I can build, I will gladly
join you on the testing ath9k in particular... can probably do ath10k,
too - and do a bit of code review... this week. it's very exciting
watching all this activity...

but I confess that I've totally lost track of what set of trees and
patchwork I should try to pull from. wireless-drivers-next? ath10k?
wireless-next? net-next? toke and I have a ton of x86 platforms
available to test on....

Avery - which patches did you use??? on top of what?

In terms of "smoothing" codel...

I emphatically do not think codel in it's current form is "ready" for
wireless, at the very least the target should not be much lower than
20ms in your 2 station tests.  There is another bit in codel where the
algo "turns off" with only a single MTU's worth of packets
outstanding, which could get bumped to the ideal size of the
aggregate. "ideal" kind of being a variable based on a ton of other
factors...

the underlying code needs to be striving successfully for per-station
airtime fairness for this to work at all, and the driver/card
interface nearly as tight as BQL is for the fq portion to behave
sanely. I'd configure codel at a higher target and try to observe what
is going on at the fq level til that got saner.

There are so many other variables and things left unaccounted for, as yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avery Pennarun March 7, 2016, 5:14 p.m. UTC | #11

On Mon, Mar 7, 2016 at 11:54 AM, Dave Taht <dave.taht@gmail.com> wrote:
> If I can just get a coherent patch set that I can build, I will gladly
> join you on the testing ath9k in particular... can probably do ath10k,
> too - and do a bit of code review... this week. it's very exciting
> watching all this activity...
>
> but I confess that I've totally lost track of what set of trees and
> patchwork I should try to pull from. wireless-drivers-next? ath10k?
> wireless-next? net-next? toke and I have a ton of x86 platforms
> available to test on....
>
> Avery - which patches did you use??? on top of what?

The patch series I'm currently using can be found here:

  git fetch https://gfiber.googlesource.com/vendor/opensource/backports
ath9k_txq+fq_codel

That's again backports-20160122, which comes from linux-next as of
20160122.  You can either build backports against whatever kernel
you're using (probably easiest) or try to use that version of
linux-next, or rebase the patches onto your favourite kernel.

> In terms of "smoothing" codel...
>
> I emphatically do not think codel in it's current form is "ready" for
> wireless, at the very least the target should not be much lower than
> 20ms in your 2 station tests.  There is another bit in codel where the
> algo "turns off" with only a single MTU's worth of packets
> outstanding, which could get bumped to the ideal size of the
> aggregate. "ideal" kind of being a variable based on a ton of other
> factors...

Yeah, I figured that sort of thing would come up.  I'm feeling forward
progress just by finally seeing the buggy oscillations finally happen,
though. :)

> the underlying code needs to be striving successfully for per-station
> airtime fairness for this to work at all, and the driver/card
> interface nearly as tight as BQL is for the fq portion to behave
> sanely. I'd configure codel at a higher target and try to observe what
> is going on at the fq level til that got saner.

That seems like two good goals.  So Emmanuel's BQL-like thing seems
like we'll need it soon.

As for per-station airtime fairness, what's a good approximation of
that?  Perhaps round-robin between stations, one aggregate per turn,
where each aggregate has a maximum allowed latency?  I don't know how
the current code works, but it's probably almost like that, as long as
we only put one aggregate's worth of stuff into each hwq, which I
guess is what the BQL-like thing will do.

So if I understand correctly, what we need is, in the following order:
1) Disable fq_codel for now, and get BQL-like thing working in ath9k
(and ensure we're getting airtime fairness even without fq_codel);
2) Re-enable fq_codel and increase fq_codel's target up to 20ms for now;
3) Tweak fq_codel's "turn off" size to be larger (how important is this?)

Is that right?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Emmanuel Grumbach March 7, 2016, 5:22 p.m. UTC | #12

On 03/07/2016 07:15 PM, Avery Pennarun wrote:
> On Mon, Mar 7, 2016 at 11:54 AM, Dave Taht <dave.taht@gmail.com> wrote:
>> If I can just get a coherent patch set that I can build, I will gladly
>> join you on the testing ath9k in particular... can probably do ath10k,
>> too - and do a bit of code review... this week. it's very exciting
>> watching all this activity...
>>
>> but I confess that I've totally lost track of what set of trees and
>> patchwork I should try to pull from. wireless-drivers-next? ath10k?
>> wireless-next? net-next? toke and I have a ton of x86 platforms
>> available to test on....
>>
>> Avery - which patches did you use??? on top of what?
>
> The patch series I'm currently using can be found here:
>
>    git fetch https://gfiber.googlesource.com/vendor/opensource/backports
> ath9k_txq+fq_codel
>
> That's again backports-20160122, which comes from linux-next as of
> 20160122.  You can either build backports against whatever kernel
> you're using (probably easiest) or try to use that version of
> linux-next, or rebase the patches onto your favourite kernel.
>
>> In terms of "smoothing" codel...
>>
>> I emphatically do not think codel in it's current form is "ready" for
>> wireless, at the very least the target should not be much lower than
>> 20ms in your 2 station tests.  There is another bit in codel where the
>> algo "turns off" with only a single MTU's worth of packets
>> outstanding, which could get bumped to the ideal size of the
>> aggregate. "ideal" kind of being a variable based on a ton of other
>> factors...
>
> Yeah, I figured that sort of thing would come up.  I'm feeling forward
> progress just by finally seeing the buggy oscillations finally happen,
> though. :)
>
>> the underlying code needs to be striving successfully for per-station
>> airtime fairness for this to work at all, and the driver/card
>> interface nearly as tight as BQL is for the fq portion to behave
>> sanely. I'd configure codel at a higher target and try to observe what
>> is going on at the fq level til that got saner.
>
> That seems like two good goals.  So Emmanuel's BQL-like thing seems
> like we'll need it soon.

Well... I am going to do that for station only, and only for iwlwifi.
I haven't had a chance to work on that since then :( but I hope to get 
back to it. I also need to check what happens in multiple channels 
scenarios (in which there is inherent latency).
AP and stations have different challenges.

>
> As for per-station airtime fairness, what's a good approximation of
> that?  Perhaps round-robin between stations, one aggregate per turn,
> where each aggregate has a maximum allowed latency?  I don't know how
> the current code works, but it's probably almost like that, as long as
> we only put one aggregate's worth of stuff into each hwq, which I
> guess is what the BQL-like thing will do.
>
> So if I understand correctly, what we need is, in the following order:
> 1) Disable fq_codel for now, and get BQL-like thing working in ath9k
> (and ensure we're getting airtime fairness even without fq_codel);
> 2) Re-enable fq_codel and increase fq_codel's target up to 20ms for now;
> 3) Tweak fq_codel's "turn off" size to be larger (how important is this?)
>
> Is that right?
>
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Taht March 7, 2016, 6:28 p.m. UTC | #13

On Mon, Mar 7, 2016 at 9:14 AM, Avery Pennarun <apenwarr@gmail.com> wrote:
> On Mon, Mar 7, 2016 at 11:54 AM, Dave Taht <dave.taht@gmail.com> wrote:
>> If I can just get a coherent patch set that I can build, I will gladly
>> join you on the testing ath9k in particular... can probably do ath10k,
>> too - and do a bit of code review... this week. it's very exciting
>> watching all this activity...
>>
>> but I confess that I've totally lost track of what set of trees and
>> patchwork I should try to pull from. wireless-drivers-next? ath10k?
>> wireless-next? net-next? toke and I have a ton of x86 platforms
>> available to test on....
>>
>> Avery - which patches did you use??? on top of what?
>
> The patch series I'm currently using can be found here:
>
>   git fetch https://gfiber.googlesource.com/vendor/opensource/backports
> ath9k_txq+fq_codel

No common commits, but ok, thx for a buildable-looking tree.

d@dancer:~/git/linux$ git clone -b ath9k_txq+fq_codel --reference
net-next https://gfiber.googlesource.com/vendor/opensource/backports
Cloning into 'backports'...
warning: no common commits
remote: Sending approximately 30.48 MiB ...
remote: Counting objects: 4758, done
remote: Finding sources: 100% (5/5)
remote: Total 19312 (delta 12999), reused 19308 (delta 12999)
Receiving objects: 100% (19312/19312), 30.48 MiB | 6.23 MiB/s, done.
Resolving deltas: 100% (12999/12999), done.


>
> That's again backports-20160122, which comes from linux-next as of
> 20160122.  You can either build backports against whatever kernel
> you're using (probably easiest) or try to use that version of
> linux-next, or rebase the patches onto your favourite kernel.
>
>> In terms of "smoothing" codel...
>>
>> I emphatically do not think codel in it's current form is "ready" for
>> wireless, at the very least the target should not be much lower than
>> 20ms in your 2 station tests.  There is another bit in codel where the
>> algo "turns off" with only a single MTU's worth of packets
>> outstanding, which could get bumped to the ideal size of the
>> aggregate. "ideal" kind of being a variable based on a ton of other
>> factors...
>
> Yeah, I figured that sort of thing would come up.  I'm feeling forward
> progress just by finally seeing the buggy oscillations finally happen,
> though. :)

It's *very* exciting to see y'all break things in a measurable, yet
positive direction.

>
>> the underlying code needs to be striving successfully for per-station
>> airtime fairness for this to work at all, and the driver/card
>> interface nearly as tight as BQL is for the fq portion to behave
>> sanely. I'd configure codel at a higher target and try to observe what
>> is going on at the fq level til that got saner.
>
> That seems like two good goals.  So Emmanuel's BQL-like thing seems
> like we'll need it soon.
>
> As for per-station airtime fairness, what's a good approximation of
> that?  Perhaps round-robin between stations, one aggregate per turn,
> where each aggregate has a maximum allowed latency?

Strict round robin is a start, and simplest, yes. Sure.

"Oldest station queues first" on a round (probably) has higher
potential for maximizing txops, but requires more overhead. (shortest
queue first would be bad). There's another algo based on last received
packets from a station possibly worth fiddling with in the long run...

as "maximum allowed latency" - well, to me that is eventually also a
variable, based on the number of stations that have to be scheduled on
that round. Trying to get away from 10 stations eating 5.7ms each +
return traffic on a round would be nicer. If you want a constant, for
now, aim for 2048us or 1TU.

> I don't know how
> the current code works, but it's probably almost like that, as long as
> we only put one aggregate's worth of stuff into each hwq, which I
> guess is what the BQL-like thing will do.

I would avoid trying to think about or using 802.11e's 4 queues at the
moment[1]. We also have fallout from mu-mimo to deal with, eventually,
also, but gang scheduling starts to fall out naturally from these
structures and methods...

>
> So if I understand correctly, what we need is, in the following order:
> 1) Disable fq_codel for now, and get BQL-like thing working in ath9k
> (and ensure we're getting airtime fairness even without fq_codel);
> 2) Re-enable fq_codel and increase fq_codel's target up to 20ms for now;
> 3) Tweak fq_codel's "turn off" size to be larger (how important is this?)
>
> Is that right?

Sounds good. I have not reviewed the codel5 based implementation, it
may not even have idea "#3" in it at the moment at all.  The relevant
line in  codel.h is in codel_should_drop

" if (codel_time_before(vars->ldelay, params->target) ||
            sch->qstats.backlog <= stats->maxpacket) {"

where instead of maxpacket you might use "some currently good looking
number for the a good aggregate size for this station". I sadly note
that in wifi you also have to worry about packets (42 max on n) AND
bytes....


[1] I've published a lot of stuff showing how damaging 802.11e's edca
scheduling can be - I lean towards, at most, 2-3 aggregates being in
the hardware, essentially disabling the VO queue on 802.11n (not sure
on ac), in favor of VI, promoting or demoting an assembled aggregate
from BE to BK or VI as needed at the last second before submitting it
to the hardware, trying harder to only have one aggregate outstanding
to one station at a time, etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Taht March 7, 2016, 11:06 p.m. UTC | #14

Dear Michal:

Going through this patchset... (while watching it compile)


+       if (!local->hw.txq_cparams.target)
+               local->hw.txq_cparams.target = MS2TIME(5);

MS2TIME(20) for now and/or add something saner to this than !*backlog

target will not be a constant in the long run.

+       if (now - custom_codel_get_enqueue_time(skb) < p->target ||
+           !*backlog) {
+               /* went below - stay below for at least interval */
+               vars->first_above_time = 0;
+               return false;
+       }


*backlog < some_sane_value_for_an_aggregate_for_this_station

Unlike regular codel *backlog should be a ptr to the queuesize for
this station, not the total queue.

regular codel, by using the shared backlog for all queues, is trying
to get to a 1 packet depth for all queues, (see commit:
865ec5523dadbedefbc5710a68969f686a28d928 ), and store the flow in the
network, not the queue...

BUT in wifi's case you want to provide good service to all stations,
which is filling up an aggregate
for each... (and varying the "sane_value_for_the_aggregate" to suit
per sta service time requirements in a given round of all stations).

...

+       fq->flows_cnt = 4096;

regular fq_codel uses 1024 and there has not been much reason to
change it. In the case of an AP which has more limited memory, 256 or
1024 would be a good setting, per station. I'd stick to 1024 for now.

With large values for flows_cnt, fq, dominates, for small values, aqm
does. We did quite a lot of testing at 16 and 32 queues in the early
days, with pretty good results, except when we didn't. Cake went whole
hog with an 8 way set associative hash leading to "near perfect" fq,
which, at the cost of more cpu overhead, could cut the number of
queues down by a lot, also. Eric did "perfect" fq with sch_fq...

(btw: another way to test how codel is working is to set flows_cnt to
1. I will probably end up doing that at some point)

+       fq->perturbation = prandom_u32();
+       fq->quantum = 300;

quantum 300 is a good compromise to maximize delivery of small packets
from different flows. Probably the right thing on a station.

It also has cpu overhead. Quantum=1514 is more of the right thing on an AP.

(but to wax philosophical, per packet fairness rather than byte
fairness probably spreads errors across more flows in a wifi aggregate
than byte fairness, thus 300 remains a decent compromise if you can
spare the cpu)

...

where would be a suitable place to make (count, ecn_marks, drops)
visible in this subsystem?

...

Is this "per station" or per station, per 802.11e queue?

Dave Täht
Let's go make home routers and wifi faster! With better software!
https://www.gofundme.com/savewifi


On Fri, Feb 26, 2016 at 5:09 AM, Michal Kazior <michal.kazior@tieto.com> wrote:
> Since 11n aggregation become important to get the
> best out of txops. However aggregation inherently
> requires buffering and queuing. Once variable
> medium conditions to different associated stations
> is considered it became apparent that bufferbloat
> can't be simply fought with qdiscs for wireless
> drivers. 11ac with MU-MIMO makes the problem
> worse because the bandwidth-delay product becomes
> even greater.
>
> This bases on codel5 and sch_fq_codel.c. It may
> not be the Right Thing yet but it should at least
> provide a framework for more improvements.
>
> I guess dropping rate could factor in per-station
> rate control info but I don't know how this should
> exactly be done. HW rate control drivers would
> need extra work to take advantage of this.
>
> This obviously works only with drivers that use
> wake_tx_queue op.
>
> Note: This uses IFF_NO_QUEUE to get rid of qdiscs
> for wireless drivers that use mac80211 and
> implement wake_tx_queue op.
>
> Moreover the current txq_limit and latency setting
> might need tweaking. Either from userspace or be
> dynamically scaled with regard to, e.g. number of
> associated stations.
>
> FWIW This already works nicely with ath10k's (not
> yey merged) pull-push congestion control for
> MU-MIMO as far as throughput is concerned.
>
> Evaluating latency improvements is a little tricky
> at this point if a driver is using more queue
> layering and/or its firmware controls tx
> scheduling - hence I don't have any solid data on
> this. I'm open for suggestions though.
>
> It might also be a good idea to do the following
> in the future:
>
>  - make generic tx scheduling which does some RR
>    over per-sta-tid queues and dequeues bursts of
>    packets to form a PPDU to fit into designated
>    txop timeframe and bytelimit
>
>    This could in theory be shared and used by
>    ath9k and (future) mt76.
>
>    Moreover tx scheduling could factor in rate
>    control info and keep per-station number of
>    queued packets at a sufficient low threshold to
>    avoid queue buildup for slow stations. Emmanuel
>    already did similar experiment for iwlwifi's
>    station mode and got promising results.
>
>  - make software queueing default internally in
>    mac80211. This could help other drivers to get
>    at least some benefit from mac80211 smarter
>    queueing.
>
> Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
> ---
>  include/net/mac80211.h     |  36 ++++-
>  net/mac80211/agg-tx.c      |   8 +-
>  net/mac80211/codel.h       | 260 +++++++++++++++++++++++++++++++
>  net/mac80211/codel_i.h     |  89 +++++++++++
>  net/mac80211/ieee80211_i.h |  27 +++-
>  net/mac80211/iface.c       |  25 ++-
>  net/mac80211/main.c        |   9 +-
>  net/mac80211/rx.c          |   2 +-
>  net/mac80211/sta_info.c    |  10 +-
>  net/mac80211/sta_info.h    |  27 ++++
>  net/mac80211/tx.c          | 370 ++++++++++++++++++++++++++++++++++++++++-----
>  net/mac80211/util.c        |  20 ++-
>  12 files changed, 805 insertions(+), 78 deletions(-)
>  create mode 100644 net/mac80211/codel.h
>  create mode 100644 net/mac80211/codel_i.h
>
> diff --git a/include/net/mac80211.h b/include/net/mac80211.h
> index 6617516a276f..4667d2bad356 100644
> --- a/include/net/mac80211.h
> +++ b/include/net/mac80211.h
> @@ -565,6 +565,18 @@ struct ieee80211_bss_conf {
>         struct ieee80211_p2p_noa_attr p2p_noa_attr;
>  };
>
> +typedef u64 codel_time_t;
> +
> +/*
> + * struct codel_params - contains codel parameters
> + * @interval:  initial drop rate
> + * @target:     maximum persistent sojourn time
> + */
> +struct codel_params {
> +       codel_time_t    interval;
> +       codel_time_t    target;
> +};
> +
>  /**
>   * enum mac80211_tx_info_flags - flags to describe transmission information/status
>   *
> @@ -886,8 +898,18 @@ struct ieee80211_tx_info {
>                                 /* only needed before rate control */
>                                 unsigned long jiffies;
>                         };
> -                       /* NB: vif can be NULL for injected frames */
> -                       struct ieee80211_vif *vif;
> +                       union {
> +                               /* NB: vif can be NULL for injected frames */
> +                               struct ieee80211_vif *vif;
> +
> +                               /* When packets are enqueued on txq it's easy
> +                                * to re-construct the vif pointer. There's no
> +                                * more space in tx_info so it can be used to
> +                                * store the necessary enqueue time for packet
> +                                * sojourn time computation.
> +                                */
> +                               codel_time_t enqueue_time;
> +                       };
>                         struct ieee80211_key_conf *hw_key;
>                         u32 flags;
>                         /* 4 bytes free */
> @@ -2102,8 +2124,8 @@ enum ieee80211_hw_flags {
>   * @cipher_schemes: a pointer to an array of cipher scheme definitions
>   *     supported by HW.
>   *
> - * @txq_ac_max_pending: maximum number of frames per AC pending in all txq
> - *     entries for a vif.
> + * @txq_cparams: codel parameters to control tx queueing dropping behavior
> + * @txq_limit: maximum number of frames queuesd
>   */
>  struct ieee80211_hw {
>         struct ieee80211_conf conf;
> @@ -2133,7 +2155,8 @@ struct ieee80211_hw {
>         u8 uapsd_max_sp_len;
>         u8 n_cipher_schemes;
>         const struct ieee80211_cipher_scheme *cipher_schemes;
> -       int txq_ac_max_pending;
> +       struct codel_params txq_cparams;
> +       u32 txq_limit;
>  };
>
>  static inline bool _ieee80211_hw_check(struct ieee80211_hw *hw,
> @@ -5602,6 +5625,9 @@ struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
>   * txq state can change half-way of this function and the caller may end up
>   * with "new" frame_cnt and "old" byte_cnt or vice-versa.
>   *
> + * Moreover returned values are best-case, i.e. assuming queueing algorithm
> + * will not drop frames due to excess latency.
> + *
>   * @txq: pointer obtained from station or virtual interface
>   * @frame_cnt: pointer to store frame count
>   * @byte_cnt: pointer to store byte count
> diff --git a/net/mac80211/agg-tx.c b/net/mac80211/agg-tx.c
> index 4932e9f243a2..b9d0cee2a786 100644
> --- a/net/mac80211/agg-tx.c
> +++ b/net/mac80211/agg-tx.c
> @@ -194,17 +194,21 @@ static void
>  ieee80211_agg_stop_txq(struct sta_info *sta, int tid)
>  {
>         struct ieee80211_txq *txq = sta->sta.txq[tid];
> +       struct ieee80211_sub_if_data *sdata;
> +       struct ieee80211_fq *fq;
>         struct txq_info *txqi;
>
>         if (!txq)
>                 return;
>
>         txqi = to_txq_info(txq);
> +       sdata = vif_to_sdata(txq->vif);
> +       fq = &sdata->local->fq;
>
>         /* Lock here to protect against further seqno updates on dequeue */
> -       spin_lock_bh(&txqi->queue.lock);
> +       spin_lock_bh(&fq->lock);
>         set_bit(IEEE80211_TXQ_STOP, &txqi->flags);
> -       spin_unlock_bh(&txqi->queue.lock);
> +       spin_unlock_bh(&fq->lock);
>  }
>
>  static void
> diff --git a/net/mac80211/codel.h b/net/mac80211/codel.h
> new file mode 100644
> index 000000000000..f6f1b9b73a9a
> --- /dev/null
> +++ b/net/mac80211/codel.h
> @@ -0,0 +1,260 @@
> +#ifndef __NET_MAC80211_CODEL_H
> +#define __NET_MAC80211_CODEL_H
> +
> +/*
> + * Codel - The Controlled-Delay Active Queue Management algorithm
> + *
> + *  Copyright (C) 2011-2012 Kathleen Nichols <nichols@pollere.com>
> + *  Copyright (C) 2011-2012 Van Jacobson <van@pollere.net>
> + *  Copyright (C) 2016 Michael D. Taht <dave.taht@bufferbloat.net>
> + *  Copyright (C) 2012 Eric Dumazet <edumazet@google.com>
> + *  Copyright (C) 2015 Jonathan Morton <chromatix99@gmail.com>
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions, and the following disclaimer,
> + *    without modification.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. The names of the authors may not be used to endorse or promote products
> + *    derived from this software without specific prior written permission.
> + *
> + * Alternatively, provided that this notice is retained in full, this
> + * software may be distributed under the terms of the GNU General
> + * Public License ("GPL") version 2, in which case the provisions of the
> + * GPL apply INSTEAD OF those given above.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> + * DAMAGE.
> + *
> + */
> +
> +#include <linux/version.h>
> +#include <linux/types.h>
> +#include <linux/ktime.h>
> +#include <linux/skbuff.h>
> +#include <net/pkt_sched.h>
> +#include <net/inet_ecn.h>
> +#include <linux/reciprocal_div.h>
> +
> +#include "codel_i.h"
> +
> +/* Controlling Queue Delay (CoDel) algorithm
> + * =========================================
> + * Source : Kathleen Nichols and Van Jacobson
> + * http://queue.acm.org/detail.cfm?id=2209336
> + *
> + * Implemented on linux by Dave Taht and Eric Dumazet
> + */
> +
> +/* CoDel5 uses a real clock, unlike codel */
> +
> +static inline codel_time_t codel_get_time(void)
> +{
> +       return ktime_get_ns();
> +}
> +
> +static inline u32 codel_time_to_us(codel_time_t val)
> +{
> +       do_div(val, NSEC_PER_USEC);
> +       return (u32)val;
> +}
> +
> +/* sizeof_in_bits(rec_inv_sqrt) */
> +#define REC_INV_SQRT_BITS (8 * sizeof(u16))
> +/* needed shift to get a Q0.32 number from rec_inv_sqrt */
> +#define REC_INV_SQRT_SHIFT (32 - REC_INV_SQRT_BITS)
> +
> +/* Newton approximation method needs more iterations at small inputs,
> + * so cache them.
> + */
> +
> +static void codel_vars_init(struct codel_vars *vars)
> +{
> +       memset(vars, 0, sizeof(*vars));
> +}
> +
> +/*
> + * http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Iterative_methods_for_reciprocal_square_roots
> + * new_invsqrt = (invsqrt / 2) * (3 - count * invsqrt^2)
> + *
> + * Here, invsqrt is a fixed point number (< 1.0), 32bit mantissa, aka Q0.32
> + */
> +static inline void codel_Newton_step(struct codel_vars *vars)
> +{
> +       u32 invsqrt = ((u32)vars->rec_inv_sqrt) << REC_INV_SQRT_SHIFT;
> +       u32 invsqrt2 = ((u64)invsqrt * invsqrt) >> 32;
> +       u64 val = (3LL << 32) - ((u64)vars->count * invsqrt2);
> +
> +       val >>= 2; /* avoid overflow in following multiply */
> +       val = (val * invsqrt) >> (32 - 2 + 1);
> +
> +       vars->rec_inv_sqrt = val >> REC_INV_SQRT_SHIFT;
> +}
> +
> +/*
> + * CoDel control_law is t + interval/sqrt(count)
> + * We maintain in rec_inv_sqrt the reciprocal value of sqrt(count) to avoid
> + * both sqrt() and divide operation.
> + */
> +static codel_time_t codel_control_law(codel_time_t t,
> +                                     codel_time_t interval,
> +                                     u32 rec_inv_sqrt)
> +{
> +       return t + reciprocal_scale(interval, rec_inv_sqrt <<
> +                                   REC_INV_SQRT_SHIFT);
> +}
> +
> +/* Forward declaration of this for use elsewhere */
> +
> +static inline codel_time_t
> +custom_codel_get_enqueue_time(struct sk_buff *skb);
> +
> +static inline struct sk_buff *
> +custom_dequeue(struct codel_vars *vars, void *ptr);
> +
> +static inline void
> +custom_drop(struct sk_buff *skb, void *ptr);
> +
> +static bool codel_should_drop(struct sk_buff *skb,
> +                             __u32 *backlog,
> +                             struct codel_vars *vars,
> +                             const struct codel_params *p,
> +                             codel_time_t now)
> +{
> +       if (!skb) {
> +               vars->first_above_time = 0;
> +               return false;
> +       }
> +
> +       if (now - custom_codel_get_enqueue_time(skb) < p->target ||
> +           !*backlog) {
> +               /* went below - stay below for at least interval */
> +               vars->first_above_time = 0;
> +               return false;
> +       }
> +
> +       if (vars->first_above_time == 0) {
> +               /* just went above from below; mark the time */
> +               vars->first_above_time = now + p->interval;
> +
> +       } else if (now > vars->first_above_time) {
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
> +static struct sk_buff *codel_dequeue(void *ptr,
> +                                    __u32 *backlog,
> +                                    struct codel_vars *vars,
> +                                    struct codel_params *p,
> +                                    codel_time_t now,
> +                                    bool overloaded)
> +{
> +       struct sk_buff *skb = custom_dequeue(vars, ptr);
> +       bool drop;
> +
> +       if (!skb) {
> +               vars->dropping = false;
> +               return skb;
> +       }
> +       drop = codel_should_drop(skb, backlog, vars, p, now);
> +       if (vars->dropping) {
> +               if (!drop) {
> +                       /* sojourn time below target - leave dropping state */
> +                       vars->dropping = false;
> +               } else if (now >= vars->drop_next) {
> +                       /* It's time for the next drop. Drop the current
> +                        * packet and dequeue the next. The dequeue might
> +                        * take us out of dropping state.
> +                        * If not, schedule the next drop.
> +                        * A large backlog might result in drop rates so high
> +                        * that the next drop should happen now,
> +                        * hence the while loop.
> +                        */
> +
> +                       /* saturating increment */
> +                       vars->count++;
> +                       if (!vars->count)
> +                               vars->count--;
> +
> +                       codel_Newton_step(vars);
> +                       vars->drop_next = codel_control_law(vars->drop_next,
> +                                                           p->interval,
> +                                                           vars->rec_inv_sqrt);
> +                       do {
> +                               if (INET_ECN_set_ce(skb) && !overloaded) {
> +                                       vars->ecn_mark++;
> +                                       /* and schedule the next drop */
> +                                       vars->drop_next = codel_control_law(
> +                                               vars->drop_next, p->interval,
> +                                               vars->rec_inv_sqrt);
> +                                       goto end;
> +                               }
> +                               custom_drop(skb, ptr);
> +                               vars->drop_count++;
> +                               skb = custom_dequeue(vars, ptr);
> +                               if (skb && !codel_should_drop(skb, backlog, vars,
> +                                                             p, now)) {
> +                                       /* leave dropping state */
> +                                       vars->dropping = false;
> +                               } else {
> +                                       /* schedule the next drop */
> +                                       vars->drop_next = codel_control_law(
> +                                               vars->drop_next, p->interval,
> +                                               vars->rec_inv_sqrt);
> +                               }
> +                       } while (skb && vars->dropping && now >=
> +                                vars->drop_next);
> +
> +                       /* Mark the packet regardless */
> +                       if (skb && INET_ECN_set_ce(skb))
> +                               vars->ecn_mark++;
> +               }
> +       } else if (drop) {
> +               if (INET_ECN_set_ce(skb) && !overloaded) {
> +                       vars->ecn_mark++;
> +               } else {
> +                       custom_drop(skb, ptr);
> +                       vars->drop_count++;
> +
> +                       skb = custom_dequeue(vars, ptr);
> +                       drop = codel_should_drop(skb, backlog, vars, p, now);
> +                       if (skb && INET_ECN_set_ce(skb))
> +                               vars->ecn_mark++;
> +               }
> +               vars->dropping = true;
> +               /* if min went above target close to when we last went below
> +                * assume that the drop rate that controlled the queue on the
> +                * last cycle is a good starting point to control it now.
> +                */
> +               if (vars->count > 2 &&
> +                   now - vars->drop_next < 8 * p->interval) {
> +                       vars->count -= 2;
> +                       codel_Newton_step(vars);
> +               } else {
> +                       vars->count = 1;
> +                       vars->rec_inv_sqrt = ~0U >> REC_INV_SQRT_SHIFT;
> +               }
> +               codel_Newton_step(vars);
> +               vars->drop_next = codel_control_law(now, p->interval,
> +                                                   vars->rec_inv_sqrt);
> +       }
> +end:
> +       return skb;
> +}
> +#endif
> diff --git a/net/mac80211/codel_i.h b/net/mac80211/codel_i.h
> new file mode 100644
> index 000000000000..83da7aa5fd9a
> --- /dev/null
> +++ b/net/mac80211/codel_i.h
> @@ -0,0 +1,89 @@
> +#ifndef __NET_MAC80211_CODEL_I_H
> +#define __NET_MAC80211_CODEL_I_H
> +
> +/*
> + * Codel - The Controlled-Delay Active Queue Management algorithm
> + *
> + *  Copyright (C) 2011-2012 Kathleen Nichols <nichols@pollere.com>
> + *  Copyright (C) 2011-2012 Van Jacobson <van@pollere.net>
> + *  Copyright (C) 2016 Michael D. Taht <dave.taht@bufferbloat.net>
> + *  Copyright (C) 2012 Eric Dumazet <edumazet@google.com>
> + *  Copyright (C) 2015 Jonathan Morton <chromatix99@gmail.com>
> + *  Copyright (C) 2016 Michal Kazior <michal.kazior@tieto.com>
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions, and the following disclaimer,
> + *    without modification.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. The names of the authors may not be used to endorse or promote products
> + *    derived from this software without specific prior written permission.
> + *
> + * Alternatively, provided that this notice is retained in full, this
> + * software may be distributed under the terms of the GNU General
> + * Public License ("GPL") version 2, in which case the provisions of the
> + * GPL apply INSTEAD OF those given above.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> + * DAMAGE.
> + *
> + */
> +
> +#include <linux/version.h>
> +#include <linux/types.h>
> +#include <linux/ktime.h>
> +#include <linux/skbuff.h>
> +#include <net/pkt_sched.h>
> +#include <net/inet_ecn.h>
> +#include <linux/reciprocal_div.h>
> +
> +/* Controlling Queue Delay (CoDel) algorithm
> + * =========================================
> + * Source : Kathleen Nichols and Van Jacobson
> + * http://queue.acm.org/detail.cfm?id=2209336
> + *
> + * Implemented on linux by Dave Taht and Eric Dumazet
> + */
> +
> +/* CoDel5 uses a real clock, unlike codel */
> +
> +#define MS2TIME(a) (a * (u64) NSEC_PER_MSEC)
> +#define US2TIME(a) (a * (u64) NSEC_PER_USEC)
> +
> +/**
> + * struct codel_vars - contains codel variables
> + * @count:             how many drops we've done since the last time we
> + *                     entered dropping state
> + * @dropping:          set to > 0 if in dropping state
> + * @rec_inv_sqrt:      reciprocal value of sqrt(count) >> 1
> + * @first_above_time:  when we went (or will go) continuously above target
> + *                     for interval
> + * @drop_next:         time to drop next packet, or when we dropped last
> + * @drop_count:        temp count of dropped packets in dequeue()
> + * @ecn_mark:  number of packets we ECN marked instead of dropping
> + */
> +
> +struct codel_vars {
> +       u32             count;
> +       u16             dropping;
> +       u16             rec_inv_sqrt;
> +       codel_time_t    first_above_time;
> +       codel_time_t    drop_next;
> +       u16             drop_count;
> +       u16             ecn_mark;
> +};
> +#endif
> diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
> index a96f8c0461f6..c099b81d5a27 100644
> --- a/net/mac80211/ieee80211_i.h
> +++ b/net/mac80211/ieee80211_i.h
> @@ -802,9 +802,12 @@ enum txq_info_flags {
>  };
>
>  struct txq_info {
> -       struct sk_buff_head queue;
> +       struct txq_flow flow;
> +       struct list_head new_flows;
> +       struct list_head old_flows;
> +       u32 backlog_bytes;
> +       u32 backlog_packets;
>         unsigned long flags;
> -       unsigned long byte_cnt;
>
>         /* keep last! */
>         struct ieee80211_txq txq;
> @@ -852,7 +855,6 @@ struct ieee80211_sub_if_data {
>         bool control_port_no_encrypt;
>         int encrypt_headroom;
>
> -       atomic_t txqs_len[IEEE80211_NUM_ACS];
>         struct ieee80211_tx_queue_params tx_conf[IEEE80211_NUM_ACS];
>         struct mac80211_qos_map __rcu *qos_map;
>
> @@ -1089,11 +1091,25 @@ enum mac80211_scan_state {
>         SCAN_ABORT,
>  };
>
> +struct ieee80211_fq {
> +       struct txq_flow *flows;
> +       struct list_head backlogs;
> +       spinlock_t lock;
> +       u32 flows_cnt;
> +       u32 perturbation;
> +       u32 quantum;
> +       u32 backlog;
> +
> +       u32 drop_overlimit;
> +       u32 drop_codel;
> +};
> +
>  struct ieee80211_local {
>         /* embed the driver visible part.
>          * don't cast (use the static inlines below), but we keep
>          * it first anyway so they become a no-op */
>         struct ieee80211_hw hw;
> +       struct ieee80211_fq fq;
>
>         const struct ieee80211_ops *ops;
>
> @@ -1935,6 +1951,11 @@ static inline bool ieee80211_can_run_worker(struct ieee80211_local *local)
>  void ieee80211_init_tx_queue(struct ieee80211_sub_if_data *sdata,
>                              struct sta_info *sta,
>                              struct txq_info *txq, int tid);
> +void ieee80211_purge_txq(struct ieee80211_local *local, struct txq_info *txqi);
> +void ieee80211_init_flow(struct txq_flow *flow);
> +int ieee80211_setup_flows(struct ieee80211_local *local);
> +void ieee80211_teardown_flows(struct ieee80211_local *local);
> +
>  void ieee80211_send_auth(struct ieee80211_sub_if_data *sdata,
>                          u16 transaction, u16 auth_alg, u16 status,
>                          const u8 *extra, size_t extra_len, const u8 *bssid,
> diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c
> index 453b4e741780..d1063b50f12c 100644
> --- a/net/mac80211/iface.c
> +++ b/net/mac80211/iface.c
> @@ -779,6 +779,7 @@ static void ieee80211_do_stop(struct ieee80211_sub_if_data *sdata,
>                               bool going_down)
>  {
>         struct ieee80211_local *local = sdata->local;
> +       struct ieee80211_fq *fq = &local->fq;
>         unsigned long flags;
>         struct sk_buff *skb, *tmp;
>         u32 hw_reconf_flags = 0;
> @@ -977,12 +978,9 @@ static void ieee80211_do_stop(struct ieee80211_sub_if_data *sdata,
>         if (sdata->vif.txq) {
>                 struct txq_info *txqi = to_txq_info(sdata->vif.txq);
>
> -               spin_lock_bh(&txqi->queue.lock);
> -               ieee80211_purge_tx_queue(&local->hw, &txqi->queue);
> -               txqi->byte_cnt = 0;
> -               spin_unlock_bh(&txqi->queue.lock);
> -
> -               atomic_set(&sdata->txqs_len[txqi->txq.ac], 0);
> +               spin_lock_bh(&fq->lock);
> +               ieee80211_purge_txq(local, txqi);
> +               spin_unlock_bh(&fq->lock);
>         }
>
>         if (local->open_count == 0)
> @@ -1198,6 +1196,13 @@ static void ieee80211_if_setup(struct net_device *dev)
>         dev->destructor = ieee80211_if_free;
>  }
>
> +static void ieee80211_if_setup_no_queue(struct net_device *dev)
> +{
> +       ieee80211_if_setup(dev);
> +       dev->priv_flags |= IFF_NO_QUEUE;
> +       /* Note for backporters: use dev->tx_queue_len = 0 instead of IFF_ */
> +}
> +
>  static void ieee80211_iface_work(struct work_struct *work)
>  {
>         struct ieee80211_sub_if_data *sdata =
> @@ -1707,6 +1712,7 @@ int ieee80211_if_add(struct ieee80211_local *local, const char *name,
>         struct net_device *ndev = NULL;
>         struct ieee80211_sub_if_data *sdata = NULL;
>         struct txq_info *txqi;
> +       void (*if_setup)(struct net_device *dev);
>         int ret, i;
>         int txqs = 1;
>
> @@ -1734,12 +1740,17 @@ int ieee80211_if_add(struct ieee80211_local *local, const char *name,
>                         txq_size += sizeof(struct txq_info) +
>                                     local->hw.txq_data_size;
>
> +               if (local->ops->wake_tx_queue)
> +                       if_setup = ieee80211_if_setup_no_queue;
> +               else
> +                       if_setup = ieee80211_if_setup;
> +
>                 if (local->hw.queues >= IEEE80211_NUM_ACS)
>                         txqs = IEEE80211_NUM_ACS;
>
>                 ndev = alloc_netdev_mqs(size + txq_size,
>                                         name, name_assign_type,
> -                                       ieee80211_if_setup, txqs, 1);
> +                                       if_setup, txqs, 1);
>                 if (!ndev)
>                         return -ENOMEM;
>                 dev_net_set(ndev, wiphy_net(local->hw.wiphy));
> diff --git a/net/mac80211/main.c b/net/mac80211/main.c
> index 8190bf27ebff..9fd3b10ae52b 100644
> --- a/net/mac80211/main.c
> +++ b/net/mac80211/main.c
> @@ -1053,9 +1053,6 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
>
>         local->dynamic_ps_forced_timeout = -1;
>
> -       if (!local->hw.txq_ac_max_pending)
> -               local->hw.txq_ac_max_pending = 64;
> -
>         result = ieee80211_wep_init(local);
>         if (result < 0)
>                 wiphy_debug(local->hw.wiphy, "Failed to initialize wep: %d\n",
> @@ -1087,6 +1084,10 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
>
>         rtnl_unlock();
>
> +       result = ieee80211_setup_flows(local);
> +       if (result)
> +               goto fail_flows;
> +
>  #ifdef CONFIG_INET
>         local->ifa_notifier.notifier_call = ieee80211_ifa_changed;
>         result = register_inetaddr_notifier(&local->ifa_notifier);
> @@ -1112,6 +1113,8 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
>  #if defined(CONFIG_INET) || defined(CONFIG_IPV6)
>   fail_ifa:
>  #endif
> +       ieee80211_teardown_flows(local);
> + fail_flows:
>         rtnl_lock();
>         rate_control_deinitialize(local);
>         ieee80211_remove_interfaces(local);
> diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
> index 664e8861edbe..66c36dc389ec 100644
> --- a/net/mac80211/rx.c
> +++ b/net/mac80211/rx.c
> @@ -1248,7 +1248,7 @@ static void sta_ps_start(struct sta_info *sta)
>         for (tid = 0; tid < ARRAY_SIZE(sta->sta.txq); tid++) {
>                 struct txq_info *txqi = to_txq_info(sta->sta.txq[tid]);
>
> -               if (!skb_queue_len(&txqi->queue))
> +               if (!txqi->backlog_packets)
>                         set_bit(tid, &sta->txq_buffered_tids);
>                 else
>                         clear_bit(tid, &sta->txq_buffered_tids);
> diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
> index 7bbcf5919fe4..456c9fb113fb 100644
> --- a/net/mac80211/sta_info.c
> +++ b/net/mac80211/sta_info.c
> @@ -112,11 +112,7 @@ static void __cleanup_single_sta(struct sta_info *sta)
>         if (sta->sta.txq[0]) {
>                 for (i = 0; i < ARRAY_SIZE(sta->sta.txq); i++) {
>                         struct txq_info *txqi = to_txq_info(sta->sta.txq[i]);
> -                       int n = skb_queue_len(&txqi->queue);
> -
> -                       ieee80211_purge_tx_queue(&local->hw, &txqi->queue);
> -                       atomic_sub(n, &sdata->txqs_len[txqi->txq.ac]);
> -                       txqi->byte_cnt = 0;
> +                       ieee80211_purge_txq(local, txqi);
>                 }
>         }
>
> @@ -1185,7 +1181,7 @@ void ieee80211_sta_ps_deliver_wakeup(struct sta_info *sta)
>                 for (i = 0; i < ARRAY_SIZE(sta->sta.txq); i++) {
>                         struct txq_info *txqi = to_txq_info(sta->sta.txq[i]);
>
> -                       if (!skb_queue_len(&txqi->queue))
> +                       if (!txqi->backlog_packets)
>                                 continue;
>
>                         drv_wake_tx_queue(local, txqi);
> @@ -1622,7 +1618,7 @@ ieee80211_sta_ps_deliver_response(struct sta_info *sta,
>                 for (tid = 0; tid < ARRAY_SIZE(sta->sta.txq); tid++) {
>                         struct txq_info *txqi = to_txq_info(sta->sta.txq[tid]);
>
> -                       if (!(tids & BIT(tid)) || skb_queue_len(&txqi->queue))
> +                       if (!(tids & BIT(tid)) || txqi->backlog_packets)
>                                 continue;
>
>                         sta_info_recalc_tim(sta);
> diff --git a/net/mac80211/sta_info.h b/net/mac80211/sta_info.h
> index f4d38994ecee..65431ea5a78d 100644
> --- a/net/mac80211/sta_info.h
> +++ b/net/mac80211/sta_info.h
> @@ -19,6 +19,7 @@
>  #include <linux/etherdevice.h>
>  #include <linux/rhashtable.h>
>  #include "key.h"
> +#include "codel_i.h"
>
>  /**
>   * enum ieee80211_sta_info_flags - Stations flags
> @@ -327,6 +328,32 @@ struct mesh_sta {
>
>  DECLARE_EWMA(signal, 1024, 8)
>
> +struct txq_info;
> +
> +/**
> + * struct txq_flow - per traffic flow queue
> + *
> + * This structure is used to distinguish and queue different traffic flows
> + * separately for fair queueing/AQM purposes.
> + *
> + * @txqi: txq_info structure it is associated at given time
> + * @flowchain: can be linked to other flows for RR purposes
> + * @backlogchain: can be linked to other flows for backlog sorting purposes
> + * @queue: sk_buff queue
> + * @cvars: codel state vars
> + * @backlog: number of bytes pending in the queue
> + * @deficit: used for fair queueing balancing
> + */
> +struct txq_flow {
> +       struct txq_info *txqi;
> +       struct list_head flowchain;
> +       struct list_head backlogchain;
> +       struct sk_buff_head queue;
> +       struct codel_vars cvars;
> +       u32 backlog;
> +       u32 deficit;
> +};
> +
>  /**
>   * struct sta_info - STA information
>   *
> diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
> index af584f7cdd63..f42f898cb8b5 100644
> --- a/net/mac80211/tx.c
> +++ b/net/mac80211/tx.c
> @@ -34,6 +34,7 @@
>  #include "wpa.h"
>  #include "wme.h"
>  #include "rate.h"
> +#include "codel.h"
>
>  /* misc utils */
>
> @@ -1228,26 +1229,312 @@ ieee80211_tx_prepare(struct ieee80211_sub_if_data *sdata,
>         return TX_CONTINUE;
>  }
>
> -static void ieee80211_drv_tx(struct ieee80211_local *local,
> -                            struct ieee80211_vif *vif,
> -                            struct ieee80211_sta *pubsta,
> -                            struct sk_buff *skb)
> +static inline codel_time_t
> +custom_codel_get_enqueue_time(struct sk_buff *skb)
> +{
> +       return IEEE80211_SKB_CB(skb)->control.enqueue_time;
> +}
> +
> +static inline struct sk_buff *
> +flow_dequeue(struct ieee80211_local *local, struct txq_flow *flow)
> +{
> +       struct ieee80211_fq *fq = &local->fq;
> +       struct txq_info *txqi = flow->txqi;
> +       struct txq_flow *i;
> +       struct sk_buff *skb;
> +
> +       skb = __skb_dequeue(&flow->queue);
> +       if (!skb)
> +               return NULL;
> +
> +       txqi->backlog_bytes -= skb->len;
> +       txqi->backlog_packets--;
> +       flow->backlog -= skb->len;
> +       fq->backlog--;
> +
> +       if (flow->backlog == 0) {
> +               list_del_init(&flow->backlogchain);
> +       } else {
> +               i = flow;
> +
> +               list_for_each_entry_continue(i, &fq->backlogs, backlogchain) {
> +                       if (i->backlog < flow->backlog)
> +                               break;
> +               }
> +
> +               list_move_tail(&flow->backlogchain, &i->backlogchain);
> +       }
> +
> +       return skb;
> +}
> +
> +static inline struct sk_buff *
> +custom_dequeue(struct codel_vars *vars, void *ptr)
> +{
> +       struct txq_flow *flow = ptr;
> +       struct txq_info *txqi = flow->txqi;
> +       struct ieee80211_vif *vif = txqi->txq.vif;
> +       struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif);
> +       struct ieee80211_local *local = sdata->local;
> +
> +       return flow_dequeue(local, flow);
> +}
> +
> +static inline void
> +custom_drop(struct sk_buff *skb, void *ptr)
> +{
> +       struct txq_flow *flow = ptr;
> +       struct txq_info *txqi = flow->txqi;
> +       struct ieee80211_vif *vif = txqi->txq.vif;
> +       struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif);
> +       struct ieee80211_local *local = sdata->local;
> +       struct ieee80211_hw *hw = &local->hw;
> +
> +       ieee80211_free_txskb(hw, skb);
> +       local->fq.drop_codel++;
> +}
> +
> +static u32 fq_hash(struct ieee80211_fq *fq, struct sk_buff *skb)
> +{
> +       u32 hash = skb_get_hash_perturb(skb, fq->perturbation);
> +       return reciprocal_scale(hash, fq->flows_cnt);
> +}
> +
> +static void fq_drop(struct ieee80211_local *local)
> +{
> +       struct ieee80211_hw *hw = &local->hw;
> +       struct ieee80211_fq *fq = &local->fq;
> +       struct txq_flow *flow;
> +       struct sk_buff *skb;
> +
> +       flow = list_first_entry_or_null(&fq->backlogs, struct txq_flow,
> +                                       backlogchain);
> +       if (WARN_ON_ONCE(!flow))
> +               return;
> +
> +       skb = flow_dequeue(local, flow);
> +       if (WARN_ON_ONCE(!skb))
> +               return;
> +
> +       ieee80211_free_txskb(hw, skb);
> +       fq->drop_overlimit++;
> +}
> +
> +void ieee80211_init_flow(struct txq_flow *flow)
> +{
> +       INIT_LIST_HEAD(&flow->flowchain);
> +       INIT_LIST_HEAD(&flow->backlogchain);
> +       __skb_queue_head_init(&flow->queue);
> +       codel_vars_init(&flow->cvars);
> +}
> +
> +int ieee80211_setup_flows(struct ieee80211_local *local)
> +{
> +       struct ieee80211_fq *fq = &local->fq;
> +       int i;
> +
> +       if (!local->ops->wake_tx_queue)
> +               return 0;
> +
> +       if (!local->hw.txq_limit)
> +               local->hw.txq_limit = 8192;
> +
> +       if (!local->hw.txq_cparams.target)
> +               local->hw.txq_cparams.target = MS2TIME(5);
> +
> +       if (!local->hw.txq_cparams.interval)
> +               local->hw.txq_cparams.interval = MS2TIME(100);
> +
> +       memset(fq, 0, sizeof(fq[0]));
> +       INIT_LIST_HEAD(&fq->backlogs);
> +       spin_lock_init(&fq->lock);
> +       fq->flows_cnt = 4096;
> +       fq->perturbation = prandom_u32();
> +       fq->quantum = 300;
> +
> +       fq->flows = kzalloc(fq->flows_cnt * sizeof(fq->flows[0]), GFP_KERNEL);
> +       if (!fq->flows)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < fq->flows_cnt; i++)
> +               ieee80211_init_flow(&fq->flows[i]);
> +
> +       return 0;
> +}
> +
> +static void ieee80211_reset_flow(struct ieee80211_local *local,
> +                                struct txq_flow *flow)
> +{
> +       if (!list_empty(&flow->flowchain))
> +               list_del_init(&flow->flowchain);
> +
> +       if (!list_empty(&flow->backlogchain))
> +               list_del_init(&flow->backlogchain);
> +
> +       ieee80211_purge_tx_queue(&local->hw, &flow->queue);
> +
> +       flow->deficit = 0;
> +       flow->txqi = NULL;
> +}
> +
> +void ieee80211_purge_txq(struct ieee80211_local *local, struct txq_info *txqi)
> +{
> +       struct txq_flow *flow;
> +       int i;
> +
> +       for (i = 0; i < local->fq.flows_cnt; i++) {
> +               flow = &local->fq.flows[i];
> +
> +               if (flow->txqi != txqi)
> +                       continue;
> +
> +               ieee80211_reset_flow(local, flow);
> +       }
> +
> +       ieee80211_reset_flow(local, &txqi->flow);
> +
> +       txqi->backlog_bytes = 0;
> +       txqi->backlog_packets = 0;
> +}
> +
> +void ieee80211_teardown_flows(struct ieee80211_local *local)
> +{
> +       struct ieee80211_fq *fq = &local->fq;
> +       struct ieee80211_sub_if_data *sdata;
> +       struct sta_info *sta;
> +       int i;
> +
> +       if (!local->ops->wake_tx_queue)
> +               return;
> +
> +       list_for_each_entry_rcu(sta, &local->sta_list, list)
> +               for (i = 0; i < IEEE80211_NUM_TIDS; i++)
> +                       ieee80211_purge_txq(local,
> +                                           to_txq_info(sta->sta.txq[i]));
> +
> +       list_for_each_entry_rcu(sdata, &local->interfaces, list)
> +               ieee80211_purge_txq(local, to_txq_info(sdata->vif.txq));
> +
> +       for (i = 0; i < fq->flows_cnt; i++)
> +               ieee80211_reset_flow(local, &fq->flows[i]);
> +
> +       kfree(fq->flows);
> +
> +       fq->flows = NULL;
> +       fq->flows_cnt = 0;
> +}
> +
> +static void ieee80211_txq_enqueue(struct ieee80211_local *local,
> +                                 struct txq_info *txqi,
> +                                 struct sk_buff *skb)
> +{
> +       struct ieee80211_fq *fq = &local->fq;
> +       struct ieee80211_hw *hw = &local->hw;
> +       struct txq_flow *flow;
> +       struct txq_flow *i;
> +       size_t idx = fq_hash(fq, skb);
> +
> +       flow = &fq->flows[idx];
> +
> +       if (flow->txqi)
> +               flow = &txqi->flow;
> +
> +       /* The following overwrites `vif` pointer effectively. It is later
> +        * restored using txq structure.
> +        */
> +       IEEE80211_SKB_CB(skb)->control.enqueue_time = codel_get_time();
> +
> +       flow->txqi = txqi;
> +       flow->backlog += skb->len;
> +       txqi->backlog_bytes += skb->len;
> +       txqi->backlog_packets++;
> +       fq->backlog++;
> +
> +       if (list_empty(&flow->backlogchain))
> +               i = list_last_entry(&fq->backlogs, struct txq_flow, backlogchain);
> +       else
> +               i = flow;
> +
> +       list_for_each_entry_continue_reverse(i, &fq->backlogs, backlogchain)
> +               if (i->backlog > flow->backlog)
> +                       break;
> +
> +       list_move(&flow->backlogchain, &i->backlogchain);
> +
> +       if (list_empty(&flow->flowchain)) {
> +               flow->deficit = fq->quantum;
> +               list_add_tail(&flow->flowchain, &txqi->new_flows);
> +       }
> +
> +       __skb_queue_tail(&flow->queue, skb);
> +
> +       if (fq->backlog > hw->txq_limit)
> +               fq_drop(local);
> +}
> +
> +static struct sk_buff *ieee80211_txq_dequeue(struct ieee80211_local *local,
> +                                            struct txq_info *txqi)
> +{
> +       struct ieee80211_fq *fq = &local->fq;
> +       struct ieee80211_hw *hw = &local->hw;
> +       struct txq_flow *flow;
> +       struct list_head *head;
> +       struct sk_buff *skb;
> +
> +begin:
> +       head = &txqi->new_flows;
> +       if (list_empty(head)) {
> +               head = &txqi->old_flows;
> +               if (list_empty(head))
> +                       return NULL;
> +       }
> +
> +       flow = list_first_entry(head, struct txq_flow, flowchain);
> +
> +       if (flow->deficit <= 0) {
> +               flow->deficit += fq->quantum;
> +               list_move_tail(&flow->flowchain, &txqi->old_flows);
> +               goto begin;
> +       }
> +
> +       skb = codel_dequeue(flow, &flow->backlog, &flow->cvars,
> +                           &hw->txq_cparams, codel_get_time(), false);
> +       if (!skb) {
> +               if ((head == &txqi->new_flows) &&
> +                   !list_empty(&txqi->old_flows)) {
> +                       list_move_tail(&flow->flowchain, &txqi->old_flows);
> +               } else {
> +                       list_del_init(&flow->flowchain);
> +                       flow->txqi = NULL;
> +               }
> +               goto begin;
> +       }
> +
> +       flow->deficit -= skb->len;
> +
> +       /* The `vif` pointer was overwritten with enqueue time during
> +        * enqueuing. Restore it before handing to driver.
> +        */
> +       IEEE80211_SKB_CB(skb)->control.vif = flow->txqi->txq.vif;
> +
> +       return skb;
> +}
> +
> +static struct txq_info *
> +ieee80211_get_txq(struct ieee80211_local *local,
> +                 struct ieee80211_vif *vif,
> +                 struct ieee80211_sta *pubsta,
> +                 struct sk_buff *skb)
>  {
>         struct ieee80211_hdr *hdr = (struct ieee80211_hdr *) skb->data;
> -       struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif);
>         struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
> -       struct ieee80211_tx_control control = {
> -               .sta = pubsta,
> -       };
>         struct ieee80211_txq *txq = NULL;
> -       struct txq_info *txqi;
> -       u8 ac;
>
>         if (info->control.flags & IEEE80211_TX_CTRL_PS_RESPONSE)
> -               goto tx_normal;
> +               return NULL;
>
>         if (!ieee80211_is_data(hdr->frame_control))
> -               goto tx_normal;
> +               return NULL;
>
>         if (pubsta) {
>                 u8 tid = skb->priority & IEEE80211_QOS_CTL_TID_MASK;
> @@ -1258,52 +1545,29 @@ static void ieee80211_drv_tx(struct ieee80211_local *local,
>         }
>
>         if (!txq)
> -               goto tx_normal;
> +               return NULL;
>
> -       ac = txq->ac;
> -       txqi = to_txq_info(txq);
> -       atomic_inc(&sdata->txqs_len[ac]);
> -       if (atomic_read(&sdata->txqs_len[ac]) >= local->hw.txq_ac_max_pending)
> -               netif_stop_subqueue(sdata->dev, ac);
> -
> -       spin_lock_bh(&txqi->queue.lock);
> -       txqi->byte_cnt += skb->len;
> -       __skb_queue_tail(&txqi->queue, skb);
> -       spin_unlock_bh(&txqi->queue.lock);
> -
> -       drv_wake_tx_queue(local, txqi);
> -
> -       return;
> -
> -tx_normal:
> -       drv_tx(local, &control, skb);
> +       return to_txq_info(txq);
>  }
>
>  struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
>                                      struct ieee80211_txq *txq)
>  {
>         struct ieee80211_local *local = hw_to_local(hw);
> -       struct ieee80211_sub_if_data *sdata = vif_to_sdata(txq->vif);
> +       struct ieee80211_fq *fq = &local->fq;
>         struct txq_info *txqi = container_of(txq, struct txq_info, txq);
>         struct ieee80211_hdr *hdr;
>         struct sk_buff *skb = NULL;
> -       u8 ac = txq->ac;
>
> -       spin_lock_bh(&txqi->queue.lock);
> +       spin_lock_bh(&fq->lock);
>
>         if (test_bit(IEEE80211_TXQ_STOP, &txqi->flags))
>                 goto out;
>
> -       skb = __skb_dequeue(&txqi->queue);
> +       skb = ieee80211_txq_dequeue(local, txqi);
>         if (!skb)
>                 goto out;
>
> -       txqi->byte_cnt -= skb->len;
> -
> -       atomic_dec(&sdata->txqs_len[ac]);
> -       if (__netif_subqueue_stopped(sdata->dev, ac))
> -               ieee80211_propagate_queue_wake(local, sdata->vif.hw_queue[ac]);
> -
>         hdr = (struct ieee80211_hdr *)skb->data;
>         if (txq->sta && ieee80211_is_data_qos(hdr->frame_control)) {
>                 struct sta_info *sta = container_of(txq->sta, struct sta_info,
> @@ -1318,7 +1582,7 @@ struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
>         }
>
>  out:
> -       spin_unlock_bh(&txqi->queue.lock);
> +       spin_unlock_bh(&fq->lock);
>
>         return skb;
>  }
> @@ -1330,7 +1594,10 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,
>                                struct sk_buff_head *skbs,
>                                bool txpending)
>  {
> +       struct ieee80211_fq *fq = &local->fq;
> +       struct ieee80211_tx_control control = {};
>         struct sk_buff *skb, *tmp;
> +       struct txq_info *txqi;
>         unsigned long flags;
>
>         skb_queue_walk_safe(skbs, skb, tmp) {
> @@ -1345,6 +1612,24 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,
>                 }
>  #endif
>
> +               /* XXX: This changes behavior for offchan-tx. Is this really a
> +                *      problem with per-sta-tid queueing now?
> +                */
> +               txqi = ieee80211_get_txq(local, vif, sta, skb);
> +               if (txqi) {
> +                       info->control.vif = vif;
> +
> +                       __skb_unlink(skb, skbs);
> +
> +                       spin_lock_bh(&fq->lock);
> +                       ieee80211_txq_enqueue(local, txqi, skb);
> +                       spin_unlock_bh(&fq->lock);
> +
> +                       drv_wake_tx_queue(local, txqi);
> +
> +                       continue;
> +               }
> +
>                 spin_lock_irqsave(&local->queue_stop_reason_lock, flags);
>                 if (local->queue_stop_reasons[q] ||
>                     (!txpending && !skb_queue_empty(&local->pending[q]))) {
> @@ -1387,9 +1672,10 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,
>                 spin_unlock_irqrestore(&local->queue_stop_reason_lock, flags);
>
>                 info->control.vif = vif;
> +               control.sta = sta;
>
>                 __skb_unlink(skb, skbs);
> -               ieee80211_drv_tx(local, vif, sta, skb);
> +               drv_tx(local, &control, skb);
>         }
>
>         return true;
> diff --git a/net/mac80211/util.c b/net/mac80211/util.c
> index 323d300878ca..0d33cb7339a2 100644
> --- a/net/mac80211/util.c
> +++ b/net/mac80211/util.c
> @@ -244,6 +244,9 @@ void ieee80211_propagate_queue_wake(struct ieee80211_local *local, int queue)
>         struct ieee80211_sub_if_data *sdata;
>         int n_acs = IEEE80211_NUM_ACS;
>
> +       if (local->ops->wake_tx_queue)
> +               return;
> +
>         if (local->hw.queues < IEEE80211_NUM_ACS)
>                 n_acs = 1;
>
> @@ -260,11 +263,6 @@ void ieee80211_propagate_queue_wake(struct ieee80211_local *local, int queue)
>                 for (ac = 0; ac < n_acs; ac++) {
>                         int ac_queue = sdata->vif.hw_queue[ac];
>
> -                       if (local->ops->wake_tx_queue &&
> -                           (atomic_read(&sdata->txqs_len[ac]) >
> -                            local->hw.txq_ac_max_pending))
> -                               continue;
> -
>                         if (ac_queue == queue ||
>                             (sdata->vif.cab_queue == queue &&
>                              local->queue_stop_reasons[ac_queue] == 0 &&
> @@ -352,6 +350,9 @@ static void __ieee80211_stop_queue(struct ieee80211_hw *hw, int queue,
>         if (__test_and_set_bit(reason, &local->queue_stop_reasons[queue]))
>                 return;
>
> +       if (local->ops->wake_tx_queue)
> +               return;
> +
>         if (local->hw.queues < IEEE80211_NUM_ACS)
>                 n_acs = 1;
>
> @@ -3364,8 +3365,11 @@ void ieee80211_init_tx_queue(struct ieee80211_sub_if_data *sdata,
>                              struct sta_info *sta,
>                              struct txq_info *txqi, int tid)
>  {
> -       skb_queue_head_init(&txqi->queue);
> +       INIT_LIST_HEAD(&txqi->old_flows);
> +       INIT_LIST_HEAD(&txqi->new_flows);
> +       ieee80211_init_flow(&txqi->flow);
>         txqi->txq.vif = &sdata->vif;
> +       txqi->flow.txqi = txqi;
>
>         if (sta) {
>                 txqi->txq.sta = &sta->sta;
> @@ -3386,9 +3390,9 @@ void ieee80211_txq_get_depth(struct ieee80211_txq *txq,
>         struct txq_info *txqi = to_txq_info(txq);
>
>         if (frame_cnt)
> -               *frame_cnt = txqi->queue.qlen;
> +               *frame_cnt = txqi->backlog_packets;
>
>         if (byte_cnt)
> -               *byte_cnt = txqi->byte_cnt;
> +               *byte_cnt = txqi->backlog_bytes;
>  }
>  EXPORT_SYMBOL(ieee80211_txq_get_depth);
> --
> 2.1.4
>
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Kazior March 8, 2016, 7:12 a.m. UTC | #15

On 8 March 2016 at 00:06, Dave Taht <dave.taht@gmail.com> wrote:
> Dear Michal:
>
> Going through this patchset... (while watching it compile)
>
>
> +       if (!local->hw.txq_cparams.target)
> +               local->hw.txq_cparams.target = MS2TIME(5);
>
> MS2TIME(20) for now and/or add something saner to this than !*backlog

Yep, makes sense.


> target will not be a constant in the long run.
>
> +       if (now - custom_codel_get_enqueue_time(skb) < p->target ||
> +           !*backlog) {
> +               /* went below - stay below for at least interval */
> +               vars->first_above_time = 0;
> +               return false;
> +       }
>
> *backlog < some_sane_value_for_an_aggregate_for_this_station
>
> Unlike regular codel *backlog should be a ptr to the queuesize for
> this station, not the total queue.
>
> regular codel, by using the shared backlog for all queues, is trying
> to get to a 1 packet depth for all queues, (see commit:
> 865ec5523dadbedefbc5710a68969f686a28d928 ), and store the flow in the
> network, not the queue...
>
> BUT in wifi's case you want to provide good service to all stations,
> which is filling up an aggregate
> for each... (and varying the "sane_value_for_the_aggregate" to suit
> per sta service time requirements in a given round of all stations).

This is tricky. Drivers that use minstrel can probably do the estimate
rather easily.

However other drivers (e.g. ath10k) have offloaded rate control on
device. There's currently no way of doing this calculation. I was
thinking of drivers exporting tx_rate to mac80211 in some way - either
via a simple sta->tx_rate scalar that the driver is responsible for
updating, or a EWMA that driver updates (hopefully) periodically and
often enough. This should in theory at least allow an estimate how
much data on average you can fit into given time frame (e.g. txop, or
hardcoded 1ms).

I'll try looking more into this. Any hints/suggestion welcome.


>
> ...
>
> +       fq->flows_cnt = 4096;
>
> regular fq_codel uses 1024 and there has not been much reason to
> change it. In the case of an AP which has more limited memory, 256 or
> 1024 would be a good setting, per station. I'd stick to 1024 for now.

Do note that the 4096 is shared _across_ station-tid queues. It is not
per-station. If you have 10 stations you still have 4096 flows
(actually 4096 + 16*10, because each tid - and there are 16 - has it's
own fallback flow in case of hash collision on the global flowmap to
maintain per-sta-tid queuing).

With that in mind do you still think 1024 is enough?


> With large values for flows_cnt, fq, dominates, for small values, aqm
> does. We did quite a lot of testing at 16 and 32 queues in the early
> days, with pretty good results, except when we didn't. Cake went whole
> hog with an 8 way set associative hash leading to "near perfect" fq,
> which, at the cost of more cpu overhead, could cut the number of
> queues down by a lot, also. Eric did "perfect" fq with sch_fq...

Out of curiosity - do you have any numbers to compare against
fq_codel? Like hash collision probability vs number of active flows?


> (btw: another way to test how codel is working is to set flows_cnt to
> 1. I will probably end up doing that at some point)
>
> +       fq->perturbation = prandom_u32();
> +       fq->quantum = 300;
>
> quantum 300 is a good compromise to maximize delivery of small packets
> from different flows. Probably the right thing on a station.
>
> It also has cpu overhead. Quantum=1514 is more of the right thing on an AP.
>
> (but to wax philosophical, per packet fairness rather than byte
> fairness probably spreads errors across more flows in a wifi aggregate
> than byte fairness, thus 300 remains a decent compromise if you can
> spare the cpu)
>
> ...
>
> where would be a suitable place to make (count, ecn_marks, drops)
> visible in this subsystem?

I was thinking of using debugfs for starters and then, once things
settle down, move to nl80211. Maybe we could re-use some enums like
TCA_FQ_CODEL_TARGET? Not sure if that's the best idea though. We might
need extra knobs to control and/or have in mind we might want to
replace the queuing algorithm at some point in the future as well
(which will need a different set of knobs).


>
> ...
>
> Is this "per station" or per station, per 802.11e queue?

What do you have in mind by "this"?



Micha?

>
> Dave Täht
> Let's go make home routers and wifi faster! With better software!
> https://www.gofundme.com/savewifi
>
>
> On Fri, Feb 26, 2016 at 5:09 AM, Michal Kazior <michal.kazior@tieto.com> wrote:
>> Since 11n aggregation become important to get the
>> best out of txops. However aggregation inherently
>> requires buffering and queuing. Once variable
>> medium conditions to different associated stations
>> is considered it became apparent that bufferbloat
>> can't be simply fought with qdiscs for wireless
>> drivers. 11ac with MU-MIMO makes the problem
>> worse because the bandwidth-delay product becomes
>> even greater.
>>
>> This bases on codel5 and sch_fq_codel.c. It may
>> not be the Right Thing yet but it should at least
>> provide a framework for more improvements.
>>
>> I guess dropping rate could factor in per-station
>> rate control info but I don't know how this should
>> exactly be done. HW rate control drivers would
>> need extra work to take advantage of this.
>>
>> This obviously works only with drivers that use
>> wake_tx_queue op.
>>
>> Note: This uses IFF_NO_QUEUE to get rid of qdiscs
>> for wireless drivers that use mac80211 and
>> implement wake_tx_queue op.
>>
>> Moreover the current txq_limit and latency setting
>> might need tweaking. Either from userspace or be
>> dynamically scaled with regard to, e.g. number of
>> associated stations.
>>
>> FWIW This already works nicely with ath10k's (not
>> yey merged) pull-push congestion control for
>> MU-MIMO as far as throughput is concerned.
>>
>> Evaluating latency improvements is a little tricky
>> at this point if a driver is using more queue
>> layering and/or its firmware controls tx
>> scheduling - hence I don't have any solid data on
>> this. I'm open for suggestions though.
>>
>> It might also be a good idea to do the following
>> in the future:
>>
>>  - make generic tx scheduling which does some RR
>>    over per-sta-tid queues and dequeues bursts of
>>    packets to form a PPDU to fit into designated
>>    txop timeframe and bytelimit
>>
>>    This could in theory be shared and used by
>>    ath9k and (future) mt76.
>>
>>    Moreover tx scheduling could factor in rate
>>    control info and keep per-station number of
>>    queued packets at a sufficient low threshold to
>>    avoid queue buildup for slow stations. Emmanuel
>>    already did similar experiment for iwlwifi's
>>    station mode and got promising results.
>>
>>  - make software queueing default internally in
>>    mac80211. This could help other drivers to get
>>    at least some benefit from mac80211 smarter
>>    queueing.
>>
>> Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
>> ---
>>  include/net/mac80211.h     |  36 ++++-
>>  net/mac80211/agg-tx.c      |   8 +-
>>  net/mac80211/codel.h       | 260 +++++++++++++++++++++++++++++++
>>  net/mac80211/codel_i.h     |  89 +++++++++++
>>  net/mac80211/ieee80211_i.h |  27 +++-
>>  net/mac80211/iface.c       |  25 ++-
>>  net/mac80211/main.c        |   9 +-
>>  net/mac80211/rx.c          |   2 +-
>>  net/mac80211/sta_info.c    |  10 +-
>>  net/mac80211/sta_info.h    |  27 ++++
>>  net/mac80211/tx.c          | 370 ++++++++++++++++++++++++++++++++++++++++-----
>>  net/mac80211/util.c        |  20 ++-
>>  12 files changed, 805 insertions(+), 78 deletions(-)
>>  create mode 100644 net/mac80211/codel.h
>>  create mode 100644 net/mac80211/codel_i.h
>>
>> diff --git a/include/net/mac80211.h b/include/net/mac80211.h
>> index 6617516a276f..4667d2bad356 100644
>> --- a/include/net/mac80211.h
>> +++ b/include/net/mac80211.h
>> @@ -565,6 +565,18 @@ struct ieee80211_bss_conf {
>>         struct ieee80211_p2p_noa_attr p2p_noa_attr;
>>  };
>>
>> +typedef u64 codel_time_t;
>> +
>> +/*
>> + * struct codel_params - contains codel parameters
>> + * @interval:  initial drop rate
>> + * @target:     maximum persistent sojourn time
>> + */
>> +struct codel_params {
>> +       codel_time_t    interval;
>> +       codel_time_t    target;
>> +};
>> +
>>  /**
>>   * enum mac80211_tx_info_flags - flags to describe transmission information/status
>>   *
>> @@ -886,8 +898,18 @@ struct ieee80211_tx_info {
>>                                 /* only needed before rate control */
>>                                 unsigned long jiffies;
>>                         };
>> -                       /* NB: vif can be NULL for injected frames */
>> -                       struct ieee80211_vif *vif;
>> +                       union {
>> +                               /* NB: vif can be NULL for injected frames */
>> +                               struct ieee80211_vif *vif;
>> +
>> +                               /* When packets are enqueued on txq it's easy
>> +                                * to re-construct the vif pointer. There's no
>> +                                * more space in tx_info so it can be used to
>> +                                * store the necessary enqueue time for packet
>> +                                * sojourn time computation.
>> +                                */
>> +                               codel_time_t enqueue_time;
>> +                       };
>>                         struct ieee80211_key_conf *hw_key;
>>                         u32 flags;
>>                         /* 4 bytes free */
>> @@ -2102,8 +2124,8 @@ enum ieee80211_hw_flags {
>>   * @cipher_schemes: a pointer to an array of cipher scheme definitions
>>   *     supported by HW.
>>   *
>> - * @txq_ac_max_pending: maximum number of frames per AC pending in all txq
>> - *     entries for a vif.
>> + * @txq_cparams: codel parameters to control tx queueing dropping behavior
>> + * @txq_limit: maximum number of frames queuesd
>>   */
>>  struct ieee80211_hw {
>>         struct ieee80211_conf conf;
>> @@ -2133,7 +2155,8 @@ struct ieee80211_hw {
>>         u8 uapsd_max_sp_len;
>>         u8 n_cipher_schemes;
>>         const struct ieee80211_cipher_scheme *cipher_schemes;
>> -       int txq_ac_max_pending;
>> +       struct codel_params txq_cparams;
>> +       u32 txq_limit;
>>  };
>>
>>  static inline bool _ieee80211_hw_check(struct ieee80211_hw *hw,
>> @@ -5602,6 +5625,9 @@ struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
>>   * txq state can change half-way of this function and the caller may end up
>>   * with "new" frame_cnt and "old" byte_cnt or vice-versa.
>>   *
>> + * Moreover returned values are best-case, i.e. assuming queueing algorithm
>> + * will not drop frames due to excess latency.
>> + *
>>   * @txq: pointer obtained from station or virtual interface
>>   * @frame_cnt: pointer to store frame count
>>   * @byte_cnt: pointer to store byte count
>> diff --git a/net/mac80211/agg-tx.c b/net/mac80211/agg-tx.c
>> index 4932e9f243a2..b9d0cee2a786 100644
>> --- a/net/mac80211/agg-tx.c
>> +++ b/net/mac80211/agg-tx.c
>> @@ -194,17 +194,21 @@ static void
>>  ieee80211_agg_stop_txq(struct sta_info *sta, int tid)
>>  {
>>         struct ieee80211_txq *txq = sta->sta.txq[tid];
>> +       struct ieee80211_sub_if_data *sdata;
>> +       struct ieee80211_fq *fq;
>>         struct txq_info *txqi;
>>
>>         if (!txq)
>>                 return;
>>
>>         txqi = to_txq_info(txq);
>> +       sdata = vif_to_sdata(txq->vif);
>> +       fq = &sdata->local->fq;
>>
>>         /* Lock here to protect against further seqno updates on dequeue */
>> -       spin_lock_bh(&txqi->queue.lock);
>> +       spin_lock_bh(&fq->lock);
>>         set_bit(IEEE80211_TXQ_STOP, &txqi->flags);
>> -       spin_unlock_bh(&txqi->queue.lock);
>> +       spin_unlock_bh(&fq->lock);
>>  }
>>
>>  static void
>> diff --git a/net/mac80211/codel.h b/net/mac80211/codel.h
>> new file mode 100644
>> index 000000000000..f6f1b9b73a9a
>> --- /dev/null
>> +++ b/net/mac80211/codel.h
>> @@ -0,0 +1,260 @@
>> +#ifndef __NET_MAC80211_CODEL_H
>> +#define __NET_MAC80211_CODEL_H
>> +
>> +/*
>> + * Codel - The Controlled-Delay Active Queue Management algorithm
>> + *
>> + *  Copyright (C) 2011-2012 Kathleen Nichols <nichols@pollere.com>
>> + *  Copyright (C) 2011-2012 Van Jacobson <van@pollere.net>
>> + *  Copyright (C) 2016 Michael D. Taht <dave.taht@bufferbloat.net>
>> + *  Copyright (C) 2012 Eric Dumazet <edumazet@google.com>
>> + *  Copyright (C) 2015 Jonathan Morton <chromatix99@gmail.com>
>> + *
>> + * Redistribution and use in source and binary forms, with or without
>> + * modification, are permitted provided that the following conditions
>> + * are met:
>> + * 1. Redistributions of source code must retain the above copyright
>> + *    notice, this list of conditions, and the following disclaimer,
>> + *    without modification.
>> + * 2. Redistributions in binary form must reproduce the above copyright
>> + *    notice, this list of conditions and the following disclaimer in the
>> + *    documentation and/or other materials provided with the distribution.
>> + * 3. The names of the authors may not be used to endorse or promote products
>> + *    derived from this software without specific prior written permission.
>> + *
>> + * Alternatively, provided that this notice is retained in full, this
>> + * software may be distributed under the terms of the GNU General
>> + * Public License ("GPL") version 2, in which case the provisions of the
>> + * GPL apply INSTEAD OF those given above.
>> + *
>> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
>> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
>> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
>> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
>> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
>> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
>> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
>> + * DAMAGE.
>> + *
>> + */
>> +
>> +#include <linux/version.h>
>> +#include <linux/types.h>
>> +#include <linux/ktime.h>
>> +#include <linux/skbuff.h>
>> +#include <net/pkt_sched.h>
>> +#include <net/inet_ecn.h>
>> +#include <linux/reciprocal_div.h>
>> +
>> +#include "codel_i.h"
>> +
>> +/* Controlling Queue Delay (CoDel) algorithm
>> + * =========================================
>> + * Source : Kathleen Nichols and Van Jacobson
>> + * http://queue.acm.org/detail.cfm?id=2209336
>> + *
>> + * Implemented on linux by Dave Taht and Eric Dumazet
>> + */
>> +
>> +/* CoDel5 uses a real clock, unlike codel */
>> +
>> +static inline codel_time_t codel_get_time(void)
>> +{
>> +       return ktime_get_ns();
>> +}
>> +
>> +static inline u32 codel_time_to_us(codel_time_t val)
>> +{
>> +       do_div(val, NSEC_PER_USEC);
>> +       return (u32)val;
>> +}
>> +
>> +/* sizeof_in_bits(rec_inv_sqrt) */
>> +#define REC_INV_SQRT_BITS (8 * sizeof(u16))
>> +/* needed shift to get a Q0.32 number from rec_inv_sqrt */
>> +#define REC_INV_SQRT_SHIFT (32 - REC_INV_SQRT_BITS)
>> +
>> +/* Newton approximation method needs more iterations at small inputs,
>> + * so cache them.
>> + */
>> +
>> +static void codel_vars_init(struct codel_vars *vars)
>> +{
>> +       memset(vars, 0, sizeof(*vars));
>> +}
>> +
>> +/*
>> + * http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Iterative_methods_for_reciprocal_square_roots
>> + * new_invsqrt = (invsqrt / 2) * (3 - count * invsqrt^2)
>> + *
>> + * Here, invsqrt is a fixed point number (< 1.0), 32bit mantissa, aka Q0.32
>> + */
>> +static inline void codel_Newton_step(struct codel_vars *vars)
>> +{
>> +       u32 invsqrt = ((u32)vars->rec_inv_sqrt) << REC_INV_SQRT_SHIFT;
>> +       u32 invsqrt2 = ((u64)invsqrt * invsqrt) >> 32;
>> +       u64 val = (3LL << 32) - ((u64)vars->count * invsqrt2);
>> +
>> +       val >>= 2; /* avoid overflow in following multiply */
>> +       val = (val * invsqrt) >> (32 - 2 + 1);
>> +
>> +       vars->rec_inv_sqrt = val >> REC_INV_SQRT_SHIFT;
>> +}
>> +
>> +/*
>> + * CoDel control_law is t + interval/sqrt(count)
>> + * We maintain in rec_inv_sqrt the reciprocal value of sqrt(count) to avoid
>> + * both sqrt() and divide operation.
>> + */
>> +static codel_time_t codel_control_law(codel_time_t t,
>> +                                     codel_time_t interval,
>> +                                     u32 rec_inv_sqrt)
>> +{
>> +       return t + reciprocal_scale(interval, rec_inv_sqrt <<
>> +                                   REC_INV_SQRT_SHIFT);
>> +}
>> +
>> +/* Forward declaration of this for use elsewhere */
>> +
>> +static inline codel_time_t
>> +custom_codel_get_enqueue_time(struct sk_buff *skb);
>> +
>> +static inline struct sk_buff *
>> +custom_dequeue(struct codel_vars *vars, void *ptr);
>> +
>> +static inline void
>> +custom_drop(struct sk_buff *skb, void *ptr);
>> +
>> +static bool codel_should_drop(struct sk_buff *skb,
>> +                             __u32 *backlog,
>> +                             struct codel_vars *vars,
>> +                             const struct codel_params *p,
>> +                             codel_time_t now)
>> +{
>> +       if (!skb) {
>> +               vars->first_above_time = 0;
>> +               return false;
>> +       }
>> +
>> +       if (now - custom_codel_get_enqueue_time(skb) < p->target ||
>> +           !*backlog) {
>> +               /* went below - stay below for at least interval */
>> +               vars->first_above_time = 0;
>> +               return false;
>> +       }
>> +
>> +       if (vars->first_above_time == 0) {
>> +               /* just went above from below; mark the time */
>> +               vars->first_above_time = now + p->interval;
>> +
>> +       } else if (now > vars->first_above_time) {
>> +               return true;
>> +       }
>> +
>> +       return false;
>> +}
>> +
>> +static struct sk_buff *codel_dequeue(void *ptr,
>> +                                    __u32 *backlog,
>> +                                    struct codel_vars *vars,
>> +                                    struct codel_params *p,
>> +                                    codel_time_t now,
>> +                                    bool overloaded)
>> +{
>> +       struct sk_buff *skb = custom_dequeue(vars, ptr);
>> +       bool drop;
>> +
>> +       if (!skb) {
>> +               vars->dropping = false;
>> +               return skb;
>> +       }
>> +       drop = codel_should_drop(skb, backlog, vars, p, now);
>> +       if (vars->dropping) {
>> +               if (!drop) {
>> +                       /* sojourn time below target - leave dropping state */
>> +                       vars->dropping = false;
>> +               } else if (now >= vars->drop_next) {
>> +                       /* It's time for the next drop. Drop the current
>> +                        * packet and dequeue the next. The dequeue might
>> +                        * take us out of dropping state.
>> +                        * If not, schedule the next drop.
>> +                        * A large backlog might result in drop rates so high
>> +                        * that the next drop should happen now,
>> +                        * hence the while loop.
>> +                        */
>> +
>> +                       /* saturating increment */
>> +                       vars->count++;
>> +                       if (!vars->count)
>> +                               vars->count--;
>> +
>> +                       codel_Newton_step(vars);
>> +                       vars->drop_next = codel_control_law(vars->drop_next,
>> +                                                           p->interval,
>> +                                                           vars->rec_inv_sqrt);
>> +                       do {
>> +                               if (INET_ECN_set_ce(skb) && !overloaded) {
>> +                                       vars->ecn_mark++;
>> +                                       /* and schedule the next drop */
>> +                                       vars->drop_next = codel_control_law(
>> +                                               vars->drop_next, p->interval,
>> +                                               vars->rec_inv_sqrt);
>> +                                       goto end;
>> +                               }
>> +                               custom_drop(skb, ptr);
>> +                               vars->drop_count++;
>> +                               skb = custom_dequeue(vars, ptr);
>> +                               if (skb && !codel_should_drop(skb, backlog, vars,
>> +                                                             p, now)) {
>> +                                       /* leave dropping state */
>> +                                       vars->dropping = false;
>> +                               } else {
>> +                                       /* schedule the next drop */
>> +                                       vars->drop_next = codel_control_law(
>> +                                               vars->drop_next, p->interval,
>> +                                               vars->rec_inv_sqrt);
>> +                               }
>> +                       } while (skb && vars->dropping && now >=
>> +                                vars->drop_next);
>> +
>> +                       /* Mark the packet regardless */
>> +                       if (skb && INET_ECN_set_ce(skb))
>> +                               vars->ecn_mark++;
>> +               }
>> +       } else if (drop) {
>> +               if (INET_ECN_set_ce(skb) && !overloaded) {
>> +                       vars->ecn_mark++;
>> +               } else {
>> +                       custom_drop(skb, ptr);
>> +                       vars->drop_count++;
>> +
>> +                       skb = custom_dequeue(vars, ptr);
>> +                       drop = codel_should_drop(skb, backlog, vars, p, now);
>> +                       if (skb && INET_ECN_set_ce(skb))
>> +                               vars->ecn_mark++;
>> +               }
>> +               vars->dropping = true;
>> +               /* if min went above target close to when we last went below
>> +                * assume that the drop rate that controlled the queue on the
>> +                * last cycle is a good starting point to control it now.
>> +                */
>> +               if (vars->count > 2 &&
>> +                   now - vars->drop_next < 8 * p->interval) {
>> +                       vars->count -= 2;
>> +                       codel_Newton_step(vars);
>> +               } else {
>> +                       vars->count = 1;
>> +                       vars->rec_inv_sqrt = ~0U >> REC_INV_SQRT_SHIFT;
>> +               }
>> +               codel_Newton_step(vars);
>> +               vars->drop_next = codel_control_law(now, p->interval,
>> +                                                   vars->rec_inv_sqrt);
>> +       }
>> +end:
>> +       return skb;
>> +}
>> +#endif
>> diff --git a/net/mac80211/codel_i.h b/net/mac80211/codel_i.h
>> new file mode 100644
>> index 000000000000..83da7aa5fd9a
>> --- /dev/null
>> +++ b/net/mac80211/codel_i.h
>> @@ -0,0 +1,89 @@
>> +#ifndef __NET_MAC80211_CODEL_I_H
>> +#define __NET_MAC80211_CODEL_I_H
>> +
>> +/*
>> + * Codel - The Controlled-Delay Active Queue Management algorithm
>> + *
>> + *  Copyright (C) 2011-2012 Kathleen Nichols <nichols@pollere.com>
>> + *  Copyright (C) 2011-2012 Van Jacobson <van@pollere.net>
>> + *  Copyright (C) 2016 Michael D. Taht <dave.taht@bufferbloat.net>
>> + *  Copyright (C) 2012 Eric Dumazet <edumazet@google.com>
>> + *  Copyright (C) 2015 Jonathan Morton <chromatix99@gmail.com>
>> + *  Copyright (C) 2016 Michal Kazior <michal.kazior@tieto.com>
>> + *
>> + * Redistribution and use in source and binary forms, with or without
>> + * modification, are permitted provided that the following conditions
>> + * are met:
>> + * 1. Redistributions of source code must retain the above copyright
>> + *    notice, this list of conditions, and the following disclaimer,
>> + *    without modification.
>> + * 2. Redistributions in binary form must reproduce the above copyright
>> + *    notice, this list of conditions and the following disclaimer in the
>> + *    documentation and/or other materials provided with the distribution.
>> + * 3. The names of the authors may not be used to endorse or promote products
>> + *    derived from this software without specific prior written permission.
>> + *
>> + * Alternatively, provided that this notice is retained in full, this
>> + * software may be distributed under the terms of the GNU General
>> + * Public License ("GPL") version 2, in which case the provisions of the
>> + * GPL apply INSTEAD OF those given above.
>> + *
>> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
>> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
>> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
>> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
>> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
>> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
>> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
>> + * DAMAGE.
>> + *
>> + */
>> +
>> +#include <linux/version.h>
>> +#include <linux/types.h>
>> +#include <linux/ktime.h>
>> +#include <linux/skbuff.h>
>> +#include <net/pkt_sched.h>
>> +#include <net/inet_ecn.h>
>> +#include <linux/reciprocal_div.h>
>> +
>> +/* Controlling Queue Delay (CoDel) algorithm
>> + * =========================================
>> + * Source : Kathleen Nichols and Van Jacobson
>> + * http://queue.acm.org/detail.cfm?id=2209336
>> + *
>> + * Implemented on linux by Dave Taht and Eric Dumazet
>> + */
>> +
>> +/* CoDel5 uses a real clock, unlike codel */
>> +
>> +#define MS2TIME(a) (a * (u64) NSEC_PER_MSEC)
>> +#define US2TIME(a) (a * (u64) NSEC_PER_USEC)
>> +
>> +/**
>> + * struct codel_vars - contains codel variables
>> + * @count:             how many drops we've done since the last time we
>> + *                     entered dropping state
>> + * @dropping:          set to > 0 if in dropping state
>> + * @rec_inv_sqrt:      reciprocal value of sqrt(count) >> 1
>> + * @first_above_time:  when we went (or will go) continuously above target
>> + *                     for interval
>> + * @drop_next:         time to drop next packet, or when we dropped last
>> + * @drop_count:        temp count of dropped packets in dequeue()
>> + * @ecn_mark:  number of packets we ECN marked instead of dropping
>> + */
>> +
>> +struct codel_vars {
>> +       u32             count;
>> +       u16             dropping;
>> +       u16             rec_inv_sqrt;
>> +       codel_time_t    first_above_time;
>> +       codel_time_t    drop_next;
>> +       u16             drop_count;
>> +       u16             ecn_mark;
>> +};
>> +#endif
>> diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
>> index a96f8c0461f6..c099b81d5a27 100644
>> --- a/net/mac80211/ieee80211_i.h
>> +++ b/net/mac80211/ieee80211_i.h
>> @@ -802,9 +802,12 @@ enum txq_info_flags {
>>  };
>>
>>  struct txq_info {
>> -       struct sk_buff_head queue;
>> +       struct txq_flow flow;
>> +       struct list_head new_flows;
>> +       struct list_head old_flows;
>> +       u32 backlog_bytes;
>> +       u32 backlog_packets;
>>         unsigned long flags;
>> -       unsigned long byte_cnt;
>>
>>         /* keep last! */
>>         struct ieee80211_txq txq;
>> @@ -852,7 +855,6 @@ struct ieee80211_sub_if_data {
>>         bool control_port_no_encrypt;
>>         int encrypt_headroom;
>>
>> -       atomic_t txqs_len[IEEE80211_NUM_ACS];
>>         struct ieee80211_tx_queue_params tx_conf[IEEE80211_NUM_ACS];
>>         struct mac80211_qos_map __rcu *qos_map;
>>
>> @@ -1089,11 +1091,25 @@ enum mac80211_scan_state {
>>         SCAN_ABORT,
>>  };
>>
>> +struct ieee80211_fq {
>> +       struct txq_flow *flows;
>> +       struct list_head backlogs;
>> +       spinlock_t lock;
>> +       u32 flows_cnt;
>> +       u32 perturbation;
>> +       u32 quantum;
>> +       u32 backlog;
>> +
>> +       u32 drop_overlimit;
>> +       u32 drop_codel;
>> +};
>> +
>>  struct ieee80211_local {
>>         /* embed the driver visible part.
>>          * don't cast (use the static inlines below), but we keep
>>          * it first anyway so they become a no-op */
>>         struct ieee80211_hw hw;
>> +       struct ieee80211_fq fq;
>>
>>         const struct ieee80211_ops *ops;
>>
>> @@ -1935,6 +1951,11 @@ static inline bool ieee80211_can_run_worker(struct ieee80211_local *local)
>>  void ieee80211_init_tx_queue(struct ieee80211_sub_if_data *sdata,
>>                              struct sta_info *sta,
>>                              struct txq_info *txq, int tid);
>> +void ieee80211_purge_txq(struct ieee80211_local *local, struct txq_info *txqi);
>> +void ieee80211_init_flow(struct txq_flow *flow);
>> +int ieee80211_setup_flows(struct ieee80211_local *local);
>> +void ieee80211_teardown_flows(struct ieee80211_local *local);
>> +
>>  void ieee80211_send_auth(struct ieee80211_sub_if_data *sdata,
>>                          u16 transaction, u16 auth_alg, u16 status,
>>                          const u8 *extra, size_t extra_len, const u8 *bssid,
>> diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c
>> index 453b4e741780..d1063b50f12c 100644
>> --- a/net/mac80211/iface.c
>> +++ b/net/mac80211/iface.c
>> @@ -779,6 +779,7 @@ static void ieee80211_do_stop(struct ieee80211_sub_if_data *sdata,
>>                               bool going_down)
>>  {
>>         struct ieee80211_local *local = sdata->local;
>> +       struct ieee80211_fq *fq = &local->fq;
>>         unsigned long flags;
>>         struct sk_buff *skb, *tmp;
>>         u32 hw_reconf_flags = 0;
>> @@ -977,12 +978,9 @@ static void ieee80211_do_stop(struct ieee80211_sub_if_data *sdata,
>>         if (sdata->vif.txq) {
>>                 struct txq_info *txqi = to_txq_info(sdata->vif.txq);
>>
>> -               spin_lock_bh(&txqi->queue.lock);
>> -               ieee80211_purge_tx_queue(&local->hw, &txqi->queue);
>> -               txqi->byte_cnt = 0;
>> -               spin_unlock_bh(&txqi->queue.lock);
>> -
>> -               atomic_set(&sdata->txqs_len[txqi->txq.ac], 0);
>> +               spin_lock_bh(&fq->lock);
>> +               ieee80211_purge_txq(local, txqi);
>> +               spin_unlock_bh(&fq->lock);
>>         }
>>
>>         if (local->open_count == 0)
>> @@ -1198,6 +1196,13 @@ static void ieee80211_if_setup(struct net_device *dev)
>>         dev->destructor = ieee80211_if_free;
>>  }
>>
>> +static void ieee80211_if_setup_no_queue(struct net_device *dev)
>> +{
>> +       ieee80211_if_setup(dev);
>> +       dev->priv_flags |= IFF_NO_QUEUE;
>> +       /* Note for backporters: use dev->tx_queue_len = 0 instead of IFF_ */
>> +}
>> +
>>  static void ieee80211_iface_work(struct work_struct *work)
>>  {
>>         struct ieee80211_sub_if_data *sdata =
>> @@ -1707,6 +1712,7 @@ int ieee80211_if_add(struct ieee80211_local *local, const char *name,
>>         struct net_device *ndev = NULL;
>>         struct ieee80211_sub_if_data *sdata = NULL;
>>         struct txq_info *txqi;
>> +       void (*if_setup)(struct net_device *dev);
>>         int ret, i;
>>         int txqs = 1;
>>
>> @@ -1734,12 +1740,17 @@ int ieee80211_if_add(struct ieee80211_local *local, const char *name,
>>                         txq_size += sizeof(struct txq_info) +
>>                                     local->hw.txq_data_size;
>>
>> +               if (local->ops->wake_tx_queue)
>> +                       if_setup = ieee80211_if_setup_no_queue;
>> +               else
>> +                       if_setup = ieee80211_if_setup;
>> +
>>                 if (local->hw.queues >= IEEE80211_NUM_ACS)
>>                         txqs = IEEE80211_NUM_ACS;
>>
>>                 ndev = alloc_netdev_mqs(size + txq_size,
>>                                         name, name_assign_type,
>> -                                       ieee80211_if_setup, txqs, 1);
>> +                                       if_setup, txqs, 1);
>>                 if (!ndev)
>>                         return -ENOMEM;
>>                 dev_net_set(ndev, wiphy_net(local->hw.wiphy));
>> diff --git a/net/mac80211/main.c b/net/mac80211/main.c
>> index 8190bf27ebff..9fd3b10ae52b 100644
>> --- a/net/mac80211/main.c
>> +++ b/net/mac80211/main.c
>> @@ -1053,9 +1053,6 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
>>
>>         local->dynamic_ps_forced_timeout = -1;
>>
>> -       if (!local->hw.txq_ac_max_pending)
>> -               local->hw.txq_ac_max_pending = 64;
>> -
>>         result = ieee80211_wep_init(local);
>>         if (result < 0)
>>                 wiphy_debug(local->hw.wiphy, "Failed to initialize wep: %d\n",
>> @@ -1087,6 +1084,10 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
>>
>>         rtnl_unlock();
>>
>> +       result = ieee80211_setup_flows(local);
>> +       if (result)
>> +               goto fail_flows;
>> +
>>  #ifdef CONFIG_INET
>>         local->ifa_notifier.notifier_call = ieee80211_ifa_changed;
>>         result = register_inetaddr_notifier(&local->ifa_notifier);
>> @@ -1112,6 +1113,8 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
>>  #if defined(CONFIG_INET) || defined(CONFIG_IPV6)
>>   fail_ifa:
>>  #endif
>> +       ieee80211_teardown_flows(local);
>> + fail_flows:
>>         rtnl_lock();
>>         rate_control_deinitialize(local);
>>         ieee80211_remove_interfaces(local);
>> diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
>> index 664e8861edbe..66c36dc389ec 100644
>> --- a/net/mac80211/rx.c
>> +++ b/net/mac80211/rx.c
>> @@ -1248,7 +1248,7 @@ static void sta_ps_start(struct sta_info *sta)
>>         for (tid = 0; tid < ARRAY_SIZE(sta->sta.txq); tid++) {
>>                 struct txq_info *txqi = to_txq_info(sta->sta.txq[tid]);
>>
>> -               if (!skb_queue_len(&txqi->queue))
>> +               if (!txqi->backlog_packets)
>>                         set_bit(tid, &sta->txq_buffered_tids);
>>                 else
>>                         clear_bit(tid, &sta->txq_buffered_tids);
>> diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
>> index 7bbcf5919fe4..456c9fb113fb 100644
>> --- a/net/mac80211/sta_info.c
>> +++ b/net/mac80211/sta_info.c
>> @@ -112,11 +112,7 @@ static void __cleanup_single_sta(struct sta_info *sta)
>>         if (sta->sta.txq[0]) {
>>                 for (i = 0; i < ARRAY_SIZE(sta->sta.txq); i++) {
>>                         struct txq_info *txqi = to_txq_info(sta->sta.txq[i]);
>> -                       int n = skb_queue_len(&txqi->queue);
>> -
>> -                       ieee80211_purge_tx_queue(&local->hw, &txqi->queue);
>> -                       atomic_sub(n, &sdata->txqs_len[txqi->txq.ac]);
>> -                       txqi->byte_cnt = 0;
>> +                       ieee80211_purge_txq(local, txqi);
>>                 }
>>         }
>>
>> @@ -1185,7 +1181,7 @@ void ieee80211_sta_ps_deliver_wakeup(struct sta_info *sta)
>>                 for (i = 0; i < ARRAY_SIZE(sta->sta.txq); i++) {
>>                         struct txq_info *txqi = to_txq_info(sta->sta.txq[i]);
>>
>> -                       if (!skb_queue_len(&txqi->queue))
>> +                       if (!txqi->backlog_packets)
>>                                 continue;
>>
>>                         drv_wake_tx_queue(local, txqi);
>> @@ -1622,7 +1618,7 @@ ieee80211_sta_ps_deliver_response(struct sta_info *sta,
>>                 for (tid = 0; tid < ARRAY_SIZE(sta->sta.txq); tid++) {
>>                         struct txq_info *txqi = to_txq_info(sta->sta.txq[tid]);
>>
>> -                       if (!(tids & BIT(tid)) || skb_queue_len(&txqi->queue))
>> +                       if (!(tids & BIT(tid)) || txqi->backlog_packets)
>>                                 continue;
>>
>>                         sta_info_recalc_tim(sta);
>> diff --git a/net/mac80211/sta_info.h b/net/mac80211/sta_info.h
>> index f4d38994ecee..65431ea5a78d 100644
>> --- a/net/mac80211/sta_info.h
>> +++ b/net/mac80211/sta_info.h
>> @@ -19,6 +19,7 @@
>>  #include <linux/etherdevice.h>
>>  #include <linux/rhashtable.h>
>>  #include "key.h"
>> +#include "codel_i.h"
>>
>>  /**
>>   * enum ieee80211_sta_info_flags - Stations flags
>> @@ -327,6 +328,32 @@ struct mesh_sta {
>>
>>  DECLARE_EWMA(signal, 1024, 8)
>>
>> +struct txq_info;
>> +
>> +/**
>> + * struct txq_flow - per traffic flow queue
>> + *
>> + * This structure is used to distinguish and queue different traffic flows
>> + * separately for fair queueing/AQM purposes.
>> + *
>> + * @txqi: txq_info structure it is associated at given time
>> + * @flowchain: can be linked to other flows for RR purposes
>> + * @backlogchain: can be linked to other flows for backlog sorting purposes
>> + * @queue: sk_buff queue
>> + * @cvars: codel state vars
>> + * @backlog: number of bytes pending in the queue
>> + * @deficit: used for fair queueing balancing
>> + */
>> +struct txq_flow {
>> +       struct txq_info *txqi;
>> +       struct list_head flowchain;
>> +       struct list_head backlogchain;
>> +       struct sk_buff_head queue;
>> +       struct codel_vars cvars;
>> +       u32 backlog;
>> +       u32 deficit;
>> +};
>> +
>>  /**
>>   * struct sta_info - STA information
>>   *
>> diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
>> index af584f7cdd63..f42f898cb8b5 100644
>> --- a/net/mac80211/tx.c
>> +++ b/net/mac80211/tx.c
>> @@ -34,6 +34,7 @@
>>  #include "wpa.h"
>>  #include "wme.h"
>>  #include "rate.h"
>> +#include "codel.h"
>>
>>  /* misc utils */
>>
>> @@ -1228,26 +1229,312 @@ ieee80211_tx_prepare(struct ieee80211_sub_if_data *sdata,
>>         return TX_CONTINUE;
>>  }
>>
>> -static void ieee80211_drv_tx(struct ieee80211_local *local,
>> -                            struct ieee80211_vif *vif,
>> -                            struct ieee80211_sta *pubsta,
>> -                            struct sk_buff *skb)
>> +static inline codel_time_t
>> +custom_codel_get_enqueue_time(struct sk_buff *skb)
>> +{
>> +       return IEEE80211_SKB_CB(skb)->control.enqueue_time;
>> +}
>> +
>> +static inline struct sk_buff *
>> +flow_dequeue(struct ieee80211_local *local, struct txq_flow *flow)
>> +{
>> +       struct ieee80211_fq *fq = &local->fq;
>> +       struct txq_info *txqi = flow->txqi;
>> +       struct txq_flow *i;
>> +       struct sk_buff *skb;
>> +
>> +       skb = __skb_dequeue(&flow->queue);
>> +       if (!skb)
>> +               return NULL;
>> +
>> +       txqi->backlog_bytes -= skb->len;
>> +       txqi->backlog_packets--;
>> +       flow->backlog -= skb->len;
>> +       fq->backlog--;
>> +
>> +       if (flow->backlog == 0) {
>> +               list_del_init(&flow->backlogchain);
>> +       } else {
>> +               i = flow;
>> +
>> +               list_for_each_entry_continue(i, &fq->backlogs, backlogchain) {
>> +                       if (i->backlog < flow->backlog)
>> +                               break;
>> +               }
>> +
>> +               list_move_tail(&flow->backlogchain, &i->backlogchain);
>> +       }
>> +
>> +       return skb;
>> +}
>> +
>> +static inline struct sk_buff *
>> +custom_dequeue(struct codel_vars *vars, void *ptr)
>> +{
>> +       struct txq_flow *flow = ptr;
>> +       struct txq_info *txqi = flow->txqi;
>> +       struct ieee80211_vif *vif = txqi->txq.vif;
>> +       struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif);
>> +       struct ieee80211_local *local = sdata->local;
>> +
>> +       return flow_dequeue(local, flow);
>> +}
>> +
>> +static inline void
>> +custom_drop(struct sk_buff *skb, void *ptr)
>> +{
>> +       struct txq_flow *flow = ptr;
>> +       struct txq_info *txqi = flow->txqi;
>> +       struct ieee80211_vif *vif = txqi->txq.vif;
>> +       struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif);
>> +       struct ieee80211_local *local = sdata->local;
>> +       struct ieee80211_hw *hw = &local->hw;
>> +
>> +       ieee80211_free_txskb(hw, skb);
>> +       local->fq.drop_codel++;
>> +}
>> +
>> +static u32 fq_hash(struct ieee80211_fq *fq, struct sk_buff *skb)
>> +{
>> +       u32 hash = skb_get_hash_perturb(skb, fq->perturbation);
>> +       return reciprocal_scale(hash, fq->flows_cnt);
>> +}
>> +
>> +static void fq_drop(struct ieee80211_local *local)
>> +{
>> +       struct ieee80211_hw *hw = &local->hw;
>> +       struct ieee80211_fq *fq = &local->fq;
>> +       struct txq_flow *flow;
>> +       struct sk_buff *skb;
>> +
>> +       flow = list_first_entry_or_null(&fq->backlogs, struct txq_flow,
>> +                                       backlogchain);
>> +       if (WARN_ON_ONCE(!flow))
>> +               return;
>> +
>> +       skb = flow_dequeue(local, flow);
>> +       if (WARN_ON_ONCE(!skb))
>> +               return;
>> +
>> +       ieee80211_free_txskb(hw, skb);
>> +       fq->drop_overlimit++;
>> +}
>> +
>> +void ieee80211_init_flow(struct txq_flow *flow)
>> +{
>> +       INIT_LIST_HEAD(&flow->flowchain);
>> +       INIT_LIST_HEAD(&flow->backlogchain);
>> +       __skb_queue_head_init(&flow->queue);
>> +       codel_vars_init(&flow->cvars);
>> +}
>> +
>> +int ieee80211_setup_flows(struct ieee80211_local *local)
>> +{
>> +       struct ieee80211_fq *fq = &local->fq;
>> +       int i;
>> +
>> +       if (!local->ops->wake_tx_queue)
>> +               return 0;
>> +
>> +       if (!local->hw.txq_limit)
>> +               local->hw.txq_limit = 8192;
>> +
>> +       if (!local->hw.txq_cparams.target)
>> +               local->hw.txq_cparams.target = MS2TIME(5);
>> +
>> +       if (!local->hw.txq_cparams.interval)
>> +               local->hw.txq_cparams.interval = MS2TIME(100);
>> +
>> +       memset(fq, 0, sizeof(fq[0]));
>> +       INIT_LIST_HEAD(&fq->backlogs);
>> +       spin_lock_init(&fq->lock);
>> +       fq->flows_cnt = 4096;
>> +       fq->perturbation = prandom_u32();
>> +       fq->quantum = 300;
>> +
>> +       fq->flows = kzalloc(fq->flows_cnt * sizeof(fq->flows[0]), GFP_KERNEL);
>> +       if (!fq->flows)
>> +               return -ENOMEM;
>> +
>> +       for (i = 0; i < fq->flows_cnt; i++)
>> +               ieee80211_init_flow(&fq->flows[i]);
>> +
>> +       return 0;
>> +}
>> +
>> +static void ieee80211_reset_flow(struct ieee80211_local *local,
>> +                                struct txq_flow *flow)
>> +{
>> +       if (!list_empty(&flow->flowchain))
>> +               list_del_init(&flow->flowchain);
>> +
>> +       if (!list_empty(&flow->backlogchain))
>> +               list_del_init(&flow->backlogchain);
>> +
>> +       ieee80211_purge_tx_queue(&local->hw, &flow->queue);
>> +
>> +       flow->deficit = 0;
>> +       flow->txqi = NULL;
>> +}
>> +
>> +void ieee80211_purge_txq(struct ieee80211_local *local, struct txq_info *txqi)
>> +{
>> +       struct txq_flow *flow;
>> +       int i;
>> +
>> +       for (i = 0; i < local->fq.flows_cnt; i++) {
>> +               flow = &local->fq.flows[i];
>> +
>> +               if (flow->txqi != txqi)
>> +                       continue;
>> +
>> +               ieee80211_reset_flow(local, flow);
>> +       }
>> +
>> +       ieee80211_reset_flow(local, &txqi->flow);
>> +
>> +       txqi->backlog_bytes = 0;
>> +       txqi->backlog_packets = 0;
>> +}
>> +
>> +void ieee80211_teardown_flows(struct ieee80211_local *local)
>> +{
>> +       struct ieee80211_fq *fq = &local->fq;
>> +       struct ieee80211_sub_if_data *sdata;
>> +       struct sta_info *sta;
>> +       int i;
>> +
>> +       if (!local->ops->wake_tx_queue)
>> +               return;
>> +
>> +       list_for_each_entry_rcu(sta, &local->sta_list, list)
>> +               for (i = 0; i < IEEE80211_NUM_TIDS; i++)
>> +                       ieee80211_purge_txq(local,
>> +                                           to_txq_info(sta->sta.txq[i]));
>> +
>> +       list_for_each_entry_rcu(sdata, &local->interfaces, list)
>> +               ieee80211_purge_txq(local, to_txq_info(sdata->vif.txq));
>> +
>> +       for (i = 0; i < fq->flows_cnt; i++)
>> +               ieee80211_reset_flow(local, &fq->flows[i]);
>> +
>> +       kfree(fq->flows);
>> +
>> +       fq->flows = NULL;
>> +       fq->flows_cnt = 0;
>> +}
>> +
>> +static void ieee80211_txq_enqueue(struct ieee80211_local *local,
>> +                                 struct txq_info *txqi,
>> +                                 struct sk_buff *skb)
>> +{
>> +       struct ieee80211_fq *fq = &local->fq;
>> +       struct ieee80211_hw *hw = &local->hw;
>> +       struct txq_flow *flow;
>> +       struct txq_flow *i;
>> +       size_t idx = fq_hash(fq, skb);
>> +
>> +       flow = &fq->flows[idx];
>> +
>> +       if (flow->txqi)
>> +               flow = &txqi->flow;
>> +
>> +       /* The following overwrites `vif` pointer effectively. It is later
>> +        * restored using txq structure.
>> +        */
>> +       IEEE80211_SKB_CB(skb)->control.enqueue_time = codel_get_time();
>> +
>> +       flow->txqi = txqi;
>> +       flow->backlog += skb->len;
>> +       txqi->backlog_bytes += skb->len;
>> +       txqi->backlog_packets++;
>> +       fq->backlog++;
>> +
>> +       if (list_empty(&flow->backlogchain))
>> +               i = list_last_entry(&fq->backlogs, struct txq_flow, backlogchain);
>> +       else
>> +               i = flow;
>> +
>> +       list_for_each_entry_continue_reverse(i, &fq->backlogs, backlogchain)
>> +               if (i->backlog > flow->backlog)
>> +                       break;
>> +
>> +       list_move(&flow->backlogchain, &i->backlogchain);
>> +
>> +       if (list_empty(&flow->flowchain)) {
>> +               flow->deficit = fq->quantum;
>> +               list_add_tail(&flow->flowchain, &txqi->new_flows);
>> +       }
>> +
>> +       __skb_queue_tail(&flow->queue, skb);
>> +
>> +       if (fq->backlog > hw->txq_limit)
>> +               fq_drop(local);
>> +}
>> +
>> +static struct sk_buff *ieee80211_txq_dequeue(struct ieee80211_local *local,
>> +                                            struct txq_info *txqi)
>> +{
>> +       struct ieee80211_fq *fq = &local->fq;
>> +       struct ieee80211_hw *hw = &local->hw;
>> +       struct txq_flow *flow;
>> +       struct list_head *head;
>> +       struct sk_buff *skb;
>> +
>> +begin:
>> +       head = &txqi->new_flows;
>> +       if (list_empty(head)) {
>> +               head = &txqi->old_flows;
>> +               if (list_empty(head))
>> +                       return NULL;
>> +       }
>> +
>> +       flow = list_first_entry(head, struct txq_flow, flowchain);
>> +
>> +       if (flow->deficit <= 0) {
>> +               flow->deficit += fq->quantum;
>> +               list_move_tail(&flow->flowchain, &txqi->old_flows);
>> +               goto begin;
>> +       }
>> +
>> +       skb = codel_dequeue(flow, &flow->backlog, &flow->cvars,
>> +                           &hw->txq_cparams, codel_get_time(), false);
>> +       if (!skb) {
>> +               if ((head == &txqi->new_flows) &&
>> +                   !list_empty(&txqi->old_flows)) {
>> +                       list_move_tail(&flow->flowchain, &txqi->old_flows);
>> +               } else {
>> +                       list_del_init(&flow->flowchain);
>> +                       flow->txqi = NULL;
>> +               }
>> +               goto begin;
>> +       }
>> +
>> +       flow->deficit -= skb->len;
>> +
>> +       /* The `vif` pointer was overwritten with enqueue time during
>> +        * enqueuing. Restore it before handing to driver.
>> +        */
>> +       IEEE80211_SKB_CB(skb)->control.vif = flow->txqi->txq.vif;
>> +
>> +       return skb;
>> +}
>> +
>> +static struct txq_info *
>> +ieee80211_get_txq(struct ieee80211_local *local,
>> +                 struct ieee80211_vif *vif,
>> +                 struct ieee80211_sta *pubsta,
>> +                 struct sk_buff *skb)
>>  {
>>         struct ieee80211_hdr *hdr = (struct ieee80211_hdr *) skb->data;
>> -       struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif);
>>         struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
>> -       struct ieee80211_tx_control control = {
>> -               .sta = pubsta,
>> -       };
>>         struct ieee80211_txq *txq = NULL;
>> -       struct txq_info *txqi;
>> -       u8 ac;
>>
>>         if (info->control.flags & IEEE80211_TX_CTRL_PS_RESPONSE)
>> -               goto tx_normal;
>> +               return NULL;
>>
>>         if (!ieee80211_is_data(hdr->frame_control))
>> -               goto tx_normal;
>> +               return NULL;
>>
>>         if (pubsta) {
>>                 u8 tid = skb->priority & IEEE80211_QOS_CTL_TID_MASK;
>> @@ -1258,52 +1545,29 @@ static void ieee80211_drv_tx(struct ieee80211_local *local,
>>         }
>>
>>         if (!txq)
>> -               goto tx_normal;
>> +               return NULL;
>>
>> -       ac = txq->ac;
>> -       txqi = to_txq_info(txq);
>> -       atomic_inc(&sdata->txqs_len[ac]);
>> -       if (atomic_read(&sdata->txqs_len[ac]) >= local->hw.txq_ac_max_pending)
>> -               netif_stop_subqueue(sdata->dev, ac);
>> -
>> -       spin_lock_bh(&txqi->queue.lock);
>> -       txqi->byte_cnt += skb->len;
>> -       __skb_queue_tail(&txqi->queue, skb);
>> -       spin_unlock_bh(&txqi->queue.lock);
>> -
>> -       drv_wake_tx_queue(local, txqi);
>> -
>> -       return;
>> -
>> -tx_normal:
>> -       drv_tx(local, &control, skb);
>> +       return to_txq_info(txq);
>>  }
>>
>>  struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
>>                                      struct ieee80211_txq *txq)
>>  {
>>         struct ieee80211_local *local = hw_to_local(hw);
>> -       struct ieee80211_sub_if_data *sdata = vif_to_sdata(txq->vif);
>> +       struct ieee80211_fq *fq = &local->fq;
>>         struct txq_info *txqi = container_of(txq, struct txq_info, txq);
>>         struct ieee80211_hdr *hdr;
>>         struct sk_buff *skb = NULL;
>> -       u8 ac = txq->ac;
>>
>> -       spin_lock_bh(&txqi->queue.lock);
>> +       spin_lock_bh(&fq->lock);
>>
>>         if (test_bit(IEEE80211_TXQ_STOP, &txqi->flags))
>>                 goto out;
>>
>> -       skb = __skb_dequeue(&txqi->queue);
>> +       skb = ieee80211_txq_dequeue(local, txqi);
>>         if (!skb)
>>                 goto out;
>>
>> -       txqi->byte_cnt -= skb->len;
>> -
>> -       atomic_dec(&sdata->txqs_len[ac]);
>> -       if (__netif_subqueue_stopped(sdata->dev, ac))
>> -               ieee80211_propagate_queue_wake(local, sdata->vif.hw_queue[ac]);
>> -
>>         hdr = (struct ieee80211_hdr *)skb->data;
>>         if (txq->sta && ieee80211_is_data_qos(hdr->frame_control)) {
>>                 struct sta_info *sta = container_of(txq->sta, struct sta_info,
>> @@ -1318,7 +1582,7 @@ struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
>>         }
>>
>>  out:
>> -       spin_unlock_bh(&txqi->queue.lock);
>> +       spin_unlock_bh(&fq->lock);
>>
>>         return skb;
>>  }
>> @@ -1330,7 +1594,10 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,
>>                                struct sk_buff_head *skbs,
>>                                bool txpending)
>>  {
>> +       struct ieee80211_fq *fq = &local->fq;
>> +       struct ieee80211_tx_control control = {};
>>         struct sk_buff *skb, *tmp;
>> +       struct txq_info *txqi;
>>         unsigned long flags;
>>
>>         skb_queue_walk_safe(skbs, skb, tmp) {
>> @@ -1345,6 +1612,24 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,
>>                 }
>>  #endif
>>
>> +               /* XXX: This changes behavior for offchan-tx. Is this really a
>> +                *      problem with per-sta-tid queueing now?
>> +                */
>> +               txqi = ieee80211_get_txq(local, vif, sta, skb);
>> +               if (txqi) {
>> +                       info->control.vif = vif;
>> +
>> +                       __skb_unlink(skb, skbs);
>> +
>> +                       spin_lock_bh(&fq->lock);
>> +                       ieee80211_txq_enqueue(local, txqi, skb);
>> +                       spin_unlock_bh(&fq->lock);
>> +
>> +                       drv_wake_tx_queue(local, txqi);
>> +
>> +                       continue;
>> +               }
>> +
>>                 spin_lock_irqsave(&local->queue_stop_reason_lock, flags);
>>                 if (local->queue_stop_reasons[q] ||
>>                     (!txpending && !skb_queue_empty(&local->pending[q]))) {
>> @@ -1387,9 +1672,10 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,
>>                 spin_unlock_irqrestore(&local->queue_stop_reason_lock, flags);
>>
>>                 info->control.vif = vif;
>> +               control.sta = sta;
>>
>>                 __skb_unlink(skb, skbs);
>> -               ieee80211_drv_tx(local, vif, sta, skb);
>> +               drv_tx(local, &control, skb);
>>         }
>>
>>         return true;
>> diff --git a/net/mac80211/util.c b/net/mac80211/util.c
>> index 323d300878ca..0d33cb7339a2 100644
>> --- a/net/mac80211/util.c
>> +++ b/net/mac80211/util.c
>> @@ -244,6 +244,9 @@ void ieee80211_propagate_queue_wake(struct ieee80211_local *local, int queue)
>>         struct ieee80211_sub_if_data *sdata;
>>         int n_acs = IEEE80211_NUM_ACS;
>>
>> +       if (local->ops->wake_tx_queue)
>> +               return;
>> +
>>         if (local->hw.queues < IEEE80211_NUM_ACS)
>>                 n_acs = 1;
>>
>> @@ -260,11 +263,6 @@ void ieee80211_propagate_queue_wake(struct ieee80211_local *local, int queue)
>>                 for (ac = 0; ac < n_acs; ac++) {
>>                         int ac_queue = sdata->vif.hw_queue[ac];
>>
>> -                       if (local->ops->wake_tx_queue &&
>> -                           (atomic_read(&sdata->txqs_len[ac]) >
>> -                            local->hw.txq_ac_max_pending))
>> -                               continue;
>> -
>>                         if (ac_queue == queue ||
>>                             (sdata->vif.cab_queue == queue &&
>>                              local->queue_stop_reasons[ac_queue] == 0 &&
>> @@ -352,6 +350,9 @@ static void __ieee80211_stop_queue(struct ieee80211_hw *hw, int queue,
>>         if (__test_and_set_bit(reason, &local->queue_stop_reasons[queue]))
>>                 return;
>>
>> +       if (local->ops->wake_tx_queue)
>> +               return;
>> +
>>         if (local->hw.queues < IEEE80211_NUM_ACS)
>>                 n_acs = 1;
>>
>> @@ -3364,8 +3365,11 @@ void ieee80211_init_tx_queue(struct ieee80211_sub_if_data *sdata,
>>                              struct sta_info *sta,
>>                              struct txq_info *txqi, int tid)
>>  {
>> -       skb_queue_head_init(&txqi->queue);
>> +       INIT_LIST_HEAD(&txqi->old_flows);
>> +       INIT_LIST_HEAD(&txqi->new_flows);
>> +       ieee80211_init_flow(&txqi->flow);
>>         txqi->txq.vif = &sdata->vif;
>> +       txqi->flow.txqi = txqi;
>>
>>         if (sta) {
>>                 txqi->txq.sta = &sta->sta;
>> @@ -3386,9 +3390,9 @@ void ieee80211_txq_get_depth(struct ieee80211_txq *txq,
>>         struct txq_info *txqi = to_txq_info(txq);
>>
>>         if (frame_cnt)
>> -               *frame_cnt = txqi->queue.qlen;
>> +               *frame_cnt = txqi->backlog_packets;
>>
>>         if (byte_cnt)
>> -               *byte_cnt = txqi->byte_cnt;
>> +               *byte_cnt = txqi->backlog_bytes;
>>  }
>>  EXPORT_SYMBOL(ieee80211_txq_get_depth);
>> --
>> 2.1.4
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Kazior March 8, 2016, 7:41 a.m. UTC | #16

On 7 March 2016 at 19:28, Dave Taht <dave.taht@gmail.com> wrote:
> On Mon, Mar 7, 2016 at 9:14 AM, Avery Pennarun <apenwarr@gmail.com> wrote:
>> On Mon, Mar 7, 2016 at 11:54 AM, Dave Taht <dave.taht@gmail.com> wrote:
[...]
>>> the underlying code needs to be striving successfully for per-station
>>> airtime fairness for this to work at all, and the driver/card
>>> interface nearly as tight as BQL is for the fq portion to behave
>>> sanely. I'd configure codel at a higher target and try to observe what
>>> is going on at the fq level til that got saner.
>>
>> That seems like two good goals.  So Emmanuel's BQL-like thing seems
>> like we'll need it soon.
>>
>> As for per-station airtime fairness, what's a good approximation of
>> that?  Perhaps round-robin between stations, one aggregate per turn,
>> where each aggregate has a maximum allowed latency?
>
> Strict round robin is a start, and simplest, yes. Sure.
>
> "Oldest station queues first" on a round (probably) has higher
> potential for maximizing txops, but requires more overhead. (shortest
> queue first would be bad). There's another algo based on last received
> packets from a station possibly worth fiddling with in the long run...
>
> as "maximum allowed latency" - well, to me that is eventually also a
> variable, based on the number of stations that have to be scheduled on
> that round. Trying to get away from 10 stations eating 5.7ms each +
> return traffic on a round would be nicer. If you want a constant, for
> now, aim for 2048us or 1TU.

The "one aggregate per turn" is a tricky.

I guess you can guarantee this sort of thing on ath9k/mt76.

This isn't the case for other drivers, e.g. ath10k has a flat tx
queue. You don't really know if the 40 frames you submitted will be
sent with 1, 2 or 3 aggregates. They might not be aggregated at all.

Best thing you can do is to estimate how much bytes can you fit into a
txop on target sta-tid/ac assuming you can get last/avg tx rate to
given station (should be doable on ath10k at least).

Moreover, for MU-MIMO you typically want to burst a few aggregates in
txop to make the sounding to pay off. And this is again tricky on flat
tx queue where you don't really know if target stations can do an
efficient MU transmission and worst case you'll end up with stacking
up 3 txops worth of data in queues.

Oh, and the unfortunate thing is ath10k does offloaded powersaving
which means some frames can clog up tx queues unnecessarily until next
TBTT. This is something that will need to be addressed as well in tx
scheduling. Not sure how yet.

A quick idea - perhaps we could unify ps_tx_buf with txqs and make use
of txqs internally regardless of wake_tx_queue implementation?

[...]
> [1] I've published a lot of stuff showing how damaging 802.11e's edca
> scheduling can be - I lean towards, at most, 2-3 aggregates being in
> the hardware, essentially disabling the VO queue on 802.11n (not sure
> on ac), in favor of VI, promoting or demoting an assembled aggregate
> from BE to BK or VI as needed at the last second before submitting it
> to the hardware, trying harder to only have one aggregate outstanding
> to one station at a time, etc.

Makes sense, but (again) tricky for drivers such as ath10k which have
a flat tx queue. Perhaps I could maintain a simulation of aggregates
or some sort of barriers and hope it's "good enough".

Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Toke Høiland-Jørgensen March 8, 2016, 10:19 a.m. UTC | #17

Michal Kazior <michal.kazior@tieto.com> writes:

>> With large values for flows_cnt, fq, dominates, for small values, aqm
>> does. We did quite a lot of testing at 16 and 32 queues in the early
>> days, with pretty good results, except when we didn't. Cake went whole
>> hog with an 8 way set associative hash leading to "near perfect" fq,
>> which, at the cost of more cpu overhead, could cut the number of
>> queues down by a lot, also. Eric did "perfect" fq with sch_fq...
>
> Out of curiosity - do you have any numbers to compare against
> fq_codel? Like hash collision probability vs number of active flows?

Basically, the analytical expression for hash collisions is fairly
straight forward (though I can't take credit for coming up with it
myself):

Given N bins with M items being hashed into them by a hypothetical
perfectly uniform hash, you get:

  Expected number of bins with x items = N * (1/N)^x * (1 - 1/N) ^ (M - x) * C(M, x)

where C(M, x) is the combinatorial function = M! / (x! * (M-x)!).

By expanding this expression for x=1 and dividing by M, you get the
probability that one of your M items is in its own bin. Subtract this
from 1 and you get the collision probability.

I have a neat spreadsheet to compute this for arbitrary numbers; but for
a 1024-bin FQ-Codel this gives a collision probability of just under 1%
for 10 flows, and just over 9% for 100 flows. This is not too far off
from actual values in a real-world hashing function.

Now, to add to the confusion, you also have to take into account that an
active flow (from an end-to-end perspective) does not necessarily
translate into an active flow from the queue perspective. And that in
fact the number of active flows in a router can be significantly less
than the number of active end-to-end flows, and scales sub-linearly...
There has been at least one paper demonstrating this, but right now I
can't recall who wrote it.

-Toke
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Kazior March 8, 2016, 10:57 a.m. UTC | #18

On 8 March 2016 at 00:06, Dave Taht <dave.taht@gmail.com> wrote:
> Dear Michal:
>
> Going through this patchset... (while watching it compile)
>
>
> +       if (!local->hw.txq_cparams.target)
> +               local->hw.txq_cparams.target = MS2TIME(5);
>
> MS2TIME(20) for now and/or add something saner to this than !*backlog
>
> target will not be a constant in the long run.
>
> +       if (now - custom_codel_get_enqueue_time(skb) < p->target ||
> +           !*backlog) {
> +               /* went below - stay below for at least interval */
> +               vars->first_above_time = 0;
> +               return false;
> +       }
>
>
> *backlog < some_sane_value_for_an_aggregate_for_this_station
>
> Unlike regular codel *backlog should be a ptr to the queuesize for
> this station, not the total queue.
>
> regular codel, by using the shared backlog for all queues, is trying
> to get to a 1 packet depth for all queues, (see commit:
> 865ec5523dadbedefbc5710a68969f686a28d928 ), and store the flow in the
> network, not the queue...
>
> BUT in wifi's case you want to provide good service to all stations,
> which is filling up an aggregate
> for each... (and varying the "sane_value_for_the_aggregate" to suit
> per sta service time requirements in a given round of all stations).

Hmm.. This actually makes me think that:

        skb = codel_dequeue(flow, &flow->backlog, &flow->cvars,
                            &hw->txq_cparams, codel_get_time(), false);

is kind of wrong.

If you want to maintain per-sta aggregate-long queue this probably
needs a sta->backlog instead of flow->backlog (because you can have
multiple flows per station) in the first place.

Not quite sure about cvars though. Thoughts?


Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bob Copeland March 8, 2016, 1:14 p.m. UTC | #19

On Tue, Mar 08, 2016 at 08:12:21AM +0100, Michal Kazior wrote:
> However other drivers (e.g. ath10k) have offloaded rate control on
> device. There's currently no way of doing this calculation. I was
> thinking of drivers exporting tx_rate to mac80211 in some way - either
> via a simple sta->tx_rate scalar that the driver is responsible for
> updating, or a EWMA that driver updates (hopefully) periodically and
> often enough. This should in theory at least allow an estimate how
> much data on average you can fit into given time frame (e.g. txop, or
> hardcoded 1ms).

What about implementing ops->get_expected_throughput?  This would be
useful for mesh (both 11s and batman) as well since they need to
estimate link quality to pick a path.

Michal Kazior March 8, 2016, 1:27 p.m. UTC | #20

On 8 March 2016 at 14:14, Bob Copeland <me@bobcopeland.com> wrote:
> On Tue, Mar 08, 2016 at 08:12:21AM +0100, Michal Kazior wrote:
>> However other drivers (e.g. ath10k) have offloaded rate control on
>> device. There's currently no way of doing this calculation. I was
>> thinking of drivers exporting tx_rate to mac80211 in some way - either
>> via a simple sta->tx_rate scalar that the driver is responsible for
>> updating, or a EWMA that driver updates (hopefully) periodically and
>> often enough. This should in theory at least allow an estimate how
>> much data on average you can fit into given time frame (e.g. txop, or
>> hardcoded 1ms).
>
> What about implementing ops->get_expected_throughput?  This would be
> useful for mesh (both 11s and batman) as well since they need to
> estimate link quality to pick a path.

Oh, good point! I totally forgot about this because - in the back of
my head - I was convinced this is available only for minstrel. I guess
this could work as well.


Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Taht March 10, 2016, 6:57 p.m. UTC | #21

>> regular fq_codel uses 1024 and there has not been much reason to
>> change it. In the case of an AP which has more limited memory, 256 or
>> 1024 would be a good setting, per station. I'd stick to 1024 for now.
>
> Do note that the 4096 is shared _across_ station-tid queues. It is not
> per-station. If you have 10 stations you still have 4096 flows
> (actually 4096 + 16*10, because each tid - and there are 16 - has it's
> own fallback flow in case of hash collision on the global flowmap to
> maintain per-sta-tid queuing).

I have to admit I didn't parse this well - still haven't, I think I
need to draw. (got a picture?)

Where is this part happening in the code (or firmware?)

" because each tid - and there are 16 - has it's
 own fallback flow in case of hash collision on the global flowmap to
 maintain per-sta-tid queuing"

"fallback flow - hash collision on global flowmap" - huh?

> With that in mind do you still think 1024 is enough?

Can't answer that question without understanding what you said above.

I assembled a few of the patches to date (your fq_codel patch, avery's
and tims ath9k stuff) and tested them, to no measurable effect,
against linus's tree a day or two back. I also acquired an ath10k card
- would one of these suit?

http://www.amazon.com/gp/product/B011SIMFR8?psc=1&redirect=true&ref_=oh_aui_detailpage_o08_s00
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Kazior March 11, 2016, 8:32 a.m. UTC | #22

On 10 March 2016 at 19:57, Dave Taht <dave.taht@gmail.com> wrote:
>>> regular fq_codel uses 1024 and there has not been much reason to
>>> change it. In the case of an AP which has more limited memory, 256 or
>>> 1024 would be a good setting, per station. I'd stick to 1024 for now.
>>
>> Do note that the 4096 is shared _across_ station-tid queues. It is not
>> per-station. If you have 10 stations you still have 4096 flows
>> (actually 4096 + 16*10, because each tid - and there are 16 - has it's
>> own fallback flow in case of hash collision on the global flowmap to
>> maintain per-sta-tid queuing).
>
> I have to admit I didn't parse this well - still haven't, I think I
> need to draw. (got a picture?)

No picture, sorry. Let me rehash instead:

 - there's a pool of 4096 tubes (flows),
 - tubes in the pool are empty,
 - each tube has an id sticker (0,1,2,3,...),
 - each wi-fi station has 16 boxes (tids) for tubes as well,
 - each station box has 1 extra immovable tube (flow),
 - packets are hashed into tube id stickers,
 - matching tube can be:
   - in the pool;
     - move the tube into station-box the packet is associated with
     - put the packet into tube
   - in the station-box the packet is associated with
     - just put the packet into tube
   - in a station-box different from the one the packet is associated with
     - (this is the collision case for fallback flow)
     - put the packet into the immovable tube of station box the
packet is associated with
 - whenever a tube becomes empty it is moved back to the pool


> Where is this part happening in the code (or firmware?)

It's in the ieee80211_txq_enqueue(). My patch originally contains a
mistake (pointed out by Felix) and the function should actually look
like this:

> static void ieee80211_txq_enqueue(struct ieee80211_local *local,
>                                   struct txq_info *txqi,
>                                   struct sk_buff *skb)
> {
>         struct ieee80211_fq *fq = &local->fq;
>         struct ieee80211_hw *hw = &local->hw;
>         struct txq_flow *flow;
>         struct txq_flow *i;
>         size_t idx = fq_hash(fq, skb);
>
>         flow = &fq->flows[idx];
>
>         if (flow->txqi && flow->txqi != txqi)
>                 flow = &txqi->flow;
[...]


> I assembled a few of the patches to date (your fq_codel patch, avery's
> and tims ath9k stuff) and tested them, to no measurable effect,
> against linus's tree a day or two back. I also acquired an ath10k card
> - would one of these suit?
>
> http://www.amazon.com/gp/product/B011SIMFR8?psc=1&redirect=true&ref_=oh_aui_detailpage_o08_s00

I haven't used this particular card. It's an older chip (no MU-MIMO)
but otherwise should work fine.


Micha?
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC/RFT] mac80211: implement fq_codel for software queuing

Commit Message

Comments

Patch