[RFC,v2] mac80211: add A-MSDU tx support

Message ID	1454668864-31777-1-git-send-email-nbd@openwrt.org (mailing list archive)
State	RFC
Delegated to:	Johannes Berg
Headers	show Return-Path: <linux-wireless-owner@kernel.org> From: Felix Fietkau <nbd@openwrt.org> To: linux-wireless@vger.kernel.org Cc: johannes@sipsolutions.net Subject: [RFC v2] mac80211: add A-MSDU tx support Date: Fri, 5 Feb 2016 11:41:04 +0100 Message-Id: <1454668864-31777-1-git-send-email-nbd@openwrt.org> Sender: linux-wireless-owner@vger.kernel.org Precedence: bulk

Felix Fietkau Feb. 5, 2016, 10:41 a.m. UTC

Requires software tx queueing support. frag_list support (for zero-copy)
is optional.

Signed-off-by: Felix Fietkau <nbd@openwrt.org>
---
 include/net/mac80211.h     |  14 +++++
 net/mac80211/agg-tx.c      |   5 ++
 net/mac80211/debugfs.c     |   2 +
 net/mac80211/ieee80211_i.h |   1 +
 net/mac80211/tx.c          | 151 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 173 insertions(+)

Emmanuel Grumbach Feb. 7, 2016, 7:25 a.m. UTC | #1

On Fri, Feb 5, 2016 at 12:41 PM, Felix Fietkau <nbd@openwrt.org> wrote:
> Requires software tx queueing support. frag_list support (for zero-copy)
> is optional.
>
> Signed-off-by: Felix Fietkau <nbd@openwrt.org>

Looks nice!
This would allow us to create aggregates of TCP Acks, the problem is
that when you are mostly receiving data, the hardware queues are
pretty much empty (nothing besides the TCP Acks which should go out
quickly) so that packets don't pile up in the software queues and
hence you don't have enough material to build an A-MSDU.
I guess that for AP oriented devices, this is ideal solution since you
can't rely on TSO (packets are not locally generated) and this allows
to build an A-MSDU without adding more latency since you build an
A-MSDU with packets that are already in the queue waiting instead of
delaying transmission of the first packet.
IIRC, the latter was the approach chose by the new Marvell driver
posted a few weeks ago. This approach is better in my eyes.
For iwlwifi which is much more station oriented (of GO which is
basically an AP with locally generated traffic), I took the TSO
approach. I guess we could try to change iwlwifi to use your tx
queues, and check how that works. This would allow us to have A-MSDU
on bridged traffic as well, although this use case is much less common
for Intel devices.

One small question below.

[snip]

> +
> +static bool ieee80211_amsdu_aggregate(struct ieee80211_sub_if_data *sdata,
> +                                     struct sta_info *sta,
> +                                     struct ieee80211_fast_tx *fast_tx,
> +                                     struct sk_buff *skb)
> +{
> +       struct ieee80211_local *local = sdata->local;
> +       u8 tid = skb->priority & IEEE80211_QOS_CTL_TAG1D_MASK;
> +       struct ieee80211_txq *txq = sta->sta.txq[tid];
> +       struct txq_info *txqi;
> +       struct sk_buff **frag_tail, *head;
> +       int subframe_len = skb->len - ETH_ALEN;
> +       int max_amsdu_len;
> +       __be16 len;
> +       void *data;
> +       bool ret = false;
> +       int n = 1;
> +
> +       if (!ieee80211_hw_check(&local->hw, TX_AMSDU))
> +               return false;
> +
> +       if (!txq)
> +               return false;
> +
> +       txqi = to_txq_info(txq);
> +       if (test_bit(IEEE80211_TXQ_NO_AMSDU, &txqi->flags))
> +               return false;
> +
> +       /*
> +        * A-MPDU limits maximum MPDU size to 4095 bytes. Since aggregation
> +        * sessions are started/stopped without txq flush, use the limit here
> +        * to avoid having to de-aggregate later.
> +        */
> +       max_amsdu_len = min_t(int, sta->sta.max_amsdu_len, 4095);

So you can't get 10K A-MSDUs? I don't see where you check that you
have an A-MPDU session here. You seem to be applying the 4095 limit
also for streams that are not an A-MPDU?
I guess you could check if the sta is a VHT peer, in that case, no
limit applies.

> +
> +       spin_lock_bh(&txqi->queue.lock);
> +
> +       head = skb_peek_tail(&txqi->queue);
> +       if (!head)
> +               goto out;
> +
> +       if (skb->len + head->len > max_amsdu_len)
> +               goto out;
> +
> +       if (!ieee80211_amsdu_prepare_head(sdata, fast_tx, head))
> +               goto out;
> +
> +       frag_tail = &skb_shinfo(head)->frag_list;
> +       while (*frag_tail) {
> +            frag_tail = &(*frag_tail)->next;
> +                n++;
> +       }
> +
> +       if (local->hw.max_tx_amsdu_subframes &&
> +           n > local->hw.max_tx_amsdu_subframes)
> +               goto out;
> +
> +       if (skb_headroom(skb) < 8 || skb_tailroom(skb) < 3) {
> +               I802_DEBUG_INC(local->tx_expand_skb_head);
> +
> +               if (pskb_expand_head(skb, 8, 3, GFP_ATOMIC)) {
> +                       wiphy_debug(local->hw.wiphy,
> +                                   "failed to reallocate TX buffer\n");
> +                       goto out;
> +               }
> +       }
> +
> +       subframe_len += ieee80211_amsdu_pad(skb, subframe_len);
> +
> +       ret = true;
> +       data = skb_push(skb, ETH_ALEN + 2);
> +       memmove(data, data + ETH_ALEN + 2, 2 * ETH_ALEN);
> +
> +       data += 2 * ETH_ALEN;
> +       len = cpu_to_be16(subframe_len);
> +       memcpy(data, &len, 2);
> +       memcpy(data + 2, rfc1042_header, ETH_ALEN);
> +
> +       head->len += skb->len;
> +       head->data_len += skb->len;
> +       *frag_tail = skb;
> +
> +out:
> +       spin_unlock_bh(&txqi->queue.lock);
> +
> +       return ret;
> +}
> +
>  static bool ieee80211_xmit_fast(struct ieee80211_sub_if_data *sdata,
>                                 struct net_device *dev, struct sta_info *sta,
>                                 struct ieee80211_fast_tx *fast_tx,
> @@ -2817,6 +2964,10 @@ static bool ieee80211_xmit_fast(struct ieee80211_sub_if_data *sdata,
>
>         ieee80211_tx_stats(dev, skb->len + extra_head);
>
> +       if ((hdr->frame_control & cpu_to_le16(IEEE80211_STYPE_QOS_DATA)) &&
> +           ieee80211_amsdu_aggregate(sdata, sta, fast_tx, skb))
> +               return true;
> +
>         /* will not be crypto-handled beyond what we do here, so use false
>          * as the may-encrypt argument for the resize to not account for
>          * more room than we already have in 'extra_head'
> --
> 2.2.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Felix Fietkau Feb. 7, 2016, 9:08 a.m. UTC | #2

On 2016-02-07 08:25, Emmanuel Grumbach wrote:
> On Fri, Feb 5, 2016 at 12:41 PM, Felix Fietkau <nbd@openwrt.org> wrote:
>> Requires software tx queueing support. frag_list support (for zero-copy)
>> is optional.
>>
>> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
> 
> Looks nice!
> This would allow us to create aggregates of TCP Acks, the problem is
> that when you are mostly receiving data, the hardware queues are
> pretty much empty (nothing besides the TCP Acks which should go out
> quickly) so that packets don't pile up in the software queues and
> hence you don't have enough material to build an A-MSDU.
> I guess that for AP oriented devices, this is ideal solution since you
> can't rely on TSO (packets are not locally generated) and this allows
> to build an A-MSDU without adding more latency since you build an
> A-MSDU with packets that are already in the queue waiting instead of
> delaying transmission of the first packet.
> IIRC, the latter was the approach chose by the new Marvell driver
> posted a few weeks ago. This approach is better in my eyes.
> For iwlwifi which is much more station oriented (of GO which is
> basically an AP with locally generated traffic), I took the TSO
> approach. I guess we could try to change iwlwifi to use your tx
> queues, and check how that works. This would allow us to have A-MSDU
> on bridged traffic as well, although this use case is much less common
> for Intel devices.
Can the iwlwifi firmware maintain per-sta per-tid queues? Because that
way you would get the most benefits from using that tx queueing
infrastructure.

>> +
>> +static bool ieee80211_amsdu_aggregate(struct ieee80211_sub_if_data *sdata,
>> +                                     struct sta_info *sta,
>> +                                     struct ieee80211_fast_tx *fast_tx,
>> +                                     struct sk_buff *skb)
>> +{
>> +       struct ieee80211_local *local = sdata->local;
>> +       u8 tid = skb->priority & IEEE80211_QOS_CTL_TAG1D_MASK;
>> +       struct ieee80211_txq *txq = sta->sta.txq[tid];
>> +       struct txq_info *txqi;
>> +       struct sk_buff **frag_tail, *head;
>> +       int subframe_len = skb->len - ETH_ALEN;
>> +       int max_amsdu_len;
>> +       __be16 len;
>> +       void *data;
>> +       bool ret = false;
>> +       int n = 1;
>> +
>> +       if (!ieee80211_hw_check(&local->hw, TX_AMSDU))
>> +               return false;
>> +
>> +       if (!txq)
>> +               return false;
>> +
>> +       txqi = to_txq_info(txq);
>> +       if (test_bit(IEEE80211_TXQ_NO_AMSDU, &txqi->flags))
>> +               return false;
>> +
>> +       /*
>> +        * A-MPDU limits maximum MPDU size to 4095 bytes. Since aggregation
>> +        * sessions are started/stopped without txq flush, use the limit here
>> +        * to avoid having to de-aggregate later.
>> +        */
>> +       max_amsdu_len = min_t(int, sta->sta.max_amsdu_len, 4095);
> 
> So you can't get 10K A-MSDUs? I don't see where you check that you
> have an A-MPDU session here. You seem to be applying the 4095 limit
> also for streams that are not an A-MPDU?
> I guess you could check if the sta is a VHT peer, in that case, no
> limit applies.
The explanation for the missing A-MPDU change is in that comment -
checking for an active A-MPDU session would make it unnecessarily complex.
Good point about checking for VHT capabilities to remove this limit, I
will add that.

- Felix
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Emmanuel Grumbach Feb. 7, 2016, 10:06 a.m. UTC | #3

On Sun, Feb 7, 2016 at 11:08 AM, Felix Fietkau <nbd@openwrt.org> wrote:
> On 2016-02-07 08:25, Emmanuel Grumbach wrote:
>> On Fri, Feb 5, 2016 at 12:41 PM, Felix Fietkau <nbd@openwrt.org> wrote:
>>> Requires software tx queueing support. frag_list support (for zero-copy)
>>> is optional.
>>>
>>> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
>>
>> Looks nice!
>> This would allow us to create aggregates of TCP Acks, the problem is
>> that when you are mostly receiving data, the hardware queues are
>> pretty much empty (nothing besides the TCP Acks which should go out
>> quickly) so that packets don't pile up in the software queues and
>> hence you don't have enough material to build an A-MSDU.
>> I guess that for AP oriented devices, this is ideal solution since you
>> can't rely on TSO (packets are not locally generated) and this allows
>> to build an A-MSDU without adding more latency since you build an
>> A-MSDU with packets that are already in the queue waiting instead of
>> delaying transmission of the first packet.
>> IIRC, the latter was the approach chose by the new Marvell driver
>> posted a few weeks ago. This approach is better in my eyes.
>> For iwlwifi which is much more station oriented (of GO which is
>> basically an AP with locally generated traffic), I took the TSO
>> approach. I guess we could try to change iwlwifi to use your tx
>> queues, and check how that works. This would allow us to have A-MSDU
>> on bridged traffic as well, although this use case is much less common
>> for Intel devices.


> Can the iwlwifi firmware maintain per-sta per-tid queues? Because that
> way you would get the most benefits from using that tx queueing
> infrastructure.

Well... iwlwifi and athXk are very different. iwlwifi really has the
concept of queues and not flat descriptors.
Any Tx descriptor lives in the context of a Tx queue which is
per-sta-per tid when A-MPDUs is enabled.
We are now moving to a scheme in which we will have per-sta-per-tid
queue regardless of the A-MPDU state which will make much sense to tie
to the tx queueing infrastructure.
Thing is that in that case I am afraid we will not have enough packets
in the software tx queue to get A-MSDUs from your code. with TSO, it
is easier :) Still worth trying to work with this instead of TSO and
see how it goes. That won't happen anytime soon though.

>
>>> +
>>> +static bool ieee80211_amsdu_aggregate(struct ieee80211_sub_if_data *sdata,
>>> +                                     struct sta_info *sta,
>>> +                                     struct ieee80211_fast_tx *fast_tx,
>>> +                                     struct sk_buff *skb)
>>> +{
>>> +       struct ieee80211_local *local = sdata->local;
>>> +       u8 tid = skb->priority & IEEE80211_QOS_CTL_TAG1D_MASK;
>>> +       struct ieee80211_txq *txq = sta->sta.txq[tid];
>>> +       struct txq_info *txqi;
>>> +       struct sk_buff **frag_tail, *head;
>>> +       int subframe_len = skb->len - ETH_ALEN;
>>> +       int max_amsdu_len;
>>> +       __be16 len;
>>> +       void *data;
>>> +       bool ret = false;
>>> +       int n = 1;
>>> +
>>> +       if (!ieee80211_hw_check(&local->hw, TX_AMSDU))
>>> +               return false;
>>> +
>>> +       if (!txq)
>>> +               return false;
>>> +
>>> +       txqi = to_txq_info(txq);
>>> +       if (test_bit(IEEE80211_TXQ_NO_AMSDU, &txqi->flags))
>>> +               return false;
>>> +
>>> +       /*
>>> +        * A-MPDU limits maximum MPDU size to 4095 bytes. Since aggregation
>>> +        * sessions are started/stopped without txq flush, use the limit here
>>> +        * to avoid having to de-aggregate later.
>>> +        */
>>> +       max_amsdu_len = min_t(int, sta->sta.max_amsdu_len, 4095);
>>
>> So you can't get 10K A-MSDUs? I don't see where you check that you
>> have an A-MPDU session here. You seem to be applying the 4095 limit
>> also for streams that are not an A-MPDU?
>> I guess you could check if the sta is a VHT peer, in that case, no
>> limit applies.
> The explanation for the missing A-MPDU change is in that comment -
> checking for an active A-MPDU session would make it unnecessarily complex.
> Good point about checking for VHT capabilities to remove this limit, I
> will add that.
>
> - Felix
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Emmanuel Grumbach Feb. 7, 2016, 10:22 a.m. UTC | #4

On Sun, Feb 7, 2016 at 12:06 PM, Emmanuel Grumbach <egrumbach@gmail.com> wrote:
> On Sun, Feb 7, 2016 at 11:08 AM, Felix Fietkau <nbd@openwrt.org> wrote:
>> On 2016-02-07 08:25, Emmanuel Grumbach wrote:
>>> On Fri, Feb 5, 2016 at 12:41 PM, Felix Fietkau <nbd@openwrt.org> wrote:
>>>> Requires software tx queueing support. frag_list support (for zero-copy)
>>>> is optional.
>>>>
>>>> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
>>>
>>> Looks nice!
>>> This would allow us to create aggregates of TCP Acks, the problem is
>>> that when you are mostly receiving data, the hardware queues are
>>> pretty much empty (nothing besides the TCP Acks which should go out
>>> quickly) so that packets don't pile up in the software queues and
>>> hence you don't have enough material to build an A-MSDU.
>>> I guess that for AP oriented devices, this is ideal solution since you
>>> can't rely on TSO (packets are not locally generated) and this allows
>>> to build an A-MSDU without adding more latency since you build an
>>> A-MSDU with packets that are already in the queue waiting instead of
>>> delaying transmission of the first packet.
>>> IIRC, the latter was the approach chose by the new Marvell driver
>>> posted a few weeks ago. This approach is better in my eyes.
>>> For iwlwifi which is much more station oriented (of GO which is
>>> basically an AP with locally generated traffic), I took the TSO
>>> approach. I guess we could try to change iwlwifi to use your tx
>>> queues, and check how that works. This would allow us to have A-MSDU
>>> on bridged traffic as well, although this use case is much less common
>>> for Intel devices.
>
>
>> Can the iwlwifi firmware maintain per-sta per-tid queues? Because that
>> way you would get the most benefits from using that tx queueing
>> infrastructure.
>
> Well... iwlwifi and athXk are very different. iwlwifi really has the
> concept of queues and not flat descriptors.
> Any Tx descriptor lives in the context of a Tx queue which is
> per-sta-per tid when A-MPDUs is enabled.
> We are now moving to a scheme in which we will have per-sta-per-tid
> queue regardless of the A-MPDU state which will make much sense to tie
> to the tx queueing infrastructure.
> Thing is that in that case I am afraid we will not have enough packets
> in the software tx queue to get A-MSDUs from your code. with TSO, it
> is easier :) Still worth trying to work with this instead of TSO and
> see how it goes. That won't happen anytime soon though.
>
>>
>>>> +
>>>> +static bool ieee80211_amsdu_aggregate(struct ieee80211_sub_if_data *sdata,
>>>> +                                     struct sta_info *sta,
>>>> +                                     struct ieee80211_fast_tx *fast_tx,
>>>> +                                     struct sk_buff *skb)
>>>> +{
>>>> +       struct ieee80211_local *local = sdata->local;
>>>> +       u8 tid = skb->priority & IEEE80211_QOS_CTL_TAG1D_MASK;
>>>> +       struct ieee80211_txq *txq = sta->sta.txq[tid];
>>>> +       struct txq_info *txqi;
>>>> +       struct sk_buff **frag_tail, *head;
>>>> +       int subframe_len = skb->len - ETH_ALEN;
>>>> +       int max_amsdu_len;
>>>> +       __be16 len;
>>>> +       void *data;
>>>> +       bool ret = false;
>>>> +       int n = 1;
>>>> +
>>>> +       if (!ieee80211_hw_check(&local->hw, TX_AMSDU))
>>>> +               return false;
>>>> +
>>>> +       if (!txq)
>>>> +               return false;
>>>> +
>>>> +       txqi = to_txq_info(txq);
>>>> +       if (test_bit(IEEE80211_TXQ_NO_AMSDU, &txqi->flags))
>>>> +               return false;
>>>> +
>>>> +       /*
>>>> +        * A-MPDU limits maximum MPDU size to 4095 bytes. Since aggregation
>>>> +        * sessions are started/stopped without txq flush, use the limit here
>>>> +        * to avoid having to de-aggregate later.
>>>> +        */
>>>> +       max_amsdu_len = min_t(int, sta->sta.max_amsdu_len, 4095);
>>>
>>> So you can't get 10K A-MSDUs? I don't see where you check that you
>>> have an A-MPDU session here. You seem to be applying the 4095 limit
>>> also for streams that are not an A-MPDU?
>>> I guess you could check if the sta is a VHT peer, in that case, no
>>> limit applies.
>> The explanation for the missing A-MPDU change is in that comment -
>> checking for an active A-MPDU session would make it unnecessarily complex.
>> Good point about checking for VHT capabilities to remove this limit, I
>> will add that.

Yes - I read the comment, but it seemed very sub-optimal to limit all
the A-MSDUs to 4K. With TSO I can get up to 10K and it really helps
TPT.
One more point. In VHT, there may be a limit on the numbers of
subframes in the A-MSDU. I don't see you handle that. Maybe I missed
it?

And... in case the driver doesn't handle frag_list, you linearize the
skb which is pretty much the only thing you can do at this stage. But,
when you'll lift the 4095 bytes limit, you'll get 11K A-MSDU,
linarizing such a long packet is really putting the memory manager
under pressure. This is an order 4 allocation, for each A-MSDU. Note
that iwlwifi (and probably other drivers) can handle gather DMA in Tx,
but they have a limited number of frags they can handle. iwlwifi e.g.
can handle up to 20 frags, but 3 are taken for "paperwork". You'll
have 2 frags per subframe at least (assuming that each subframe's
payload is nicely contiguous and not on a page boundary). I think that
it may be worthwhile to ask the driver how many frags it is supposed
to handle. I can't promise iwlwifi will use it, but I guess it will be
useful for someone.


>>
>> - Felix
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Felix Fietkau Feb. 7, 2016, 10:48 a.m. UTC | #5

On 2016-02-07 11:22, Emmanuel Grumbach wrote:
> On Sun, Feb 7, 2016 at 12:06 PM, Emmanuel Grumbach <egrumbach@gmail.com> wrote:
>> On Sun, Feb 7, 2016 at 11:08 AM, Felix Fietkau <nbd@openwrt.org> wrote:
>>> On 2016-02-07 08:25, Emmanuel Grumbach wrote:
>>>> On Fri, Feb 5, 2016 at 12:41 PM, Felix Fietkau <nbd@openwrt.org> wrote:
>>>>> Requires software tx queueing support. frag_list support (for zero-copy)
>>>>> is optional.
>>>>>
>>>>> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
>>>>
>>>> Looks nice!
>>>> This would allow us to create aggregates of TCP Acks, the problem is
>>>> that when you are mostly receiving data, the hardware queues are
>>>> pretty much empty (nothing besides the TCP Acks which should go out
>>>> quickly) so that packets don't pile up in the software queues and
>>>> hence you don't have enough material to build an A-MSDU.
>>>> I guess that for AP oriented devices, this is ideal solution since you
>>>> can't rely on TSO (packets are not locally generated) and this allows
>>>> to build an A-MSDU without adding more latency since you build an
>>>> A-MSDU with packets that are already in the queue waiting instead of
>>>> delaying transmission of the first packet.
>>>> IIRC, the latter was the approach chose by the new Marvell driver
>>>> posted a few weeks ago. This approach is better in my eyes.
>>>> For iwlwifi which is much more station oriented (of GO which is
>>>> basically an AP with locally generated traffic), I took the TSO
>>>> approach. I guess we could try to change iwlwifi to use your tx
>>>> queues, and check how that works. This would allow us to have A-MSDU
>>>> on bridged traffic as well, although this use case is much less common
>>>> for Intel devices.
>>
>>
>>> Can the iwlwifi firmware maintain per-sta per-tid queues? Because that
>>> way you would get the most benefits from using that tx queueing
>>> infrastructure.
>>
>> Well... iwlwifi and athXk are very different. iwlwifi really has the
>> concept of queues and not flat descriptors.
>> Any Tx descriptor lives in the context of a Tx queue which is
>> per-sta-per tid when A-MPDUs is enabled.
>> We are now moving to a scheme in which we will have per-sta-per-tid
>> queue regardless of the A-MPDU state which will make much sense to tie
>> to the tx queueing infrastructure.
>> Thing is that in that case I am afraid we will not have enough packets
>> in the software tx queue to get A-MSDUs from your code. with TSO, it
>> is easier :) Still worth trying to work with this instead of TSO and
>> see how it goes. That won't happen anytime soon though.
>>
>>>
>>>>> +
>>>>> +static bool ieee80211_amsdu_aggregate(struct ieee80211_sub_if_data *sdata,
>>>>> +                                     struct sta_info *sta,
>>>>> +                                     struct ieee80211_fast_tx *fast_tx,
>>>>> +                                     struct sk_buff *skb)
>>>>> +{
>>>>> +       struct ieee80211_local *local = sdata->local;
>>>>> +       u8 tid = skb->priority & IEEE80211_QOS_CTL_TAG1D_MASK;
>>>>> +       struct ieee80211_txq *txq = sta->sta.txq[tid];
>>>>> +       struct txq_info *txqi;
>>>>> +       struct sk_buff **frag_tail, *head;
>>>>> +       int subframe_len = skb->len - ETH_ALEN;
>>>>> +       int max_amsdu_len;
>>>>> +       __be16 len;
>>>>> +       void *data;
>>>>> +       bool ret = false;
>>>>> +       int n = 1;
>>>>> +
>>>>> +       if (!ieee80211_hw_check(&local->hw, TX_AMSDU))
>>>>> +               return false;
>>>>> +
>>>>> +       if (!txq)
>>>>> +               return false;
>>>>> +
>>>>> +       txqi = to_txq_info(txq);
>>>>> +       if (test_bit(IEEE80211_TXQ_NO_AMSDU, &txqi->flags))
>>>>> +               return false;
>>>>> +
>>>>> +       /*
>>>>> +        * A-MPDU limits maximum MPDU size to 4095 bytes. Since aggregation
>>>>> +        * sessions are started/stopped without txq flush, use the limit here
>>>>> +        * to avoid having to de-aggregate later.
>>>>> +        */
>>>>> +       max_amsdu_len = min_t(int, sta->sta.max_amsdu_len, 4095);
>>>>
>>>> So you can't get 10K A-MSDUs? I don't see where you check that you
>>>> have an A-MPDU session here. You seem to be applying the 4095 limit
>>>> also for streams that are not an A-MPDU?
>>>> I guess you could check if the sta is a VHT peer, in that case, no
>>>> limit applies.
>>> The explanation for the missing A-MPDU change is in that comment -
>>> checking for an active A-MPDU session would make it unnecessarily complex.
>>> Good point about checking for VHT capabilities to remove this limit, I
>>> will add that.
> 
> Yes - I read the comment, but it seemed very sub-optimal to limit all
> the A-MSDUs to 4K. With TSO I can get up to 10K and it really helps
> TPT.
This was built with the assumption that most scenarios use A-MPDU anyway
and thus don't need really large A-MSDUs.

> One more point. In VHT, there may be a limit on the numbers of
> subframes in the A-MSDU. I don't see you handle that. Maybe I missed
> it?
I haven't looked at that much yet. Right now the driver can only specify
a limit for the number of subframes.

> And... in case the driver doesn't handle frag_list, you linearize the
> skb which is pretty much the only thing you can do at this stage. But,
> when you'll lift the 4095 bytes limit, you'll get 11K A-MSDU,
> linarizing such a long packet is really putting the memory manager
> under pressure. 
I added no-frag_list support primarily for debugging purposes, it's not
supposed to perform well.

> This is an order 4 allocation, for each A-MSDU. Note
> that iwlwifi (and probably other drivers) can handle gather DMA in Tx,
> but they have a limited number of frags they can handle. iwlwifi e.g.
> can handle up to 20 frags, but 3 are taken for "paperwork". You'll
> have 2 frags per subframe at least (assuming that each subframe's
> payload is nicely contiguous and not on a page boundary). I think that
> it may be worthwhile to ask the driver how many frags it is supposed
> to handle. I can't promise iwlwifi will use it, but I guess it will be
> useful for someone.
You mean an extra frag limit in addition to the driver subframe limit,
in case individual subframes are fragmented as well?

- Felix
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Emmanuel Grumbach Feb. 7, 2016, 11:32 a.m. UTC | #6

On Sun, Feb 7, 2016 at 12:48 PM, Felix Fietkau <nbd@openwrt.org> wrote:
> On 2016-02-07 11:22, Emmanuel Grumbach wrote:
>> On Sun, Feb 7, 2016 at 12:06 PM, Emmanuel Grumbach <egrumbach@gmail.com> wrote:
>>> On Sun, Feb 7, 2016 at 11:08 AM, Felix Fietkau <nbd@openwrt.org> wrote:
>>>> On 2016-02-07 08:25, Emmanuel Grumbach wrote:
>>>>> On Fri, Feb 5, 2016 at 12:41 PM, Felix Fietkau <nbd@openwrt.org> wrote:
>>>>>> Requires software tx queueing support. frag_list support (for zero-copy)
>>>>>> is optional.
>>>>>>
>>>>>> Signed-off-by: Felix Fietkau <nbd@openwrt.org>
>>>>>
>>>>> Looks nice!
>>>>> This would allow us to create aggregates of TCP Acks, the problem is
>>>>> that when you are mostly receiving data, the hardware queues are
>>>>> pretty much empty (nothing besides the TCP Acks which should go out
>>>>> quickly) so that packets don't pile up in the software queues and
>>>>> hence you don't have enough material to build an A-MSDU.
>>>>> I guess that for AP oriented devices, this is ideal solution since you
>>>>> can't rely on TSO (packets are not locally generated) and this allows
>>>>> to build an A-MSDU without adding more latency since you build an
>>>>> A-MSDU with packets that are already in the queue waiting instead of
>>>>> delaying transmission of the first packet.
>>>>> IIRC, the latter was the approach chose by the new Marvell driver
>>>>> posted a few weeks ago. This approach is better in my eyes.
>>>>> For iwlwifi which is much more station oriented (of GO which is
>>>>> basically an AP with locally generated traffic), I took the TSO
>>>>> approach. I guess we could try to change iwlwifi to use your tx
>>>>> queues, and check how that works. This would allow us to have A-MSDU
>>>>> on bridged traffic as well, although this use case is much less common
>>>>> for Intel devices.
>>>
>>>
>>>> Can the iwlwifi firmware maintain per-sta per-tid queues? Because that
>>>> way you would get the most benefits from using that tx queueing
>>>> infrastructure.
>>>
>>> Well... iwlwifi and athXk are very different. iwlwifi really has the
>>> concept of queues and not flat descriptors.
>>> Any Tx descriptor lives in the context of a Tx queue which is
>>> per-sta-per tid when A-MPDUs is enabled.
>>> We are now moving to a scheme in which we will have per-sta-per-tid
>>> queue regardless of the A-MPDU state which will make much sense to tie
>>> to the tx queueing infrastructure.
>>> Thing is that in that case I am afraid we will not have enough packets
>>> in the software tx queue to get A-MSDUs from your code. with TSO, it
>>> is easier :) Still worth trying to work with this instead of TSO and
>>> see how it goes. That won't happen anytime soon though.
>>>
>>>>
>>>>>> +
>>>>>> +static bool ieee80211_amsdu_aggregate(struct ieee80211_sub_if_data *sdata,
>>>>>> +                                     struct sta_info *sta,
>>>>>> +                                     struct ieee80211_fast_tx *fast_tx,
>>>>>> +                                     struct sk_buff *skb)
>>>>>> +{
>>>>>> +       struct ieee80211_local *local = sdata->local;
>>>>>> +       u8 tid = skb->priority & IEEE80211_QOS_CTL_TAG1D_MASK;
>>>>>> +       struct ieee80211_txq *txq = sta->sta.txq[tid];
>>>>>> +       struct txq_info *txqi;
>>>>>> +       struct sk_buff **frag_tail, *head;
>>>>>> +       int subframe_len = skb->len - ETH_ALEN;
>>>>>> +       int max_amsdu_len;
>>>>>> +       __be16 len;
>>>>>> +       void *data;
>>>>>> +       bool ret = false;
>>>>>> +       int n = 1;
>>>>>> +
>>>>>> +       if (!ieee80211_hw_check(&local->hw, TX_AMSDU))
>>>>>> +               return false;
>>>>>> +
>>>>>> +       if (!txq)
>>>>>> +               return false;
>>>>>> +
>>>>>> +       txqi = to_txq_info(txq);
>>>>>> +       if (test_bit(IEEE80211_TXQ_NO_AMSDU, &txqi->flags))
>>>>>> +               return false;
>>>>>> +
>>>>>> +       /*
>>>>>> +        * A-MPDU limits maximum MPDU size to 4095 bytes. Since aggregation
>>>>>> +        * sessions are started/stopped without txq flush, use the limit here
>>>>>> +        * to avoid having to de-aggregate later.
>>>>>> +        */
>>>>>> +       max_amsdu_len = min_t(int, sta->sta.max_amsdu_len, 4095);
>>>>>
>>>>> So you can't get 10K A-MSDUs? I don't see where you check that you
>>>>> have an A-MPDU session here. You seem to be applying the 4095 limit
>>>>> also for streams that are not an A-MPDU?
>>>>> I guess you could check if the sta is a VHT peer, in that case, no
>>>>> limit applies.
>>>> The explanation for the missing A-MPDU change is in that comment -
>>>> checking for an active A-MPDU session would make it unnecessarily complex.
>>>> Good point about checking for VHT capabilities to remove this limit, I
>>>> will add that.
>>
>> Yes - I read the comment, but it seemed very sub-optimal to limit all
>> the A-MSDUs to 4K. With TSO I can get up to 10K and it really helps
>> TPT.
> This was built with the assumption that most scenarios use A-MPDU anyway
> and thus don't need really large A-MSDUs.

Yes - so that's interesting. We can chose to have long A-MSDUs inside
a short (in terms of number of MPDUs) A-MPDU, of with shorter A-MSDU
and squeeze more of these into a single A-MDPU.
The first intuition says that we'd better have more MPDUs because of
the CRC check for each MPDU which doesn't exist in A-MSDU. OTOH, I
remember I could clearly see that I get a higher TPT with longer
A-MSDUs. Maybe I wasn't looking right at the size of the A-MPDU? I
guess I'd need to go back to the table with all the values we had, but
since we pretty much got what we wanted, I am not sure I will able to
find time for this :)

>
>> One more point. In VHT, there may be a limit on the numbers of
>> subframes in the A-MSDU. I don't see you handle that. Maybe I missed
>> it?
> I haven't looked at that much yet. Right now the driver can only specify
> a limit for the number of subframes.

I am talking about a limitation the peer can advertise. Check this out:
https://git.kernel.org/cgit/linux/kernel/git/jberg/mac80211-next.git/tree/net/mac80211/cfg.c#n1134

I couldn't see the limit the driver can specify in your code. I may
very well have missed it.

>
>> And... in case the driver doesn't handle frag_list, you linearize the
>> skb which is pretty much the only thing you can do at this stage. But,
>> when you'll lift the 4095 bytes limit, you'll get 11K A-MSDU,
>> linarizing such a long packet is really putting the memory manager
>> under pressure.
> I added no-frag_list support primarily for debugging purposes, it's not
> supposed to perform well.

ok.

>
>> This is an order 4 allocation, for each A-MSDU. Note
>> that iwlwifi (and probably other drivers) can handle gather DMA in Tx,
>> but they have a limited number of frags they can handle. iwlwifi e.g.
>> can handle up to 20 frags, but 3 are taken for "paperwork". You'll
>> have 2 frags per subframe at least (assuming that each subframe's
>> payload is nicely contiguous and not on a page boundary). I think that
>> it may be worthwhile to ask the driver how many frags it is supposed
>> to handle. I can't promise iwlwifi will use it, but I guess it will be
>> useful for someone.
> You mean an extra frag limit in addition to the driver subframe limit,
> in case individual subframes are fragmented as well?
>

well.. Yes, you can't assume that you'll have one descriptor for one
MSDU payload  (unless the driver doesn't advertise SG to the
netstack).

> - Felix
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Felix Fietkau Feb. 7, 2016, 11:49 a.m. UTC | #7

On 2016-02-07 12:32, Emmanuel Grumbach wrote:
>>>>>>> +
>>>>>>> +static bool ieee80211_amsdu_aggregate(struct ieee80211_sub_if_data *sdata,
>>>>>>> +                                     struct sta_info *sta,
>>>>>>> +                                     struct ieee80211_fast_tx *fast_tx,
>>>>>>> +                                     struct sk_buff *skb)
>>>>>>> +{
>>>>>>> +       struct ieee80211_local *local = sdata->local;
>>>>>>> +       u8 tid = skb->priority & IEEE80211_QOS_CTL_TAG1D_MASK;
>>>>>>> +       struct ieee80211_txq *txq = sta->sta.txq[tid];
>>>>>>> +       struct txq_info *txqi;
>>>>>>> +       struct sk_buff **frag_tail, *head;
>>>>>>> +       int subframe_len = skb->len - ETH_ALEN;
>>>>>>> +       int max_amsdu_len;
>>>>>>> +       __be16 len;
>>>>>>> +       void *data;
>>>>>>> +       bool ret = false;
>>>>>>> +       int n = 1;
>>>>>>> +
>>>>>>> +       if (!ieee80211_hw_check(&local->hw, TX_AMSDU))
>>>>>>> +               return false;
>>>>>>> +
>>>>>>> +       if (!txq)
>>>>>>> +               return false;
>>>>>>> +
>>>>>>> +       txqi = to_txq_info(txq);
>>>>>>> +       if (test_bit(IEEE80211_TXQ_NO_AMSDU, &txqi->flags))
>>>>>>> +               return false;
>>>>>>> +
>>>>>>> +       /*
>>>>>>> +        * A-MPDU limits maximum MPDU size to 4095 bytes. Since aggregation
>>>>>>> +        * sessions are started/stopped without txq flush, use the limit here
>>>>>>> +        * to avoid having to de-aggregate later.
>>>>>>> +        */
>>>>>>> +       max_amsdu_len = min_t(int, sta->sta.max_amsdu_len, 4095);
>>>>>>
>>>>>> So you can't get 10K A-MSDUs? I don't see where you check that you
>>>>>> have an A-MPDU session here. You seem to be applying the 4095 limit
>>>>>> also for streams that are not an A-MPDU?
>>>>>> I guess you could check if the sta is a VHT peer, in that case, no
>>>>>> limit applies.
>>>>> The explanation for the missing A-MPDU change is in that comment -
>>>>> checking for an active A-MPDU session would make it unnecessarily complex.
>>>>> Good point about checking for VHT capabilities to remove this limit, I
>>>>> will add that.
>>>
>>> Yes - I read the comment, but it seemed very sub-optimal to limit all
>>> the A-MSDUs to 4K. With TSO I can get up to 10K and it really helps
>>> TPT.
>> This was built with the assumption that most scenarios use A-MPDU anyway
>> and thus don't need really large A-MSDUs.
> 
> Yes - so that's interesting. We can chose to have long A-MSDUs inside
> a short (in terms of number of MPDUs) A-MPDU, of with shorter A-MSDU
> and squeeze more of these into a single A-MDPU.
> The first intuition says that we'd better have more MPDUs because of
> the CRC check for each MPDU which doesn't exist in A-MSDU. OTOH, I
> remember I could clearly see that I get a higher TPT with longer
> A-MSDUs. Maybe I wasn't looking right at the size of the A-MPDU? I
> guess I'd need to go back to the table with all the values we had, but
> since we pretty much got what we wanted, I am not sure I will able to
> find time for this :)
I think it also depends on the environment. I'd guess that under very
ideal conditions with very few retransmissions, really long A-MSDU might
have some performance benefits, but I don't think that'll hold if the
conditions are less than ideal and you have rate fluctuation and
retransmissions.

>>> One more point. In VHT, there may be a limit on the numbers of
>>> subframes in the A-MSDU. I don't see you handle that. Maybe I missed
>>> it?
>> I haven't looked at that much yet. Right now the driver can only specify
>> a limit for the number of subframes.
> 
> I am talking about a limitation the peer can advertise. Check this out:
> https://git.kernel.org/cgit/linux/kernel/git/jberg/mac80211-next.git/tree/net/mac80211/cfg.c#n1134
> 
> I couldn't see the limit the driver can specify in your code. I may
> very well have missed it.
I missed that one. Will add it in the next patch.

>>> This is an order 4 allocation, for each A-MSDU. Note
>>> that iwlwifi (and probably other drivers) can handle gather DMA in Tx,
>>> but they have a limited number of frags they can handle. iwlwifi e.g.
>>> can handle up to 20 frags, but 3 are taken for "paperwork". You'll
>>> have 2 frags per subframe at least (assuming that each subframe's
>>> payload is nicely contiguous and not on a page boundary). I think that
>>> it may be worthwhile to ask the driver how many frags it is supposed
>>> to handle. I can't promise iwlwifi will use it, but I guess it will be
>>> useful for someone.
>> You mean an extra frag limit in addition to the driver subframe limit,
>> in case individual subframes are fragmented as well?
>>
> 
> well.. Yes, you can't assume that you'll have one descriptor for one
> MSDU payload  (unless the driver doesn't advertise SG to the
> netstack).
Okay, please make a suggestion describing the exact kinds of limits you
would need for iwlwifi.

- Felix
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Emmanuel Grumbach Feb. 7, 2016, 11:56 a.m. UTC | #8

On Sun, Feb 7, 2016 at 1:49 PM, Felix Fietkau <nbd@openwrt.org> wrote:
> On 2016-02-07 12:32, Emmanuel Grumbach wrote:
>>>>>>>> +
>>>>>>>> +static bool ieee80211_amsdu_aggregate(struct ieee80211_sub_if_data *sdata,
>>>>>>>> +                                     struct sta_info *sta,
>>>>>>>> +                                     struct ieee80211_fast_tx *fast_tx,
>>>>>>>> +                                     struct sk_buff *skb)
>>>>>>>> +{
>>>>>>>> +       struct ieee80211_local *local = sdata->local;
>>>>>>>> +       u8 tid = skb->priority & IEEE80211_QOS_CTL_TAG1D_MASK;
>>>>>>>> +       struct ieee80211_txq *txq = sta->sta.txq[tid];
>>>>>>>> +       struct txq_info *txqi;
>>>>>>>> +       struct sk_buff **frag_tail, *head;
>>>>>>>> +       int subframe_len = skb->len - ETH_ALEN;
>>>>>>>> +       int max_amsdu_len;
>>>>>>>> +       __be16 len;
>>>>>>>> +       void *data;
>>>>>>>> +       bool ret = false;
>>>>>>>> +       int n = 1;
>>>>>>>> +
>>>>>>>> +       if (!ieee80211_hw_check(&local->hw, TX_AMSDU))
>>>>>>>> +               return false;
>>>>>>>> +
>>>>>>>> +       if (!txq)
>>>>>>>> +               return false;
>>>>>>>> +
>>>>>>>> +       txqi = to_txq_info(txq);
>>>>>>>> +       if (test_bit(IEEE80211_TXQ_NO_AMSDU, &txqi->flags))
>>>>>>>> +               return false;
>>>>>>>> +
>>>>>>>> +       /*
>>>>>>>> +        * A-MPDU limits maximum MPDU size to 4095 bytes. Since aggregation
>>>>>>>> +        * sessions are started/stopped without txq flush, use the limit here
>>>>>>>> +        * to avoid having to de-aggregate later.
>>>>>>>> +        */
>>>>>>>> +       max_amsdu_len = min_t(int, sta->sta.max_amsdu_len, 4095);
>>>>>>>
>>>>>>> So you can't get 10K A-MSDUs? I don't see where you check that you
>>>>>>> have an A-MPDU session here. You seem to be applying the 4095 limit
>>>>>>> also for streams that are not an A-MPDU?
>>>>>>> I guess you could check if the sta is a VHT peer, in that case, no
>>>>>>> limit applies.
>>>>>> The explanation for the missing A-MPDU change is in that comment -
>>>>>> checking for an active A-MPDU session would make it unnecessarily complex.
>>>>>> Good point about checking for VHT capabilities to remove this limit, I
>>>>>> will add that.
>>>>
>>>> Yes - I read the comment, but it seemed very sub-optimal to limit all
>>>> the A-MSDUs to 4K. With TSO I can get up to 10K and it really helps
>>>> TPT.
>>> This was built with the assumption that most scenarios use A-MPDU anyway
>>> and thus don't need really large A-MSDUs.
>>
>> Yes - so that's interesting. We can chose to have long A-MSDUs inside
>> a short (in terms of number of MPDUs) A-MPDU, of with shorter A-MSDU
>> and squeeze more of these into a single A-MDPU.
>> The first intuition says that we'd better have more MPDUs because of
>> the CRC check for each MPDU which doesn't exist in A-MSDU. OTOH, I
>> remember I could clearly see that I get a higher TPT with longer
>> A-MSDUs. Maybe I wasn't looking right at the size of the A-MPDU? I
>> guess I'd need to go back to the table with all the values we had, but
>> since we pretty much got what we wanted, I am not sure I will able to
>> find time for this :)
> I think it also depends on the environment. I'd guess that under very
> ideal conditions with very few retransmissions, really long A-MSDU might
> have some performance benefits, but I don't think that'll hold if the
> conditions are less than ideal and you have rate fluctuation and
> retransmissions.
>
>>>> One more point. In VHT, there may be a limit on the numbers of
>>>> subframes in the A-MSDU. I don't see you handle that. Maybe I missed
>>>> it?
>>> I haven't looked at that much yet. Right now the driver can only specify
>>> a limit for the number of subframes.
>>
>> I am talking about a limitation the peer can advertise. Check this out:
>> https://git.kernel.org/cgit/linux/kernel/git/jberg/mac80211-next.git/tree/net/mac80211/cfg.c#n1134
>>
>> I couldn't see the limit the driver can specify in your code. I may
>> very well have missed it.
> I missed that one. Will add it in the next patch.
>
>>>> This is an order 4 allocation, for each A-MSDU. Note
>>>> that iwlwifi (and probably other drivers) can handle gather DMA in Tx,
>>>> but they have a limited number of frags they can handle. iwlwifi e.g.
>>>> can handle up to 20 frags, but 3 are taken for "paperwork". You'll
>>>> have 2 frags per subframe at least (assuming that each subframe's
>>>> payload is nicely contiguous and not on a page boundary). I think that
>>>> it may be worthwhile to ask the driver how many frags it is supposed
>>>> to handle. I can't promise iwlwifi will use it, but I guess it will be
>>>> useful for someone.
>>> You mean an extra frag limit in addition to the driver subframe limit,
>>> in case individual subframes are fragmented as well?
>>>
>>
>> well.. Yes, you can't assume that you'll have one descriptor for one
>> MSDU payload  (unless the driver doesn't advertise SG to the
>> netstack).
> Okay, please make a suggestion describing the exact kinds of limits you
> would need for iwlwifi.

Are athX devices able to handle MPDUs with any number of frags? Say if
you have 30 different physically contiguous fragments, the DMA would
be able to load all these into one single packet and send it to the
air?
iwlwifi currently has the limitation of 20 Transmit Buffers (BTs)
which I mentioned earlier. I guess it'd be nice if the driver would be
able to advertise how many fragments it can handle. Then, you'd need
to stop the A-MSDU building if you'd cross this boundary?

You can look at skb_shinfo(skb)->nr_frags to know how many frags you
have for each skb. On top of that, you need 1 frag for each subframe
(subframe header).
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Felix Fietkau Feb. 7, 2016, 1:21 p.m. UTC | #9

On 2016-02-07 12:56, Emmanuel Grumbach wrote:
>>> well.. Yes, you can't assume that you'll have one descriptor for one
>>> MSDU payload  (unless the driver doesn't advertise SG to the
>>> netstack).
>> Okay, please make a suggestion describing the exact kinds of limits you
>> would need for iwlwifi.
> 
> Are athX devices able to handle MPDUs with any number of frags? Say if
> you have 30 different physically contiguous fragments, the DMA would
> be able to load all these into one single packet and send it to the
> air?
I think athX devices have no limitations there. I'm not testing this
with atheros devices though - ath9k does not have mac80211
per-sta-per-tid queueing support yet. I'm working with MediaTek MT76x2
chipsets with my mt76 driver, which I will upstream soon.

> iwlwifi currently has the limitation of 20 Transmit Buffers (BTs)
> which I mentioned earlier. I guess it'd be nice if the driver would be
> able to advertise how many fragments it can handle. Then, you'd need
> to stop the A-MSDU building if you'd cross this boundary?
> 
> You can look at skb_shinfo(skb)->nr_frags to know how many frags you
> have for each skb. On top of that, you need 1 frag for each subframe
> (subframe header).
I implemented all of your suggestions in RFC v3, let me know if
anything's missing.

- Felix
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC,v2] mac80211: add A-MSDU tx support

Commit Message

Comments

Patch