diff mbox

[v8] mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue.

Message ID 20160906114426.25520-1-toke@toke.dk (mailing list archive)
State Changes Requested
Delegated to: Johannes Berg
Headers show

Commit Message

Toke Høiland-Jørgensen Sept. 6, 2016, 11:44 a.m. UTC
The TXQ intermediate queues can cause packet reordering when more than
one flow is active to a single station. Since some of the wifi-specific
packet handling (notably sequence number and encryption handling) is
sensitive to re-ordering, things break if they are applied before the
TXQ.

This splits up the TX handlers and fast_xmit logic into two parts: An
early part and a late part. The former is applied before TXQ enqueue,
and the latter after dequeue. The non-TXQ path just applies both parts
at once.

Because fragments shouldn't be split up or reordered, the fragmentation
handler is run after dequeue. Any fragments are then kept in the TXQ and
on subsequent dequeues they take precedence over dequeueing from the FQ
structure.

This approach avoids having to scatter special cases for when TXQ is
enabled, at the cost of making the fast_xmit and TX handler code
slightly more complex.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
Changes since v8:
- Don't rely on the fast_tx pointer on TXQ dequeue; it can go away while
  the packet was queued, and we don't actually need it, since we can get
  the key configuration and offset from the packet info.

 include/net/mac80211.h     |   2 +
 net/mac80211/ieee80211_i.h |   2 +
 net/mac80211/tx.c          | 255 +++++++++++++++++++++++++++++++++------------
 3 files changed, 195 insertions(+), 64 deletions(-)

Comments

Felix Fietkau Sept. 6, 2016, 10:04 p.m. UTC | #1
On 2016-09-06 13:44, Toke Høiland-Jørgensen wrote:
> The TXQ intermediate queues can cause packet reordering when more than
> one flow is active to a single station. Since some of the wifi-specific
> packet handling (notably sequence number and encryption handling) is
> sensitive to re-ordering, things break if they are applied before the
> TXQ.
> 
> This splits up the TX handlers and fast_xmit logic into two parts: An
> early part and a late part. The former is applied before TXQ enqueue,
> and the latter after dequeue. The non-TXQ path just applies both parts
> at once.
> 
> Because fragments shouldn't be split up or reordered, the fragmentation
> handler is run after dequeue. Any fragments are then kept in the TXQ and
> on subsequent dequeues they take precedence over dequeueing from the FQ
> structure.
> 
> This approach avoids having to scatter special cases for when TXQ is
> enabled, at the cost of making the fast_xmit and TX handler code
> slightly more complex.
> 
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Acked-by: Felix Fietkau <nbd@nbd.name>
Johannes Berg Sept. 12, 2016, 12:35 p.m. UTC | #2
> +static int invoke_tx_handlers_late(struct ieee80211_tx_data *tx);
> +static bool ieee80211_xmit_fast_finish(struct ieee80211_sub_if_data *sdata,
> +				       struct sta_info *sta, u8 pn_offs,
> +				       struct ieee80211_key_conf *key_conf,
> +				       struct sk_buff *skb);
> +

I'm not very happy with this - I think you should do some
refactoring/code move in a separate prior patch to avoid this.

> +	if (txq->sta && info->control.flags & IEEE80211_TX_CTRL_FAST_XMIT) {
>  		struct sta_info *sta = container_of(txq->sta, struct sta_info,
>  						    sta);
> -		struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
> +		u8 pn_offs = 0;
>  
> -		hdr->seq_ctrl = ieee80211_tx_next_seq(sta, txq->tid);
> -		if (test_bit(IEEE80211_TXQ_AMPDU, &txqi->flags))
> -			info->flags |= IEEE80211_TX_CTL_AMPDU;
> -		else
> -			info->flags &= ~IEEE80211_TX_CTL_AMPDU;
> +		if (info->control.hw_key)
> +			pn_offs = ieee80211_hdrlen(hdr->frame_control);

Not very happy with this either - the fast-xmit path explicitly tries
to avoid all these calculations.

I suppose I don't have to care all that much about the TXQs, but ...

Then again, adding a field in the skb->cb for the sake of this? No, not really either.


> +		ieee80211_xmit_fast_finish(sta->sdata, sta, pn_offs,
> +					   info->control.hw_key, skb);

I don't see how keeping the info->control.hw_key pointer across the
TXQ/FQ/Codel queueing isn't a potential bug? Probably one that already
exists in your code today, before this patch, of course.


> +	} else {
> +		struct ieee80211_tx_data tx = { };
> +
> +		__skb_queue_head_init(&tx.skbs);
> +		tx.local = local;
> +		tx.skb = skb;

an empty initializer is weird - why not at least move local/skb
initializations into it? Even txq->sta, I guess, since you can assign
txq->sta either way.

> -	CALL_TXH(ieee80211_tx_h_select_key);
> +
>  	if (!ieee80211_hw_check(&tx->local->hw, HAS_RATE_CONTROL))
>  		CALL_TXH(ieee80211_tx_h_rate_ctrl);
[...]
> 	if (unlikely(info->flags & IEEE80211_TX_INTFL_RETRANSMISSION)) {
>  		__skb_queue_tail(&tx->skbs, tx->skb);
>  		tx->skb = NULL;
>  		goto txh_done;
>  	}
> 
> +	CALL_TXH(ieee80211_tx_h_select_key);

What happens for the IEEE80211_TX_INTFL_RETRANSMISSION packets wrt. key
selection? Why is it OK to change this?

johannes
Toke Høiland-Jørgensen Sept. 12, 2016, 1:08 p.m. UTC | #3
Johannes Berg <johannes@sipsolutions.net> writes:

>> +static int invoke_tx_handlers_late(struct ieee80211_tx_data *tx);
>> +static bool ieee80211_xmit_fast_finish(struct ieee80211_sub_if_data *sdata,
>> +				       struct sta_info *sta, u8 pn_offs,
>> +				       struct ieee80211_key_conf *key_conf,
>> +				       struct sk_buff *skb);
>> +
>
> I'm not very happy with this - I think you should do some
> refactoring/code move in a separate prior patch to avoid this.

Noted, will do.

>> +	if (txq->sta && info->control.flags & IEEE80211_TX_CTRL_FAST_XMIT) {
>>  		struct sta_info *sta = container_of(txq->sta, struct sta_info,
>>  						    sta);
>> -		struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
>> +		u8 pn_offs = 0;
>>  
>> -		hdr->seq_ctrl = ieee80211_tx_next_seq(sta, txq->tid);
>> -		if (test_bit(IEEE80211_TXQ_AMPDU, &txqi->flags))
>> -			info->flags |= IEEE80211_TX_CTL_AMPDU;
>> -		else
>> -			info->flags &= ~IEEE80211_TX_CTL_AMPDU;
>> +		if (info->control.hw_key)
>> +			pn_offs = ieee80211_hdrlen(hdr->frame_control);
>
> Not very happy with this either - the fast-xmit path explicitly tries
> to avoid all these calculations.

Well, the TXQ already adds a lot of other overhead (hashing on the
packet header, for one), so my guess would be that this would be
negligible compared to all that? 

> I suppose I don't have to care all that much about the TXQs, but ...
>
> Then again, adding a field in the skb->cb for the sake of this? No,
> not really either.

So that's a "keep it", then? :)

>> +		ieee80211_xmit_fast_finish(sta->sdata, sta, pn_offs,
>> +					   info->control.hw_key, skb);
>
> I don't see how keeping the info->control.hw_key pointer across the
> TXQ/FQ/Codel queueing isn't a potential bug? Probably one that already
> exists in your code today, before this patch, of course.

You mean the key could get removed from the hardware while the packet
was queued? Can certainly add a check for that. Under what conditions
does that happen? Does it make sense to try to recover from it (I guess
by calling tx_h_select_key), or is it rare enough that giving up and
dropping the packet makes more sense?

>> +	} else {
>> +		struct ieee80211_tx_data tx = { };
>> +
>> +		__skb_queue_head_init(&tx.skbs);
>> +		tx.local = local;
>> +		tx.skb = skb;
>
> an empty initializer is weird - why not at least move local/skb
> initializations into it? Even txq->sta, I guess, since you can assign
> txq->sta either way.

Yup, makes sense. Noted.

>> -	CALL_TXH(ieee80211_tx_h_select_key);
>> +
>>  	if (!ieee80211_hw_check(&tx->local->hw, HAS_RATE_CONTROL))
>>  		CALL_TXH(ieee80211_tx_h_rate_ctrl);
> [...]
>> 	if (unlikely(info->flags & IEEE80211_TX_INTFL_RETRANSMISSION)) {
>>  		__skb_queue_tail(&tx->skbs, tx->skb);
>>  		tx->skb = NULL;
>>  		goto txh_done;
>>  	}
>> 
>> +	CALL_TXH(ieee80211_tx_h_select_key);
>
> What happens for the IEEE80211_TX_INTFL_RETRANSMISSION packets wrt.
> key selection? Why is it OK to change this?

You're right, that's an oversight on my part. Will fix.

-Toke
Johannes Berg Sept. 12, 2016, 1:19 p.m. UTC | #4
> Well, the TXQ already adds a lot of other overhead (hashing on the
> packet header, for one), so my guess would be that this would be
> negligible compared to all that? 
> 
> > 
> > I suppose I don't have to care all that much about the TXQs, but
> > ...
> > 
> > Then again, adding a field in the skb->cb for the sake of this? No,
> > not really either.
> 
> So that's a "keep it", then? :)

Yeah I think so :)

> > > +		ieee80211_xmit_fast_finish(sta->sdata, sta,
> > > pn_offs,
> > > +					   info->control.hw_key, 
> > > skb);
> > 
> > I don't see how keeping the info->control.hw_key pointer across the
> > TXQ/FQ/Codel queueing isn't a potential bug? Probably one that
> > already exists in your code today, before this patch, of course.
> 
> You mean the key could get removed from the hardware while the packet
> was queued? Can certainly add a check for that. Under what conditions
> does that happen? Does it make sense to try to recover from it (I
> guess by calling tx_h_select_key), or is it rare enough that giving
> up and dropping the packet makes more sense?

Not just from the hardware, more importantly the whole key structure
can be kfree()d, leading to use-after-free here, no?

Fast-xmit solves this by invalidating the fast-xmit cache when the key
pointer changes/goes away and possibly punting some frames to the slow
path, but you've absolutely no protection on these pointers here within
the TXQs, afaict?

A similar situation occurs with other pointers, like stations and vifs,
but when those are removed then obviously the entire TXQs are flushed,
so they're not relevant.

With the key though, frames can be on the queue while a key is removed,
and even before this patch, drivers would consequently access an
invalid key pointer.

Mind you, as I just wrote I think that issue exists even before this
patch, so you should probably look at it separately. Felix might know
better too.

johannes
Toke Høiland-Jørgensen Sept. 22, 2016, 5:04 p.m. UTC | #5
This is the ninth iteration of my attempts to reorder the TXQ dequeue
path to avoid issues with reorder-sensitive operations. This version is
split into two patches; the first one moves ieee80211_tx_dequeue() to
avoid adding function stubs at the top of tx.c.

Changes since v8:
- Don't add function stubs to the beginning of tx.c
- Don't use control.hw_key from the dequeued packet, since that can go
  away while the packet is queued. Instead, run the select_key handler
  on dequeue and use the key from that.
- Change places that check tin.backlog_packets as an indication of
  whether the TXQ has anything queued to also look at the 'frags' queue.
- Don't change the order of the select_key handler with respect to the
  other handlers.
- Rebase on current mac80211-next tree.

Toke Høiland-Jørgensen (2):
  mac80211: Move ieee802111_tx_dequeue() to later in tx.c
  mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue

 include/net/mac80211.h     |   2 +
 net/mac80211/ieee80211_i.h |   8 ++
 net/mac80211/rx.c          |   4 +-
 net/mac80211/sta_info.c    |  10 +-
 net/mac80211/tx.c          | 335 +++++++++++++++++++++++++++++++--------------
 net/mac80211/util.c        |  11 +-
 6 files changed, 256 insertions(+), 114 deletions(-)
diff mbox

Patch

diff --git a/include/net/mac80211.h b/include/net/mac80211.h
index cca510a..9a6a3e9 100644
--- a/include/net/mac80211.h
+++ b/include/net/mac80211.h
@@ -715,6 +715,7 @@  enum mac80211_tx_info_flags {
  *	frame (PS-Poll or uAPSD).
  * @IEEE80211_TX_CTRL_RATE_INJECT: This frame is injected with rate information
  * @IEEE80211_TX_CTRL_AMSDU: This frame is an A-MSDU frame
+ * @IEEE80211_TX_CTRL_FAST_XMIT: This frame is going through the fast_xmit path
  *
  * These flags are used in tx_info->control.flags.
  */
@@ -723,6 +724,7 @@  enum mac80211_tx_control_flags {
 	IEEE80211_TX_CTRL_PS_RESPONSE		= BIT(1),
 	IEEE80211_TX_CTRL_RATE_INJECT		= BIT(2),
 	IEEE80211_TX_CTRL_AMSDU			= BIT(3),
+	IEEE80211_TX_CTRL_FAST_XMIT		= BIT(4),
 };
 
 /*
diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index 9211cce..d36f3b1 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -813,11 +813,13 @@  enum txq_info_flags {
  * @def_flow: used as a fallback flow when a packet destined to @tin hashes to
  *	a fq_flow which is already owned by a different tin
  * @def_cvars: codel vars for @def_flow
+ * @frags: used to keep fragments created after dequeue
  */
 struct txq_info {
 	struct fq_tin tin;
 	struct fq_flow def_flow;
 	struct codel_vars def_cvars;
+	struct sk_buff_head frags;
 	unsigned long flags;
 
 	/* keep last! */
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index efc38e7..f8eec60 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -38,6 +38,12 @@ 
 #include "wme.h"
 #include "rate.h"
 
+static int invoke_tx_handlers_late(struct ieee80211_tx_data *tx);
+static bool ieee80211_xmit_fast_finish(struct ieee80211_sub_if_data *sdata,
+				       struct sta_info *sta, u8 pn_offs,
+				       struct ieee80211_key_conf *key_conf,
+				       struct sk_buff *skb);
+
 /* misc utils */
 
 static inline void ieee80211_tx_stats(struct net_device *dev, u32 len)
@@ -853,8 +859,7 @@  ieee80211_tx_h_sequence(struct ieee80211_tx_data *tx)
 	tid = *qc & IEEE80211_QOS_CTL_TID_MASK;
 	tx->sta->tx_stats.msdu[tid]++;
 
-	if (!tx->sta->sta.txq[0])
-		hdr->seq_ctrl = ieee80211_tx_next_seq(tx->sta, tid);
+	hdr->seq_ctrl = ieee80211_tx_next_seq(tx->sta, tid);
 
 	return TX_CONTINUE;
 }
@@ -1403,6 +1408,7 @@  void ieee80211_txq_init(struct ieee80211_sub_if_data *sdata,
 	fq_tin_init(&txqi->tin);
 	fq_flow_init(&txqi->def_flow);
 	codel_vars_init(&txqi->def_cvars);
+	__skb_queue_head_init(&txqi->frags);
 
 	txqi->txq.vif = &sdata->vif;
 
@@ -1425,6 +1431,7 @@  void ieee80211_txq_purge(struct ieee80211_local *local,
 	struct fq_tin *tin = &txqi->tin;
 
 	fq_tin_reset(fq, tin, fq_skb_free_func);
+	ieee80211_purge_tx_queue(&local->hw, &txqi->frags);
 }
 
 int ieee80211_txq_setup_flows(struct ieee80211_local *local)
@@ -1485,12 +1492,19 @@  struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
 	struct sk_buff *skb = NULL;
 	struct fq *fq = &local->fq;
 	struct fq_tin *tin = &txqi->tin;
+	struct ieee80211_tx_info *info;
 
 	spin_lock_bh(&fq->lock);
 
 	if (test_bit(IEEE80211_TXQ_STOP, &txqi->flags))
 		goto out;
 
+	/* Make sure fragments stay together. */
+	skb = __skb_dequeue(&txqi->frags);
+	if (skb)
+		goto out;
+
+begin:
 	skb = fq_tin_dequeue(fq, tin, fq_tin_dequeue_func);
 	if (!skb)
 		goto out;
@@ -1498,16 +1512,37 @@  struct sk_buff *ieee80211_tx_dequeue(struct ieee80211_hw *hw,
 	ieee80211_set_skb_vif(skb, txqi);
 
 	hdr = (struct ieee80211_hdr *)skb->data;
-	if (txq->sta && ieee80211_is_data_qos(hdr->frame_control)) {
+	info = IEEE80211_SKB_CB(skb);
+	if (txq->sta && info->control.flags & IEEE80211_TX_CTRL_FAST_XMIT) {
 		struct sta_info *sta = container_of(txq->sta, struct sta_info,
 						    sta);
-		struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
+		u8 pn_offs = 0;
 
-		hdr->seq_ctrl = ieee80211_tx_next_seq(sta, txq->tid);
-		if (test_bit(IEEE80211_TXQ_AMPDU, &txqi->flags))
-			info->flags |= IEEE80211_TX_CTL_AMPDU;
-		else
-			info->flags &= ~IEEE80211_TX_CTL_AMPDU;
+		if (info->control.hw_key)
+			pn_offs = ieee80211_hdrlen(hdr->frame_control);
+
+		ieee80211_xmit_fast_finish(sta->sdata, sta, pn_offs,
+					   info->control.hw_key, skb);
+	} else {
+		struct ieee80211_tx_data tx = { };
+
+		__skb_queue_head_init(&tx.skbs);
+		tx.local = local;
+		tx.skb = skb;
+		if (txq->sta) {
+			tx.sta = container_of(txq->sta, struct sta_info, sta);
+			tx.sdata = tx.sta->sdata;
+		} else {
+			tx.sdata = vif_to_sdata(info->control.vif);
+		}
+
+		if (invoke_tx_handlers_late(&tx))
+			goto begin;
+
+		skb = __skb_dequeue(&tx.skbs);
+
+		if (!skb_queue_empty(&tx.skbs))
+			skb_queue_splice_tail(&tx.skbs, &txqi->frags);
 	}
 
 out:
@@ -1521,6 +1556,47 @@  out:
 }
 EXPORT_SYMBOL(ieee80211_tx_dequeue);
 
+static bool ieee80211_queue_skb(struct ieee80211_local *local,
+				struct ieee80211_sub_if_data *sdata,
+				struct sta_info *sta,
+				struct sk_buff *skb)
+{
+	struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
+	struct fq *fq = &local->fq;
+	struct ieee80211_vif *vif;
+	struct txq_info *txqi;
+	struct ieee80211_sta *pubsta;
+
+	if (!local->ops->wake_tx_queue ||
+	    sdata->vif.type == NL80211_IFTYPE_MONITOR)
+		return false;
+
+	if (sta && sta->uploaded)
+		pubsta = &sta->sta;
+	else
+		pubsta = NULL;
+
+	if (sdata->vif.type == NL80211_IFTYPE_AP_VLAN)
+		sdata = container_of(sdata->bss,
+				     struct ieee80211_sub_if_data, u.ap);
+
+	vif = &sdata->vif;
+	txqi = ieee80211_get_txq(local, vif, pubsta, skb);
+
+	if (!txqi)
+		return false;
+
+	info->control.vif = vif;
+
+	spin_lock_bh(&fq->lock);
+	ieee80211_txq_enqueue(local, txqi, skb);
+	spin_unlock_bh(&fq->lock);
+
+	drv_wake_tx_queue(local, txqi);
+
+	return true;
+}
+
 static bool ieee80211_tx_frags(struct ieee80211_local *local,
 			       struct ieee80211_vif *vif,
 			       struct ieee80211_sta *sta,
@@ -1528,9 +1604,7 @@  static bool ieee80211_tx_frags(struct ieee80211_local *local,
 			       bool txpending)
 {
 	struct ieee80211_tx_control control = {};
-	struct fq *fq = &local->fq;
 	struct sk_buff *skb, *tmp;
-	struct txq_info *txqi;
 	unsigned long flags;
 
 	skb_queue_walk_safe(skbs, skb, tmp) {
@@ -1545,21 +1619,6 @@  static bool ieee80211_tx_frags(struct ieee80211_local *local,
 		}
 #endif
 
-		txqi = ieee80211_get_txq(local, vif, sta, skb);
-		if (txqi) {
-			info->control.vif = vif;
-
-			__skb_unlink(skb, skbs);
-
-			spin_lock_bh(&fq->lock);
-			ieee80211_txq_enqueue(local, txqi, skb);
-			spin_unlock_bh(&fq->lock);
-
-			drv_wake_tx_queue(local, txqi);
-
-			continue;
-		}
-
 		spin_lock_irqsave(&local->queue_stop_reason_lock, flags);
 		if (local->queue_stop_reasons[q] ||
 		    (!txpending && !skb_queue_empty(&local->pending[q]))) {
@@ -1680,10 +1739,13 @@  static bool __ieee80211_tx(struct ieee80211_local *local,
 /*
  * Invoke TX handlers, return 0 on success and non-zero if the
  * frame was dropped or queued.
+ *
+ * The handlers are split into an early and late part. The latter is everything
+ * that can be sensitive to reordering, and will be deferred to after packets
+ * are dequeued from the intermediate queues (when they are enabled).
  */
-static int invoke_tx_handlers(struct ieee80211_tx_data *tx)
+static int invoke_tx_handlers_early(struct ieee80211_tx_data *tx)
 {
-	struct ieee80211_tx_info *info = IEEE80211_SKB_CB(tx->skb);
 	ieee80211_tx_result res = TX_DROP;
 
 #define CALL_TXH(txh) \
@@ -1697,16 +1759,42 @@  static int invoke_tx_handlers(struct ieee80211_tx_data *tx)
 	CALL_TXH(ieee80211_tx_h_check_assoc);
 	CALL_TXH(ieee80211_tx_h_ps_buf);
 	CALL_TXH(ieee80211_tx_h_check_control_port_protocol);
-	CALL_TXH(ieee80211_tx_h_select_key);
+
 	if (!ieee80211_hw_check(&tx->local->hw, HAS_RATE_CONTROL))
 		CALL_TXH(ieee80211_tx_h_rate_ctrl);
 
+ txh_done:
+	if (unlikely(res == TX_DROP)) {
+		I802_DEBUG_INC(tx->local->tx_handlers_drop);
+		if (tx->skb)
+			ieee80211_free_txskb(&tx->local->hw, tx->skb);
+		else
+			ieee80211_purge_tx_queue(&tx->local->hw, &tx->skbs);
+		return -1;
+	} else if (unlikely(res == TX_QUEUED)) {
+		I802_DEBUG_INC(tx->local->tx_handlers_queued);
+		return -1;
+	}
+
+	return 0;
+}
+
+/*
+ * Late handlers can be called while the sta lock is held. Handlers that can
+ * cause packets to be generated will cause deadlock!
+ */
+static int invoke_tx_handlers_late(struct ieee80211_tx_data *tx)
+{
+	struct ieee80211_tx_info *info = IEEE80211_SKB_CB(tx->skb);
+	ieee80211_tx_result res = TX_CONTINUE;
+
 	if (unlikely(info->flags & IEEE80211_TX_INTFL_RETRANSMISSION)) {
 		__skb_queue_tail(&tx->skbs, tx->skb);
 		tx->skb = NULL;
 		goto txh_done;
 	}
 
+	CALL_TXH(ieee80211_tx_h_select_key);
 	CALL_TXH(ieee80211_tx_h_michael_mic_add);
 	CALL_TXH(ieee80211_tx_h_sequence);
 	CALL_TXH(ieee80211_tx_h_fragment);
@@ -1733,6 +1821,15 @@  static int invoke_tx_handlers(struct ieee80211_tx_data *tx)
 	return 0;
 }
 
+static int invoke_tx_handlers(struct ieee80211_tx_data *tx)
+{
+	int r = invoke_tx_handlers_early(tx);
+
+	if (r)
+		return r;
+	return invoke_tx_handlers_late(tx);
+}
+
 bool ieee80211_tx_prepare_skb(struct ieee80211_hw *hw,
 			      struct ieee80211_vif *vif, struct sk_buff *skb,
 			      int band, struct ieee80211_sta **sta)
@@ -1807,7 +1904,13 @@  static bool ieee80211_tx(struct ieee80211_sub_if_data *sdata,
 		info->hw_queue =
 			sdata->vif.hw_queue[skb_get_queue_mapping(skb)];
 
-	if (!invoke_tx_handlers(&tx))
+	if (invoke_tx_handlers_early(&tx))
+		return false;
+
+	if (ieee80211_queue_skb(local, sdata, tx.sta, tx.skb))
+		return true;
+
+	if (!invoke_tx_handlers_late(&tx))
 		result = __ieee80211_tx(local, &tx.skbs, led_len,
 					tx.sta, txpending);
 
@@ -3159,7 +3262,7 @@  out:
 }
 
 static bool ieee80211_xmit_fast(struct ieee80211_sub_if_data *sdata,
-				struct net_device *dev, struct sta_info *sta,
+				struct sta_info *sta,
 				struct ieee80211_fast_tx *fast_tx,
 				struct sk_buff *skb)
 {
@@ -3170,9 +3273,9 @@  static bool ieee80211_xmit_fast(struct ieee80211_sub_if_data *sdata,
 	struct ethhdr eth;
 	struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
 	struct ieee80211_hdr *hdr = (void *)fast_tx->hdr;
-	struct ieee80211_tx_data tx;
-	ieee80211_tx_result r;
 	struct tid_ampdu_tx *tid_tx = NULL;
+	ieee80211_tx_result r;
+	struct ieee80211_tx_data tx;
 	u8 tid = IEEE80211_NUM_TIDS;
 
 	/* control port protocol needs a lot of special handling */
@@ -3210,8 +3313,6 @@  static bool ieee80211_xmit_fast(struct ieee80211_sub_if_data *sdata,
 			return true;
 	}
 
-	ieee80211_tx_stats(dev, skb->len + extra_head);
-
 	if ((hdr->frame_control & cpu_to_le16(IEEE80211_STYPE_QOS_DATA)) &&
 	    ieee80211_amsdu_aggregate(sdata, sta, fast_tx, skb))
 		return true;
@@ -3240,24 +3341,7 @@  static bool ieee80211_xmit_fast(struct ieee80211_sub_if_data *sdata,
 	info->flags = IEEE80211_TX_CTL_FIRST_FRAGMENT |
 		      IEEE80211_TX_CTL_DONTFRAG |
 		      (tid_tx ? IEEE80211_TX_CTL_AMPDU : 0);
-
-	if (hdr->frame_control & cpu_to_le16(IEEE80211_STYPE_QOS_DATA)) {
-		*ieee80211_get_qos_ctl(hdr) = tid;
-		if (!sta->sta.txq[0])
-			hdr->seq_ctrl = ieee80211_tx_next_seq(sta, tid);
-	} else {
-		info->flags |= IEEE80211_TX_CTL_ASSIGN_SEQ;
-		hdr->seq_ctrl = cpu_to_le16(sdata->sequence_number);
-		sdata->sequence_number += 0x10;
-	}
-
-	if (skb_shinfo(skb)->gso_size)
-		sta->tx_stats.msdu[tid] +=
-			DIV_ROUND_UP(skb->len, skb_shinfo(skb)->gso_size);
-	else
-		sta->tx_stats.msdu[tid]++;
-
-	info->hw_queue = sdata->vif.hw_queue[skb_get_queue_mapping(skb)];
+	info->control.flags = IEEE80211_TX_CTRL_FAST_XMIT;
 
 	__skb_queue_head_init(&tx.skbs);
 
@@ -3283,22 +3367,71 @@  static bool ieee80211_xmit_fast(struct ieee80211_sub_if_data *sdata,
 		}
 	}
 
+	if (ieee80211_queue_skb(local, sdata, sta, skb))
+		return true;
+
+	ieee80211_xmit_fast_finish(sdata, sta, fast_tx->pn_offs,
+				   &fast_tx->key->conf, skb);
+
+	if (sdata->vif.type == NL80211_IFTYPE_AP_VLAN)
+		sdata = container_of(sdata->bss,
+				     struct ieee80211_sub_if_data, u.ap);
+
+	__skb_queue_tail(&tx.skbs, skb);
+	ieee80211_tx_frags(local, &sdata->vif, &sta->sta, &tx.skbs, false);
+
+	return true;
+}
+
+/*
+ * Can be called while the sta lock is held. Anything that can cause packets to
+ * be generated will cause deadlock!
+ */
+static bool ieee80211_xmit_fast_finish(struct ieee80211_sub_if_data *sdata,
+				       struct sta_info *sta, u8 pn_offs,
+				       struct ieee80211_key_conf *key_conf,
+				       struct sk_buff *skb)
+{
+	struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
+	struct ieee80211_hdr *hdr = (void *)skb->data;
+	u8 tid = IEEE80211_NUM_TIDS;
+
+	ieee80211_tx_stats(skb->dev, skb->len);
+
+	if (hdr->frame_control & cpu_to_le16(IEEE80211_STYPE_QOS_DATA)) {
+		tid = skb->priority & IEEE80211_QOS_CTL_TAG1D_MASK;
+		*ieee80211_get_qos_ctl(hdr) = tid;
+		hdr->seq_ctrl = ieee80211_tx_next_seq(sta, tid);
+	} else {
+		info->flags |= IEEE80211_TX_CTL_ASSIGN_SEQ;
+		hdr->seq_ctrl = cpu_to_le16(sdata->sequence_number);
+		sdata->sequence_number += 0x10;
+	}
+
+	if (skb_shinfo(skb)->gso_size)
+		sta->tx_stats.msdu[tid] +=
+			DIV_ROUND_UP(skb->len, skb_shinfo(skb)->gso_size);
+	else
+		sta->tx_stats.msdu[tid]++;
+
+	info->hw_queue = sdata->vif.hw_queue[skb_get_queue_mapping(skb)];
+
 	/* statistics normally done by ieee80211_tx_h_stats (but that
 	 * has to consider fragmentation, so is more complex)
 	 */
 	sta->tx_stats.bytes[skb_get_queue_mapping(skb)] += skb->len;
 	sta->tx_stats.packets[skb_get_queue_mapping(skb)]++;
 
-	if (fast_tx->pn_offs) {
+	if (pn_offs) {
 		u64 pn;
-		u8 *crypto_hdr = skb->data + fast_tx->pn_offs;
+		u8 *crypto_hdr = skb->data + pn_offs;
 
-		switch (fast_tx->key->conf.cipher) {
+		switch (key_conf->cipher) {
 		case WLAN_CIPHER_SUITE_CCMP:
 		case WLAN_CIPHER_SUITE_CCMP_256:
 		case WLAN_CIPHER_SUITE_GCMP:
 		case WLAN_CIPHER_SUITE_GCMP_256:
-			pn = atomic64_inc_return(&fast_tx->key->conf.tx_pn);
+			pn = atomic64_inc_return(&key_conf->tx_pn);
 			crypto_hdr[0] = pn;
 			crypto_hdr[1] = pn >> 8;
 			crypto_hdr[4] = pn >> 16;
@@ -3309,12 +3442,6 @@  static bool ieee80211_xmit_fast(struct ieee80211_sub_if_data *sdata,
 		}
 	}
 
-	if (sdata->vif.type == NL80211_IFTYPE_AP_VLAN)
-		sdata = container_of(sdata->bss,
-				     struct ieee80211_sub_if_data, u.ap);
-
-	__skb_queue_tail(&tx.skbs, skb);
-	ieee80211_tx_frags(local, &sdata->vif, &sta->sta, &tx.skbs, false);
 	return true;
 }
 
@@ -3342,7 +3469,7 @@  void __ieee80211_subif_start_xmit(struct sk_buff *skb,
 		fast_tx = rcu_dereference(sta->fast_tx);
 
 		if (fast_tx &&
-		    ieee80211_xmit_fast(sdata, dev, sta, fast_tx, skb))
+		    ieee80211_xmit_fast(sdata, sta, fast_tx, skb))
 			goto out;
 	}