diff mbox series

wifi: mac80211: Serialize calls to drv_wake_tx_queue()

Message ID 20230313201542.72325-1-alexander@wetzel-home.de (mailing list archive)
State Rejected
Delegated to: Johannes Berg
Headers show
Series wifi: mac80211: Serialize calls to drv_wake_tx_queue() | expand

Commit Message

Alexander Wetzel March 13, 2023, 8:15 p.m. UTC
drv_wake_tx_queue() has no protection against running concurrent
multiple times. It's normally executed within the task calling
ndo_start_xmit(). But wake_txqs_tasklet is also calling into it,
regardless if the function is already running on another CPU or not.

While drivers with native iTXQ support are able to handle that, calls to
ieee80211_handle_wake_tx_queue() - used by the drivers without
native iTXQ support - must be serialized. Otherwise drivers can get
unexpected overlapping drv_tx() calls from mac80211. Which causes issues
for at least rt2800usb.

To avoid what seems to be a not needed distinction between native and
drivers using ieee80211_handle_wake_tx_queue(), the serialization is
done for drv_wake_tx_queue() here.

The serialization works by detecting and blocking concurrent calls into
drv_wake_tx_queue() and - when needed - restarting all queues after the
wake_tx_queue ops returned from the driver.

This fix here is only required when a tree has 'c850e31f79f0 ("wifi:
mac80211: add internal handler for wake_tx_queue")', which introduced
the buggy code path in mac80211. Drivers were switched to it with
'a790cc3a4fad ("wifi: mac80211: add wake_tx_queue callback to
drivers")'. But only after fixing an independent bug with commit
'4444bc2116ae ("wifi: mac80211: Proper mark iTXQs for resumption")'
problematic concurrent calls to drv_wake_tx_queue() really happened and
exposed the initial issue.

Fixes: c850e31f79f0 ("wifi: mac80211: add internal handler for wake_tx_queue")
Reported-by: Thomas Mann <rauchwolke@gmx.net>
Tested-by: Thomas Mann <rauchwolke@gmx.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217119
Link: https://lore.kernel.org/r/b8efebc6-4399-d0b8-b2a0-66843314616b@leemhuis.info/
CC: <stable@vger.kernel.org>
Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
---

There are multiple variations how we can fix the issue.

But this fix has to go directly into 6.2 to solve an ongoing
regression.
I tried to prevent an outright redesign while still having a
full fix:

Most questionable decision here probably is, if we should fix it in
drv_wake_tx_queue() or only in ieee80211_handle_wake_tx_queue().
But the decision how to serialize - tasklet vs some kind of locking
- may also be an issue.

Personally, I did not like the overhead of checking all iTXQ for every
drv_wake_tx_queue(). These are happening per-packet AND can be easily
avoided when we are not using a tasklet. Most of the time.

I also assume that drivers don't *want* concurrent drv_wake_tx_queue()
calls and it's a benefit to all drivers to serialize it.

If we consider concurrent calls for drv_wake_tx_queue() to be a feature,
I would just move the code into ieee80211_handle_wake_tx_queue().

Which may also be the solution for the regression in 6.2:
Do it now for ieee80211_handle_wake_tx_queue() and apply this patch
to the development tree only.

We could probably also do it without the queue stops. But then
__ieee80211_wake_queue() needs more modifications.

Thus I ended up with the approach in this patch...

Alexander
---
 net/mac80211/driver-ops.h  | 21 +++++++++++++++++++++
 net/mac80211/ieee80211_i.h |  9 +++++++++
 net/mac80211/util.c        |  3 ++-
 3 files changed, 32 insertions(+), 1 deletion(-)

Comments

Johannes Berg March 13, 2023, 8:33 p.m. UTC | #1
On Mon, 2023-03-13 at 21:15 +0100, Alexander Wetzel wrote:
> 
> While drivers with native iTXQ support are able to handle that, 
> 

questionable. Looking at iwlwifi:

void iwl_mvm_mac_wake_tx_queue(struct ieee80211_hw *hw,
                               struct ieee80211_txq *txq)
{
        struct iwl_mvm *mvm = IWL_MAC80211_GET_MVM(hw);
...
        list_add_tail(&mvmtxq->list, &mvm->add_stream_txqs);
...
}

which might explain some rare and hard to debug list corruptions in the
driver that we've seen reports of ...

> To avoid what seems to be a not needed distinction between native and
> drivers using ieee80211_handle_wake_tx_queue(), the serialization is
> done for drv_wake_tx_queue() here.

So probably no objection to that, at least for now, though in the common
path (what I showed above was the 'first use' path), iwlwifi actually
hits different HW resources, so it _could_ benefit from concurrent TX
after fixing that bug.

> The serialization works by detecting and blocking concurrent calls into
> drv_wake_tx_queue() and - when needed - restarting all queues after the
> wake_tx_queue ops returned from the driver.

This seems ... overly complex? It feels like you're implementing a kind
of spinlock (using atomic bit ops) with very expensive handling of
contention?

Since drivers are supposed to handle concurrent TX per AC, you could
almost just do something like this:

diff --git a/include/net/mac80211.h b/include/net/mac80211.h
index f07a3c1b4d9a..1946e28868ea 100644
--- a/include/net/mac80211.h
+++ b/include/net/mac80211.h
@@ -7108,10 +7108,8 @@ struct ieee80211_txq *ieee80211_next_txq(struct ieee80211_hw *hw, u8 ac);
  */
 void ieee80211_txq_schedule_start(struct ieee80211_hw *hw, u8 ac);
 
-/* (deprecated) */
-static inline void ieee80211_txq_schedule_end(struct ieee80211_hw *hw, u8 ac)
-{
-}
+/** ... */
+void ieee80211_txq_schedule_end(struct ieee80211_hw *hw, u8 ac);
 
 void __ieee80211_schedule_txq(struct ieee80211_hw *hw,
 			      struct ieee80211_txq *txq, bool force);
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 1fae44fb1be6..606ca8620d20 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -4250,11 +4250,17 @@ void ieee80211_txq_schedule_start(struct ieee80211_hw *hw, u8 ac)
 	} else {
 		local->schedule_round[ac] = 0;
 	}
-
-	spin_unlock_bh(&local->active_txq_lock[ac]);
 }
 EXPORT_SYMBOL(ieee80211_txq_schedule_start);
 
+void ieee80211_txq_schedule_end(struct ieee80211_hw *hw, u8 ac)
+{
+	struct ieee80211_local *local = hw_to_local(hw);
+
+	spin_unlock_bh(&local->active_txq_lock[ac]);
+}
+EXPORT_SYMBOL(ieee80211_txq_schedule_end);
+
 void __ieee80211_subif_start_xmit(struct sk_buff *skb,
 				  struct net_device *dev,
 				  u32 info_flags,



assuming that TXQ drivers actually still call
ieee80211_txq_schedule_end() which says it's deprecated.

That even has _bh() so the tasklet can't be running anyway ...

So if the concurrency really is only TX vs. tasklet, then you could even
just keep the BHs disabled (in _start spin_unlock only and then in _end
local_bh_disable)?

> Which may also be the solution for the regression in 6.2:
> Do it now for ieee80211_handle_wake_tx_queue() and apply this patch
> to the development tree only.

I'd argue the other way around - do it for all to fix these issues, and
then audit drivers such as iwlwifi or even make concurrency here opt-in.

Felix did see some benefits of the concurrency I think though, so he
might have a different opinion.

johannes
Felix Fietkau March 13, 2023, 9:04 p.m. UTC | #2
On 13.03.23 21:15, Alexander Wetzel wrote:
> drv_wake_tx_queue() has no protection against running concurrent
> multiple times. It's normally executed within the task calling
> ndo_start_xmit(). But wake_txqs_tasklet is also calling into it,
> regardless if the function is already running on another CPU or not.
> 
> While drivers with native iTXQ support are able to handle that, calls to
> ieee80211_handle_wake_tx_queue() - used by the drivers without
> native iTXQ support - must be serialized. Otherwise drivers can get
> unexpected overlapping drv_tx() calls from mac80211. Which causes issues
> for at least rt2800usb.
> 
> To avoid what seems to be a not needed distinction between native and
> drivers using ieee80211_handle_wake_tx_queue(), the serialization is
> done for drv_wake_tx_queue() here.
> 
> The serialization works by detecting and blocking concurrent calls into
> drv_wake_tx_queue() and - when needed - restarting all queues after the
> wake_tx_queue ops returned from the driver.
> 
> This fix here is only required when a tree has 'c850e31f79f0 ("wifi:
> mac80211: add internal handler for wake_tx_queue")', which introduced
> the buggy code path in mac80211. Drivers were switched to it with
> 'a790cc3a4fad ("wifi: mac80211: add wake_tx_queue callback to
> drivers")'. But only after fixing an independent bug with commit
> '4444bc2116ae ("wifi: mac80211: Proper mark iTXQs for resumption")'
> problematic concurrent calls to drv_wake_tx_queue() really happened and
> exposed the initial issue.
> 
> Fixes: c850e31f79f0 ("wifi: mac80211: add internal handler for wake_tx_queue")
> Reported-by: Thomas Mann <rauchwolke@gmx.net>
> Tested-by: Thomas Mann <rauchwolke@gmx.net>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217119
> Link: https://lore.kernel.org/r/b8efebc6-4399-d0b8-b2a0-66843314616b@leemhuis.info/
> CC: <stable@vger.kernel.org>
> Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
> ---
> 
> There are multiple variations how we can fix the issue.
> 
> But this fix has to go directly into 6.2 to solve an ongoing
> regression.
> I tried to prevent an outright redesign while still having a
> full fix:

This serialization approach confuses me a bit, and I worry that it might 
cause performance regressions on mt76 somehow, though I can't point to 
any specific scenario in which this would happen.

> Most questionable decision here probably is, if we should fix it in
> drv_wake_tx_queue() or only in ieee80211_handle_wake_tx_queue().
> But the decision how to serialize - tasklet vs some kind of locking
> - may also be an issue.
> 
> Personally, I did not like the overhead of checking all iTXQ for every
> drv_wake_tx_queue(). These are happening per-packet AND can be easily
> avoided when we are not using a tasklet. Most of the time.

Moving the scheduling to a tasklet wouldn't involve checking all iTXQ 
for every wake call, only the active ones that have packets queued.
If you use a kthread instead, multiple concurrent events (and in some 
cases even consecutive ones when busy enough) can be handled in a single 
scheduling run, which can be flexible about which CPU to run on as well.

> I also assume that drivers don't *want* concurrent drv_wake_tx_queue()
> calls and it's a benefit to all drivers to serialize it.

I think the main problem with your patch is that it does not solve the 
issue completely, just one instance of it. Concurrent calls to 
ieee80211_handle_wake_tx_queue for multiple iTXQs belonging to the same 
WMM AC are just as problematic as concurrent calls for the same iTXQ. 
Your patch does not seem to handle that.

Also, mt76 is completely fine with concurrent drv_wake_tx_queue calls, 
as it only uses them to schedule the main tx kthread.

In my opinion, the best approach is still to use a single kthread for 
the ieee80211_handle_wake_tx_queue case and leave drivers alone that 
take care of scheduling on their own.

- Felix
Alexander Wetzel March 14, 2023, 11:20 a.m. UTC | #3
On 13.03.23 21:33, Johannes Berg wrote:
> On Mon, 2023-03-13 at 21:15 +0100, Alexander Wetzel wrote:
>>
>> While drivers with native iTXQ support are able to handle that,
>>
> 
> questionable. Looking at iwlwifi:
> 
> void iwl_mvm_mac_wake_tx_queue(struct ieee80211_hw *hw,
>                                 struct ieee80211_txq *txq)
> {
>          struct iwl_mvm *mvm = IWL_MAC80211_GET_MVM(hw);
> ...
>          list_add_tail(&mvmtxq->list, &mvm->add_stream_txqs);
> ...
> }
> 
> which might explain some rare and hard to debug list corruptions in the
> driver that we've seen reports of ...
> 

Shall I change the scope of the fix from "only 6.2" to any stable kernel?
'Fixes: ba8c3d6f16a1 ("mac80211: add an intermediate software queue 
implementation")'

Or is that a overreaction and we better stick to what we know for sure 
and keep the 'Fixes: c850e31f79f0 ("wifi: mac80211: add internal handler 
for wake_tx_queue")'?

>> To avoid what seems to be a not needed distinction between native and
>> drivers using ieee80211_handle_wake_tx_queue(), the serialization is
>> done for drv_wake_tx_queue() here.
> 
> So probably no objection to that, at least for now, though in the common
> path (what I showed above was the 'first use' path), iwlwifi actually
> hits different HW resources, so it _could_ benefit from concurrent TX
> after fixing that bug.
> 

I could also directly add a driver a driver flag, allowing drivers to 
opt in to overlapping calls. And then set the flag for mt76, since Felix 
prefers to not change the behavior and knows this driver works with it.

>> The serialization works by detecting and blocking concurrent calls into
>> drv_wake_tx_queue() and - when needed - restarting all queues after the
>> wake_tx_queue ops returned from the driver.
> 
> This seems ... overly complex? It feels like you're implementing a kind
> of spinlock (using atomic bit ops) with very expensive handling of
> contention?
> 

Exactly. With the benefit that when we run uncontested the overhead is 
close to zero.
But this should also be true for spinlocks. And when we spin on 
contention it even better... So I'll change it to use a spinlock instead.

> Since drivers are supposed to handle concurrent TX per AC, you could
I assume you mean multiple drv_wake_tx_queue() can run in parallel, as 
long as they are serving different ACs?
In a prior test patch I just blocked concurrent calls for the same ac.

While that may be enough for full iTXQ drivers I have doubts that this 
is sufficient for the ones using ieee80211_handle_wake_tx_queue.

I at least was unable to identify any code in the rt2800usb driver which 
looked dangerous for concurrent execution. (Large parts are protected by 
a driver spinlock)

I ended up with the assumption, that the DMA handover - or something 
related to that - must cause the queue freezes. (Or my ability to detect 
critical calls is not good enough, yet.)

The patch still fixed the issue, of course. All races in the examined 
regression are for IEEE80211_AC_BE, so it's sufficient fix when we 
decide that's save.

Nevertheless I decided to better serialize any calls to 
drv_wake_tx_queue(). When we don't do that we may get a much harder to 
detect/trigger regression. Even more rarely hit and due to that probably 
never reported and fixed.

> almost just do something like this:
> 
> diff --git a/include/net/mac80211.h b/include/net/mac80211.h
> index f07a3c1b4d9a..1946e28868ea 100644
> --- a/include/net/mac80211.h
> +++ b/include/net/mac80211.h
> @@ -7108,10 +7108,8 @@ struct ieee80211_txq *ieee80211_next_txq(struct ieee80211_hw *hw, u8 ac);
>    */
>   void ieee80211_txq_schedule_start(struct ieee80211_hw *hw, u8 ac);
>   
> -/* (deprecated) */
> -static inline void ieee80211_txq_schedule_end(struct ieee80211_hw *hw, u8 ac)
> -{
> -}
> +/** ... */
> +void ieee80211_txq_schedule_end(struct ieee80211_hw *hw, u8 ac);
>   
>   void __ieee80211_schedule_txq(struct ieee80211_hw *hw,
>   			      struct ieee80211_txq *txq, bool force);
> diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
> index 1fae44fb1be6..606ca8620d20 100644
> --- a/net/mac80211/tx.c
> +++ b/net/mac80211/tx.c
> @@ -4250,11 +4250,17 @@ void ieee80211_txq_schedule_start(struct ieee80211_hw *hw, u8 ac)
>   	} else {
>   		local->schedule_round[ac] = 0;
>   	}
> -
> -	spin_unlock_bh(&local->active_txq_lock[ac]);
>   }
>   EXPORT_SYMBOL(ieee80211_txq_schedule_start);
>   
> +void ieee80211_txq_schedule_end(struct ieee80211_hw *hw, u8 ac)
> +{
> +	struct ieee80211_local *local = hw_to_local(hw);
> +
> +	spin_unlock_bh(&local->active_txq_lock[ac]);
> +}
> +EXPORT_SYMBOL(ieee80211_txq_schedule_end);
> +
>   void __ieee80211_subif_start_xmit(struct sk_buff *skb,
>   				  struct net_device *dev,
>   				  u32 info_flags,
>

Holding active_txq_lock[ac] all that time - it's normally acquired and 
released multiple times in between - seems to be a bit too daring for a 
"stable" patch. I would feel better using a new spinlock - one drivers 
can opt out of. Would that be acceptable, too?

Scheduling it in ieee80211_schedule_txq() also requires that lock.
Would your idea not often force userspace TX to spin for 
active_txq_lock[ac] after the skb has been added to the iTXQ, just to 
then find a empty queue?

I also still would place the spinlock in drv_wake_tx_queue(). After all 
ieee80211_txq_schedule_start/end is kind of optional and e.g. iwlwifi is 
not using it.

> 
> 
> assuming that TXQ drivers actually still call
> ieee80211_txq_schedule_end() which says it's deprecated.
> 
> That even has _bh() so the tasklet can't be running anyway ...
> 
> So if the concurrency really is only TX vs. tasklet, then you could even
> just keep the BHs disabled (in _start spin_unlock only and then in _end
> local_bh_disable)?
> 
>> Which may also be the solution for the regression in 6.2:
>> Do it now for ieee80211_handle_wake_tx_queue() and apply this patch
>> to the development tree only.
> 
> I'd argue the other way around - do it for all to fix these issues, and
> then audit drivers such as iwlwifi or even make concurrency here opt-in.
> 
> Felix did see some benefits of the concurrency I think though, so he
> might have a different opinion.

What about handling it that way:

Keep serializing drv_wake_tx_queue(). But use a new spinlock the drivers 
can opt out from.
  1) No flag set:
     drv_wake_tx_queue() can be running only once at any time

  2) IEEE80211_HW_CONCURRENT_AC_TX
     drv_wake_tx_queue() can be running once per AC at any time

  3) IEEE80211_HW_CONCURRENT
     current behavior.

Alexander
Johannes Berg March 14, 2023, 12:22 p.m. UTC | #4
Hi,

On Tue, 2023-03-14 at 12:20 +0100, Alexander Wetzel wrote:
> On 13.03.23 21:33, Johannes Berg wrote:
> > On Mon, 2023-03-13 at 21:15 +0100, Alexander Wetzel wrote:
> > > 
> > > While drivers with native iTXQ support are able to handle that,
> > > 
> > 
> > questionable. Looking at iwlwifi:
> > 
> > void iwl_mvm_mac_wake_tx_queue(struct ieee80211_hw *hw,
> >                                 struct ieee80211_txq *txq)
> > {
> >          struct iwl_mvm *mvm = IWL_MAC80211_GET_MVM(hw);
> > ...
> >          list_add_tail(&mvmtxq->list, &mvm->add_stream_txqs);
> > ...
> > }
> > 
> > which might explain some rare and hard to debug list corruptions in the
> > driver that we've seen reports of ...
> > 
> 
> Shall I change the scope of the fix from "only 6.2" to any stable kernel?
> 'Fixes: ba8c3d6f16a1 ("mac80211: add an intermediate software queue 
> implementation")'
> 
> Or is that a overreaction and we better stick to what we know for sure 
> and keep the 'Fixes: c850e31f79f0 ("wifi: mac80211: add internal handler 
> for wake_tx_queue")'?

I think we stick with the latter. I already have a fix for iwlwifi (see
https://lore.kernel.org/r/20230314103840.30771-1-jtornosm@redhat.com)
and other drivers should be fine.

Also that means we should probably restrict the fix to actually be for
mac80211 only.

> > > To avoid what seems to be a not needed distinction between native and
> > > drivers using ieee80211_handle_wake_tx_queue(), the serialization is
> > > done for drv_wake_tx_queue() here.
> > 
> > So probably no objection to that, at least for now, though in the common
> > path (what I showed above was the 'first use' path), iwlwifi actually
> > hits different HW resources, so it _could_ benefit from concurrent TX
> > after fixing that bug.
> > 
> 
> I could also directly add a driver a driver flag, allowing drivers to 
> opt in to overlapping calls. And then set the flag for mt76, since Felix 
> prefers to not change the behavior and knows this driver works with it.

Too much complexity. I change my position - let's fix mac80211 and
iwlwifi separately.

> Exactly. With the benefit that when we run uncontested the overhead is 
> close to zero.
> But this should also be true for spinlocks. And when we spin on 
> contention it even better... So I'll change it to use a spinlock instead.

Yeah that was kind of my thought process here too.


> > Since drivers are supposed to handle concurrent TX per AC, you could
> I assume you mean multiple drv_wake_tx_queue() can run in parallel, as 
> long as they are serving different ACs?
> In a prior test patch I just blocked concurrent calls for the same ac.

No, I meant that drv_tx() was previously allowed concurrently for
different HW queues, which in practice means at least different ACs.

> While that may be enough for full iTXQ drivers I have doubts that this 
> is sufficient for the ones using ieee80211_handle_wake_tx_queue.

No no, that's what I mean, it should've been sufficient for old-style
drivers.

> I at least was unable to identify any code in the rt2800usb driver which 
> looked dangerous for concurrent execution. (Large parts are protected by 
> a driver spinlock)

Indeed, that's not so clear.

> I ended up with the assumption, that the DMA handover - or something 
> related to that - must cause the queue freezes. (Or my ability to detect 
> critical calls is not good enough, yet.)

That part is protected by a per-queue lock in
rt2x00queue_write_tx_frame() though, but I don't see any problem either.

> The patch still fixed the issue, of course. All races in the examined 
> regression are for IEEE80211_AC_BE, so it's sufficient fix when we 
> decide that's save.

It seems to me it should be safe, again, since we previously assumed you
could do TX per HW queue concurrently. However, that's not precisely per
AC, so we might need to be careful, and maybe that's not worth it.

> Holding active_txq_lock[ac] all that time - it's normally acquired and 
> released multiple times in between - seems to be a bit too daring for a 
> "stable" patch.

Fair enough :)

> I also still would place the spinlock in drv_wake_tx_queue(). After all 
> ieee80211_txq_schedule_start/end is kind of optional and e.g. iwlwifi is 
> not using it.

True.

> What about handling it that way:
> 
> Keep serializing drv_wake_tx_queue(). But use a new spinlock the drivers 
> can opt out from.
>   1) No flag set:
>      drv_wake_tx_queue() can be running only once at any time
> 
>   2) IEEE80211_HW_CONCURRENT_AC_TX
>      drv_wake_tx_queue() can be running once per AC at any time
> 
>   3) IEEE80211_HW_CONCURRENT
>      current behavior.
> 

I don't like the flags, to be honest.

Have you considered Felix's proposal of having a separate thread to
handle all the transmits? Or maybe that's too much for stable too?


I think my personal order of options would be:

1) for stable, just add a spinlock to ieee80211_handle_wake_tx_queue()

2) for later, maybe consider making ieee80211_handle_wake_tx_queue()
   just wake up a kthread or something, and handle the schedule loop
   there

However we may need to think more about 2), maybe we could even map to
hardware queues and do concurrency there, but better we just say screw
it and remove all that cruft instead ...

johannes
Felix Fietkau March 14, 2023, 12:28 p.m. UTC | #5
On 14.03.23 12:20, Alexander Wetzel wrote:
>> assuming that TXQ drivers actually still call
>> ieee80211_txq_schedule_end() which says it's deprecated.
>> 
>> That even has _bh() so the tasklet can't be running anyway ...
>> 
>> So if the concurrency really is only TX vs. tasklet, then you could even
>> just keep the BHs disabled (in _start spin_unlock only and then in _end
>> local_bh_disable)?
>> 
>>> Which may also be the solution for the regression in 6.2:
>>> Do it now for ieee80211_handle_wake_tx_queue() and apply this patch
>>> to the development tree only.
>> 
>> I'd argue the other way around - do it for all to fix these issues, and
>> then audit drivers such as iwlwifi or even make concurrency here opt-in.
>> 
>> Felix did see some benefits of the concurrency I think though, so he
>> might have a different opinion.
> 
> What about handling it that way:
> 
> Keep serializing drv_wake_tx_queue(). But use a new spinlock the drivers
> can opt out from.
>    1) No flag set:
>       drv_wake_tx_queue() can be running only once at any time
> 
>    2) IEEE80211_HW_CONCURRENT_AC_TX
>       drv_wake_tx_queue() can be running once per AC at any time
> 
>    3) IEEE80211_HW_CONCURRENT
>       current behavior.

I think if you really insist on handling this through drv_wake_tx_queue 
locking, it really shouldn't be done through extra hw flags, because the 
locking requirements are too different.

I took a quick look at all the drivers that implement iTXQ themselves 
and what their requirements are.

Only two drivers need changes:
- ath10k: needs a per-AC lock, because it does a scheduling round in the 
wake_tx_queue call.
- iwlwifi: needs a global lock, since it touches a common list. None of 
your proposed locking types are enough for this one.

The rest seem fine to me:
- ath9k has a per-hw-queue lock
- mt76 only schedules a kthread
- rtw88 uses a global lock + workqueue for scheduling
- rtw89 uses a (potentially unnecessary) ieee80211_schedule_txq call + 
workqueue for scheduling

If you want to address this in the least invasive way, add a per-AC lock 
to ath10k's wake_tx_queue handler, a global lock to iwlwifi, and a 
per-AC lock inside ieee80211_handle_wake_tx_queue().

That way no locking is needed around the drv_wake_tx_queue calls.

- Felix
Johannes Berg March 14, 2023, 12:32 p.m. UTC | #6
On Tue, 2023-03-14 at 13:28 +0100, Felix Fietkau wrote:
> If you want to address this in the least invasive way, add [...],
> a global lock to iwlwifi
> 

I'm already fixing this, see
https://lore.kernel.org/r/5674c40151267fea1333f0eda1701b141bbaa170.camel@sipsolutions.net

> , and a 
> per-AC lock inside ieee80211_handle_wake_tx_queue().

I'm not *entirely* sure per AC is sufficient given that you could
technically map two ACs to the same HW queue with vif->hw_queue[]?

But again, not really sure all that complexity is still worth it.

johannes
Felix Fietkau March 14, 2023, 12:36 p.m. UTC | #7
On 14.03.23 13:32, Johannes Berg wrote:
> On Tue, 2023-03-14 at 13:28 +0100, Felix Fietkau wrote:
>> If you want to address this in the least invasive way, add [...],
>> a global lock to iwlwifi
>> 
> 
> I'm already fixing this, see
> https://lore.kernel.org/r/5674c40151267fea1333f0eda1701b141bbaa170.camel@sipsolutions.net
> 
>> , and a 
>> per-AC lock inside ieee80211_handle_wake_tx_queue().
> 
> I'm not *entirely* sure per AC is sufficient given that you could
> technically map two ACs to the same HW queue with vif->hw_queue[]?
> 
> But again, not really sure all that complexity is still worth it.
Per AC should be sufficient even in that case, because scheduling 
happens per AC regardless of vif hw queue mapping.

- Felix
Alexander Wetzel March 14, 2023, 4:44 p.m. UTC | #8
On 14.03.23 13:22, Johannes Berg wrote:
> Hi,
> 
> On Tue, 2023-03-14 at 12:20 +0100, Alexander Wetzel wrote:
>> On 13.03.23 21:33, Johannes Berg wrote:
>>> On Mon, 2023-03-13 at 21:15 +0100, Alexander Wetzel wrote:
>>>>
>>>> While drivers with native iTXQ support are able to handle that,
>>>>
>>>
>>> questionable. Looking at iwlwifi:
>>>
>>> void iwl_mvm_mac_wake_tx_queue(struct ieee80211_hw *hw,
>>>                                  struct ieee80211_txq *txq)
>>> {
>>>           struct iwl_mvm *mvm = IWL_MAC80211_GET_MVM(hw);
>>> ...
>>>           list_add_tail(&mvmtxq->list, &mvm->add_stream_txqs);
>>> ...
>>> }
>>>
>>> which might explain some rare and hard to debug list corruptions in the
>>> driver that we've seen reports of ...
>>>
>>
>> Shall I change the scope of the fix from "only 6.2" to any stable kernel?
>> 'Fixes: ba8c3d6f16a1 ("mac80211: add an intermediate software queue
>> implementation")'
>>
>> Or is that a overreaction and we better stick to what we know for sure
>> and keep the 'Fixes: c850e31f79f0 ("wifi: mac80211: add internal handler
>> for wake_tx_queue")'?
> 
> I think we stick with the latter. I already have a fix for iwlwifi (see
> https://lore.kernel.org/r/20230314103840.30771-1-jtornosm@redhat.com)
> and other drivers should be fine.
> 
> Also that means we should probably restrict the fix to actually be for
> mac80211 only.
> 
>>>> To avoid what seems to be a not needed distinction between native and
>>>> drivers using ieee80211_handle_wake_tx_queue(), the serialization is
>>>> done for drv_wake_tx_queue() here.
>>>
>>> So probably no objection to that, at least for now, though in the common
>>> path (what I showed above was the 'first use' path), iwlwifi actually
>>> hits different HW resources, so it _could_ benefit from concurrent TX
>>> after fixing that bug.
>>>
>>
>> I could also directly add a driver a driver flag, allowing drivers to
>> opt in to overlapping calls. And then set the flag for mt76, since Felix
>> prefers to not change the behavior and knows this driver works with it.
> 
> Too much complexity. I change my position - let's fix mac80211 and
> iwlwifi separately.
> 
>> Exactly. With the benefit that when we run uncontested the overhead is
>> close to zero.
>> But this should also be true for spinlocks. And when we spin on
>> contention it even better... So I'll change it to use a spinlock instead.
> 
> Yeah that was kind of my thought process here too.
> 
> 
>>> Since drivers are supposed to handle concurrent TX per AC, you could
>> I assume you mean multiple drv_wake_tx_queue() can run in parallel, as
>> long as they are serving different ACs?
>> In a prior test patch I just blocked concurrent calls for the same ac.
> 
> No, I meant that drv_tx() was previously allowed concurrently for
> different HW queues, which in practice means at least different ACs.
> 
>> While that may be enough for full iTXQ drivers I have doubts that this
>> is sufficient for the ones using ieee80211_handle_wake_tx_queue.
> 
> No no, that's what I mean, it should've been sufficient for old-style
> drivers.
> 
>> I at least was unable to identify any code in the rt2800usb driver which
>> looked dangerous for concurrent execution. (Large parts are protected by
>> a driver spinlock)
> 
> Indeed, that's not so clear.
> 
>> I ended up with the assumption, that the DMA handover - or something
>> related to that - must cause the queue freezes. (Or my ability to detect
>> critical calls is not good enough, yet.)
> 
> That part is protected by a per-queue lock in
> rt2x00queue_write_tx_frame() though, but I don't see any problem either.
> 

Not really relevant here but to elaborate a bit:
The Hw itself seems to suffer from random freezes. But Thomas tested the 
watchdog in the driver for those. It was not helping.

On the other end the presence of such a watchdog is not exactly an 
indication for trustworthy card...

And I'm also quite surprised that Thomas triggers the issue so reliable 
while may close to be identical card I dug up works fine.
We even both use Gentoo and the same kernel. Build with the same 
compiler. Down to the version and build flags, as far as I can tell.

So nifty run-times differences caused by quite different Intel CPUs - or 
another USB controller/speed - is actually high on the list of suspects.

Effectively I just assume now, that two TX operations too close together 
are enough to kill TX. (This could be missing lock in the firmware or 
even a HW issue the firmware is not (able to) working around.)

If I could reproduce the issue I would try to track that down out of 
sheer curiosity. As things are it's not worth the effort.

Fact is, that the Thomas's driver is frequently running out of internal 
queue buffers, stop the associated mac80211 queue to pause TX. And when 
the queue is stopped during a concurrent TX operation the driver somehow 
needs >30s to start the queue again. Without logging any complaint or a 
enabled watchdog in the driver detecting the hung.

>> The patch still fixed the issue, of course. All races in the examined
>> regression are for IEEE80211_AC_BE, so it's sufficient fix when we
>> decide that's save.
> 
> It seems to me it should be safe, again, since we previously assumed you
> could do TX per HW queue concurrently. However, that's not precisely per
> AC, so we might need to be careful, and maybe that's not worth it.
> 
>> Holding active_txq_lock[ac] all that time - it's normally acquired and
>> released multiple times in between - seems to be a bit too daring for a
>> "stable" patch.
> 
> Fair enough :)
> 
>> I also still would place the spinlock in drv_wake_tx_queue(). After all
>> ieee80211_txq_schedule_start/end is kind of optional and e.g. iwlwifi is
>> not using it.
> 
> True.
> 
>> What about handling it that way:
>>
>> Keep serializing drv_wake_tx_queue(). But use a new spinlock the drivers
>> can opt out from.
>>    1) No flag set:
>>       drv_wake_tx_queue() can be running only once at any time
>>
>>    2) IEEE80211_HW_CONCURRENT_AC_TX
>>       drv_wake_tx_queue() can be running once per AC at any time
>>
>>    3) IEEE80211_HW_CONCURRENT
>>       current behavior.
>>
> 
> I don't like the flags, to be honest.
> 
> Have you considered Felix's proposal of having a separate thread to
> handle all the transmits? Or maybe that's too much for stable too?

I was planning to respond to Felix mail, too.
But guess it makes more sense here now and not have multiple concurrent 
discussions:-)

For the immediate fix I do not like the kthread. Which simply may be due 
to the fact that I would need more time to delve into that.

Not so sure how we can get the queue wakes as cheap as they currently 
are, though. Each time we kick off the kthread it would have to check 
the scheduling lists for each AC. Or use one kthread per AC... (That 
from someone who knows close to nothing about kthread and needs do more 
reading...)

Generally it sounds like improvement for mac80211 roadmap. But nothing I 
would want to pickup right now.
I first want to wrap up the migration to iTXQ. (Which now also redesigns 
how filtered and pending frames are handled, btw.)

> 
> I think my personal order of options would be:
> 
> 1) for stable, just add a spinlock to ieee80211_handle_wake_tx_queue()
> 
> 2) for later, maybe consider making ieee80211_handle_wake_tx_queue()
>     just wake up a kthread or something, and handle the schedule loop
>     there
> 
> However we may need to think more about 2), maybe we could even map to
> hardware queues and do concurrency there, but better we just say screw
> it and remove all that cruft instead ...
> 

I like that.

I try to work on that this evening and hopefully can share a new patch 
later today or tomorrow. (No V2, title and solution has changed.)

I'll just use per-ac spinlocks in ieee80211_handle_wake_tx_queue(), 
allowing multiple ACs to TX at the same time.

That's probably not able to prevent the stopped queue issue Thomas got 
with 6.2 kernels when mutliply ACs have concurrent TX.
But it will bring back the driver to the level it operated on kernel 
<6.1. Which sounds acceptable for me.

Someone planning to fix the issue for ath10k?
If not I'll look into that, too. (Since we don't have known issues for 
that it's not exactly a priority.)

Alexander
Johannes Berg March 14, 2023, 4:50 p.m. UTC | #9
On Tue, 2023-03-14 at 17:44 +0100, Alexander Wetzel wrote:
> > 
> > Have you considered Felix's proposal of having a separate thread to
> > handle all the transmits? Or maybe that's too much for stable too?
> 
> I was planning to respond to Felix mail, too.
> But guess it makes more sense here now and not have multiple concurrent 
> discussions:-)

:-)

> For the immediate fix I do not like the kthread. Which simply may be due 
> to the fact that I would need more time to delve into that.

Agree.

> Not so sure how we can get the queue wakes as cheap as they currently 
> are, though. Each time we kick off the kthread it would have to check 
> the scheduling lists for each AC. Or use one kthread per AC... (That 
> from someone who knows close to nothing about kthread and needs do more 
> reading...)

Yes it might not be as cheap, but a kthread could also handle multiple
packets together if there were many enqueued, so might be even be a net
win in a kind of high traffic case. Not sure.

> Generally it sounds like improvement for mac80211 roadmap. But nothing I 
> would want to pickup right now.
> I first want to wrap up the migration to iTXQ. (Which now also redesigns 
> how filtered and pending frames are handled, btw.)

Also agree with that.

> I'll just use per-ac spinlocks in ieee80211_handle_wake_tx_queue(), 
> allowing multiple ACs to TX at the same time.

Not sure it's even worth splitting it across ACs?

> That's probably not able to prevent the stopped queue issue Thomas got 
> with 6.2 kernels when mutliply ACs have concurrent TX.
> But it will bring back the driver to the level it operated on kernel 
> <6.1. Which sounds acceptable for me.

Sure.

> Someone planning to fix the issue for ath10k?
> If not I'll look into that, too. (Since we don't have known issues for 
> that it's not exactly a priority.)

Not me, already have my hands full with iwlwifi :-)

johannes
diff mbox series

Patch

diff --git a/net/mac80211/driver-ops.h b/net/mac80211/driver-ops.h
index 5d13a3dfd366..8bc4904c32e2 100644
--- a/net/mac80211/driver-ops.h
+++ b/net/mac80211/driver-ops.h
@@ -1192,6 +1192,16 @@  drv_tdls_recv_channel_switch(struct ieee80211_local *local,
 	trace_drv_return_void(local);
 }
 
+static void handle_aborted_drv_wake_tx_queue(struct ieee80211_hw *hw)
+{
+	ieee80211_stop_queues_by_reason(hw, IEEE80211_MAX_QUEUE_MAP,
+					IEEE80211_QUEUE_STOP_REASON_SERIALIZE,
+					false);
+	ieee80211_wake_queues_by_reason(hw, IEEE80211_MAX_QUEUE_MAP,
+					IEEE80211_QUEUE_STOP_REASON_SERIALIZE,
+					false);
+}
+
 static inline void drv_wake_tx_queue(struct ieee80211_local *local,
 				     struct txq_info *txq)
 {
@@ -1206,8 +1216,19 @@  static inline void drv_wake_tx_queue(struct ieee80211_local *local,
 	if (!check_sdata_in_driver(sdata))
 		return;
 
+	/* Serialize calls to wake_tx_queue */
+	if (test_and_set_bit(ITXQ_RUN_WAKE_TX_QUEUE_ACTIVE,
+			     &local->itxq_run_flags)) {
+		set_bit(ITXQ_RUN_WAKE_TX_QUEUE_NEEDED, &local->itxq_run_flags);
+		return;
+	}
 	trace_drv_wake_tx_queue(local, sdata, txq);
 	local->ops->wake_tx_queue(&local->hw, &txq->txq);
+	clear_bit(ITXQ_RUN_WAKE_TX_QUEUE_ACTIVE, &local->itxq_run_flags);
+
+	if (test_and_clear_bit(ITXQ_RUN_WAKE_TX_QUEUE_NEEDED,
+			       &local->itxq_run_flags))
+		handle_aborted_drv_wake_tx_queue(&local->hw);
 }
 
 static inline void schedule_and_wake_txq(struct ieee80211_local *local,
diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index ecc232eb1ee8..42cd32892186 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -1197,6 +1197,7 @@  enum queue_stop_reason {
 	IEEE80211_QUEUE_STOP_REASON_TDLS_TEARDOWN,
 	IEEE80211_QUEUE_STOP_REASON_RESERVE_TID,
 	IEEE80211_QUEUE_STOP_REASON_IFTYPE_CHANGE,
+	IEEE80211_QUEUE_STOP_REASON_SERIALIZE,
 
 	IEEE80211_QUEUE_STOP_REASONS,
 };
@@ -1269,6 +1270,11 @@  enum mac80211_scan_state {
 
 DECLARE_STATIC_KEY_FALSE(aql_disable);
 
+enum itxq_run_flags {
+	ITXQ_RUN_WAKE_TX_QUEUE_ACTIVE,
+	ITXQ_RUN_WAKE_TX_QUEUE_NEEDED,
+};
+
 struct ieee80211_local {
 	/* embed the driver visible part.
 	 * don't cast (use the static inlines below), but we keep
@@ -1284,6 +1290,9 @@  struct ieee80211_local {
 	struct list_head active_txqs[IEEE80211_NUM_ACS];
 	u16 schedule_round[IEEE80211_NUM_ACS];
 
+	/* Used to handle/prevent overlapping calls to drv_wake_tx_queue() */
+	unsigned long itxq_run_flags;
+
 	u16 airtime_flags;
 	u32 aql_txq_limit_low[IEEE80211_NUM_ACS];
 	u32 aql_txq_limit_high[IEEE80211_NUM_ACS];
diff --git a/net/mac80211/util.c b/net/mac80211/util.c
index 1a28fe5cb614..48483eed4826 100644
--- a/net/mac80211/util.c
+++ b/net/mac80211/util.c
@@ -474,7 +474,8 @@  static void __ieee80211_wake_queue(struct ieee80211_hw *hw, int queue,
 	 * release someone's lock, but it is fine because all the callers of
 	 * __ieee80211_wake_queue call it right before releasing the lock.
 	 */
-	if (reason == IEEE80211_QUEUE_STOP_REASON_DRIVER)
+	if (reason == IEEE80211_QUEUE_STOP_REASON_DRIVER ||
+	    reason == IEEE80211_QUEUE_STOP_REASON_SERIALIZE)
 		tasklet_schedule(&local->wake_txqs_tasklet);
 	else
 		_ieee80211_wake_txqs(local, flags);