diff mbox series

[v2] mac80211: fix regression in sta connection monitor

Message ID 20200922102434.42727-1-nbd@nbd.name (mailing list archive)
State Superseded
Delegated to: Johannes Berg
Headers show
Series [v2] mac80211: fix regression in sta connection monitor | expand

Commit Message

Felix Fietkau Sept. 22, 2020, 10:24 a.m. UTC
When a frame was acked and probe frames were sent, the connection monitoring
needs to be reset, otherwise it will keep probing until the connection is
considered dead, even though frames have been acked in the mean time.

Fixes: 9abf4e49830d ("mac80211: optimize station connection monitor")
Reported-by: Georgi Valkov <gvalkov@abv.bg>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
---
v2: reset connection monitor when a frame was acked (not just for nulldata)

 net/mac80211/status.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Georgi Valkov Sept. 26, 2020, 4:41 a.m. UTC | #1
Hi Felix!

With your latest suggestion, it takes between 10 and 17 hours for the connection to drop, then long five minutes to reconnect.
Notice the order of code execution in the original code of ieee80211_sta_tx_notify():
probe_send_count is always cleared when ack is true. But before clearing probe_send_count, we need to check if ieee80211_is_any_nullfunc() returns true and probe_send_count > 0: if yes, we must call ieee80211_queue_work(). You cleared probe_send_count ahead of time, so the condition to run ieee80211_queue_work() will never be met.

To spare your time, I did spend one week to find the cause, then another learning every detail about the code and testing various solutions, including those you proposed. While I do not have experience with mac80211’s design, I’m quite good at preserving the exact behaviour during large scale refactoring. And in my fix I tried changing as little as possible to keep the patch small, preserving both your changes and the original design behaviour.

Good luck, and thank you for being a long time supporter of this amazing project!

Georgi

> On 2020-09-22, at 1:24 PM, Felix Fietkau <nbd@nbd.name> wrote:
> 
> When a frame was acked and probe frames were sent, the connection monitoring
> needs to be reset, otherwise it will keep probing until the connection is
> considered dead, even though frames have been acked in the mean time.
> 
> Fixes: 9abf4e49830d ("mac80211: optimize station connection monitor")
> Reported-by: Georgi Valkov <gvalkov@abv.bg>
> Signed-off-by: Felix Fietkau <nbd@nbd.name>
> ---
> v2: reset connection monitor when a frame was acked (not just for nulldata)
> 
> net/mac80211/status.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
> 
> diff --git a/net/mac80211/status.c b/net/mac80211/status.c
> index 7fe5bececfd9..cc870d1f7db6 100644
> --- a/net/mac80211/status.c
> +++ b/net/mac80211/status.c
> @@ -1120,6 +1120,8 @@ void ieee80211_tx_status_ext(struct ieee80211_hw *hw,
> 	noack_success = !!(info->flags & IEEE80211_TX_STAT_NOACK_TRANSMITTED);
> 
> 	if (pubsta) {
> +		struct ieee80211_sub_if_data *sdata = sta->sdata;
> +
> 		if (!acked && !noack_success)
> 			sta->status_stats.retry_failed++;
> 		sta->status_stats.retry_count += retry_count;
> @@ -1134,6 +1136,13 @@ void ieee80211_tx_status_ext(struct ieee80211_hw *hw,
> 				/* Track when last packet was ACKed */
> 				sta->status_stats.last_pkt_time = jiffies;
> 
> +				/* Reset connection monitor */
> +				if (sdata->vif.type == NL80211_IFTYPE_STATION &&
> +				    unlikely(sdata->u.mgd.probe_send_count > 0)) {
> +					sdata->u.mgd.probe_send_count = 0;
> +					ieee80211_queue_work(&local->hw, &sdata->work);
> +				}
> +
> 				if (info->status.is_valid_ack_signal) {
> 					sta->status_stats.last_ack_signal =
> 							 (s8)info->status.ack_signal;
> -- 
> 2.28.0
> 
>
Felix Fietkau Sept. 26, 2020, 6:11 a.m. UTC | #2
Hi Georgi,

thanks for testing and for your insight into this issue.

On 2020-09-26 06:41, Georgi Valkov wrote:
> Hi Felix!
> 
> With your latest suggestion, it takes between 10 and 17 hours for the connection to drop, then long five minutes to reconnect.
> Notice the order of code execution in the original code of ieee80211_sta_tx_notify():
> probe_send_count is always cleared when ack is true. But before clearing probe_send_count, we need to check if ieee80211_is_any_nullfunc() returns true and probe_send_count > 0: if yes, we must call ieee80211_queue_work(). You cleared probe_send_count ahead of time, so the condition to run ieee80211_queue_work() will never be met.
The condition in ieee80211_sta_tx_notify may not be met, but
ieee80211_queue_work is called from ieee80211_tx_status_ext in that
case. Any idea why that is not enough?

> To spare your time, I did spend one week to find the cause, then another learning every detail about the code and testing various solutions, including those you proposed. While I do not have experience with mac80211’s design, I’m quite good at preserving the exact behaviour during large scale refactoring. And in my fix I tried changing as little as possible to keep the patch small, preserving both your changes and the original design behaviour.
And the reason I'm still proposing changes to it is because your patch
does not take into account no-skb or 802.3 tx status.
I'll try to make a v3 patch that leaves more of the original code intact
at the cost of a little more duplication.

- Felix
Georgi Valkov Sept. 26, 2020, 7:31 a.m. UTC | #3
> thanks for testing and for your insight into this issue.
I’m happy to help!

> On 2020-09-26 06:41, Georgi Valkov wrote:
>> Hi Felix!
>> 
>> With your latest suggestion, it takes between 10 and 17 hours for the connection to drop, then long five minutes to reconnect.
>> Notice the order of code execution in the original code of ieee80211_sta_tx_notify():
>> probe_send_count is always cleared when ack is true. But before clearing probe_send_count, we need to check if ieee80211_is_any_nullfunc() returns true and probe_send_count > 0: if yes, we must call ieee80211_queue_work(). You cleared probe_send_count ahead of time, so the condition to run ieee80211_queue_work() will never be met.
> The condition in ieee80211_sta_tx_notify may not be met, but
> ieee80211_queue_work is called from ieee80211_tx_status_ext in that
> case. Any idea why that is not enough?

Whenever that condition is met and probe_send_count is reset there, ieee80211_queue_work() is called for sure. Could that be the problem? Because in the original code we also check the result from ieee80211_is_any_nullfunc(). We can also add a trace e.g. printk() to shine some light on the order of events, hopefully without flooding the kernel log too much and without impact on the performance?
You can send me patches directly without committing to master and I will test them.

>> To spare your time, I did spend one week to find the cause, then another learning every detail about the code and testing various solutions, including those you proposed. While I do not have experience with mac80211’s design, I’m quite good at preserving the exact behaviour during large scale refactoring. And in my fix I tried changing as little as possible to keep the patch small, preserving both your changes and the original design behaviour.
> And the reason I'm still proposing changes to it is because your patch
> does not take into account no-skb or 802.3 tx status.
> I'll try to make a v3 patch that leaves more of the original code intact
> at the cost of a little more duplication.
Good point. You are familiar with how it should work, and I only know the changes. You can add the common code into a new function to make it easier to follow.

I’m curious: the original RX code used to call ieee80211_sta_tx_notify(). Do we need to clear probe_send_count or do anything else there? I suppose you removed that because it will be performed later. Either way we should not touch that before the issue is resolved.


Georgi
diff mbox series

Patch

diff --git a/net/mac80211/status.c b/net/mac80211/status.c
index 7fe5bececfd9..cc870d1f7db6 100644
--- a/net/mac80211/status.c
+++ b/net/mac80211/status.c
@@ -1120,6 +1120,8 @@  void ieee80211_tx_status_ext(struct ieee80211_hw *hw,
 	noack_success = !!(info->flags & IEEE80211_TX_STAT_NOACK_TRANSMITTED);
 
 	if (pubsta) {
+		struct ieee80211_sub_if_data *sdata = sta->sdata;
+
 		if (!acked && !noack_success)
 			sta->status_stats.retry_failed++;
 		sta->status_stats.retry_count += retry_count;
@@ -1134,6 +1136,13 @@  void ieee80211_tx_status_ext(struct ieee80211_hw *hw,
 				/* Track when last packet was ACKed */
 				sta->status_stats.last_pkt_time = jiffies;
 
+				/* Reset connection monitor */
+				if (sdata->vif.type == NL80211_IFTYPE_STATION &&
+				    unlikely(sdata->u.mgd.probe_send_count > 0)) {
+					sdata->u.mgd.probe_send_count = 0;
+					ieee80211_queue_work(&local->hw, &sdata->work);
+				}
+
 				if (info->status.is_valid_ack_signal) {
 					sta->status_stats.last_ack_signal =
 							 (s8)info->status.ack_signal;