diff mbox

mac80211: fix Tx BA session stuck issue during sw scanning

Message ID 1479887968-17473-1-git-send-email-chiu@endlessm.com (mailing list archive)
State Changes Requested
Delegated to: Johannes Berg
Headers show

Commit Message

Chris Chiu Nov. 23, 2016, 7:59 a.m. UTC
ieee80211_iface_work() will check if sw scanning is in progress
before handling block ack session. In our case, the RTL8821AE
operate in station mode, when tx session expired, DELBA packet
stuck during sw scanning and so do other data packets.

ieee80211_scan_state_decision() will take lots of time in
SCAN_SUSPEND/SCAN_RESUME state due to !tx_empty or bad_latency.
Then the sw scanning mostly take > 20 seconds to finish or even
worse in our case RTL8821AE have 37 channels for 2G+5G to scan
and tx stalls ~300 seconds. AP side still thinks the connection
is alive because it still receives the QoS NULL packet from STA.
So the link state will never change but actually no data tx/rx
during this long time.

This commit tries to send out packet in SCAN_SUSPEND state so the
sw scanning can complete more efficiently and less affect on Block
Ack session handling. Verified on RTL8821AE for > 30000 pings and
no Tx BA session stuck observed.

Signed-off-by: Chris Chiu <chiu@endlessm.com>
---
 net/mac80211/ieee80211_i.h | 1 +
 net/mac80211/iface.c       | 4 +++-
 net/mac80211/scan.c        | 4 ++++
 3 files changed, 8 insertions(+), 1 deletion(-)

Comments

Johannes Berg Nov. 29, 2016, 12:59 p.m. UTC | #1
On Wed, 2016-11-23 at 15:59 +0800, Chris Chiu wrote:
> ieee80211_iface_work() will check if sw scanning is in progress
> before handling block ack session. In our case, the RTL8821AE
> operate in station mode, when tx session expired, DELBA packet
> stuck during sw scanning and so do other data packets.
> 
> ieee80211_scan_state_decision() will take lots of time in
> SCAN_SUSPEND/SCAN_RESUME state due to !tx_empty or bad_latency.
> Then the sw scanning mostly take > 20 seconds to finish or even
> worse in our case RTL8821AE have 37 channels for 2G+5G to scan
> and tx stalls ~300 seconds. AP side still thinks the connection
> is alive because it still receives the QoS NULL packet from STA.
> So the link state will never change but actually no data tx/rx
> during this long time.
> 
> This commit tries to send out packet in SCAN_SUSPEND state so the
> sw scanning can complete more efficiently and less affect on Block
> Ack session handling. Verified on RTL8821AE for > 30000 pings and
> no Tx BA session stuck observed.

The premise seems fairly reasonable, although I'm a little worried that
if so much new traffic is coming in we never finish the scan suspend?
Actually, the queues are still stopped, so it's only management frames
that can come in, so that should be ok?

> +	test_and_clear_bit(SCAN_SUSPEND_SCANNING, &local->scanning);
> 

That makes no sense, you're not checking the return value, just clear
the bit without test.
 
> @@ -844,6 +846,8 @@ static void ieee80211_scan_state_suspend(struct
> ieee80211_local *local,
>  	/* disable PS */
>  	ieee80211_offchannel_return(local);
>  
> +	__set_bit(SCAN_SUSPEND_SCANNING, &local->scanning);

Why are you using the non-atomic version here, vs. the atomic one
above?

johannes
Arend van Spriel Nov. 29, 2016, 1:38 p.m. UTC | #2
On 29-11-2016 13:59, Johannes Berg wrote:
> On Wed, 2016-11-23 at 15:59 +0800, Chris Chiu wrote:
>> ieee80211_iface_work() will check if sw scanning is in progress
>> before handling block ack session. In our case, the RTL8821AE
>> operate in station mode, when tx session expired, DELBA packet
>> stuck during sw scanning and so do other data packets.
>>
>> ieee80211_scan_state_decision() will take lots of time in
>> SCAN_SUSPEND/SCAN_RESUME state due to !tx_empty or bad_latency.
>> Then the sw scanning mostly take > 20 seconds to finish or even
>> worse in our case RTL8821AE have 37 channels for 2G+5G to scan
>> and tx stalls ~300 seconds. AP side still thinks the connection
>> is alive because it still receives the QoS NULL packet from STA.
>> So the link state will never change but actually no data tx/rx
>> during this long time.
>>
>> This commit tries to send out packet in SCAN_SUSPEND state so the
>> sw scanning can complete more efficiently and less affect on Block
>> Ack session handling. Verified on RTL8821AE for > 30000 pings and
>> no Tx BA session stuck observed.
> 
> The premise seems fairly reasonable, although I'm a little worried that
> if so much new traffic is coming in we never finish the scan suspend?
> Actually, the queues are still stopped, so it's only management frames
> that can come in, so that should be ok?

So are pings a good way to verify block-ack session handling. How often
was a scan issued within those 30000 pings or was that left to
wpa_supplicant/network-manager/whatever?

Regards,
Arend

>> +	test_and_clear_bit(SCAN_SUSPEND_SCANNING, &local->scanning);
>>
> 
> That makes no sense, you're not checking the return value, just clear
> the bit without test.
>  
>> @@ -844,6 +846,8 @@ static void ieee80211_scan_state_suspend(struct
>> ieee80211_local *local,
>>  	/* disable PS */
>>  	ieee80211_offchannel_return(local);
>>  
>> +	__set_bit(SCAN_SUSPEND_SCANNING, &local->scanning);
> 
> Why are you using the non-atomic version here, vs. the atomic one
> above?
> 
> johannes
>
Chris Chiu Dec. 1, 2016, 11:24 a.m. UTC | #3
On Tue, Nov 29, 2016 at 8:59 PM, Johannes Berg
<johannes@sipsolutions.net> wrote:
> On Wed, 2016-11-23 at 15:59 +0800, Chris Chiu wrote:
>> ieee80211_iface_work() will check if sw scanning is in progress
>> before handling block ack session. In our case, the RTL8821AE
>> operate in station mode, when tx session expired, DELBA packet
>> stuck during sw scanning and so do other data packets.
>>
>> ieee80211_scan_state_decision() will take lots of time in
>> SCAN_SUSPEND/SCAN_RESUME state due to !tx_empty or bad_latency.
>> Then the sw scanning mostly take > 20 seconds to finish or even
>> worse in our case RTL8821AE have 37 channels for 2G+5G to scan
>> and tx stalls ~300 seconds. AP side still thinks the connection
>> is alive because it still receives the QoS NULL packet from STA.
>> So the link state will never change but actually no data tx/rx
>> during this long time.
>>
>> This commit tries to send out packet in SCAN_SUSPEND state so the
>> sw scanning can complete more efficiently and less affect on Block
>> Ack session handling. Verified on RTL8821AE for > 30000 pings and
>> no Tx BA session stuck observed.
>
> The premise seems fairly reasonable, although I'm a little worried that
> if so much new traffic is coming in we never finish the scan suspend?
> Actually, the queues are still stopped, so it's only management frames
> that can come in, so that should be ok?
>

Actually it will finish scan eventually and back to SCAN_DECISION
state but almost 20~30 seconds elapsed. The local->scanning should be
cleared after all channels been scanned. However, from the debug
messages I added in ieee80211_iface_work(), it still returns when
check local->scanning and the DELBA still has no chance to be
transferred then stuck again at the next scan state machine. Supposed
to be another scan request issued but I don't know who's the killer.
Except to find who issue the next scan request, BA session have no
chance to reset in this long scan period (>20s) still need to be taken
care.

>> +     test_and_clear_bit(SCAN_SUSPEND_SCANNING, &local->scanning);
>>
>
> That makes no sense, you're not checking the return value, just clear
> the bit without test.
>
>> @@ -844,6 +846,8 @@ static void ieee80211_scan_state_suspend(struct
>> ieee80211_local *local,
>>       /* disable PS */
>>       ieee80211_offchannel_return(local);
>>
>> +     __set_bit(SCAN_SUSPEND_SCANNING, &local->scanning);
>
> Why are you using the non-atomic version here, vs. the atomic one
> above?
>
You're right. I just want to clear_bit and set_bit in this case, sorry
for that confusing. Or any better suggestion?

> johannes
Johannes Berg Dec. 5, 2016, 2:57 p.m. UTC | #4
> > The premise seems fairly reasonable, although I'm a little worried
> > that if so much new traffic is coming in we never finish the scan
> > suspend? Actually, the queues are still stopped, so it's only
> > management frames that can come in, so that should be ok?
> > 
> 
> Actually it will finish scan eventually and back to SCAN_DECISION
> state but almost 20~30 seconds elapsed. The local->scanning should be
> cleared after all channels been scanned. However, from the debug
> messages I added in ieee80211_iface_work(), it still returns when
> check local->scanning and the DELBA still has no chance to be
> transferred then stuck again at the next scan state machine. Supposed
> to be another scan request issued but I don't know who's the killer.
> Except to find who issue the next scan request, BA session have no
> chance to reset in this long scan period (>20s) still need to be
> taken
> care.

No no, you misunderstood. My question is more like:

Where is this traffic coming from, since netdev queues should be
stopped?

And then, if there's so much traffic coming in that we can take 20-30
seconds to send it out, could we - with the change - get stuck forever?

> You're right. I just want to clear_bit and set_bit in this case,
> sorry for that confusing. Or any better suggestion?

We seem to be using set_bit/clear_bit so that seems reasonable, unless
you can prove that all of those are under the lock and we can remove
the atomics entirely ... Not that it matters hugely, we don't scan all
the time after all!

johannes
Chris Chiu Dec. 5, 2016, 4:56 p.m. UTC | #5
On Mon, Dec 5, 2016 at 10:57 PM, Johannes Berg
<johannes@sipsolutions.net> wrote:
>> > The premise seems fairly reasonable, although I'm a little worried
>> > that if so much new traffic is coming in we never finish the scan
>> > suspend? Actually, the queues are still stopped, so it's only
>> > management frames that can come in, so that should be ok?
>> >
>>
>> Actually it will finish scan eventually and back to SCAN_DECISION
>> state but almost 20~30 seconds elapsed. The local->scanning should be
>> cleared after all channels been scanned. However, from the debug
>> messages I added in ieee80211_iface_work(), it still returns when
>> check local->scanning and the DELBA still has no chance to be
>> transferred then stuck again at the next scan state machine. Supposed
>> to be another scan request issued but I don't know who's the killer.
>> Except to find who issue the next scan request, BA session have no
>> chance to reset in this long scan period (>20s) still need to be
>> taken
>> care.
>
> No no, you misunderstood. My question is more like:
>
> Where is this traffic coming from, since netdev queues should be
> stopped?
>
> And then, if there's so much traffic coming in that we can take 20-30
> seconds to send it out, could we - with the change - get stuck forever?
>
My test scenario is just simply ping (not greedy ping). When it
stalls, you just see 1 ping packet has been retried on the air for 3
times w/o ack from AP. Then no more tx QoS data packet observed on the
air. From the log, it shows the tx ba session been expired and a DELBA
is queued but never sent out. So the STA wants to terminate current BA
session but can't make it. The air capture pcap file shows It can't
deal with the coming "Action" frame for "ADDBA" also. The Block Ack
state machine just stops until the whole sw scan done. In my case,
it's >300 seconds.

So this patch just tries to seek a chance to send out the DELBA and
process the BA related packets in ieee80211_iface_work(). It tries to
avoid the stuck forever thing. I tried 30000+ pings and verified no
stuck observed.

>> You're right. I just want to clear_bit and set_bit in this case,
>> sorry for that confusing. Or any better suggestion?
>
> We seem to be using set_bit/clear_bit so that seems reasonable, unless
> you can prove that all of those are under the lock and we can remove
> the atomics entirely ... Not that it matters hugely, we don't scan all
> the time after all!
>
I agree. Will modify this patch and then send again for approval.
> johannes
diff mbox

Patch

diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index f56d342..78c1a13 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -1076,6 +1076,7 @@  enum {
 	SCAN_SW_SCANNING,
 	SCAN_HW_SCANNING,
 	SCAN_ONCHANNEL_SCANNING,
+	SCAN_SUSPEND_SCANNING,
 	SCAN_COMPLETED,
 	SCAN_ABORTED,
 	SCAN_HW_CANCELLED,
diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c
index b123a9e..0a43997 100644
--- a/net/mac80211/iface.c
+++ b/net/mac80211/iface.c
@@ -1215,8 +1215,10 @@  static void ieee80211_iface_work(struct work_struct *work)
 	if (!ieee80211_sdata_running(sdata))
 		return;
 
-	if (test_bit(SCAN_SW_SCANNING, &local->scanning))
+	if (test_bit(SCAN_SW_SCANNING, &local->scanning) &&
+	    !test_bit(SCAN_SUSPEND_SCANNING, &local->scanning)) {
 		return;
+	}
 
 	if (!ieee80211_can_run_worker(local))
 		return;
diff --git a/net/mac80211/scan.c b/net/mac80211/scan.c
index 070b40f..ebd32a0 100644
--- a/net/mac80211/scan.c
+++ b/net/mac80211/scan.c
@@ -800,6 +800,8 @@  static void ieee80211_scan_state_set_channel(struct ieee80211_local *local,
 		break;
 	}
 
+	test_and_clear_bit(SCAN_SUSPEND_SCANNING, &local->scanning);
+
 	if (ieee80211_hw_config(local, IEEE80211_CONF_CHANGE_CHANNEL))
 		skip = 1;
 
@@ -844,6 +846,8 @@  static void ieee80211_scan_state_suspend(struct ieee80211_local *local,
 	/* disable PS */
 	ieee80211_offchannel_return(local);
 
+	__set_bit(SCAN_SUSPEND_SCANNING, &local->scanning);
+
 	*next_delay = HZ / 5;
 	/* afterwards, resume scan & go to next channel */
 	local->next_scan_state = SCAN_RESUME;