diff mbox series

[v2] 802.11n IBSS: wlan0 stops receiving packets due to aggregation after sender reboot

Message ID m37e4tjfbu.fsf@t19.piap.pl (mailing list archive)
State Rejected
Delegated to: Johannes Berg
Headers show
Series [v2] 802.11n IBSS: wlan0 stops receiving packets due to aggregation after sender reboot | expand

Commit Message

Krzysztof Hałasa Oct. 25, 2019, 10:21 a.m. UTC
Fix a bug where the mac80211 RX aggregation code sets a new aggregation
"session" at the remote station's request, but the head_seq_num
(the sequence number the receiver expects to receive) isn't reset.

Spotted on a pair of AR9580 in IBSS mode.

Signed-off-by: Krzysztof Halasa <khalasa@piap.pl>

Comments

Johannes Berg Oct. 28, 2019, 12:21 p.m. UTC | #1
On Fri, 2019-10-25 at 12:21 +0200, Krzysztof Hałasa wrote:
> Fix a bug where the mac80211 RX aggregation code sets a new aggregation
> "session" at the remote station's request, but the head_seq_num
> (the sequence number the receiver expects to receive) isn't reset.
> 
> Spotted on a pair of AR9580 in IBSS mode.
> 
> Signed-off-by: Krzysztof Halasa <khalasa@piap.pl>
> 
> diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c
> index 4d1c335e06e5..67733bd61297 100644
> --- a/net/mac80211/agg-rx.c
> +++ b/net/mac80211/agg-rx.c
> @@ -354,10 +354,13 @@ void ___ieee80211_start_rx_ba_session(struct sta_info *sta,
>  			 */
>  			rcu_read_lock();
>  			tid_rx = rcu_dereference(sta->ampdu_mlme.tid_rx[tid]);
> -			if (tid_rx && tid_rx->timeout == timeout)
> +			if (tid_rx && tid_rx->timeout == timeout) {
> +				tid_rx->ssn = start_seq_num;
> +				tid_rx->head_seq_num = start_seq_num;
>  				status = WLAN_STATUS_SUCCESS;

This is wrong, this is the case of *updating an existing session*, we
must not reset the head SN then.

I think you just got very lucky (or unlucky) to have the same dialog
token, because we start from 0 - maybe we should initialize it to a
random value to flush out such issues.

Really what I think probably happened is that one of your stations lost
the connection to the other, and didn't tell it about it in any way - so
the other kept all the status alive.

I suspect to make all this work well we need to not only have the fixes
I made recently to actually send and parse deauth frames, but also to
even send an auth and reset the state when we receive that, so if we
move out of range and even the deauth frame is lost, we can still reset
properly.

In any case, this is not the right approach - we need to handle the
"lost connection" case better I suspect, but since you don't say what
really happened I don't really know that that's what you're seeing.

johannes
Koen Vandeputte Oct. 29, 2019, 8:41 a.m. UTC | #2
On 28.10.19 13:21, Johannes Berg wrote:
> On Fri, 2019-10-25 at 12:21 +0200, Krzysztof Hałasa wrote:
>> Fix a bug where the mac80211 RX aggregation code sets a new aggregation
>> "session" at the remote station's request, but the head_seq_num
>> (the sequence number the receiver expects to receive) isn't reset.
>>
>> Spotted on a pair of AR9580 in IBSS mode.
>>
>> Signed-off-by: Krzysztof Halasa <khalasa@piap.pl>
>>
>> diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c
>> index 4d1c335e06e5..67733bd61297 100644
>> --- a/net/mac80211/agg-rx.c
>> +++ b/net/mac80211/agg-rx.c
>> @@ -354,10 +354,13 @@ void ___ieee80211_start_rx_ba_session(struct sta_info *sta,
>>   			 */
>>   			rcu_read_lock();
>>   			tid_rx = rcu_dereference(sta->ampdu_mlme.tid_rx[tid]);
>> -			if (tid_rx && tid_rx->timeout == timeout)
>> +			if (tid_rx && tid_rx->timeout == timeout) {
>> +				tid_rx->ssn = start_seq_num;
>> +				tid_rx->head_seq_num = start_seq_num;
>>   				status = WLAN_STATUS_SUCCESS;
> This is wrong, this is the case of *updating an existing session*, we
> must not reset the head SN then.
>
> I think you just got very lucky (or unlucky) to have the same dialog
> token, because we start from 0 - maybe we should initialize it to a
> random value to flush out such issues.
>
> Really what I think probably happened is that one of your stations lost
> the connection to the other, and didn't tell it about it in any way - so
> the other kept all the status alive.
>
> I suspect to make all this work well we need to not only have the fixes
> I made recently to actually send and parse deauth frames, but also to
> even send an auth and reset the state when we receive that, so if we
> move out of range and even the deauth frame is lost, we can still reset
> properly.
>
> In any case, this is not the right approach - we need to handle the
> "lost connection" case better I suspect, but since you don't say what
> really happened I don't really know that that's what you're seeing.
>
> johannes

Hi all,

I can confirm the issue as I'm also seeing this sometimes in the field here.

Sometimes when a devices goes out of range and then re-enters,
the link refuses to "come up", as in rx looks to be "stuck" without any 
reports in system log or locking issues (lockdep enabled)

I have dozens of devices installed offshore (802.11n based), both on 
static and moving assets,
which cover from short (250m) up to very long distances (~35km)

So .. while there is some momentum for this issue,
I'm more than happy to provide extensive testing should fixes be posted 
regarding IBSS in general.

Regards,

Koen
Krzysztof Hałasa Oct. 29, 2019, 8:54 a.m. UTC | #3
Johannes Berg <johannes@sipsolutions.net> writes:

> I think you just got very lucky (or unlucky) to have the same dialog
> token, because we start from 0

Right, it seems to be the case.

> - maybe we should initialize it to a
> random value to flush out such issues.

The problem I can see is that the dialog_tokens are 8-bit, way too small
to eliminate conflicts.

> Really what I think probably happened is that one of your stations lost
> the connection to the other, and didn't tell it about it in any way - so
> the other kept all the status alive.

You must have missed my previous mail - I simply rebooted that station,
and alternatively rmmoded/modprobed ath9k. But the problem originated in
a station going out of and back in range, in fact.

> I suspect to make all this work well we need to not only have the fixes
> I made recently to actually send and parse deauth frames, but also to
> even send an auth and reset the state when we receive that, so if we
> move out of range and even the deauth frame is lost, we can still reset
> properly.

That's one thing. The other is a station trying ADDBA for the first time
after boot (while the local station has seen it before that reboot).

> In any case, this is not the right approach - we need to handle the
> "lost connection" case better I suspect, but since you don't say what
> really happened I don't really know that that's what you're seeing.

I guess we need to identify "new connection" reliably. Otherwise,
the new connections are treated as old ones and it doesn't work.

Now how can it be fixed?
Sebastian Gottschall Oct. 29, 2019, 8:58 a.m. UTC | #4
35 km? for 802.11n with ht40 this is out of the ack timing range the 
chipset supports. so this should be considered at any troubles with 
connections

Am 29.10.2019 um 09:41 schrieb Koen Vandeputte:
>
> On 28.10.19 13:21, Johannes Berg wrote:
>> On Fri, 2019-10-25 at 12:21 +0200, Krzysztof Hałasa wrote:
>>> Fix a bug where the mac80211 RX aggregation code sets a new aggregation
>>> "session" at the remote station's request, but the head_seq_num
>>> (the sequence number the receiver expects to receive) isn't reset.
>>>
>>> Spotted on a pair of AR9580 in IBSS mode.
>>>
>>> Signed-off-by: Krzysztof Halasa <khalasa@piap.pl>
>>>
>>> diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c
>>> index 4d1c335e06e5..67733bd61297 100644
>>> --- a/net/mac80211/agg-rx.c
>>> +++ b/net/mac80211/agg-rx.c
>>> @@ -354,10 +354,13 @@ void ___ieee80211_start_rx_ba_session(struct 
>>> sta_info *sta,
>>>                */
>>>               rcu_read_lock();
>>>               tid_rx = rcu_dereference(sta->ampdu_mlme.tid_rx[tid]);
>>> -            if (tid_rx && tid_rx->timeout == timeout)
>>> +            if (tid_rx && tid_rx->timeout == timeout) {
>>> +                tid_rx->ssn = start_seq_num;
>>> +                tid_rx->head_seq_num = start_seq_num;
>>>                   status = WLAN_STATUS_SUCCESS;
>> This is wrong, this is the case of *updating an existing session*, we
>> must not reset the head SN then.
>>
>> I think you just got very lucky (or unlucky) to have the same dialog
>> token, because we start from 0 - maybe we should initialize it to a
>> random value to flush out such issues.
>>
>> Really what I think probably happened is that one of your stations lost
>> the connection to the other, and didn't tell it about it in any way - so
>> the other kept all the status alive.
>>
>> I suspect to make all this work well we need to not only have the fixes
>> I made recently to actually send and parse deauth frames, but also to
>> even send an auth and reset the state when we receive that, so if we
>> move out of range and even the deauth frame is lost, we can still reset
>> properly.
>>
>> In any case, this is not the right approach - we need to handle the
>> "lost connection" case better I suspect, but since you don't say what
>> really happened I don't really know that that's what you're seeing.
>>
>> johannes
>
> Hi all,
>
> I can confirm the issue as I'm also seeing this sometimes in the field 
> here.
>
> Sometimes when a devices goes out of range and then re-enters,
> the link refuses to "come up", as in rx looks to be "stuck" without 
> any reports in system log or locking issues (lockdep enabled)
>
> I have dozens of devices installed offshore (802.11n based), both on 
> static and moving assets,
> which cover from short (250m) up to very long distances (~35km)
>
> So .. while there is some momentum for this issue,
> I'm more than happy to provide extensive testing should fixes be 
> posted regarding IBSS in general.
>
> Regards,
>
> Koen
>
>
Johannes Berg Oct. 29, 2019, 9:03 a.m. UTC | #5
On Tue, 2019-10-29 at 09:41 +0100, Koen Vandeputte wrote:

> I can confirm the issue as I'm also seeing this sometimes in the field here.
> 
> Sometimes when a devices goes out of range and then re-enters,
> the link refuses to "come up", as in rx looks to be "stuck" without any 
> reports in system log or locking issues (lockdep enabled)

Right. I've recently debugged this due to issues in distributed
beaconing (rather than moving in/out of range), but I guess it would be
relatively simple to reproduce this with wmediumd, if that can be
controlled dynamically?

What kernel are you running? You could check if you have

95697f9907bf ("mac80211: accept deauth frames in IBSS mode")
4b08d1b6a994 ("mac80211: IBSS: send deauth when expiring inactive STAs")

which might help somewhat, but don't fully cover the case of moving out
of range.

johannes
Johannes Berg Oct. 29, 2019, 9:07 a.m. UTC | #6
On Tue, 2019-10-29 at 09:54 +0100, Krzysztof Hałasa wrote:

> The problem I can see is that the dialog_tokens are 8-bit, way too small
> to eliminate conflicts.

Well, they're also per station, we could just randomize the start and
then we'd delete the old session and start a new one, on the receiver.

So that would improve robustness somewhat (down to a 1/256 chance to hit
this problem).

> > Really what I think probably happened is that one of your stations lost
> > the connection to the other, and didn't tell it about it in any way - so
> > the other kept all the status alive.
> 
> You must have missed my previous mail - I simply rebooted that station,
> and alternatively rmmoded/modprobed ath9k. But the problem originated in
> a station going out of and back in range, in fact.

I was on vacation, so yeah, quite possible I missed it.

Sounds like we need not just
4b08d1b6a994 ("mac80211: IBSS: send deauth when expiring inactive STAs")

but also send a deauth when we disconnect. Surprising we don't do that,
actually.

> > I suspect to make all this work well we need to not only have the fixes
> > I made recently to actually send and parse deauth frames, but also to
> > even send an auth and reset the state when we receive that, so if we
> > move out of range and even the deauth frame is lost, we can still reset
> > properly.
> 
> That's one thing. The other is a station trying ADDBA for the first time
> after boot (while the local station has seen it before that reboot).

That's the situation though - the local station needs to know that it
has in fact *not* seen the same instance of the station, but that the
station has reset and needs to be removed & re-added.

> I guess we need to identify "new connection" reliably. Otherwise,
> the new connections are treated as old ones and it doesn't work.

Right. But we can implement the (optional) authentication (which you
actually already get when you implement [encrypted] IBSS with wpa_s),
and reset the station state when we get an authentication frame.

johannes
Koen Vandeputte Oct. 29, 2019, 9:40 a.m. UTC | #7
On 29.10.19 09:58, Sebastian Gottschall wrote:
> 35 km? for 802.11n with ht40 this is out of the ack timing range the 
> chipset supports. so this should be considered at any troubles with 
> connections
>
(Please don't top-post)

When we know a link can exceed ~ 21km, it's set to HT20 for this reason.

Koen

> Am 29.10.2019 um 09:41 schrieb Koen Vandeputte:
>>
>> On 28.10.19 13:21, Johannes Berg wrote:
>>> On Fri, 2019-10-25 at 12:21 +0200, Krzysztof Hałasa wrote:
>>>> Fix a bug where the mac80211 RX aggregation code sets a new 
>>>> aggregation
>>>> "session" at the remote station's request, but the head_seq_num
>>>> (the sequence number the receiver expects to receive) isn't reset.
>>>>
>>>> Spotted on a pair of AR9580 in IBSS mode.
>>>>
>>>> Signed-off-by: Krzysztof Halasa <khalasa@piap.pl>
>>>>
>>>> diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c
>>>> index 4d1c335e06e5..67733bd61297 100644
>>>> --- a/net/mac80211/agg-rx.c
>>>> +++ b/net/mac80211/agg-rx.c
>>>> @@ -354,10 +354,13 @@ void ___ieee80211_start_rx_ba_session(struct 
>>>> sta_info *sta,
>>>>                */
>>>>               rcu_read_lock();
>>>>               tid_rx = rcu_dereference(sta->ampdu_mlme.tid_rx[tid]);
>>>> -            if (tid_rx && tid_rx->timeout == timeout)
>>>> +            if (tid_rx && tid_rx->timeout == timeout) {
>>>> +                tid_rx->ssn = start_seq_num;
>>>> +                tid_rx->head_seq_num = start_seq_num;
>>>>                   status = WLAN_STATUS_SUCCESS;
>>> This is wrong, this is the case of *updating an existing session*, we
>>> must not reset the head SN then.
>>>
>>> I think you just got very lucky (or unlucky) to have the same dialog
>>> token, because we start from 0 - maybe we should initialize it to a
>>> random value to flush out such issues.
>>>
>>> Really what I think probably happened is that one of your stations lost
>>> the connection to the other, and didn't tell it about it in any way 
>>> - so
>>> the other kept all the status alive.
>>>
>>> I suspect to make all this work well we need to not only have the fixes
>>> I made recently to actually send and parse deauth frames, but also to
>>> even send an auth and reset the state when we receive that, so if we
>>> move out of range and even the deauth frame is lost, we can still reset
>>> properly.
>>>
>>> In any case, this is not the right approach - we need to handle the
>>> "lost connection" case better I suspect, but since you don't say what
>>> really happened I don't really know that that's what you're seeing.
>>>
>>> johannes
>>
>> Hi all,
>>
>> I can confirm the issue as I'm also seeing this sometimes in the 
>> field here.
>>
>> Sometimes when a devices goes out of range and then re-enters,
>> the link refuses to "come up", as in rx looks to be "stuck" without 
>> any reports in system log or locking issues (lockdep enabled)
>>
>> I have dozens of devices installed offshore (802.11n based), both on 
>> static and moving assets,
>> which cover from short (250m) up to very long distances (~35km)
>>
>> So .. while there is some momentum for this issue,
>> I'm more than happy to provide extensive testing should fixes be 
>> posted regarding IBSS in general.
>>
>> Regards,
>>
>> Koen
>>
>>
Koen Vandeputte Oct. 29, 2019, 9:47 a.m. UTC | #8
On 29.10.19 10:03, Johannes Berg wrote:
> On Tue, 2019-10-29 at 09:41 +0100, Koen Vandeputte wrote:
>
>> I can confirm the issue as I'm also seeing this sometimes in the field here.
>>
>> Sometimes when a devices goes out of range and then re-enters,
>> the link refuses to "come up", as in rx looks to be "stuck" without any
>> reports in system log or locking issues (lockdep enabled)
> Right. I've recently debugged this due to issues in distributed
> beaconing (rather than moving in/out of range), but I guess it would be
> relatively simple to reproduce this with wmediumd, if that can be
> controlled dynamically?
>
> What kernel are you running? You could check if you have
>
> 95697f9907bf ("mac80211: accept deauth frames in IBSS mode")
> 4b08d1b6a994 ("mac80211: IBSS: send deauth when expiring inactive STAs")
>
> which might help somewhat, but don't fully cover the case of moving out
> of range.
>
> johannes
>
I'm running OpenWrt (kernel 4.14.150 with 4.19.79 mac80211)
I noticed these fixes last week and made a build 2 days ago with them 
backported to it.
Running in the field on roughly 4 devices since a day.

Koen
Krzysztof Hałasa Oct. 29, 2019, 10:51 a.m. UTC | #9
Johannes Berg <johannes@sipsolutions.net> writes:

>> The problem I can see is that the dialog_tokens are 8-bit, way too small
>> to eliminate conflicts.
>
> Well, they're also per station, we could just randomize the start and
> then we'd delete the old session and start a new one, on the receiver.
>
> So that would improve robustness somewhat (down to a 1/256 chance to hit
> this problem).

That was what I meant. Still, 1/256 seems hardly acceptable to me -
unless there is some work around (a short timeout or something similar).
Remember that when it doesn't work, it doesn't work - it won't recover
until the sequence catches up, which may mean basically forever.

Or, maybe the remote station can request de-aggregation first, so the
subsequent aggregation request is always treated as new?

Alternatively, perhaps the remote can signal that it's a new request and
not merely an existing session?

> That's the situation though - the local station needs to know that it
> has in fact *not* seen the same instance of the station, but that the
> station has reset and needs to be removed & re-added.

Precisely. And it seems to me that the first time the local station
learns of this is when a new, regular, non-aggregated packet arrives.
Or, when a new aggregation request arrives.
Johannes Berg Oct. 29, 2019, 10:57 a.m. UTC | #10
On Tue, 2019-10-29 at 11:51 +0100, Krzysztof Hałasa wrote:
> Johannes Berg <johannes@sipsolutions.net> writes:
> 
> > > The problem I can see is that the dialog_tokens are 8-bit, way too small
> > > to eliminate conflicts.
> > 
> > Well, they're also per station, we could just randomize the start and
> > then we'd delete the old session and start a new one, on the receiver.
> > 
> > So that would improve robustness somewhat (down to a 1/256 chance to hit
> > this problem).
> 
> That was what I meant. Still, 1/256 seems hardly acceptable to me -
> unless there is some work around (a short timeout or something similar).
> Remember that when it doesn't work, it doesn't work - it won't recover
> until the sequence catches up, which may mean basically forever.

Agree, it just helps in "most" cases to do this. Perhaps we shouldn't do
this then so that we find the problem more easily...

> Or, maybe the remote station can request de-aggregation first, so the
> subsequent aggregation request is always treated as new?

> Alternatively, perhaps the remote can signal that it's a new request and
> not merely an existing session?

I think we should just implement authentication and reset of the station
properly, instead of fudging around with aggregation. This is just one
possible problematic scenario ... what if the station was reconfigured
with a different number of antennas in the meantime, for example, or
whatnot. There's a lot of state we keep for each station.

> > That's the situation though - the local station needs to know that it
> > has in fact *not* seen the same instance of the station, but that the
> > station has reset and needs to be removed & re-added.
> 
> Precisely. And it seems to me that the first time the local station
> learns of this is when a new, regular, non-aggregated packet arrives.
> Or, when a new aggregation request arrives.

Well, it should learn about the station when there's a beacon from it,
or if not ... we have a patch to force a probe request/response cycle so
we have all the capabilities properly. We should upstream that patch,
but need to do something to avoid being able to use this for traffic
amplification attacks.

johannes
diff mbox series

Patch

diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c
index 4d1c335e06e5..67733bd61297 100644
--- a/net/mac80211/agg-rx.c
+++ b/net/mac80211/agg-rx.c
@@ -354,10 +354,13 @@  void ___ieee80211_start_rx_ba_session(struct sta_info *sta,
 			 */
 			rcu_read_lock();
 			tid_rx = rcu_dereference(sta->ampdu_mlme.tid_rx[tid]);
-			if (tid_rx && tid_rx->timeout == timeout)
+			if (tid_rx && tid_rx->timeout == timeout) {
+				tid_rx->ssn = start_seq_num;
+				tid_rx->head_seq_num = start_seq_num;
 				status = WLAN_STATUS_SUCCESS;
-			else
+			} else {
 				status = WLAN_STATUS_REQUEST_DECLINED;
+			}
 			rcu_read_unlock();
 			goto end;
 		}