Message ID | m37e4tjfbu.fsf@t19.piap.pl (mailing list archive) |
---|---|
State | Rejected |
Delegated to: | Johannes Berg |
Headers | show |
Series | [v2] 802.11n IBSS: wlan0 stops receiving packets due to aggregation after sender reboot | expand |
On Fri, 2019-10-25 at 12:21 +0200, Krzysztof Hałasa wrote: > Fix a bug where the mac80211 RX aggregation code sets a new aggregation > "session" at the remote station's request, but the head_seq_num > (the sequence number the receiver expects to receive) isn't reset. > > Spotted on a pair of AR9580 in IBSS mode. > > Signed-off-by: Krzysztof Halasa <khalasa@piap.pl> > > diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c > index 4d1c335e06e5..67733bd61297 100644 > --- a/net/mac80211/agg-rx.c > +++ b/net/mac80211/agg-rx.c > @@ -354,10 +354,13 @@ void ___ieee80211_start_rx_ba_session(struct sta_info *sta, > */ > rcu_read_lock(); > tid_rx = rcu_dereference(sta->ampdu_mlme.tid_rx[tid]); > - if (tid_rx && tid_rx->timeout == timeout) > + if (tid_rx && tid_rx->timeout == timeout) { > + tid_rx->ssn = start_seq_num; > + tid_rx->head_seq_num = start_seq_num; > status = WLAN_STATUS_SUCCESS; This is wrong, this is the case of *updating an existing session*, we must not reset the head SN then. I think you just got very lucky (or unlucky) to have the same dialog token, because we start from 0 - maybe we should initialize it to a random value to flush out such issues. Really what I think probably happened is that one of your stations lost the connection to the other, and didn't tell it about it in any way - so the other kept all the status alive. I suspect to make all this work well we need to not only have the fixes I made recently to actually send and parse deauth frames, but also to even send an auth and reset the state when we receive that, so if we move out of range and even the deauth frame is lost, we can still reset properly. In any case, this is not the right approach - we need to handle the "lost connection" case better I suspect, but since you don't say what really happened I don't really know that that's what you're seeing. johannes
On 28.10.19 13:21, Johannes Berg wrote: > On Fri, 2019-10-25 at 12:21 +0200, Krzysztof Hałasa wrote: >> Fix a bug where the mac80211 RX aggregation code sets a new aggregation >> "session" at the remote station's request, but the head_seq_num >> (the sequence number the receiver expects to receive) isn't reset. >> >> Spotted on a pair of AR9580 in IBSS mode. >> >> Signed-off-by: Krzysztof Halasa <khalasa@piap.pl> >> >> diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c >> index 4d1c335e06e5..67733bd61297 100644 >> --- a/net/mac80211/agg-rx.c >> +++ b/net/mac80211/agg-rx.c >> @@ -354,10 +354,13 @@ void ___ieee80211_start_rx_ba_session(struct sta_info *sta, >> */ >> rcu_read_lock(); >> tid_rx = rcu_dereference(sta->ampdu_mlme.tid_rx[tid]); >> - if (tid_rx && tid_rx->timeout == timeout) >> + if (tid_rx && tid_rx->timeout == timeout) { >> + tid_rx->ssn = start_seq_num; >> + tid_rx->head_seq_num = start_seq_num; >> status = WLAN_STATUS_SUCCESS; > This is wrong, this is the case of *updating an existing session*, we > must not reset the head SN then. > > I think you just got very lucky (or unlucky) to have the same dialog > token, because we start from 0 - maybe we should initialize it to a > random value to flush out such issues. > > Really what I think probably happened is that one of your stations lost > the connection to the other, and didn't tell it about it in any way - so > the other kept all the status alive. > > I suspect to make all this work well we need to not only have the fixes > I made recently to actually send and parse deauth frames, but also to > even send an auth and reset the state when we receive that, so if we > move out of range and even the deauth frame is lost, we can still reset > properly. > > In any case, this is not the right approach - we need to handle the > "lost connection" case better I suspect, but since you don't say what > really happened I don't really know that that's what you're seeing. > > johannes Hi all, I can confirm the issue as I'm also seeing this sometimes in the field here. Sometimes when a devices goes out of range and then re-enters, the link refuses to "come up", as in rx looks to be "stuck" without any reports in system log or locking issues (lockdep enabled) I have dozens of devices installed offshore (802.11n based), both on static and moving assets, which cover from short (250m) up to very long distances (~35km) So .. while there is some momentum for this issue, I'm more than happy to provide extensive testing should fixes be posted regarding IBSS in general. Regards, Koen
Johannes Berg <johannes@sipsolutions.net> writes: > I think you just got very lucky (or unlucky) to have the same dialog > token, because we start from 0 Right, it seems to be the case. > - maybe we should initialize it to a > random value to flush out such issues. The problem I can see is that the dialog_tokens are 8-bit, way too small to eliminate conflicts. > Really what I think probably happened is that one of your stations lost > the connection to the other, and didn't tell it about it in any way - so > the other kept all the status alive. You must have missed my previous mail - I simply rebooted that station, and alternatively rmmoded/modprobed ath9k. But the problem originated in a station going out of and back in range, in fact. > I suspect to make all this work well we need to not only have the fixes > I made recently to actually send and parse deauth frames, but also to > even send an auth and reset the state when we receive that, so if we > move out of range and even the deauth frame is lost, we can still reset > properly. That's one thing. The other is a station trying ADDBA for the first time after boot (while the local station has seen it before that reboot). > In any case, this is not the right approach - we need to handle the > "lost connection" case better I suspect, but since you don't say what > really happened I don't really know that that's what you're seeing. I guess we need to identify "new connection" reliably. Otherwise, the new connections are treated as old ones and it doesn't work. Now how can it be fixed?
35 km? for 802.11n with ht40 this is out of the ack timing range the chipset supports. so this should be considered at any troubles with connections Am 29.10.2019 um 09:41 schrieb Koen Vandeputte: > > On 28.10.19 13:21, Johannes Berg wrote: >> On Fri, 2019-10-25 at 12:21 +0200, Krzysztof Hałasa wrote: >>> Fix a bug where the mac80211 RX aggregation code sets a new aggregation >>> "session" at the remote station's request, but the head_seq_num >>> (the sequence number the receiver expects to receive) isn't reset. >>> >>> Spotted on a pair of AR9580 in IBSS mode. >>> >>> Signed-off-by: Krzysztof Halasa <khalasa@piap.pl> >>> >>> diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c >>> index 4d1c335e06e5..67733bd61297 100644 >>> --- a/net/mac80211/agg-rx.c >>> +++ b/net/mac80211/agg-rx.c >>> @@ -354,10 +354,13 @@ void ___ieee80211_start_rx_ba_session(struct >>> sta_info *sta, >>> */ >>> rcu_read_lock(); >>> tid_rx = rcu_dereference(sta->ampdu_mlme.tid_rx[tid]); >>> - if (tid_rx && tid_rx->timeout == timeout) >>> + if (tid_rx && tid_rx->timeout == timeout) { >>> + tid_rx->ssn = start_seq_num; >>> + tid_rx->head_seq_num = start_seq_num; >>> status = WLAN_STATUS_SUCCESS; >> This is wrong, this is the case of *updating an existing session*, we >> must not reset the head SN then. >> >> I think you just got very lucky (or unlucky) to have the same dialog >> token, because we start from 0 - maybe we should initialize it to a >> random value to flush out such issues. >> >> Really what I think probably happened is that one of your stations lost >> the connection to the other, and didn't tell it about it in any way - so >> the other kept all the status alive. >> >> I suspect to make all this work well we need to not only have the fixes >> I made recently to actually send and parse deauth frames, but also to >> even send an auth and reset the state when we receive that, so if we >> move out of range and even the deauth frame is lost, we can still reset >> properly. >> >> In any case, this is not the right approach - we need to handle the >> "lost connection" case better I suspect, but since you don't say what >> really happened I don't really know that that's what you're seeing. >> >> johannes > > Hi all, > > I can confirm the issue as I'm also seeing this sometimes in the field > here. > > Sometimes when a devices goes out of range and then re-enters, > the link refuses to "come up", as in rx looks to be "stuck" without > any reports in system log or locking issues (lockdep enabled) > > I have dozens of devices installed offshore (802.11n based), both on > static and moving assets, > which cover from short (250m) up to very long distances (~35km) > > So .. while there is some momentum for this issue, > I'm more than happy to provide extensive testing should fixes be > posted regarding IBSS in general. > > Regards, > > Koen > >
On Tue, 2019-10-29 at 09:41 +0100, Koen Vandeputte wrote: > I can confirm the issue as I'm also seeing this sometimes in the field here. > > Sometimes when a devices goes out of range and then re-enters, > the link refuses to "come up", as in rx looks to be "stuck" without any > reports in system log or locking issues (lockdep enabled) Right. I've recently debugged this due to issues in distributed beaconing (rather than moving in/out of range), but I guess it would be relatively simple to reproduce this with wmediumd, if that can be controlled dynamically? What kernel are you running? You could check if you have 95697f9907bf ("mac80211: accept deauth frames in IBSS mode") 4b08d1b6a994 ("mac80211: IBSS: send deauth when expiring inactive STAs") which might help somewhat, but don't fully cover the case of moving out of range. johannes
On Tue, 2019-10-29 at 09:54 +0100, Krzysztof Hałasa wrote: > The problem I can see is that the dialog_tokens are 8-bit, way too small > to eliminate conflicts. Well, they're also per station, we could just randomize the start and then we'd delete the old session and start a new one, on the receiver. So that would improve robustness somewhat (down to a 1/256 chance to hit this problem). > > Really what I think probably happened is that one of your stations lost > > the connection to the other, and didn't tell it about it in any way - so > > the other kept all the status alive. > > You must have missed my previous mail - I simply rebooted that station, > and alternatively rmmoded/modprobed ath9k. But the problem originated in > a station going out of and back in range, in fact. I was on vacation, so yeah, quite possible I missed it. Sounds like we need not just 4b08d1b6a994 ("mac80211: IBSS: send deauth when expiring inactive STAs") but also send a deauth when we disconnect. Surprising we don't do that, actually. > > I suspect to make all this work well we need to not only have the fixes > > I made recently to actually send and parse deauth frames, but also to > > even send an auth and reset the state when we receive that, so if we > > move out of range and even the deauth frame is lost, we can still reset > > properly. > > That's one thing. The other is a station trying ADDBA for the first time > after boot (while the local station has seen it before that reboot). That's the situation though - the local station needs to know that it has in fact *not* seen the same instance of the station, but that the station has reset and needs to be removed & re-added. > I guess we need to identify "new connection" reliably. Otherwise, > the new connections are treated as old ones and it doesn't work. Right. But we can implement the (optional) authentication (which you actually already get when you implement [encrypted] IBSS with wpa_s), and reset the station state when we get an authentication frame. johannes
On 29.10.19 09:58, Sebastian Gottschall wrote: > 35 km? for 802.11n with ht40 this is out of the ack timing range the > chipset supports. so this should be considered at any troubles with > connections > (Please don't top-post) When we know a link can exceed ~ 21km, it's set to HT20 for this reason. Koen > Am 29.10.2019 um 09:41 schrieb Koen Vandeputte: >> >> On 28.10.19 13:21, Johannes Berg wrote: >>> On Fri, 2019-10-25 at 12:21 +0200, Krzysztof Hałasa wrote: >>>> Fix a bug where the mac80211 RX aggregation code sets a new >>>> aggregation >>>> "session" at the remote station's request, but the head_seq_num >>>> (the sequence number the receiver expects to receive) isn't reset. >>>> >>>> Spotted on a pair of AR9580 in IBSS mode. >>>> >>>> Signed-off-by: Krzysztof Halasa <khalasa@piap.pl> >>>> >>>> diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c >>>> index 4d1c335e06e5..67733bd61297 100644 >>>> --- a/net/mac80211/agg-rx.c >>>> +++ b/net/mac80211/agg-rx.c >>>> @@ -354,10 +354,13 @@ void ___ieee80211_start_rx_ba_session(struct >>>> sta_info *sta, >>>> */ >>>> rcu_read_lock(); >>>> tid_rx = rcu_dereference(sta->ampdu_mlme.tid_rx[tid]); >>>> - if (tid_rx && tid_rx->timeout == timeout) >>>> + if (tid_rx && tid_rx->timeout == timeout) { >>>> + tid_rx->ssn = start_seq_num; >>>> + tid_rx->head_seq_num = start_seq_num; >>>> status = WLAN_STATUS_SUCCESS; >>> This is wrong, this is the case of *updating an existing session*, we >>> must not reset the head SN then. >>> >>> I think you just got very lucky (or unlucky) to have the same dialog >>> token, because we start from 0 - maybe we should initialize it to a >>> random value to flush out such issues. >>> >>> Really what I think probably happened is that one of your stations lost >>> the connection to the other, and didn't tell it about it in any way >>> - so >>> the other kept all the status alive. >>> >>> I suspect to make all this work well we need to not only have the fixes >>> I made recently to actually send and parse deauth frames, but also to >>> even send an auth and reset the state when we receive that, so if we >>> move out of range and even the deauth frame is lost, we can still reset >>> properly. >>> >>> In any case, this is not the right approach - we need to handle the >>> "lost connection" case better I suspect, but since you don't say what >>> really happened I don't really know that that's what you're seeing. >>> >>> johannes >> >> Hi all, >> >> I can confirm the issue as I'm also seeing this sometimes in the >> field here. >> >> Sometimes when a devices goes out of range and then re-enters, >> the link refuses to "come up", as in rx looks to be "stuck" without >> any reports in system log or locking issues (lockdep enabled) >> >> I have dozens of devices installed offshore (802.11n based), both on >> static and moving assets, >> which cover from short (250m) up to very long distances (~35km) >> >> So .. while there is some momentum for this issue, >> I'm more than happy to provide extensive testing should fixes be >> posted regarding IBSS in general. >> >> Regards, >> >> Koen >> >>
On 29.10.19 10:03, Johannes Berg wrote: > On Tue, 2019-10-29 at 09:41 +0100, Koen Vandeputte wrote: > >> I can confirm the issue as I'm also seeing this sometimes in the field here. >> >> Sometimes when a devices goes out of range and then re-enters, >> the link refuses to "come up", as in rx looks to be "stuck" without any >> reports in system log or locking issues (lockdep enabled) > Right. I've recently debugged this due to issues in distributed > beaconing (rather than moving in/out of range), but I guess it would be > relatively simple to reproduce this with wmediumd, if that can be > controlled dynamically? > > What kernel are you running? You could check if you have > > 95697f9907bf ("mac80211: accept deauth frames in IBSS mode") > 4b08d1b6a994 ("mac80211: IBSS: send deauth when expiring inactive STAs") > > which might help somewhat, but don't fully cover the case of moving out > of range. > > johannes > I'm running OpenWrt (kernel 4.14.150 with 4.19.79 mac80211) I noticed these fixes last week and made a build 2 days ago with them backported to it. Running in the field on roughly 4 devices since a day. Koen
Johannes Berg <johannes@sipsolutions.net> writes: >> The problem I can see is that the dialog_tokens are 8-bit, way too small >> to eliminate conflicts. > > Well, they're also per station, we could just randomize the start and > then we'd delete the old session and start a new one, on the receiver. > > So that would improve robustness somewhat (down to a 1/256 chance to hit > this problem). That was what I meant. Still, 1/256 seems hardly acceptable to me - unless there is some work around (a short timeout or something similar). Remember that when it doesn't work, it doesn't work - it won't recover until the sequence catches up, which may mean basically forever. Or, maybe the remote station can request de-aggregation first, so the subsequent aggregation request is always treated as new? Alternatively, perhaps the remote can signal that it's a new request and not merely an existing session? > That's the situation though - the local station needs to know that it > has in fact *not* seen the same instance of the station, but that the > station has reset and needs to be removed & re-added. Precisely. And it seems to me that the first time the local station learns of this is when a new, regular, non-aggregated packet arrives. Or, when a new aggregation request arrives.
On Tue, 2019-10-29 at 11:51 +0100, Krzysztof Hałasa wrote: > Johannes Berg <johannes@sipsolutions.net> writes: > > > > The problem I can see is that the dialog_tokens are 8-bit, way too small > > > to eliminate conflicts. > > > > Well, they're also per station, we could just randomize the start and > > then we'd delete the old session and start a new one, on the receiver. > > > > So that would improve robustness somewhat (down to a 1/256 chance to hit > > this problem). > > That was what I meant. Still, 1/256 seems hardly acceptable to me - > unless there is some work around (a short timeout or something similar). > Remember that when it doesn't work, it doesn't work - it won't recover > until the sequence catches up, which may mean basically forever. Agree, it just helps in "most" cases to do this. Perhaps we shouldn't do this then so that we find the problem more easily... > Or, maybe the remote station can request de-aggregation first, so the > subsequent aggregation request is always treated as new? > Alternatively, perhaps the remote can signal that it's a new request and > not merely an existing session? I think we should just implement authentication and reset of the station properly, instead of fudging around with aggregation. This is just one possible problematic scenario ... what if the station was reconfigured with a different number of antennas in the meantime, for example, or whatnot. There's a lot of state we keep for each station. > > That's the situation though - the local station needs to know that it > > has in fact *not* seen the same instance of the station, but that the > > station has reset and needs to be removed & re-added. > > Precisely. And it seems to me that the first time the local station > learns of this is when a new, regular, non-aggregated packet arrives. > Or, when a new aggregation request arrives. Well, it should learn about the station when there's a beacon from it, or if not ... we have a patch to force a probe request/response cycle so we have all the capabilities properly. We should upstream that patch, but need to do something to avoid being able to use this for traffic amplification attacks. johannes
diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c index 4d1c335e06e5..67733bd61297 100644 --- a/net/mac80211/agg-rx.c +++ b/net/mac80211/agg-rx.c @@ -354,10 +354,13 @@ void ___ieee80211_start_rx_ba_session(struct sta_info *sta, */ rcu_read_lock(); tid_rx = rcu_dereference(sta->ampdu_mlme.tid_rx[tid]); - if (tid_rx && tid_rx->timeout == timeout) + if (tid_rx && tid_rx->timeout == timeout) { + tid_rx->ssn = start_seq_num; + tid_rx->head_seq_num = start_seq_num; status = WLAN_STATUS_SUCCESS; - else + } else { status = WLAN_STATUS_REQUEST_DECLINED; + } rcu_read_unlock(); goto end; }
Fix a bug where the mac80211 RX aggregation code sets a new aggregation "session" at the remote station's request, but the head_seq_num (the sequence number the receiver expects to receive) isn't reset. Spotted on a pair of AR9580 in IBSS mode. Signed-off-by: Krzysztof Halasa <khalasa@piap.pl>