mbox series

[0/4] Packet/beacon loss roaming improvements

Message ID 20231030134837.452957-1-prestwoj@gmail.com (mailing list archive)
Headers show
Series Packet/beacon loss roaming improvements | expand

Message

James Prestwood Oct. 30, 2023, 1:48 p.m. UTC
The rate limiting patch shouldn't be too controversial, but the
forced roam on beacon loss may be so I wanted to provde some
background. This behavior was only tested on ath10k, so I understand
if we don't want to impose the same behavior for all drivers.

We were seeing beacon loss events not resulting in an immediate
disconnnect (as I have always expected), still eventually but after
plenty of time to roam. I initially added handling for
beacon loss identical to packet loss (try and find a better BSS) but
noticed that if IWD did not find a better candidate it resulted in a
disconnect 100% of the time. I watched a client for a full day and
whenever beacon loss events arrived they were followed by a
disconnect within ~5-6 seconds if IWD did not roam. This led me to
believe that at least on ath10k a beacon loss is still very much a
sign that a disconnect is going to come, we just have a bit more time
than other drivers. This was the motivation behind re-using/re-naming
the 'ap_directed_roam' flag in order to force IWD off the BSS.

Again, this is just one driver. Maybe other drivers can
handle/recover from beacon loss. If we instead want to keep the
behavior the same as packet loss I'm ok with that (I can keep the
patch locally), or put the forced roam behavior behind a user
option e.g. [Roam].ForceRoamOnBeaconLoss

James Prestwood (4):
  station: rename ap_directed_roam to force_roam
  station: start roam on beacon loss event
  netdev: handle/send beacon loss event
  station: rate limit packet loss roam scans

 src/netdev.c  |  6 ++++-
 src/netdev.h  |  1 +
 src/station.c | 61 +++++++++++++++++++++++++++++++++++++++++++--------
 3 files changed, 58 insertions(+), 10 deletions(-)

Comments

Denis Kenzior Oct. 30, 2023, 3 p.m. UTC | #1
Hi James,

> 
> We were seeing beacon loss events not resulting in an immediate
> disconnnect (as I have always expected), still eventually but after

If I recall correctly, Lost Beacon is sent after several beacons in succession 
were lost.  You are right that this could just be bad luck and doesn't actually 
mean that no packets are getting through.  However, in practice mac80211 almost 
always disconnected us soon after.  Didn't we test this pretty thoroughly?

My memory is fuzzy here, but I seem to recall that power save has an effect on 
how lost beacon events are treated by mac80211.  Maybe your recent power save 
patches had an effect?

> plenty of time to roam. I initially added handling for
> beacon loss identical to packet loss (try and find a better BSS) but
> noticed that if IWD did not find a better candidate it resulted in a
> disconnect 100% of the time. I watched a client for a full day and
> whenever beacon loss events arrived they were followed by a
> disconnect within ~5-6 seconds if IWD did not roam. This led me to
> believe that at least on ath10k a beacon loss is still very much a
> sign that a disconnect is going to come, we just have a bit more time
> than other drivers. This was the motivation behind re-using/re-naming
> the 'ap_directed_roam' flag in order to force IWD off the BSS.
> 

ath10k is still a mac80211 driver, no?  Given that we did test Lost Beacon event 
behavior before, I would like some more data points before being convinced it is 
a driver behavior change.

> Again, this is just one driver. Maybe other drivers can
> handle/recover from beacon loss. If we instead want to keep the
> behavior the same as packet loss I'm ok with that (I can keep the
> patch locally), or put the forced roam behavior behind a user
> option e.g. [Roam].ForceRoamOnBeaconLoss

If this is a driver behavior quirk, then this belongs in src/wiphy.c 
driver_infos table somehow.  I'd really rather not add a bazillion config 
options that address the bug-of-the-day.

> 
> James Prestwood (4):
>    station: rename ap_directed_roam to force_roam
>    station: start roam on beacon loss event
>    netdev: handle/send beacon loss event
>    station: rate limit packet loss roam scans
> 
>   src/netdev.c  |  6 ++++-
>   src/netdev.h  |  1 +
>   src/station.c | 61 +++++++++++++++++++++++++++++++++++++++++++--------
>   3 files changed, 58 insertions(+), 10 deletions(-)
> 

Regards,
-Denis
James Prestwood Oct. 30, 2023, 3:37 p.m. UTC | #2
Hi Denis,

On 10/30/23 8:00 AM, Denis Kenzior wrote:
> Hi James,
> 
>>
>> We were seeing beacon loss events not resulting in an immediate
>> disconnnect (as I have always expected), still eventually but after
> 
> If I recall correctly, Lost Beacon is sent after several beacons in 
> succession were lost.  You are right that this could just be bad luck 
> and doesn't actually mean that no packets are getting through.  However, 
> in practice mac80211 almost always disconnected us soon after.  Didn't 
> we test this pretty thoroughly?

Yes, it appears mac80211 by default waits for 7 missed beacons before 
sending the event. It then probes the AP (either nullfunc or probe 
request) so apparently the connection could be recovered if the AP 
responded. Unfortunately we don't get any notification in userspace if 
the AP responded or not...

I can't remember what hardware was tested. But there really wasn't a 
consistent way to test this. The testing involved me disabling roaming 
and walking away from the AP until I got disconnected. Sometimes this 
was due to beacon loss, sometimes the AP disconnected explicitly. But 
what I do remember is when beacon loss occurred, a local disconnect 
followed near immediately. This is why (I think) we thought there was no 
reason to handle this event.

> 
> My memory is fuzzy here, but I seem to recall that power save has an 
> effect on how lost beacon events are treated by mac80211.  Maybe your 
> recent power save patches had an effect?

 From what I can tell in mac80211 power save doesn't change handling. 
Its the driver that tells mac80211 of the beacon loss but maybe the 
driver (or firmware) could handle it differently depending on power save.

When I was watching this device power save was disabled.

> 
>> plenty of time to roam. I initially added handling for
>> beacon loss identical to packet loss (try and find a better BSS) but
>> noticed that if IWD did not find a better candidate it resulted in a
>> disconnect 100% of the time. I watched a client for a full day and
>> whenever beacon loss events arrived they were followed by a
>> disconnect within ~5-6 seconds if IWD did not roam. This led me to
>> believe that at least on ath10k a beacon loss is still very much a
>> sign that a disconnect is going to come, we just have a bit more time
>> than other drivers. This was the motivation behind re-using/re-naming
>> the 'ap_directed_roam' flag in order to force IWD off the BSS.
>>
> 
> ath10k is still a mac80211 driver, no?  Given that we did test Lost 
> Beacon event behavior before, I would like some more data points before 
> being convinced it is a driver behavior change.
> 
>> Again, this is just one driver. Maybe other drivers can
>> handle/recover from beacon loss. If we instead want to keep the
>> behavior the same as packet loss I'm ok with that (I can keep the
>> patch locally), or put the forced roam behavior behind a user
>> option e.g. [Roam].ForceRoamOnBeaconLoss
> 
> If this is a driver behavior quirk, then this belongs in src/wiphy.c 
> driver_infos table somehow.  I'd really rather not add a bazillion 
> config options that address the bug-of-the-day.

Yeah, adding a driver specific quirk doesn't seem like the right route.

I think for now there is no harm in trying to roam on beacon loss, 
basically the same handling as packet loss. If a disconnect comes 
immediately the scan would be canceled. Otherwise maybe we get lucky and 
be able to roam.

Since our specific hardware/use case seems to benefit from the forced 
roam I can keep that change out of tree (just set ap_directed_roaming), 
at least until more testing can be done, or if others report similar 
behavior.

Thanks,
James

> 
>>
>> James Prestwood (4):
>>    station: rename ap_directed_roam to force_roam
>>    station: start roam on beacon loss event
>>    netdev: handle/send beacon loss event
>>    station: rate limit packet loss roam scans
>>
>>   src/netdev.c  |  6 ++++-
>>   src/netdev.h  |  1 +
>>   src/station.c | 61 +++++++++++++++++++++++++++++++++++++++++++--------
>>   3 files changed, 58 insertions(+), 10 deletions(-)
>>
> 
> Regards,
> -Denis
Denis Kenzior Oct. 30, 2023, 5:05 p.m. UTC | #3
Hi James,

On 10/30/23 10:37, James Prestwood wrote:
> Hi Denis,
> 
> On 10/30/23 8:00 AM, Denis Kenzior wrote:
>> Hi James,
>>
>>>
>>> We were seeing beacon loss events not resulting in an immediate
>>> disconnnect (as I have always expected), still eventually but after
>>
>> If I recall correctly, Lost Beacon is sent after several beacons in succession 
>> were lost.  You are right that this could just be bad luck and doesn't 
>> actually mean that no packets are getting through.  However, in practice 
>> mac80211 almost always disconnected us soon after.  Didn't we test this pretty 
>> thoroughly?
> 
> Yes, it appears mac80211 by default waits for 7 missed beacons before sending 
> the event. It then probes the AP (either nullfunc or probe request) so 
> apparently the connection could be recovered if the AP responded. Unfortunately 
> we don't get any notification in userspace if the AP responded or not...

So this magic here?
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git/tree/net/mac80211/mlme.c#n3215

> 
> I can't remember what hardware was tested. But there really wasn't a consistent 
> way to test this. The testing involved me disabling roaming and walking away 
> from the AP until I got disconnected. Sometimes this was due to beacon loss, 
> sometimes the AP disconnected explicitly. But what I do remember is when beacon 
> loss occurred, a local disconnect followed near immediately. This is why (I 
> think) we thought there was no reason to handle this event.

So what does your ath10k driver/hw do?  Does it send nullfuncs or probe requests?

> 
>>
>> My memory is fuzzy here, but I seem to recall that power save has an effect on 
>> how lost beacon events are treated by mac80211.  Maybe your recent power save 
>> patches had an effect?
> 
>  From what I can tell in mac80211 power save doesn't change handling. Its the 
> driver that tells mac80211 of the beacon loss but maybe the driver (or firmware) 
> could handle it differently depending on power save.
> 
> When I was watching this device power save was disabled.

Okay, fair enough.

>> If this is a driver behavior quirk, then this belongs in src/wiphy.c 
>> driver_infos table somehow.  I'd really rather not add a bazillion config 
>> options that address the bug-of-the-day.
> 
> Yeah, adding a driver specific quirk doesn't seem like the right route.
> 
> I think for now there is no harm in trying to roam on beacon loss, basically the 
> same handling as packet loss. If a disconnect comes immediately the scan would 
> be canceled. Otherwise maybe we get lucky and be able to roam.

So the problem is, we had the _exact_ same behavior you're proposing here.  We 
took it out.  See commit:
836beb1276d1 ("station/wsc: remove beacon loss handling")

So when we do that, alarm bells start going off.  Why did we get rid of it if it 
was useful?

7 consecutive lost beacons is actually a lot.  That's ~700ms with no connection 
with default settings.  And you can maintain the connection after that for 
another 5-6?  Something smells fishy.

If the kernel has a hard limit after which it expects the connection to be 
disconnected, we can start a timer for 2-4x that limit?  Looks like kernel uses 
probe_wait_ms parameter for this with a default of 500ms.  Is your setup using 
the default values for beacon_loss_count and probe_wait_ms?

Regards,
-Denis
James Prestwood Oct. 30, 2023, 5:37 p.m. UTC | #4
Hi Denis,

On 10/30/23 10:05 AM, Denis Kenzior wrote:
> Hi James,
> 
> On 10/30/23 10:37, James Prestwood wrote:
>> Hi Denis,
>>
>> On 10/30/23 8:00 AM, Denis Kenzior wrote:
>>> Hi James,
>>>
>>>>
>>>> We were seeing beacon loss events not resulting in an immediate
>>>> disconnnect (as I have always expected), still eventually but after
>>>
>>> If I recall correctly, Lost Beacon is sent after several beacons in 
>>> succession were lost.  You are right that this could just be bad luck 
>>> and doesn't actually mean that no packets are getting through.  
>>> However, in practice mac80211 almost always disconnected us soon 
>>> after.  Didn't we test this pretty thoroughly?
>>
>> Yes, it appears mac80211 by default waits for 7 missed beacons before 
>> sending the event. It then probes the AP (either nullfunc or probe 
>> request) so apparently the connection could be recovered if the AP 
>> responded. Unfortunately we don't get any notification in userspace if 
>> the AP responded or not...
> 
> So this magic here?
> https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git/tree/net/mac80211/mlme.c#n3215

Yep

> 
>>
>> I can't remember what hardware was tested. But there really wasn't a 
>> consistent way to test this. The testing involved me disabling roaming 
>> and walking away from the AP until I got disconnected. Sometimes this 
>> was due to beacon loss, sometimes the AP disconnected explicitly. But 
>> what I do remember is when beacon loss occurred, a local disconnect 
>> followed near immediately. This is why (I think) we thought there was 
>> no reason to handle this event.
> 
> So what does your ath10k driver/hw do?  Does it send nullfuncs or probe 
> requests?

Based on the driver it sets REPORTS_TX_ACK_STATUS, so nullfunc. Which 
would add an extra second before disconnect by default (2 nullfuncs 
500ms apart).

> 
>>
>>>
>>> My memory is fuzzy here, but I seem to recall that power save has an 
>>> effect on how lost beacon events are treated by mac80211.  Maybe your 
>>> recent power save patches had an effect?
>>
>>  From what I can tell in mac80211 power save doesn't change handling. 
>> Its the driver that tells mac80211 of the beacon loss but maybe the 
>> driver (or firmware) could handle it differently depending on power save.
>>
>> When I was watching this device power save was disabled.
> 
> Okay, fair enough.
> 
>>> If this is a driver behavior quirk, then this belongs in src/wiphy.c 
>>> driver_infos table somehow.  I'd really rather not add a bazillion 
>>> config options that address the bug-of-the-day.
>>
>> Yeah, adding a driver specific quirk doesn't seem like the right route.
>>
>> I think for now there is no harm in trying to roam on beacon loss, 
>> basically the same handling as packet loss. If a disconnect comes 
>> immediately the scan would be canceled. Otherwise maybe we get lucky 
>> and be able to roam.
> 
> So the problem is, we had the _exact_ same behavior you're proposing 
> here.  We took it out.  See commit:
> 836beb1276d1 ("station/wsc: remove beacon loss handling")
> 
> So when we do that, alarm bells start going off.  Why did we get rid of 
> it if it was useful?

Huh, apparently it was handled previously. I really have no idea, 
apparently I thought something changed in the kernel. Maybe based on 
testing only certain hardware.

Looking back to kernel 5.3 (since 5.4+ was mentioned in that commit) I 
really don't see much difference either. This device was on 5.11, and 
again, not much different than current upstream in that regard.

> 
> 7 consecutive lost beacons is actually a lot.  That's ~700ms with no 
> connection with default settings.  And you can maintain the connection 
> after that for another 5-6?  Something smells fishy.

According to the logs, yes. Looking back to the logs in the commit it 
was actually 4 seconds.

Still, I'm not sure how it could get that far without a second lost 
beacon event (e.g. miss beacons, but get a nullfunc reply, miss 7 more, 
nullfunc reply etc.). With a single event the longest it could get drug 
out would be 1.7 seconds (7 beacons + 2 nullfuncs) before either 
disconnecting or recovering but missing more beacons and generating an 
event, unless there is some other reason=4 failure path I'm missing.


> If the kernel has a hard limit after which it expects the connection to 
> be disconnected, we can start a timer for 2-4x that limit?  Looks like 
> kernel uses probe_wait_ms parameter for this with a default of 500ms.  
> Is your setup using the default values for beacon_loss_count and 
> probe_wait_ms?

Yep, everything is default as far as nl80211/driver options.

> 
> Regards,
> -Denis
James Prestwood Nov. 1, 2023, 12:07 p.m. UTC | #5
Hi Denis,

On 10/30/23 10:37 AM, James Prestwood wrote:
> Hi Denis,
> 
> On 10/30/23 10:05 AM, Denis Kenzior wrote:
>> Hi James,
>>
>> On 10/30/23 10:37, James Prestwood wrote:
>>> Hi Denis,
>>>
>>> On 10/30/23 8:00 AM, Denis Kenzior wrote:
>>>> Hi James,
>>>>
>>>>>
>>>>> We were seeing beacon loss events not resulting in an immediate
>>>>> disconnnect (as I have always expected), still eventually but after
>>>>
>>>> If I recall correctly, Lost Beacon is sent after several beacons in 
>>>> succession were lost.  You are right that this could just be bad 
>>>> luck and doesn't actually mean that no packets are getting through. 
>>>> However, in practice mac80211 almost always disconnected us soon 
>>>> after.  Didn't we test this pretty thoroughly?
>>>
>>> Yes, it appears mac80211 by default waits for 7 missed beacons before 
>>> sending the event. It then probes the AP (either nullfunc or probe 
>>> request) so apparently the connection could be recovered if the AP 
>>> responded. Unfortunately we don't get any notification in userspace 
>>> if the AP responded or not...
>>
>> So this magic here?
>> https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git/tree/net/mac80211/mlme.c#n3215
> 
> Yep
> 
>>
>>>
>>> I can't remember what hardware was tested. But there really wasn't a 
>>> consistent way to test this. The testing involved me disabling 
>>> roaming and walking away from the AP until I got disconnected. 
>>> Sometimes this was due to beacon loss, sometimes the AP disconnected 
>>> explicitly. But what I do remember is when beacon loss occurred, a 
>>> local disconnect followed near immediately. This is why (I think) we 
>>> thought there was no reason to handle this event.
>>
>> So what does your ath10k driver/hw do?  Does it send nullfuncs or 
>> probe requests?
> 
> Based on the driver it sets REPORTS_TX_ACK_STATUS, so nullfunc. Which 
> would add an extra second before disconnect by default (2 nullfuncs 
> 500ms apart).
> 
>>
>>>
>>>>
>>>> My memory is fuzzy here, but I seem to recall that power save has an 
>>>> effect on how lost beacon events are treated by mac80211.  Maybe 
>>>> your recent power save patches had an effect?
>>>
>>>  From what I can tell in mac80211 power save doesn't change handling. 
>>> Its the driver that tells mac80211 of the beacon loss but maybe the 
>>> driver (or firmware) could handle it differently depending on power 
>>> save.
>>>
>>> When I was watching this device power save was disabled.
>>
>> Okay, fair enough.
>>
>>>> If this is a driver behavior quirk, then this belongs in src/wiphy.c 
>>>> driver_infos table somehow.  I'd really rather not add a bazillion 
>>>> config options that address the bug-of-the-day.
>>>
>>> Yeah, adding a driver specific quirk doesn't seem like the right route.
>>>
>>> I think for now there is no harm in trying to roam on beacon loss, 
>>> basically the same handling as packet loss. If a disconnect comes 
>>> immediately the scan would be canceled. Otherwise maybe we get lucky 
>>> and be able to roam.
>>
>> So the problem is, we had the _exact_ same behavior you're proposing 
>> here.  We took it out.  See commit:
>> 836beb1276d1 ("station/wsc: remove beacon loss handling")
>>
>> So when we do that, alarm bells start going off.  Why did we get rid 
>> of it if it was useful?
> 
> Huh, apparently it was handled previously. I really have no idea, 
> apparently I thought something changed in the kernel. Maybe based on 
> testing only certain hardware.
> 
> Looking back to kernel 5.3 (since 5.4+ was mentioned in that commit) I 
> really don't see much difference either. This device was on 5.11, and 
> again, not much different than current upstream in that regard.
> 
>>
>> 7 consecutive lost beacons is actually a lot.  That's ~700ms with no 
>> connection with default settings.  And you can maintain the connection 
>> after that for another 5-6?  Something smells fishy.
> 
> According to the logs, yes. Looking back to the logs in the commit it 
> was actually 4 seconds.
> 
> Still, I'm not sure how it could get that far without a second lost 
> beacon event (e.g. miss beacons, but get a nullfunc reply, miss 7 more, 
> nullfunc reply etc.). With a single event the longest it could get drug 
> out would be 1.7 seconds (7 beacons + 2 nullfuncs) before either 
> disconnecting or recovering but missing more beacons and generating an 
> event, unless there is some other reason=4 failure path I'm missing.
> 
> 
>> If the kernel has a hard limit after which it expects the connection 
>> to be disconnected, we can start a timer for 2-4x that limit?  Looks 
>> like kernel uses probe_wait_ms parameter for this with a default of 
>> 500ms. Is your setup using the default values for beacon_loss_count 
>> and probe_wait_ms?
> 
> Yep, everything is default as far as nl80211/driver options.

I found an old thread I asked on linux-wireless back when I removed the 
beacon loss handling [1]. Johannes explained that behavior did 
apparently change in order to support power save (your memory was correct).

I identified the behavior change with hwsim, and apparently took his 
word when he said:

"I'm pretty sure real hardware will behave just like hwsim here, albeit 
perhaps a bit slower"

And I never added "proper" testing of beacon loss, i.e. block several 
beacons as opposed to just tearing down the AP. And this appears closer 
to the behavior I'm seeing in reality (AP not going down, just dropping 
beacons).

So this is my bad, I didn't take into account the situation of beacons 
being dropped but then being picked up again, or the nullfuncs/probes 
coming back successfully.

[1] 
https://lore.kernel.org/linux-wireless/ada14dfad76b93d654606c3b397de059d968096b.camel@gmail.com/

Thanks,
James
> 
>>
>> Regards,
>> -Denis
Denis Kenzior Nov. 2, 2023, 1:39 a.m. UTC | #6
Hi James,

>>
>>> If the kernel has a hard limit after which it expects the connection to be 
>>> disconnected, we can start a timer for 2-4x that limit?  Looks like kernel 
>>> uses probe_wait_ms parameter for this with a default of 500ms. Is your setup 
>>> using the default values for beacon_loss_count and probe_wait_ms?
>>
>> Yep, everything is default as far as nl80211/driver options.
> 
> I found an old thread I asked on linux-wireless back when I removed the beacon 
> loss handling [1]. Johannes explained that behavior did apparently change in 
> order to support power save (your memory was correct).
> 

Hah, funny.

> I identified the behavior change with hwsim, and apparently took his word when 
> he said:
> 
> "I'm pretty sure real hardware will behave just like hwsim here, albeit perhaps 
> a bit slower"
> 
> And I never added "proper" testing of beacon loss, i.e. block several beacons as 
> opposed to just tearing down the AP. And this appears closer to the behavior I'm 
> seeing in reality (AP not going down, just dropping beacons).

Right.

> 
> So this is my bad, I didn't take into account the situation of beacons being 
> dropped but then being picked up again, or the nullfuncs/probes coming back 
> successfully.

So the question is still how we handle this properly.  When the kernel sends the 
nullfunc / probe request, are we going to be interfering with that logic by 
initiating a scan right away?  A scan would force the device to go offchannel, 
preventing it from receiving the AP's response.  Given that nullfunc should be 
extremely fast, maybe we should delay any roam action we take by a second or 
two?  If the AP is truly lost, then we'll get disconnected soon after the 
LOST_BEACON event.  If it isn't, then an extra second/two is not going to make 
much difference since the scan should be limited and quite fast most of the time.

Regards,
-Denis
James Prestwood Nov. 2, 2023, 11:58 a.m. UTC | #7
Hi Denis,

On 11/1/23 6:39 PM, Denis Kenzior wrote:
> Hi James,
> 
>>>
>>>> If the kernel has a hard limit after which it expects the connection 
>>>> to be disconnected, we can start a timer for 2-4x that limit?  Looks 
>>>> like kernel uses probe_wait_ms parameter for this with a default of 
>>>> 500ms. Is your setup using the default values for beacon_loss_count 
>>>> and probe_wait_ms?
>>>
>>> Yep, everything is default as far as nl80211/driver options.
>>
>> I found an old thread I asked on linux-wireless back when I removed 
>> the beacon loss handling [1]. Johannes explained that behavior did 
>> apparently change in order to support power save (your memory was 
>> correct).
>>
> 
> Hah, funny.
> 
>> I identified the behavior change with hwsim, and apparently took his 
>> word when he said:
>>
>> "I'm pretty sure real hardware will behave just like hwsim here, 
>> albeit perhaps a bit slower"
>>
>> And I never added "proper" testing of beacon loss, i.e. block several 
>> beacons as opposed to just tearing down the AP. And this appears 
>> closer to the behavior I'm seeing in reality (AP not going down, just 
>> dropping beacons).
> 
> Right.
> 
>>
>> So this is my bad, I didn't take into account the situation of beacons 
>> being dropped but then being picked up again, or the nullfuncs/probes 
>> coming back successfully.
> 
> So the question is still how we handle this properly.  When the kernel 
> sends the nullfunc / probe request, are we going to be interfering with 
> that logic by initiating a scan right away?  A scan would force the 
> device to go offchannel, preventing it from receiving the AP's 
> response.  Given that nullfunc should be extremely fast, maybe we should 
> delay any roam action we take by a second or two?  If the AP is truly 
> lost, then we'll get disconnected soon after the LOST_BEACON event.  If 
> it isn't, then an extra second/two is not going to make much difference 
> since the scan should be limited and quite fast most of the time.

Yes scanning likely kills all chances of recovery. And this may very 
well be why I was seeing a 100% rate of disconnections. Granted, our 
devices are always moving so its very likely that once beacon/packet 
loss starts its only going to get worse, but that's not the case for 
everyone. Scanning just makes things worse.

I'm fine adding similar handling that I added for packet loss, except 
always delay rather than only on additional events. But I would like to 
explore other options in the future.

I'm not sure how, but being able to detect if the AP responded to 
nullfunc/probes prior to the kernel blowing away the connection would be 
great. (like send our own nullfunc frames or something, not really sure...)

Its taking IWD about 4-5 seconds to reconnect, 3 seconds are the quick 
scan and ~1-2 seconds for DHCP. (I need to look at why the quick scan is 
taking that long, that seems like something isn't right). So if its at 
all possible to roam that is best, obviously.

Thanks,
James


> 
> Regards,
> -Denis
Denis Kenzior Nov. 2, 2023, 2:10 p.m. UTC | #8
Hi James,

> 
> I'm fine adding similar handling that I added for packet loss, except always 
> delay rather than only on additional events. But I would like to explore other 
> options in the future.

I guess the question is, does adding LOST BEACON handling actually help or 
you're speculating that it does?  I don't mind if we add this back in with a 
delay, but I'm worried it doesn't actually do anything.  7 beacons lost in a row 
is likely not recoverable territory.

I'm actually surprised the driver doesn't give you any other indications prior 
to the lost beacon event.  I would have expected RSSI or packet loss to manifest 
itself prior?

> 
> I'm not sure how, but being able to detect if the AP responded to 
> nullfunc/probes prior to the kernel blowing away the connection would be great. 
> (like send our own nullfunc frames or something, not really sure...)

Yes, the current implementation of this event in the kernel is pretty useless. 
What we really need is an additional threshold that generates an event out to 
userspace _before_ the kernel starts taking potentially irreversible actions. 
Something like a pre-beacon-loss event that gets generated when 2-3 beacons are 
lost in a row.

> 
> Its taking IWD about 4-5 seconds to reconnect, 3 seconds are the quick scan and 
> ~1-2 seconds for DHCP. (I need to look at why the quick scan is taking that 

That's awful.  How many frequencies are you generating?  Even a full scan should 
be ~1-2 seconds at most.  And 1-2 seconds for DHCP also seems fishy.  I've 
tested our implementation to be sub-300ms or so.

> long, that seems like something isn't right). So if its at all possible to roam 
> that is best, obviously.
> 

Yep.

Regards,
-Denis
James Prestwood Nov. 2, 2023, 2:33 p.m. UTC | #9
Hi Denis,

On 11/2/23 7:10 AM, Denis Kenzior wrote:
> Hi James,
> 
>>
>> I'm fine adding similar handling that I added for packet loss, except 
>> always delay rather than only on additional events. But I would like 
>> to explore other options in the future.
> 
> I guess the question is, does adding LOST BEACON handling actually help 
> or you're speculating that it does?  I don't mind if we add this back in 
> with a delay, but I'm worried it doesn't actually do anything.  7 
> beacons lost in a row is likely not recoverable territory.

In the behavior I witnessed which was actually quite reproducible at the 
time, and yes roaming on beacon loss definitely helped. It avoided an 
inevitable disconnect.

> 
> I'm actually surprised the driver doesn't give you any other indications 
> prior to the lost beacon event.  I would have expected RSSI or packet 
> loss to manifest itself prior?

There were packet loss events in some cases, and I'd say 75% of the time 
IWD did end up roaming anyways (with a patch that roamed on beacon loss, 
identical to packet loss). It was only the cases when we got a beacon 
loss and IWD did not roam that the disconnect was always inevitable. 
This was the motivation to force the roam for beacon loss.

RSSI wasn't always bad, but it was a very high load environment so I 
assume that wasn't helping.

> 
>>
>> I'm not sure how, but being able to detect if the AP responded to 
>> nullfunc/probes prior to the kernel blowing away the connection would 
>> be great. (like send our own nullfunc frames or something, not really 
>> sure...)
> 
> Yes, the current implementation of this event in the kernel is pretty 
> useless. What we really need is an additional threshold that generates 
> an event out to userspace _before_ the kernel starts taking potentially 
> irreversible actions. Something like a pre-beacon-loss event that gets 
> generated when 2-3 beacons are lost in a row.

Best would be an event like "Disconnection Imminent" (after the 2 
nullfuncs failed) so userspace could start a roam. Like you mentioned 
even if we knew a few beacons were lost, scanning is probably going to 
ruin any chance of recovery.

On the bright side a disconnect isn't that terrible since IWD is super 
fast at reconnecting, I was just looking for any optimization I could to 
maintain the connection.

> 
>>
>> Its taking IWD about 4-5 seconds to reconnect, 3 seconds are the quick 
>> scan and ~1-2 seconds for DHCP. (I need to look at why the quick scan 
>> is taking that 
> 
> That's awful.  How many frequencies are you generating?  Even a full 
> scan should be ~1-2 seconds at most.  And 1-2 seconds for DHCP also 
> seems fishy.  I've tested our implementation to be sub-300ms or so.

In theory only 5:
known_freq_set = known_networks_get_recent_frequencies(5);

But I'm also seeing a random New Scan Results event prior to the quick 
scan. Anyways, its something I gotta look into.

> 
>> long, that seems like something isn't right). So if its at all 
>> possible to roam that is best, obviously.
>>
> 
> Yep.
> 
> Regards,
> -Denis
Denis Kenzior Nov. 2, 2023, 3:17 p.m. UTC | #10
Hi James,

> 
> In the behavior I witnessed which was actually quite reproducible at the time, 
> and yes roaming on beacon loss definitely helped. It avoided an inevitable 
> disconnect.
> 

Okay, then lets put this in with a 1-2 delay to let the kernel disconnect if it 
is going to.

>>
>>>
>>> I'm not sure how, but being able to detect if the AP responded to 
>>> nullfunc/probes prior to the kernel blowing away the connection would be 
>>> great. (like send our own nullfunc frames or something, not really sure...)
>>
>> Yes, the current implementation of this event in the kernel is pretty useless. 
>> What we really need is an additional threshold that generates an event out to 
>> userspace _before_ the kernel starts taking potentially irreversible actions. 
>> Something like a pre-beacon-loss event that gets generated when 2-3 beacons 
>> are lost in a row.
> 
> Best would be an event like "Disconnection Imminent" (after the 2 nullfuncs 

It is too late by that point.

> failed) so userspace could start a roam. Like you mentioned even if we knew a 
> few beacons were lost, scanning is probably going to ruin any chance of recovery.

<snip>

> 
> In theory only 5:
> known_freq_set = known_networks_get_recent_frequencies(5);
> 

No, that picks the top 5 most recently used networks.  The known frequencies for 
each of the top 5 are added to the quick scan.  This works okay most of the 
time, but we do not rotate out known frequencies, so if you have a network with 
lots of APs spread out over the spectrum, it might end up scanning quite a bit.

Regards,
-Denis
James Prestwood Nov. 2, 2023, 3:41 p.m. UTC | #11
Hi Denis,

>>
>> In theory only 5:
>> known_freq_set = known_networks_get_recent_frequencies(5);
>>
> 
> No, that picks the top 5 most recently used networks.  The known 
> frequencies for each of the top 5 are added to the quick scan.  This 
> works okay most of the time, but we do not rotate out known frequencies, 
> so if you have a network with lots of APs spread out over the spectrum, 
> it might end up scanning quite a bit.

Ah, that is exactly the problem then.

I wonder if we should be limiting both known networks and frequencies in 
those networks. We do push new frequencies in order that they are added 
so it may be nice to just pick the first 5 or so in the queue. Then we'd 
be capping 25 frequencies at most instead of the whole spectrum in the 
worst case.

> 
> Regards,
> -Denis
Denis Kenzior Nov. 2, 2023, 4:10 p.m. UTC | #12
Hi James,

> 
> I wonder if we should be limiting both known networks and frequencies in those 
> networks. We do push new frequencies in order that they are added so it may be 
> nice to just pick the first 5 or so in the queue. Then we'd be capping 25 
> frequencies at most instead of the whole spectrum in the worst case.

We should, but we don't track them in LRU order at the moment.

Regards,
-Denis
James Prestwood Nov. 2, 2023, 4:13 p.m. UTC | #13
Hi Denis,

On 11/2/23 9:10 AM, Denis Kenzior wrote:
> Hi James,
> 
>>
>> I wonder if we should be limiting both known networks and frequencies 
>> in those networks. We do push new frequencies in order that they are 
>> added so it may be nice to just pick the first 5 or so in the queue. 
>> Then we'd be capping 25 frequencies at most instead of the whole 
>> spectrum in the worst case.
> 
> We should, but we don't track them in LRU order at the moment.

Yeah I see that now, we just add them as we see BSS's, not when we 
connect. I'll put it in the backlog and see if I can come up with something.

> 
> Regards,
> -Denis
>