Message ID | 20231030134837.452957-1-prestwoj@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Packet/beacon loss roaming improvements | expand |
Hi James, > > We were seeing beacon loss events not resulting in an immediate > disconnnect (as I have always expected), still eventually but after If I recall correctly, Lost Beacon is sent after several beacons in succession were lost. You are right that this could just be bad luck and doesn't actually mean that no packets are getting through. However, in practice mac80211 almost always disconnected us soon after. Didn't we test this pretty thoroughly? My memory is fuzzy here, but I seem to recall that power save has an effect on how lost beacon events are treated by mac80211. Maybe your recent power save patches had an effect? > plenty of time to roam. I initially added handling for > beacon loss identical to packet loss (try and find a better BSS) but > noticed that if IWD did not find a better candidate it resulted in a > disconnect 100% of the time. I watched a client for a full day and > whenever beacon loss events arrived they were followed by a > disconnect within ~5-6 seconds if IWD did not roam. This led me to > believe that at least on ath10k a beacon loss is still very much a > sign that a disconnect is going to come, we just have a bit more time > than other drivers. This was the motivation behind re-using/re-naming > the 'ap_directed_roam' flag in order to force IWD off the BSS. > ath10k is still a mac80211 driver, no? Given that we did test Lost Beacon event behavior before, I would like some more data points before being convinced it is a driver behavior change. > Again, this is just one driver. Maybe other drivers can > handle/recover from beacon loss. If we instead want to keep the > behavior the same as packet loss I'm ok with that (I can keep the > patch locally), or put the forced roam behavior behind a user > option e.g. [Roam].ForceRoamOnBeaconLoss If this is a driver behavior quirk, then this belongs in src/wiphy.c driver_infos table somehow. I'd really rather not add a bazillion config options that address the bug-of-the-day. > > James Prestwood (4): > station: rename ap_directed_roam to force_roam > station: start roam on beacon loss event > netdev: handle/send beacon loss event > station: rate limit packet loss roam scans > > src/netdev.c | 6 ++++- > src/netdev.h | 1 + > src/station.c | 61 +++++++++++++++++++++++++++++++++++++++++++-------- > 3 files changed, 58 insertions(+), 10 deletions(-) > Regards, -Denis
Hi Denis, On 10/30/23 8:00 AM, Denis Kenzior wrote: > Hi James, > >> >> We were seeing beacon loss events not resulting in an immediate >> disconnnect (as I have always expected), still eventually but after > > If I recall correctly, Lost Beacon is sent after several beacons in > succession were lost. You are right that this could just be bad luck > and doesn't actually mean that no packets are getting through. However, > in practice mac80211 almost always disconnected us soon after. Didn't > we test this pretty thoroughly? Yes, it appears mac80211 by default waits for 7 missed beacons before sending the event. It then probes the AP (either nullfunc or probe request) so apparently the connection could be recovered if the AP responded. Unfortunately we don't get any notification in userspace if the AP responded or not... I can't remember what hardware was tested. But there really wasn't a consistent way to test this. The testing involved me disabling roaming and walking away from the AP until I got disconnected. Sometimes this was due to beacon loss, sometimes the AP disconnected explicitly. But what I do remember is when beacon loss occurred, a local disconnect followed near immediately. This is why (I think) we thought there was no reason to handle this event. > > My memory is fuzzy here, but I seem to recall that power save has an > effect on how lost beacon events are treated by mac80211. Maybe your > recent power save patches had an effect? From what I can tell in mac80211 power save doesn't change handling. Its the driver that tells mac80211 of the beacon loss but maybe the driver (or firmware) could handle it differently depending on power save. When I was watching this device power save was disabled. > >> plenty of time to roam. I initially added handling for >> beacon loss identical to packet loss (try and find a better BSS) but >> noticed that if IWD did not find a better candidate it resulted in a >> disconnect 100% of the time. I watched a client for a full day and >> whenever beacon loss events arrived they were followed by a >> disconnect within ~5-6 seconds if IWD did not roam. This led me to >> believe that at least on ath10k a beacon loss is still very much a >> sign that a disconnect is going to come, we just have a bit more time >> than other drivers. This was the motivation behind re-using/re-naming >> the 'ap_directed_roam' flag in order to force IWD off the BSS. >> > > ath10k is still a mac80211 driver, no? Given that we did test Lost > Beacon event behavior before, I would like some more data points before > being convinced it is a driver behavior change. > >> Again, this is just one driver. Maybe other drivers can >> handle/recover from beacon loss. If we instead want to keep the >> behavior the same as packet loss I'm ok with that (I can keep the >> patch locally), or put the forced roam behavior behind a user >> option e.g. [Roam].ForceRoamOnBeaconLoss > > If this is a driver behavior quirk, then this belongs in src/wiphy.c > driver_infos table somehow. I'd really rather not add a bazillion > config options that address the bug-of-the-day. Yeah, adding a driver specific quirk doesn't seem like the right route. I think for now there is no harm in trying to roam on beacon loss, basically the same handling as packet loss. If a disconnect comes immediately the scan would be canceled. Otherwise maybe we get lucky and be able to roam. Since our specific hardware/use case seems to benefit from the forced roam I can keep that change out of tree (just set ap_directed_roaming), at least until more testing can be done, or if others report similar behavior. Thanks, James > >> >> James Prestwood (4): >> station: rename ap_directed_roam to force_roam >> station: start roam on beacon loss event >> netdev: handle/send beacon loss event >> station: rate limit packet loss roam scans >> >> src/netdev.c | 6 ++++- >> src/netdev.h | 1 + >> src/station.c | 61 +++++++++++++++++++++++++++++++++++++++++++-------- >> 3 files changed, 58 insertions(+), 10 deletions(-) >> > > Regards, > -Denis
Hi James, On 10/30/23 10:37, James Prestwood wrote: > Hi Denis, > > On 10/30/23 8:00 AM, Denis Kenzior wrote: >> Hi James, >> >>> >>> We were seeing beacon loss events not resulting in an immediate >>> disconnnect (as I have always expected), still eventually but after >> >> If I recall correctly, Lost Beacon is sent after several beacons in succession >> were lost. You are right that this could just be bad luck and doesn't >> actually mean that no packets are getting through. However, in practice >> mac80211 almost always disconnected us soon after. Didn't we test this pretty >> thoroughly? > > Yes, it appears mac80211 by default waits for 7 missed beacons before sending > the event. It then probes the AP (either nullfunc or probe request) so > apparently the connection could be recovered if the AP responded. Unfortunately > we don't get any notification in userspace if the AP responded or not... So this magic here? https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git/tree/net/mac80211/mlme.c#n3215 > > I can't remember what hardware was tested. But there really wasn't a consistent > way to test this. The testing involved me disabling roaming and walking away > from the AP until I got disconnected. Sometimes this was due to beacon loss, > sometimes the AP disconnected explicitly. But what I do remember is when beacon > loss occurred, a local disconnect followed near immediately. This is why (I > think) we thought there was no reason to handle this event. So what does your ath10k driver/hw do? Does it send nullfuncs or probe requests? > >> >> My memory is fuzzy here, but I seem to recall that power save has an effect on >> how lost beacon events are treated by mac80211. Maybe your recent power save >> patches had an effect? > > From what I can tell in mac80211 power save doesn't change handling. Its the > driver that tells mac80211 of the beacon loss but maybe the driver (or firmware) > could handle it differently depending on power save. > > When I was watching this device power save was disabled. Okay, fair enough. >> If this is a driver behavior quirk, then this belongs in src/wiphy.c >> driver_infos table somehow. I'd really rather not add a bazillion config >> options that address the bug-of-the-day. > > Yeah, adding a driver specific quirk doesn't seem like the right route. > > I think for now there is no harm in trying to roam on beacon loss, basically the > same handling as packet loss. If a disconnect comes immediately the scan would > be canceled. Otherwise maybe we get lucky and be able to roam. So the problem is, we had the _exact_ same behavior you're proposing here. We took it out. See commit: 836beb1276d1 ("station/wsc: remove beacon loss handling") So when we do that, alarm bells start going off. Why did we get rid of it if it was useful? 7 consecutive lost beacons is actually a lot. That's ~700ms with no connection with default settings. And you can maintain the connection after that for another 5-6? Something smells fishy. If the kernel has a hard limit after which it expects the connection to be disconnected, we can start a timer for 2-4x that limit? Looks like kernel uses probe_wait_ms parameter for this with a default of 500ms. Is your setup using the default values for beacon_loss_count and probe_wait_ms? Regards, -Denis
Hi Denis, On 10/30/23 10:05 AM, Denis Kenzior wrote: > Hi James, > > On 10/30/23 10:37, James Prestwood wrote: >> Hi Denis, >> >> On 10/30/23 8:00 AM, Denis Kenzior wrote: >>> Hi James, >>> >>>> >>>> We were seeing beacon loss events not resulting in an immediate >>>> disconnnect (as I have always expected), still eventually but after >>> >>> If I recall correctly, Lost Beacon is sent after several beacons in >>> succession were lost. You are right that this could just be bad luck >>> and doesn't actually mean that no packets are getting through. >>> However, in practice mac80211 almost always disconnected us soon >>> after. Didn't we test this pretty thoroughly? >> >> Yes, it appears mac80211 by default waits for 7 missed beacons before >> sending the event. It then probes the AP (either nullfunc or probe >> request) so apparently the connection could be recovered if the AP >> responded. Unfortunately we don't get any notification in userspace if >> the AP responded or not... > > So this magic here? > https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git/tree/net/mac80211/mlme.c#n3215 Yep > >> >> I can't remember what hardware was tested. But there really wasn't a >> consistent way to test this. The testing involved me disabling roaming >> and walking away from the AP until I got disconnected. Sometimes this >> was due to beacon loss, sometimes the AP disconnected explicitly. But >> what I do remember is when beacon loss occurred, a local disconnect >> followed near immediately. This is why (I think) we thought there was >> no reason to handle this event. > > So what does your ath10k driver/hw do? Does it send nullfuncs or probe > requests? Based on the driver it sets REPORTS_TX_ACK_STATUS, so nullfunc. Which would add an extra second before disconnect by default (2 nullfuncs 500ms apart). > >> >>> >>> My memory is fuzzy here, but I seem to recall that power save has an >>> effect on how lost beacon events are treated by mac80211. Maybe your >>> recent power save patches had an effect? >> >> From what I can tell in mac80211 power save doesn't change handling. >> Its the driver that tells mac80211 of the beacon loss but maybe the >> driver (or firmware) could handle it differently depending on power save. >> >> When I was watching this device power save was disabled. > > Okay, fair enough. > >>> If this is a driver behavior quirk, then this belongs in src/wiphy.c >>> driver_infos table somehow. I'd really rather not add a bazillion >>> config options that address the bug-of-the-day. >> >> Yeah, adding a driver specific quirk doesn't seem like the right route. >> >> I think for now there is no harm in trying to roam on beacon loss, >> basically the same handling as packet loss. If a disconnect comes >> immediately the scan would be canceled. Otherwise maybe we get lucky >> and be able to roam. > > So the problem is, we had the _exact_ same behavior you're proposing > here. We took it out. See commit: > 836beb1276d1 ("station/wsc: remove beacon loss handling") > > So when we do that, alarm bells start going off. Why did we get rid of > it if it was useful? Huh, apparently it was handled previously. I really have no idea, apparently I thought something changed in the kernel. Maybe based on testing only certain hardware. Looking back to kernel 5.3 (since 5.4+ was mentioned in that commit) I really don't see much difference either. This device was on 5.11, and again, not much different than current upstream in that regard. > > 7 consecutive lost beacons is actually a lot. That's ~700ms with no > connection with default settings. And you can maintain the connection > after that for another 5-6? Something smells fishy. According to the logs, yes. Looking back to the logs in the commit it was actually 4 seconds. Still, I'm not sure how it could get that far without a second lost beacon event (e.g. miss beacons, but get a nullfunc reply, miss 7 more, nullfunc reply etc.). With a single event the longest it could get drug out would be 1.7 seconds (7 beacons + 2 nullfuncs) before either disconnecting or recovering but missing more beacons and generating an event, unless there is some other reason=4 failure path I'm missing. > If the kernel has a hard limit after which it expects the connection to > be disconnected, we can start a timer for 2-4x that limit? Looks like > kernel uses probe_wait_ms parameter for this with a default of 500ms. > Is your setup using the default values for beacon_loss_count and > probe_wait_ms? Yep, everything is default as far as nl80211/driver options. > > Regards, > -Denis
Hi Denis, On 10/30/23 10:37 AM, James Prestwood wrote: > Hi Denis, > > On 10/30/23 10:05 AM, Denis Kenzior wrote: >> Hi James, >> >> On 10/30/23 10:37, James Prestwood wrote: >>> Hi Denis, >>> >>> On 10/30/23 8:00 AM, Denis Kenzior wrote: >>>> Hi James, >>>> >>>>> >>>>> We were seeing beacon loss events not resulting in an immediate >>>>> disconnnect (as I have always expected), still eventually but after >>>> >>>> If I recall correctly, Lost Beacon is sent after several beacons in >>>> succession were lost. You are right that this could just be bad >>>> luck and doesn't actually mean that no packets are getting through. >>>> However, in practice mac80211 almost always disconnected us soon >>>> after. Didn't we test this pretty thoroughly? >>> >>> Yes, it appears mac80211 by default waits for 7 missed beacons before >>> sending the event. It then probes the AP (either nullfunc or probe >>> request) so apparently the connection could be recovered if the AP >>> responded. Unfortunately we don't get any notification in userspace >>> if the AP responded or not... >> >> So this magic here? >> https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git/tree/net/mac80211/mlme.c#n3215 > > Yep > >> >>> >>> I can't remember what hardware was tested. But there really wasn't a >>> consistent way to test this. The testing involved me disabling >>> roaming and walking away from the AP until I got disconnected. >>> Sometimes this was due to beacon loss, sometimes the AP disconnected >>> explicitly. But what I do remember is when beacon loss occurred, a >>> local disconnect followed near immediately. This is why (I think) we >>> thought there was no reason to handle this event. >> >> So what does your ath10k driver/hw do? Does it send nullfuncs or >> probe requests? > > Based on the driver it sets REPORTS_TX_ACK_STATUS, so nullfunc. Which > would add an extra second before disconnect by default (2 nullfuncs > 500ms apart). > >> >>> >>>> >>>> My memory is fuzzy here, but I seem to recall that power save has an >>>> effect on how lost beacon events are treated by mac80211. Maybe >>>> your recent power save patches had an effect? >>> >>> From what I can tell in mac80211 power save doesn't change handling. >>> Its the driver that tells mac80211 of the beacon loss but maybe the >>> driver (or firmware) could handle it differently depending on power >>> save. >>> >>> When I was watching this device power save was disabled. >> >> Okay, fair enough. >> >>>> If this is a driver behavior quirk, then this belongs in src/wiphy.c >>>> driver_infos table somehow. I'd really rather not add a bazillion >>>> config options that address the bug-of-the-day. >>> >>> Yeah, adding a driver specific quirk doesn't seem like the right route. >>> >>> I think for now there is no harm in trying to roam on beacon loss, >>> basically the same handling as packet loss. If a disconnect comes >>> immediately the scan would be canceled. Otherwise maybe we get lucky >>> and be able to roam. >> >> So the problem is, we had the _exact_ same behavior you're proposing >> here. We took it out. See commit: >> 836beb1276d1 ("station/wsc: remove beacon loss handling") >> >> So when we do that, alarm bells start going off. Why did we get rid >> of it if it was useful? > > Huh, apparently it was handled previously. I really have no idea, > apparently I thought something changed in the kernel. Maybe based on > testing only certain hardware. > > Looking back to kernel 5.3 (since 5.4+ was mentioned in that commit) I > really don't see much difference either. This device was on 5.11, and > again, not much different than current upstream in that regard. > >> >> 7 consecutive lost beacons is actually a lot. That's ~700ms with no >> connection with default settings. And you can maintain the connection >> after that for another 5-6? Something smells fishy. > > According to the logs, yes. Looking back to the logs in the commit it > was actually 4 seconds. > > Still, I'm not sure how it could get that far without a second lost > beacon event (e.g. miss beacons, but get a nullfunc reply, miss 7 more, > nullfunc reply etc.). With a single event the longest it could get drug > out would be 1.7 seconds (7 beacons + 2 nullfuncs) before either > disconnecting or recovering but missing more beacons and generating an > event, unless there is some other reason=4 failure path I'm missing. > > >> If the kernel has a hard limit after which it expects the connection >> to be disconnected, we can start a timer for 2-4x that limit? Looks >> like kernel uses probe_wait_ms parameter for this with a default of >> 500ms. Is your setup using the default values for beacon_loss_count >> and probe_wait_ms? > > Yep, everything is default as far as nl80211/driver options. I found an old thread I asked on linux-wireless back when I removed the beacon loss handling [1]. Johannes explained that behavior did apparently change in order to support power save (your memory was correct). I identified the behavior change with hwsim, and apparently took his word when he said: "I'm pretty sure real hardware will behave just like hwsim here, albeit perhaps a bit slower" And I never added "proper" testing of beacon loss, i.e. block several beacons as opposed to just tearing down the AP. And this appears closer to the behavior I'm seeing in reality (AP not going down, just dropping beacons). So this is my bad, I didn't take into account the situation of beacons being dropped but then being picked up again, or the nullfuncs/probes coming back successfully. [1] https://lore.kernel.org/linux-wireless/ada14dfad76b93d654606c3b397de059d968096b.camel@gmail.com/ Thanks, James > >> >> Regards, >> -Denis
Hi James, >> >>> If the kernel has a hard limit after which it expects the connection to be >>> disconnected, we can start a timer for 2-4x that limit? Looks like kernel >>> uses probe_wait_ms parameter for this with a default of 500ms. Is your setup >>> using the default values for beacon_loss_count and probe_wait_ms? >> >> Yep, everything is default as far as nl80211/driver options. > > I found an old thread I asked on linux-wireless back when I removed the beacon > loss handling [1]. Johannes explained that behavior did apparently change in > order to support power save (your memory was correct). > Hah, funny. > I identified the behavior change with hwsim, and apparently took his word when > he said: > > "I'm pretty sure real hardware will behave just like hwsim here, albeit perhaps > a bit slower" > > And I never added "proper" testing of beacon loss, i.e. block several beacons as > opposed to just tearing down the AP. And this appears closer to the behavior I'm > seeing in reality (AP not going down, just dropping beacons). Right. > > So this is my bad, I didn't take into account the situation of beacons being > dropped but then being picked up again, or the nullfuncs/probes coming back > successfully. So the question is still how we handle this properly. When the kernel sends the nullfunc / probe request, are we going to be interfering with that logic by initiating a scan right away? A scan would force the device to go offchannel, preventing it from receiving the AP's response. Given that nullfunc should be extremely fast, maybe we should delay any roam action we take by a second or two? If the AP is truly lost, then we'll get disconnected soon after the LOST_BEACON event. If it isn't, then an extra second/two is not going to make much difference since the scan should be limited and quite fast most of the time. Regards, -Denis
Hi Denis, On 11/1/23 6:39 PM, Denis Kenzior wrote: > Hi James, > >>> >>>> If the kernel has a hard limit after which it expects the connection >>>> to be disconnected, we can start a timer for 2-4x that limit? Looks >>>> like kernel uses probe_wait_ms parameter for this with a default of >>>> 500ms. Is your setup using the default values for beacon_loss_count >>>> and probe_wait_ms? >>> >>> Yep, everything is default as far as nl80211/driver options. >> >> I found an old thread I asked on linux-wireless back when I removed >> the beacon loss handling [1]. Johannes explained that behavior did >> apparently change in order to support power save (your memory was >> correct). >> > > Hah, funny. > >> I identified the behavior change with hwsim, and apparently took his >> word when he said: >> >> "I'm pretty sure real hardware will behave just like hwsim here, >> albeit perhaps a bit slower" >> >> And I never added "proper" testing of beacon loss, i.e. block several >> beacons as opposed to just tearing down the AP. And this appears >> closer to the behavior I'm seeing in reality (AP not going down, just >> dropping beacons). > > Right. > >> >> So this is my bad, I didn't take into account the situation of beacons >> being dropped but then being picked up again, or the nullfuncs/probes >> coming back successfully. > > So the question is still how we handle this properly. When the kernel > sends the nullfunc / probe request, are we going to be interfering with > that logic by initiating a scan right away? A scan would force the > device to go offchannel, preventing it from receiving the AP's > response. Given that nullfunc should be extremely fast, maybe we should > delay any roam action we take by a second or two? If the AP is truly > lost, then we'll get disconnected soon after the LOST_BEACON event. If > it isn't, then an extra second/two is not going to make much difference > since the scan should be limited and quite fast most of the time. Yes scanning likely kills all chances of recovery. And this may very well be why I was seeing a 100% rate of disconnections. Granted, our devices are always moving so its very likely that once beacon/packet loss starts its only going to get worse, but that's not the case for everyone. Scanning just makes things worse. I'm fine adding similar handling that I added for packet loss, except always delay rather than only on additional events. But I would like to explore other options in the future. I'm not sure how, but being able to detect if the AP responded to nullfunc/probes prior to the kernel blowing away the connection would be great. (like send our own nullfunc frames or something, not really sure...) Its taking IWD about 4-5 seconds to reconnect, 3 seconds are the quick scan and ~1-2 seconds for DHCP. (I need to look at why the quick scan is taking that long, that seems like something isn't right). So if its at all possible to roam that is best, obviously. Thanks, James > > Regards, > -Denis
Hi James, > > I'm fine adding similar handling that I added for packet loss, except always > delay rather than only on additional events. But I would like to explore other > options in the future. I guess the question is, does adding LOST BEACON handling actually help or you're speculating that it does? I don't mind if we add this back in with a delay, but I'm worried it doesn't actually do anything. 7 beacons lost in a row is likely not recoverable territory. I'm actually surprised the driver doesn't give you any other indications prior to the lost beacon event. I would have expected RSSI or packet loss to manifest itself prior? > > I'm not sure how, but being able to detect if the AP responded to > nullfunc/probes prior to the kernel blowing away the connection would be great. > (like send our own nullfunc frames or something, not really sure...) Yes, the current implementation of this event in the kernel is pretty useless. What we really need is an additional threshold that generates an event out to userspace _before_ the kernel starts taking potentially irreversible actions. Something like a pre-beacon-loss event that gets generated when 2-3 beacons are lost in a row. > > Its taking IWD about 4-5 seconds to reconnect, 3 seconds are the quick scan and > ~1-2 seconds for DHCP. (I need to look at why the quick scan is taking that That's awful. How many frequencies are you generating? Even a full scan should be ~1-2 seconds at most. And 1-2 seconds for DHCP also seems fishy. I've tested our implementation to be sub-300ms or so. > long, that seems like something isn't right). So if its at all possible to roam > that is best, obviously. > Yep. Regards, -Denis
Hi Denis, On 11/2/23 7:10 AM, Denis Kenzior wrote: > Hi James, > >> >> I'm fine adding similar handling that I added for packet loss, except >> always delay rather than only on additional events. But I would like >> to explore other options in the future. > > I guess the question is, does adding LOST BEACON handling actually help > or you're speculating that it does? I don't mind if we add this back in > with a delay, but I'm worried it doesn't actually do anything. 7 > beacons lost in a row is likely not recoverable territory. In the behavior I witnessed which was actually quite reproducible at the time, and yes roaming on beacon loss definitely helped. It avoided an inevitable disconnect. > > I'm actually surprised the driver doesn't give you any other indications > prior to the lost beacon event. I would have expected RSSI or packet > loss to manifest itself prior? There were packet loss events in some cases, and I'd say 75% of the time IWD did end up roaming anyways (with a patch that roamed on beacon loss, identical to packet loss). It was only the cases when we got a beacon loss and IWD did not roam that the disconnect was always inevitable. This was the motivation to force the roam for beacon loss. RSSI wasn't always bad, but it was a very high load environment so I assume that wasn't helping. > >> >> I'm not sure how, but being able to detect if the AP responded to >> nullfunc/probes prior to the kernel blowing away the connection would >> be great. (like send our own nullfunc frames or something, not really >> sure...) > > Yes, the current implementation of this event in the kernel is pretty > useless. What we really need is an additional threshold that generates > an event out to userspace _before_ the kernel starts taking potentially > irreversible actions. Something like a pre-beacon-loss event that gets > generated when 2-3 beacons are lost in a row. Best would be an event like "Disconnection Imminent" (after the 2 nullfuncs failed) so userspace could start a roam. Like you mentioned even if we knew a few beacons were lost, scanning is probably going to ruin any chance of recovery. On the bright side a disconnect isn't that terrible since IWD is super fast at reconnecting, I was just looking for any optimization I could to maintain the connection. > >> >> Its taking IWD about 4-5 seconds to reconnect, 3 seconds are the quick >> scan and ~1-2 seconds for DHCP. (I need to look at why the quick scan >> is taking that > > That's awful. How many frequencies are you generating? Even a full > scan should be ~1-2 seconds at most. And 1-2 seconds for DHCP also > seems fishy. I've tested our implementation to be sub-300ms or so. In theory only 5: known_freq_set = known_networks_get_recent_frequencies(5); But I'm also seeing a random New Scan Results event prior to the quick scan. Anyways, its something I gotta look into. > >> long, that seems like something isn't right). So if its at all >> possible to roam that is best, obviously. >> > > Yep. > > Regards, > -Denis
Hi James, > > In the behavior I witnessed which was actually quite reproducible at the time, > and yes roaming on beacon loss definitely helped. It avoided an inevitable > disconnect. > Okay, then lets put this in with a 1-2 delay to let the kernel disconnect if it is going to. >> >>> >>> I'm not sure how, but being able to detect if the AP responded to >>> nullfunc/probes prior to the kernel blowing away the connection would be >>> great. (like send our own nullfunc frames or something, not really sure...) >> >> Yes, the current implementation of this event in the kernel is pretty useless. >> What we really need is an additional threshold that generates an event out to >> userspace _before_ the kernel starts taking potentially irreversible actions. >> Something like a pre-beacon-loss event that gets generated when 2-3 beacons >> are lost in a row. > > Best would be an event like "Disconnection Imminent" (after the 2 nullfuncs It is too late by that point. > failed) so userspace could start a roam. Like you mentioned even if we knew a > few beacons were lost, scanning is probably going to ruin any chance of recovery. <snip> > > In theory only 5: > known_freq_set = known_networks_get_recent_frequencies(5); > No, that picks the top 5 most recently used networks. The known frequencies for each of the top 5 are added to the quick scan. This works okay most of the time, but we do not rotate out known frequencies, so if you have a network with lots of APs spread out over the spectrum, it might end up scanning quite a bit. Regards, -Denis
Hi Denis, >> >> In theory only 5: >> known_freq_set = known_networks_get_recent_frequencies(5); >> > > No, that picks the top 5 most recently used networks. The known > frequencies for each of the top 5 are added to the quick scan. This > works okay most of the time, but we do not rotate out known frequencies, > so if you have a network with lots of APs spread out over the spectrum, > it might end up scanning quite a bit. Ah, that is exactly the problem then. I wonder if we should be limiting both known networks and frequencies in those networks. We do push new frequencies in order that they are added so it may be nice to just pick the first 5 or so in the queue. Then we'd be capping 25 frequencies at most instead of the whole spectrum in the worst case. > > Regards, > -Denis
Hi James, > > I wonder if we should be limiting both known networks and frequencies in those > networks. We do push new frequencies in order that they are added so it may be > nice to just pick the first 5 or so in the queue. Then we'd be capping 25 > frequencies at most instead of the whole spectrum in the worst case. We should, but we don't track them in LRU order at the moment. Regards, -Denis
Hi Denis, On 11/2/23 9:10 AM, Denis Kenzior wrote: > Hi James, > >> >> I wonder if we should be limiting both known networks and frequencies >> in those networks. We do push new frequencies in order that they are >> added so it may be nice to just pick the first 5 or so in the queue. >> Then we'd be capping 25 frequencies at most instead of the whole >> spectrum in the worst case. > > We should, but we don't track them in LRU order at the moment. Yeah I see that now, we just add them as we see BSS's, not when we connect. I'll put it in the backlog and see if I can come up with something. > > Regards, > -Denis >