Message ID | 20240103184638.533221-3-prestwoj@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/8] station: handle netconfig after roaming for FW roams | expand |
Context | Check | Description |
---|---|---|
tedd_an/pre-ci_am | success | Success |
prestwoj/iwd-ci-gitlint | success | GitLint |
Hi James, On 1/3/24 12:46, James Prestwood wrote: > There was an unhandled corner case if netconfig was running and > multiple roam conditions happened in sequence, all before netconfig > had completed. A single roam before netconfig was already handled > (23f0f5717c) but this did not take into account any additional roam > conditions. So if netconfig hasn't completed, we're in the 'connecting' state. Any subsequent roams should still be treated as if we are in 'connecting' state. Are we transitioning from connecting -> roaming at the D-Bus API level? We shouldn't be. Another weirdness is that I think we're sending the d-bus reply after connecting to the AP, but before netconfig runs. > > If IWD is in this state, having started netconfig, then roamed, and > again restarted netconfig it is still in a roaming state which will > prevent any further roams. IWD will remain "stuck" on the current > BSS until netconfig completes or gets disconnected. Makes sense, since roaming means netconfig isn't really doing anything. > > To fix this a new internal station state was added (no changes to > the DBus API) to distinguish between a purely WiFi connecting state > (STATION_STATE_CONNECTING/AUTO) and netconfig > (STATION_STATE_NETCONFIG). This allows IWD roam as needed if > netconfig is still running. Okay, but how would you distinguish between connecting -> netconfig and netconfig->roaming->netconfig? > > The change is mainly just adding STATION_STATE_NETCONFIG anywhere > that STATION_STATE_CONNECTING is to maintain the same behavior, > except within the netconfig event handler. In this case we should > never get here without being in either a NETCONFIG or ROAMING > state. > > For some background this scenario happens if the DHCP server goes > down for an extended period, e.g. if its being upgraded/serviced. > --- > src/station.c | 23 +++++++++++++++++------ > 1 file changed, 17 insertions(+), 6 deletions(-) > > diff --git a/src/station.c b/src/station.c > index 57d22e91..8f310ec8 100644 > --- a/src/station.c > +++ b/src/station.c > @@ -1768,6 +1768,7 @@ static void station_reset_connection_state(struct station *station) > if (station->state == STATION_STATE_CONNECTED || > station->state == STATION_STATE_CONNECTING || > station->state == STATION_STATE_CONNECTING_AUTO || > + station->state == STATION_STATE_NETCONFIG || > station_is_roaming(station)) > network_disconnected(network); > } > @@ -2043,8 +2044,9 @@ static void station_netconfig_event_handler(enum netconfig_event event, > dbus_pending_reply(&station->connect_pending, reply); > } > > - if (L_IN_SET(station->state, STATION_STATE_CONNECTING, > - STATION_STATE_CONNECTING_AUTO)) > + if (L_IN_SET(station->state, STATION_STATE_NETCONFIG, > + STATION_STATE_ROAMING, STATION_STATE_FT_ROAMING, > + STATION_STATE_FW_ROAMING)) I understand why NETCONFIG state is in the set, but why the others? > network_connect_failed(station->connected_network, > false); > > @@ -2070,9 +2072,14 @@ static bool netconfig_after_roam(struct station *station) > network_get_settings(network))) > return false; > > - return netconfig_configure(station->netconfig, > + if (L_WARN_ON(!netconfig_configure(station->netconfig, > station_netconfig_event_handler, > - station); > + station))) You already have an L_WARN_ON in the single call site of netconfig_after_roam? > + return false; > + > + station_enter_state(station, STATION_STATE_NETCONFIG); > + > + return true; > } > > static void station_roamed(struct station *station) > @@ -3255,6 +3262,8 @@ static void station_connect_ok(struct station *station) > station_netconfig_event_handler, > station))) > return; > + > + station_enter_state(station, STATION_STATE_NETCONFIG); > } else > station_enter_state(station, STATION_STATE_CONNECTED); > } > @@ -4067,7 +4076,8 @@ static struct l_dbus_message *station_dbus_scan(struct l_dbus *dbus, > return dbus_error_busy(message); > > if (station->state == STATION_STATE_CONNECTING || > - station->state == STATION_STATE_CONNECTING_AUTO) > + station->state == STATION_STATE_CONNECTING_AUTO || > + station->state == STATION_STATE_NETCONFIG) Might as well use L_IN_SET here > return dbus_error_busy(message); > > station->dbus_scan_subset_idx = 0; > @@ -5025,7 +5035,8 @@ static struct l_dbus_message *station_debug_scan(struct l_dbus *dbus, > return dbus_error_busy(message); > > if (station->state == STATION_STATE_CONNECTING || > - station->state == STATION_STATE_CONNECTING_AUTO) > + station->state == STATION_STATE_CONNECTING_AUTO || > + station->state == STATION_STATE_NETCONFIG) Also, shouldn't this also cover the roaming states? And do we still need this check given wiphy_work? Does netconfig use wiphy work to make sure nothing tries to scan or go off-channel? > return dbus_error_busy(message); > > if (!l_dbus_message_get_arguments(message, "aq", &iter)) Regards, -Denis
On 1/4/24 10:14 AM, Denis Kenzior wrote: > Hi James, > > On 1/3/24 12:46, James Prestwood wrote: >> There was an unhandled corner case if netconfig was running and >> multiple roam conditions happened in sequence, all before netconfig >> had completed. A single roam before netconfig was already handled >> (23f0f5717c) but this did not take into account any additional roam >> conditions. > > So if netconfig hasn't completed, we're in the 'connecting' state. > Any subsequent roams should still be treated as if we are in > 'connecting' state. Are we transitioning from connecting -> roaming at > the D-Bus API level? We shouldn't be. Yes we are as it is today. > > Another weirdness is that I think we're sending the d-bus reply after > connecting to the AP, but before netconfig runs. Probably this as well... But I'm not sure we can really send it after netconfig unless we want to add a timeout error or something. If netconfig takes a long time DBus will be unhappy. IMO netconfig is somewhat of a special case in this regard, and consumers of the API should be waiting on the connected state, not only the DBus method return. > >> >> If IWD is in this state, having started netconfig, then roamed, and >> again restarted netconfig it is still in a roaming state which will >> prevent any further roams. IWD will remain "stuck" on the current >> BSS until netconfig completes or gets disconnected. > > Makes sense, since roaming means netconfig isn't really doing anything. > >> >> To fix this a new internal station state was added (no changes to >> the DBus API) to distinguish between a purely WiFi connecting state >> (STATION_STATE_CONNECTING/AUTO) and netconfig >> (STATION_STATE_NETCONFIG). This allows IWD roam as needed if >> netconfig is still running. > > Okay, but how would you distinguish between connecting -> netconfig > and netconfig->roaming->netconfig? I hadn't had a need to distinguish, but given the above of wanting to remain the connecting state I think we'll need to. > >> >> The change is mainly just adding STATION_STATE_NETCONFIG anywhere >> that STATION_STATE_CONNECTING is to maintain the same behavior, >> except within the netconfig event handler. In this case we should >> never get here without being in either a NETCONFIG or ROAMING >> state. >> >> For some background this scenario happens if the DHCP server goes >> down for an extended period, e.g. if its being upgraded/serviced. >> --- >> src/station.c | 23 +++++++++++++++++------ >> 1 file changed, 17 insertions(+), 6 deletions(-) >> >> diff --git a/src/station.c b/src/station.c >> index 57d22e91..8f310ec8 100644 >> --- a/src/station.c >> +++ b/src/station.c >> @@ -1768,6 +1768,7 @@ static void >> station_reset_connection_state(struct station *station) >> if (station->state == STATION_STATE_CONNECTED || >> station->state == STATION_STATE_CONNECTING || >> station->state == STATION_STATE_CONNECTING_AUTO || >> + station->state == STATION_STATE_NETCONFIG || >> station_is_roaming(station)) >> network_disconnected(network); >> } >> @@ -2043,8 +2044,9 @@ static void >> station_netconfig_event_handler(enum netconfig_event event, >> dbus_pending_reply(&station->connect_pending, reply); >> } >> - if (L_IN_SET(station->state, STATION_STATE_CONNECTING, >> - STATION_STATE_CONNECTING_AUTO)) >> + if (L_IN_SET(station->state, STATION_STATE_NETCONFIG, >> + STATION_STATE_ROAMING, STATION_STATE_FT_ROAMING, >> + STATION_STATE_FW_ROAMING)) > > I understand why NETCONFIG state is in the set, but why the others? This was because if we are roaming and netconfig fails for some reason we want to disconnect, right? > >> network_connect_failed(station->connected_network, >> false); >> @@ -2070,9 +2072,14 @@ static bool netconfig_after_roam(struct >> station *station) >> network_get_settings(network))) >> return false; >> - return netconfig_configure(station->netconfig, >> + if (L_WARN_ON(!netconfig_configure(station->netconfig, >> station_netconfig_event_handler, >> - station); >> + station))) > > You already have an L_WARN_ON in the single call site of > netconfig_after_roam? > >> + return false; >> + >> + station_enter_state(station, STATION_STATE_NETCONFIG); >> + >> + return true; >> } >> static void station_roamed(struct station *station) >> @@ -3255,6 +3262,8 @@ static void station_connect_ok(struct station >> *station) >> station_netconfig_event_handler, >> station))) >> return; >> + >> + station_enter_state(station, STATION_STATE_NETCONFIG); >> } else >> station_enter_state(station, STATION_STATE_CONNECTED); >> } >> @@ -4067,7 +4076,8 @@ static struct l_dbus_message >> *station_dbus_scan(struct l_dbus *dbus, >> return dbus_error_busy(message); >> if (station->state == STATION_STATE_CONNECTING || >> - station->state == STATION_STATE_CONNECTING_AUTO) >> + station->state == STATION_STATE_CONNECTING_AUTO || >> + station->state == STATION_STATE_NETCONFIG) > > Might as well use L_IN_SET here > >> return dbus_error_busy(message); >> station->dbus_scan_subset_idx = 0; >> @@ -5025,7 +5035,8 @@ static struct l_dbus_message >> *station_debug_scan(struct l_dbus *dbus, >> return dbus_error_busy(message); >> if (station->state == STATION_STATE_CONNECTING || >> - station->state == STATION_STATE_CONNECTING_AUTO) >> + station->state == STATION_STATE_CONNECTING_AUTO || >> + station->state == STATION_STATE_NETCONFIG) > > Also, shouldn't this also cover the roaming states? And do we still > need this check given wiphy_work? Does netconfig use wiphy work to > make sure nothing tries to scan or go off-channel? Yes, we shouldn't scan/offchannel while roaming. And no netconfig doesn't uses the wiphy queue, but it really should. > >> return dbus_error_busy(message); >> if (!l_dbus_message_get_arguments(message, "aq", &iter)) So sounds like I opened up a can of worms here :) We only noticed these issues cropping up recently because of extended server upgrade times, i.e. the DHCP server was down for a long time. Clients roamed and if they didn't have the recent changes to restart netconfig they'd be stuck connected without an IP. I then noticed some of these other limitations now, but at least currently being unable to roam is better than having no IP and requiring physical attention. > Regards, > -Denis
Hi James, >> Another weirdness is that I think we're sending the d-bus reply after >> connecting to the AP, but before netconfig runs. > Probably this as well... But I'm not sure we can really send it after netconfig > unless we want to add a timeout error or something. If netconfig takes a long We already should be disconnecting if netconfig fails. It is part of the NETCONFIG_EVENT_FAILED handling this patch was touching actually. I'm not sure how this was tested exactly, but Andrew was pretty adamant that we should be doing so and I thought his arguments were compelling enough. > time DBus will be unhappy. IMO netconfig is somewhat of a special case in this > regard, and consumers of the API should be waiting on the connected state, not > only the DBus method return. netconfig should take a max of a few seconds under normal circumstances. Capping the DHCP time might make sense anyway and I think there are even some TODO items inside ell/netconfig.c about this. Anyhow, this is a smaller issue compared to the one above. >>> @@ -2043,8 +2044,9 @@ static void station_netconfig_event_handler(enum >>> netconfig_event event, >>> dbus_pending_reply(&station->connect_pending, reply); >>> } >>> - if (L_IN_SET(station->state, STATION_STATE_CONNECTING, >>> - STATION_STATE_CONNECTING_AUTO)) >>> + if (L_IN_SET(station->state, STATION_STATE_NETCONFIG, >>> + STATION_STATE_ROAMING, STATION_STATE_FT_ROAMING, >>> + STATION_STATE_FW_ROAMING)) >> >> I understand why NETCONFIG state is in the set, but why the others? > This was because if we are roaming and netconfig fails for some reason we want > to disconnect, right? Well, AFAIK, the original intent of this piece was to discard secrets if the initial connection failed. This is now covered by the NETCONFIG state which is roughly equivalent to CONNECTING. The addition of roaming states modifies the original intent. Hence why I'm wondering why this is being added? >> Also, shouldn't this also cover the roaming states? And do we still need this >> check given wiphy_work? Does netconfig use wiphy work to make sure nothing >> tries to scan or go off-channel? > Yes, we shouldn't scan/offchannel while roaming. And no netconfig doesn't uses > the wiphy queue, but it really should. >> >>> return dbus_error_busy(message); >>> if (!l_dbus_message_get_arguments(message, "aq", &iter)) > So sounds like I opened up a can of worms here :) We only noticed these issues > cropping up recently because of extended server upgrade times, i.e. the DHCP Yep, there's always going to be nasty corner cases to deal with. I suspect your earlier change to re-start netconfig exposed some of these inconsistencies as well. > server was down for a long time. Clients roamed and if they didn't have the > recent changes to restart netconfig they'd be stuck connected without an IP. I Hence why I think it is better to treat this as an initial connection failure rather than pretending that we roamed and dealing with the consequences. > then noticed some of these other limitations now, but at least currently being > unable to roam is better than having no IP and requiring physical attention. Regards, -Denis
Hi Denis, On 1/4/24 10:55 AM, Denis Kenzior wrote: > Hi James, > >>> Another weirdness is that I think we're sending the d-bus reply >>> after connecting to the AP, but before netconfig runs. >> Probably this as well... But I'm not sure we can really send it after >> netconfig unless we want to add a timeout error or something. If >> netconfig takes a long > > We already should be disconnecting if netconfig fails. It is part of > the NETCONFIG_EVENT_FAILED handling this patch was touching actually. > I'm not sure how this was tested exactly, but Andrew was pretty > adamant that we should be doing so and I thought his arguments were > compelling enough. > >> time DBus will be unhappy. IMO netconfig is somewhat of a special >> case in this regard, and consumers of the API should be waiting on >> the connected state, not only the DBus method return. > > netconfig should take a max of a few seconds under normal > circumstances. Capping the DHCP time might make sense anyway and I > think there are even some TODO items inside ell/netconfig.c about this. This seems much simpler to implement and maintain... > > Anyhow, this is a smaller issue compared to the one above. > >>>> @@ -2043,8 +2044,9 @@ static void >>>> station_netconfig_event_handler(enum netconfig_event event, >>>> dbus_pending_reply(&station->connect_pending, reply); >>>> } >>>> - if (L_IN_SET(station->state, STATION_STATE_CONNECTING, >>>> - STATION_STATE_CONNECTING_AUTO)) >>>> + if (L_IN_SET(station->state, STATION_STATE_NETCONFIG, >>>> + STATION_STATE_ROAMING, STATION_STATE_FT_ROAMING, >>>> + STATION_STATE_FW_ROAMING)) >>> >>> I understand why NETCONFIG state is in the set, but why the others? >> This was because if we are roaming and netconfig fails for some >> reason we want to disconnect, right? > > Well, AFAIK, the original intent of this piece was to discard secrets > if the initial connection failed. This is now covered by the > NETCONFIG state which is roughly equivalent to CONNECTING. > > The addition of roaming states modifies the original intent. Hence why > I'm wondering why this is being added? > >>> Also, shouldn't this also cover the roaming states? And do we still >>> need this check given wiphy_work? Does netconfig use wiphy work to >>> make sure nothing tries to scan or go off-channel? >> Yes, we shouldn't scan/offchannel while roaming. And no netconfig >> doesn't uses the wiphy queue, but it really should. >>> >>>> return dbus_error_busy(message); >>>> if (!l_dbus_message_get_arguments(message, "aq", &iter)) >> So sounds like I opened up a can of worms here :) We only noticed >> these issues cropping up recently because of extended server upgrade >> times, i.e. the DHCP > > Yep, there's always going to be nasty corner cases to deal with. I > suspect your earlier change to re-start netconfig exposed some of > these inconsistencies as well. > >> server was down for a long time. Clients roamed and if they didn't >> have the recent changes to restart netconfig they'd be stuck >> connected without an IP. I > > Hence why I think it is better to treat this as an initial connection > failure rather than pretending that we roamed and dealing with the > consequences. So when I originally sent the netconfig after roam change I wanted to actually prevent any roaming until netconfig finished, but instead we allowed the roam and restarted netconfig. So yes, I feel like this wasn't the right direction to go. I'm thinking instead we should prevent any roaming, cap the netconfig time (e.g. 15 seconds) and fail the connection if netconfig doesn't finish. Then we avoid all these nasty details of state changes and trying to manage roams but appear as "connecting" on DBus etc. >> then noticed some of these other limitations now, but at least >> currently being unable to roam is better than having no IP and >> requiring physical attention. > > Regards, > -Denis
Hi James, >> Hence why I think it is better to treat this as an initial connection failure >> rather than pretending that we roamed and dealing with the consequences. > So when I originally sent the netconfig after roam change I wanted to actually > prevent any roaming until netconfig finished, but instead we allowed the roam > and restarted netconfig. So yes, I feel like this wasn't the right direction to > go. I'm thinking instead we should prevent any roaming, cap the netconfig time > (e.g. 15 seconds) and fail the connection if netconfig doesn't finish. Then we Preventing roaming would certainly be easier to implement, but the fact that you had roams during the initial connection (the original problem which prompted this change), probably means that we do want to allow roaming eventually. 6G in particular would benefit, since attaching to 2.4/5GHz frequencies first, then asking for neighbor reports and roaming to 6G is probably the best strategy compared to passively scanning on the 6 GHz band. Regards, -Denis
diff --git a/src/station.c b/src/station.c index 57d22e91..8f310ec8 100644 --- a/src/station.c +++ b/src/station.c @@ -1768,6 +1768,7 @@ static void station_reset_connection_state(struct station *station) if (station->state == STATION_STATE_CONNECTED || station->state == STATION_STATE_CONNECTING || station->state == STATION_STATE_CONNECTING_AUTO || + station->state == STATION_STATE_NETCONFIG || station_is_roaming(station)) network_disconnected(network); } @@ -2043,8 +2044,9 @@ static void station_netconfig_event_handler(enum netconfig_event event, dbus_pending_reply(&station->connect_pending, reply); } - if (L_IN_SET(station->state, STATION_STATE_CONNECTING, - STATION_STATE_CONNECTING_AUTO)) + if (L_IN_SET(station->state, STATION_STATE_NETCONFIG, + STATION_STATE_ROAMING, STATION_STATE_FT_ROAMING, + STATION_STATE_FW_ROAMING)) network_connect_failed(station->connected_network, false); @@ -2070,9 +2072,14 @@ static bool netconfig_after_roam(struct station *station) network_get_settings(network))) return false; - return netconfig_configure(station->netconfig, + if (L_WARN_ON(!netconfig_configure(station->netconfig, station_netconfig_event_handler, - station); + station))) + return false; + + station_enter_state(station, STATION_STATE_NETCONFIG); + + return true; } static void station_roamed(struct station *station) @@ -3255,6 +3262,8 @@ static void station_connect_ok(struct station *station) station_netconfig_event_handler, station))) return; + + station_enter_state(station, STATION_STATE_NETCONFIG); } else station_enter_state(station, STATION_STATE_CONNECTED); } @@ -4067,7 +4076,8 @@ static struct l_dbus_message *station_dbus_scan(struct l_dbus *dbus, return dbus_error_busy(message); if (station->state == STATION_STATE_CONNECTING || - station->state == STATION_STATE_CONNECTING_AUTO) + station->state == STATION_STATE_CONNECTING_AUTO || + station->state == STATION_STATE_NETCONFIG) return dbus_error_busy(message); station->dbus_scan_subset_idx = 0; @@ -5025,7 +5035,8 @@ static struct l_dbus_message *station_debug_scan(struct l_dbus *dbus, return dbus_error_busy(message); if (station->state == STATION_STATE_CONNECTING || - station->state == STATION_STATE_CONNECTING_AUTO) + station->state == STATION_STATE_CONNECTING_AUTO || + station->state == STATION_STATE_NETCONFIG) return dbus_error_busy(message); if (!l_dbus_message_get_arguments(message, "aq", &iter))