[3/8] station: add handling for new NETCONFIG state

Message ID	20240103184638.533221-3-prestwoj@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-qk1-f176.google.com (mail-qk1-f176.google.com [209.85.222.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5448B1C6BC for <iwd@lists.linux.dev>; Wed, 3 Jan 2024 18:47:00 +0000 (UTC) From: James Prestwood <prestwoj@gmail.com> To: iwd@lists.linux.dev Cc: James Prestwood <prestwoj@gmail.com> Subject: [PATCH 3/8] station: add handling for new NETCONFIG state Date: Wed, 3 Jan 2024 10:46:33 -0800 Message-Id: <20240103184638.533221-3-prestwoj@gmail.com> In-Reply-To: <20240103184638.533221-1-prestwoj@gmail.com> References: <20240103184638.533221-1-prestwoj@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[1/8] station: handle netconfig after roaming for FW roams \| expand [1/8] station: handle netconfig after roaming for FW roams [2/8] station: add additional internal state, STATION_STATE_NETCONFIG [3/8] station: add handling for new NETCONFIG state [4/8] station: add debug events for internal states [5/8] auto-t: update roam test to use new debug events [6/8] auto-t: add test for roaming + netconfig [7/8] auto-t: improve failure handling in testPSK-roam [8/8] auto-t: fix random testPSK-roam failure

Context	Check	Description
tedd_an/pre-ci_am	success	Success
prestwoj/iwd-ci-gitlint	success	GitLint

James Prestwood Jan. 3, 2024, 6:46 p.m. UTC

There was an unhandled corner case if netconfig was running and
multiple roam conditions happened in sequence, all before netconfig
had completed. A single roam before netconfig was already handled
(23f0f5717c) but this did not take into account any additional roam
conditions.

If IWD is in this state, having started netconfig, then roamed, and
again restarted netconfig it is still in a roaming state which will
prevent any further roams. IWD will remain "stuck" on the current
BSS until netconfig completes or gets disconnected.

To fix this a new internal station state was added (no changes to
the DBus API) to distinguish between a purely WiFi connecting state
(STATION_STATE_CONNECTING/AUTO) and netconfig
(STATION_STATE_NETCONFIG). This allows IWD roam as needed if
netconfig is still running.

The change is mainly just adding STATION_STATE_NETCONFIG anywhere
that STATION_STATE_CONNECTING is to maintain the same behavior,
except within the netconfig event handler. In this case we should
never get here without being in either a NETCONFIG or ROAMING
state.

For some background this scenario happens if the DHCP server goes
down for an extended period, e.g. if its being upgraded/serviced.
---
 src/station.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

Denis Kenzior Jan. 4, 2024, 6:14 p.m. UTC | #1

Hi James,

On 1/3/24 12:46, James Prestwood wrote:
> There was an unhandled corner case if netconfig was running and
> multiple roam conditions happened in sequence, all before netconfig
> had completed. A single roam before netconfig was already handled
> (23f0f5717c) but this did not take into account any additional roam
> conditions.

So if netconfig hasn't completed, we're in the 'connecting' state.  Any 
subsequent roams should still be treated as if we are in 'connecting' state. 
Are we transitioning from connecting -> roaming at the D-Bus API level? We 
shouldn't be.

Another weirdness is that I think we're sending the d-bus reply after connecting 
to the AP, but before netconfig runs.

> 
> If IWD is in this state, having started netconfig, then roamed, and
> again restarted netconfig it is still in a roaming state which will
> prevent any further roams. IWD will remain "stuck" on the current
> BSS until netconfig completes or gets disconnected.

Makes sense, since roaming means netconfig isn't really doing anything.

> 
> To fix this a new internal station state was added (no changes to
> the DBus API) to distinguish between a purely WiFi connecting state
> (STATION_STATE_CONNECTING/AUTO) and netconfig
> (STATION_STATE_NETCONFIG). This allows IWD roam as needed if
> netconfig is still running.

Okay, but how would you distinguish between connecting -> netconfig and 
netconfig->roaming->netconfig?

> 
> The change is mainly just adding STATION_STATE_NETCONFIG anywhere
> that STATION_STATE_CONNECTING is to maintain the same behavior,
> except within the netconfig event handler. In this case we should
> never get here without being in either a NETCONFIG or ROAMING
> state.
> 
> For some background this scenario happens if the DHCP server goes
> down for an extended period, e.g. if its being upgraded/serviced.
> ---
>   src/station.c | 23 +++++++++++++++++------
>   1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/src/station.c b/src/station.c
> index 57d22e91..8f310ec8 100644
> --- a/src/station.c
> +++ b/src/station.c
> @@ -1768,6 +1768,7 @@ static void station_reset_connection_state(struct station *station)
>   	if (station->state == STATION_STATE_CONNECTED ||
>   			station->state == STATION_STATE_CONNECTING ||
>   			station->state == STATION_STATE_CONNECTING_AUTO ||
> +			station->state == STATION_STATE_NETCONFIG ||
>   			station_is_roaming(station))
>   		network_disconnected(network);
>   }
> @@ -2043,8 +2044,9 @@ static void station_netconfig_event_handler(enum netconfig_event event,
>   			dbus_pending_reply(&station->connect_pending, reply);
>   		}
>   
> -		if (L_IN_SET(station->state, STATION_STATE_CONNECTING,
> -				STATION_STATE_CONNECTING_AUTO))
> +		if (L_IN_SET(station->state, STATION_STATE_NETCONFIG,
> +				STATION_STATE_ROAMING, STATION_STATE_FT_ROAMING,
> +				STATION_STATE_FW_ROAMING))

I understand why NETCONFIG state is in the set, but why the others?

>   			network_connect_failed(station->connected_network,
>   						false);
>   
> @@ -2070,9 +2072,14 @@ static bool netconfig_after_roam(struct station *station)
>   					network_get_settings(network)))
>   		return false;
>   
> -	return netconfig_configure(station->netconfig,
> +	if (L_WARN_ON(!netconfig_configure(station->netconfig,
>   					station_netconfig_event_handler,
> -					station);
> +					station)))

You already have an L_WARN_ON in the single call site of netconfig_after_roam?

> +		return false;
> +
> +	station_enter_state(station, STATION_STATE_NETCONFIG);
> +
> +	return true;
>   }
>   
>   static void station_roamed(struct station *station)
> @@ -3255,6 +3262,8 @@ static void station_connect_ok(struct station *station)
>   						station_netconfig_event_handler,
>   						station)))
>   			return;
> +
> +		station_enter_state(station, STATION_STATE_NETCONFIG);
>   	} else
>   		station_enter_state(station, STATION_STATE_CONNECTED);
>   }
> @@ -4067,7 +4076,8 @@ static struct l_dbus_message *station_dbus_scan(struct l_dbus *dbus,
>   		return dbus_error_busy(message);
>   
>   	if (station->state == STATION_STATE_CONNECTING ||
> -			station->state == STATION_STATE_CONNECTING_AUTO)
> +			station->state == STATION_STATE_CONNECTING_AUTO ||
> +			station->state == STATION_STATE_NETCONFIG)

Might as well use L_IN_SET here

>   		return dbus_error_busy(message);
>   
>   	station->dbus_scan_subset_idx = 0;
> @@ -5025,7 +5035,8 @@ static struct l_dbus_message *station_debug_scan(struct l_dbus *dbus,
>   		return dbus_error_busy(message);
>   
>   	if (station->state == STATION_STATE_CONNECTING ||
> -			station->state == STATION_STATE_CONNECTING_AUTO)
> +			station->state == STATION_STATE_CONNECTING_AUTO ||
> +			station->state == STATION_STATE_NETCONFIG)

Also, shouldn't this also cover the roaming states?  And do we still need this 
check given wiphy_work?  Does netconfig use wiphy work to make sure nothing 
tries to scan or go off-channel?

>   		return dbus_error_busy(message);
>   
>   	if (!l_dbus_message_get_arguments(message, "aq", &iter))

Regards,
-Denis

James Prestwood Jan. 4, 2024, 6:31 p.m. UTC | #2

On 1/4/24 10:14 AM, Denis Kenzior wrote:
> Hi James,
>
> On 1/3/24 12:46, James Prestwood wrote:
>> There was an unhandled corner case if netconfig was running and
>> multiple roam conditions happened in sequence, all before netconfig
>> had completed. A single roam before netconfig was already handled
>> (23f0f5717c) but this did not take into account any additional roam
>> conditions.
>
> So if netconfig hasn't completed, we're in the 'connecting' state.  
> Any subsequent roams should still be treated as if we are in 
> 'connecting' state. Are we transitioning from connecting -> roaming at 
> the D-Bus API level? We shouldn't be.
Yes we are as it is today.
>
> Another weirdness is that I think we're sending the d-bus reply after 
> connecting to the AP, but before netconfig runs.
Probably this as well... But I'm not sure we can really send it after 
netconfig unless we want to add a timeout error or something. If 
netconfig takes a long time DBus will be unhappy. IMO netconfig is 
somewhat of a special case in this regard, and consumers of the API 
should be waiting on the connected state, not only the DBus method return.
>
>>
>> If IWD is in this state, having started netconfig, then roamed, and
>> again restarted netconfig it is still in a roaming state which will
>> prevent any further roams. IWD will remain "stuck" on the current
>> BSS until netconfig completes or gets disconnected.
>
> Makes sense, since roaming means netconfig isn't really doing anything.
>
>>
>> To fix this a new internal station state was added (no changes to
>> the DBus API) to distinguish between a purely WiFi connecting state
>> (STATION_STATE_CONNECTING/AUTO) and netconfig
>> (STATION_STATE_NETCONFIG). This allows IWD roam as needed if
>> netconfig is still running.
>
> Okay, but how would you distinguish between connecting -> netconfig 
> and netconfig->roaming->netconfig?
I hadn't had a need to distinguish, but given the above of wanting to 
remain the connecting state I think we'll need to.
>
>>
>> The change is mainly just adding STATION_STATE_NETCONFIG anywhere
>> that STATION_STATE_CONNECTING is to maintain the same behavior,
>> except within the netconfig event handler. In this case we should
>> never get here without being in either a NETCONFIG or ROAMING
>> state.
>>
>> For some background this scenario happens if the DHCP server goes
>> down for an extended period, e.g. if its being upgraded/serviced.
>> ---
>>   src/station.c | 23 +++++++++++++++++------
>>   1 file changed, 17 insertions(+), 6 deletions(-)
>>
>> diff --git a/src/station.c b/src/station.c
>> index 57d22e91..8f310ec8 100644
>> --- a/src/station.c
>> +++ b/src/station.c
>> @@ -1768,6 +1768,7 @@ static void 
>> station_reset_connection_state(struct station *station)
>>       if (station->state == STATION_STATE_CONNECTED ||
>>               station->state == STATION_STATE_CONNECTING ||
>>               station->state == STATION_STATE_CONNECTING_AUTO ||
>> +            station->state == STATION_STATE_NETCONFIG ||
>>               station_is_roaming(station))
>>           network_disconnected(network);
>>   }
>> @@ -2043,8 +2044,9 @@ static void 
>> station_netconfig_event_handler(enum netconfig_event event,
>> dbus_pending_reply(&station->connect_pending, reply);
>>           }
>>   -        if (L_IN_SET(station->state, STATION_STATE_CONNECTING,
>> -                STATION_STATE_CONNECTING_AUTO))
>> +        if (L_IN_SET(station->state, STATION_STATE_NETCONFIG,
>> +                STATION_STATE_ROAMING, STATION_STATE_FT_ROAMING,
>> +                STATION_STATE_FW_ROAMING))
>
> I understand why NETCONFIG state is in the set, but why the others?
This was because if we are roaming and netconfig fails for some reason 
we want to disconnect, right?
>
>> network_connect_failed(station->connected_network,
>>                           false);
>>   @@ -2070,9 +2072,14 @@ static bool netconfig_after_roam(struct 
>> station *station)
>>                       network_get_settings(network)))
>>           return false;
>>   -    return netconfig_configure(station->netconfig,
>> +    if (L_WARN_ON(!netconfig_configure(station->netconfig,
>>                       station_netconfig_event_handler,
>> -                    station);
>> +                    station)))
>
> You already have an L_WARN_ON in the single call site of 
> netconfig_after_roam?
>
>> +        return false;
>> +
>> +    station_enter_state(station, STATION_STATE_NETCONFIG);
>> +
>> +    return true;
>>   }
>>     static void station_roamed(struct station *station)
>> @@ -3255,6 +3262,8 @@ static void station_connect_ok(struct station 
>> *station)
>>                           station_netconfig_event_handler,
>>                           station)))
>>               return;
>> +
>> +        station_enter_state(station, STATION_STATE_NETCONFIG);
>>       } else
>>           station_enter_state(station, STATION_STATE_CONNECTED);
>>   }
>> @@ -4067,7 +4076,8 @@ static struct l_dbus_message 
>> *station_dbus_scan(struct l_dbus *dbus,
>>           return dbus_error_busy(message);
>>         if (station->state == STATION_STATE_CONNECTING ||
>> -            station->state == STATION_STATE_CONNECTING_AUTO)
>> +            station->state == STATION_STATE_CONNECTING_AUTO ||
>> +            station->state == STATION_STATE_NETCONFIG)
>
> Might as well use L_IN_SET here
>
>>           return dbus_error_busy(message);
>>         station->dbus_scan_subset_idx = 0;
>> @@ -5025,7 +5035,8 @@ static struct l_dbus_message 
>> *station_debug_scan(struct l_dbus *dbus,
>>           return dbus_error_busy(message);
>>         if (station->state == STATION_STATE_CONNECTING ||
>> -            station->state == STATION_STATE_CONNECTING_AUTO)
>> +            station->state == STATION_STATE_CONNECTING_AUTO ||
>> +            station->state == STATION_STATE_NETCONFIG)
>
> Also, shouldn't this also cover the roaming states?  And do we still 
> need this check given wiphy_work?  Does netconfig use wiphy work to 
> make sure nothing tries to scan or go off-channel?
Yes, we shouldn't scan/offchannel while roaming. And no netconfig 
doesn't uses the wiphy queue, but it really should.
>
>>           return dbus_error_busy(message);
>>         if (!l_dbus_message_get_arguments(message, "aq", &iter))
So sounds like I opened up a can of worms here :) We only noticed these 
issues cropping up recently because of extended server upgrade times, 
i.e. the DHCP server was down for a long time. Clients roamed and if 
they didn't have the recent changes to restart netconfig they'd be stuck 
connected without an IP. I then noticed some of these other limitations 
now, but at least currently being unable to roam is better than having 
no IP and requiring physical attention.
> Regards,
> -Denis

Denis Kenzior Jan. 4, 2024, 6:55 p.m. UTC | #3

Hi James,

>> Another weirdness is that I think we're sending the d-bus reply after 
>> connecting to the AP, but before netconfig runs.
> Probably this as well... But I'm not sure we can really send it after netconfig 
> unless we want to add a timeout error or something. If netconfig takes a long 

We already should be disconnecting if netconfig fails.  It is part of the 
NETCONFIG_EVENT_FAILED handling this patch was touching actually.  I'm not sure 
how this was tested exactly, but Andrew was pretty adamant that we should be 
doing so and I thought his arguments were compelling enough.

> time DBus will be unhappy. IMO netconfig is somewhat of a special case in this 
> regard, and consumers of the API should be waiting on the connected state, not 
> only the DBus method return.

netconfig should take a max of a few seconds under normal circumstances. 
Capping the DHCP time might make sense anyway and I think there are even some 
TODO items inside ell/netconfig.c about this.

Anyhow, this is a smaller issue compared to the one above.

>>> @@ -2043,8 +2044,9 @@ static void station_netconfig_event_handler(enum 
>>> netconfig_event event,
>>> dbus_pending_reply(&station->connect_pending, reply);
>>>           }
>>>   -        if (L_IN_SET(station->state, STATION_STATE_CONNECTING,
>>> -                STATION_STATE_CONNECTING_AUTO))
>>> +        if (L_IN_SET(station->state, STATION_STATE_NETCONFIG,
>>> +                STATION_STATE_ROAMING, STATION_STATE_FT_ROAMING,
>>> +                STATION_STATE_FW_ROAMING))
>>
>> I understand why NETCONFIG state is in the set, but why the others?
> This was because if we are roaming and netconfig fails for some reason we want 
> to disconnect, right?

Well, AFAIK, the original intent of this piece was to discard secrets if the 
initial connection failed.  This is now covered by the NETCONFIG state which is 
roughly equivalent to CONNECTING.

The addition of roaming states modifies the original intent.  Hence why I'm 
wondering why this is being added?

>> Also, shouldn't this also cover the roaming states?  And do we still need this 
>> check given wiphy_work?  Does netconfig use wiphy work to make sure nothing 
>> tries to scan or go off-channel?
> Yes, we shouldn't scan/offchannel while roaming. And no netconfig doesn't uses 
> the wiphy queue, but it really should.
>>
>>>           return dbus_error_busy(message);
>>>         if (!l_dbus_message_get_arguments(message, "aq", &iter))
> So sounds like I opened up a can of worms here :) We only noticed these issues 
> cropping up recently because of extended server upgrade times, i.e. the DHCP 

Yep, there's always going to be nasty corner cases to deal with.  I suspect your 
earlier change to re-start netconfig exposed some of these inconsistencies as well.

> server was down for a long time. Clients roamed and if they didn't have the 
> recent changes to restart netconfig they'd be stuck connected without an IP. I 

Hence why I think it is better to treat this as an initial connection failure 
rather than pretending that we roamed and dealing with the consequences.

> then noticed some of these other limitations now, but at least currently being 
> unable to roam is better than having no IP and requiring physical attention.

Regards,
-Denis

James Prestwood Jan. 4, 2024, 7:55 p.m. UTC | #4

Hi Denis,

On 1/4/24 10:55 AM, Denis Kenzior wrote:
> Hi James,
>
>>> Another weirdness is that I think we're sending the d-bus reply 
>>> after connecting to the AP, but before netconfig runs.
>> Probably this as well... But I'm not sure we can really send it after 
>> netconfig unless we want to add a timeout error or something. If 
>> netconfig takes a long 
>
> We already should be disconnecting if netconfig fails.  It is part of 
> the NETCONFIG_EVENT_FAILED handling this patch was touching actually.  
> I'm not sure how this was tested exactly, but Andrew was pretty 
> adamant that we should be doing so and I thought his arguments were 
> compelling enough.
>
>> time DBus will be unhappy. IMO netconfig is somewhat of a special 
>> case in this regard, and consumers of the API should be waiting on 
>> the connected state, not only the DBus method return.
>
> netconfig should take a max of a few seconds under normal 
> circumstances. Capping the DHCP time might make sense anyway and I 
> think there are even some TODO items inside ell/netconfig.c about this.
This seems much simpler to implement and maintain...
>
> Anyhow, this is a smaller issue compared to the one above.
>
>>>> @@ -2043,8 +2044,9 @@ static void 
>>>> station_netconfig_event_handler(enum netconfig_event event,
>>>> dbus_pending_reply(&station->connect_pending, reply);
>>>>           }
>>>>   -        if (L_IN_SET(station->state, STATION_STATE_CONNECTING,
>>>> -                STATION_STATE_CONNECTING_AUTO))
>>>> +        if (L_IN_SET(station->state, STATION_STATE_NETCONFIG,
>>>> +                STATION_STATE_ROAMING, STATION_STATE_FT_ROAMING,
>>>> +                STATION_STATE_FW_ROAMING))
>>>
>>> I understand why NETCONFIG state is in the set, but why the others?
>> This was because if we are roaming and netconfig fails for some 
>> reason we want to disconnect, right?
>
> Well, AFAIK, the original intent of this piece was to discard secrets 
> if the initial connection failed.  This is now covered by the 
> NETCONFIG state which is roughly equivalent to CONNECTING.
>
> The addition of roaming states modifies the original intent. Hence why 
> I'm wondering why this is being added?
>
>>> Also, shouldn't this also cover the roaming states?  And do we still 
>>> need this check given wiphy_work?  Does netconfig use wiphy work to 
>>> make sure nothing tries to scan or go off-channel?
>> Yes, we shouldn't scan/offchannel while roaming. And no netconfig 
>> doesn't uses the wiphy queue, but it really should.
>>>
>>>>           return dbus_error_busy(message);
>>>>         if (!l_dbus_message_get_arguments(message, "aq", &iter))
>> So sounds like I opened up a can of worms here :) We only noticed 
>> these issues cropping up recently because of extended server upgrade 
>> times, i.e. the DHCP 
>
> Yep, there's always going to be nasty corner cases to deal with. I 
> suspect your earlier change to re-start netconfig exposed some of 
> these inconsistencies as well.
>
>> server was down for a long time. Clients roamed and if they didn't 
>> have the recent changes to restart netconfig they'd be stuck 
>> connected without an IP. I 
>
> Hence why I think it is better to treat this as an initial connection 
> failure rather than pretending that we roamed and dealing with the 
> consequences.
So when I originally sent the netconfig after roam change I wanted to 
actually prevent any roaming until netconfig finished, but instead we 
allowed the roam and restarted netconfig. So yes, I feel like this 
wasn't the right direction to go. I'm thinking instead we should prevent 
any roaming, cap the netconfig time (e.g. 15 seconds) and fail the 
connection if netconfig doesn't finish. Then we avoid all these nasty 
details of state changes and trying to manage roams but appear as 
"connecting" on DBus etc.
>> then noticed some of these other limitations now, but at least 
>> currently being unable to roam is better than having no IP and 
>> requiring physical attention.
>
> Regards,
> -Denis

Denis Kenzior Jan. 4, 2024, 9:01 p.m. UTC | #5

Hi James,

>> Hence why I think it is better to treat this as an initial connection failure 
>> rather than pretending that we roamed and dealing with the consequences.
> So when I originally sent the netconfig after roam change I wanted to actually 
> prevent any roaming until netconfig finished, but instead we allowed the roam 
> and restarted netconfig. So yes, I feel like this wasn't the right direction to 
> go. I'm thinking instead we should prevent any roaming, cap the netconfig time 
> (e.g. 15 seconds) and fail the connection if netconfig doesn't finish. Then we 

Preventing roaming would certainly be easier to implement, but the fact that you 
had roams during the initial connection (the original problem which prompted 
this change), probably means that we do want to allow roaming eventually.

6G in particular would benefit, since attaching to 2.4/5GHz frequencies first, 
then asking for neighbor reports and roaming to 6G is probably the best strategy 
compared to passively scanning on the 6 GHz band.

Regards,
-Denis

[3/8] station: add handling for new NETCONFIG state

Checks

Commit Message

Comments

Patch