diff mbox series

station: handle ROAMING state in disconnect event (again)

Message ID 20230403141927.235014-1-alvin@pqrs.dk (mailing list archive)
State New
Headers show
Series station: handle ROAMING state in disconnect event (again) | expand

Checks

Context Check Description
tedd_an/pre-ci_am success Success
prestwoj/iwd-alpine-ci-fetch success Fetch PR
prestwoj/iwd-ci-gitlint success GitLint
prestwoj/iwd-ci-fetch success Fetch PR
prestwoj/iwd-alpine-ci-makedistcheck fail Make Distcheck Make FAIL: make[2]: *** No rule to make target 'ell/sysctl.h', needed by 'distdir-am'. Stop. make[1]: *** [Makefile:3218: distdir] Error 2 make: *** [Makefile:3298: dist] Error 2
prestwoj/iwd-alpine-ci-incremental_build success Incremental build not run PASS
prestwoj/iwd-ci-incremental_build success Incremental build not run PASS
prestwoj/iwd-alpine-ci-build success Build - Configure
prestwoj/iwd-alpine-ci-makecheckvalgrind fail Make FAIL: make[1]: *** No rule to make target 'ell/sysctl.c', needed by 'ell/sysctl.lo'. Stop. make[1]: *** Waiting for unfinished jobs.... make: *** [Makefile:1710: all] Error 2
prestwoj/iwd-alpine-ci-makecheck pending makecheck SKIP
prestwoj/iwd-ci-build success Build - Configure
prestwoj/iwd-ci-makecheckvalgrind fail Make FAIL: make[1]: *** No rule to make target 'ell/sysctl.c', needed by 'ell/sysctl.lo'. Stop. make[1]: *** Waiting for unfinished jobs.... make: *** [Makefile:1709: all] Error 2
prestwoj/iwd-ci-clang fail Clang IWD - make FAIL: make[1]: *** No rule to make target 'ell/sysctl.c', needed by 'ell/sysctl.lo'. Stop. make[1]: *** Waiting for unfinished jobs.... make: *** [Makefile:1709: all] Error 2
prestwoj/iwd-ci-makecheck pending makecheck SKIP
prestwoj/iwd-ci-makedistcheck fail Make Distcheck Make FAIL: make[2]: *** No rule to make target 'ell/sysctl.h', needed by 'distdir-am'. Stop. make[1]: *** [Makefile:3217: distdir] Error 2 make: *** [Makefile:3297: dist] Error 2
prestwoj/iwd-ci-testrunner pending testrunner SKIP

Commit Message

Alvin Šipraga April 3, 2023, 2:19 p.m. UTC
From: Alvin Šipraga <alsi@bang-olufsen.dk>

The blamed commit clobbered the non-FT (Reassociation) roaming case
in the disconnect event handler, causing iwd to potentially get stuck in
the ROAMING state after a failed reassociation during non-FT roaming.
Add it back.

The subject is inspired by a previous commit which did the same thing,
but introduced before the new roaming states were added, namely commit
8758cc89487b ("station: handle ROAMING state in disconnect event"). The
rationale here is the same.

Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)")
---
 src/station.c | 1 +
 1 file changed, 1 insertion(+)

Comments

James Prestwood April 3, 2023, 10:35 p.m. UTC | #1
Hi Alvin,

On 4/3/23 07:19, Alvin Šipraga wrote:
> From: Alvin Šipraga <alsi@bang-olufsen.dk>
>
> The blamed commit clobbered the non-FT (Reassociation) roaming case
> in the disconnect event handler, causing iwd to potentially get stuck in
> the ROAMING state after a failed reassociation during non-FT roaming.
> Add it back.
>
> The subject is inspired by a previous commit which did the same thing,
> but introduced before the new roaming states were added, namely commit
> 8758cc89487b ("station: handle ROAMING state in disconnect event"). The
> rationale here is the same.
>
> Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)")

IIRC leaving out the old ROAMING state in this switch was actually done 
intentionally. This is because netdev_reassociate() takes a callback and 
any failures with the roam *should* show up in netdev_connect_event(), 
and in turn call the callback which then leads to the same 
station_disassociated() call. We wanted to avoid the potential of a 
disconnect event AND the connect callback both calling 
station_disassocated().

Do you happen to have an iwmon pcap/log of this behavior? If its a 
kernel bug and a CMD_CONNECT connect event never comes then we probably 
do need this patch, but maybe instead with an L_WARN_ON or a big debug 
print because it really shouldn't be happening.

Thanks,

James

> ---
>   src/station.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/src/station.c b/src/station.c
> index d5650a2995a9..69f9be79880d 100644
> --- a/src/station.c
> +++ b/src/station.c
> @@ -3201,6 +3201,7 @@ static void station_disconnect_event(struct station *station, void *event_data)
>   					event_data, station);
>   		return;
>   	case STATION_STATE_CONNECTED:
> +	case STATION_STATE_ROAMING:
>   	case STATION_STATE_FT_ROAMING:
>   	case STATION_STATE_FW_ROAMING:
>   		station_disassociated(station);
Alvin Šipraga April 5, 2023, 9:12 a.m. UTC | #2
Hi James,

On Mon, Apr 03, 2023 at 03:35:28PM -0700, James Prestwood wrote:
> Hi Alvin,
> 
> On 4/3/23 07:19, Alvin Šipraga wrote:
> > From: Alvin Šipraga <alsi@bang-olufsen.dk>
> > 
> > The blamed commit clobbered the non-FT (Reassociation) roaming case
> > in the disconnect event handler, causing iwd to potentially get stuck in
> > the ROAMING state after a failed reassociation during non-FT roaming.
> > Add it back.
> > 
> > The subject is inspired by a previous commit which did the same thing,
> > but introduced before the new roaming states were added, namely commit
> > 8758cc89487b ("station: handle ROAMING state in disconnect event"). The
> > rationale here is the same.
> > 
> > Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)")
> 
> IIRC leaving out the old ROAMING state in this switch was actually done
> intentionally. This is because netdev_reassociate() takes a callback and any
> failures with the roam *should* show up in netdev_connect_event(), and in
> turn call the callback which then leads to the same station_disassociated()
> call. We wanted to avoid the potential of a disconnect event AND the connect
> callback both calling station_disassocated().

Yes, that makes sense.

> 
> Do you happen to have an iwmon pcap/log of this behavior? If its a kernel
> bug and a CMD_CONNECT connect event never comes then we probably do need
> this patch, but maybe instead with an L_WARN_ON or a big debug print because
> it really shouldn't be happening.

Yes, this will be our next direction of inquiry. Easter is coming up and the
issue is reproducible by colleagues who are on holiday. But I will follow up.

Thanks for the quick reply.

Kind regards,
Alvin

> 
> Thanks,
> 
> James
> 
> > ---
> >   src/station.c | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/src/station.c b/src/station.c
> > index d5650a2995a9..69f9be79880d 100644
> > --- a/src/station.c
> > +++ b/src/station.c
> > @@ -3201,6 +3201,7 @@ static void station_disconnect_event(struct station *station, void *event_data)
> >   					event_data, station);
> >   		return;
> >   	case STATION_STATE_CONNECTED:
> > +	case STATION_STATE_ROAMING:
> >   	case STATION_STATE_FT_ROAMING:
> >   	case STATION_STATE_FW_ROAMING:
> >   		station_disassociated(station);
Alvin Šipraga April 30, 2023, 10:22 a.m. UTC | #3
Hi again,

On Wed, Apr 05, 2023 at 09:12:20AM +0000, Alvin Šipraga wrote:
> Hi James,
> 
> On Mon, Apr 03, 2023 at 03:35:28PM -0700, James Prestwood wrote:
> > Hi Alvin,
> > 
> > On 4/3/23 07:19, Alvin Šipraga wrote:
> > > From: Alvin Šipraga <alsi@bang-olufsen.dk>
> > > 
> > > The blamed commit clobbered the non-FT (Reassociation) roaming case
> > > in the disconnect event handler, causing iwd to potentially get stuck in
> > > the ROAMING state after a failed reassociation during non-FT roaming.
> > > Add it back.
> > > 
> > > The subject is inspired by a previous commit which did the same thing,
> > > but introduced before the new roaming states were added, namely commit
> > > 8758cc89487b ("station: handle ROAMING state in disconnect event"). The
> > > rationale here is the same.
> > > 
> > > Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)")
> > 
> > IIRC leaving out the old ROAMING state in this switch was actually done
> > intentionally. This is because netdev_reassociate() takes a callback and any
> > failures with the roam *should* show up in netdev_connect_event(), and in
> > turn call the callback which then leads to the same station_disassociated()
> > call. We wanted to avoid the potential of a disconnect event AND the connect
> > callback both calling station_disassocated().
> 
> Yes, that makes sense.
> 
> > 
> > Do you happen to have an iwmon pcap/log of this behavior? If its a kernel
> > bug and a CMD_CONNECT connect event never comes then we probably do need
> > this patch, but maybe instead with an L_WARN_ON or a big debug print because
> > it really shouldn't be happening.
> 
> Yes, this will be our next direction of inquiry. Easter is coming up and the
> issue is reproducible by colleagues who are on holiday. But I will follow up.

Just a short update on this. Currently we root-caused the failure to a bug where
the 2nd packet of the 4-Way Handshake gets silently dropped by the firmware we
are using and the vendor is looking into it. Until this is fixed we can't point
any fingers at IWD. I'll follow up once that is fixed. Thanks!

Kind regards,
Alvin

> 
> Thanks for the quick reply.
> 
> Kind regards,
> Alvin
> 
> > 
> > Thanks,
> > 
> > James
> > 
> > > ---
> > >   src/station.c | 1 +
> > >   1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/src/station.c b/src/station.c
> > > index d5650a2995a9..69f9be79880d 100644
> > > --- a/src/station.c
> > > +++ b/src/station.c
> > > @@ -3201,6 +3201,7 @@ static void station_disconnect_event(struct station *station, void *event_data)
> > >   					event_data, station);
> > >   		return;
> > >   	case STATION_STATE_CONNECTED:
> > > +	case STATION_STATE_ROAMING:
> > >   	case STATION_STATE_FT_ROAMING:
> > >   	case STATION_STATE_FW_ROAMING:
> > >   		station_disassociated(station);
James Prestwood May 1, 2023, 3:53 p.m. UTC | #4
Hi Alvin,

On 4/30/23 3:22 AM, Alvin Šipraga wrote:
> Hi again,
> 
> On Wed, Apr 05, 2023 at 09:12:20AM +0000, Alvin Šipraga wrote:
>> Hi James,
>>
>> On Mon, Apr 03, 2023 at 03:35:28PM -0700, James Prestwood wrote:
>>> Hi Alvin,
>>>
>>> On 4/3/23 07:19, Alvin Šipraga wrote:
>>>> From: Alvin Šipraga <alsi@bang-olufsen.dk>
>>>>
>>>> The blamed commit clobbered the non-FT (Reassociation) roaming case
>>>> in the disconnect event handler, causing iwd to potentially get stuck in
>>>> the ROAMING state after a failed reassociation during non-FT roaming.
>>>> Add it back.
>>>>
>>>> The subject is inspired by a previous commit which did the same thing,
>>>> but introduced before the new roaming states were added, namely commit
>>>> 8758cc89487b ("station: handle ROAMING state in disconnect event"). The
>>>> rationale here is the same.
>>>>
>>>> Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)")
>>>
>>> IIRC leaving out the old ROAMING state in this switch was actually done
>>> intentionally. This is because netdev_reassociate() takes a callback and any
>>> failures with the roam *should* show up in netdev_connect_event(), and in
>>> turn call the callback which then leads to the same station_disassociated()
>>> call. We wanted to avoid the potential of a disconnect event AND the connect
>>> callback both calling station_disassocated().
>>
>> Yes, that makes sense.
>>
>>>
>>> Do you happen to have an iwmon pcap/log of this behavior? If its a kernel
>>> bug and a CMD_CONNECT connect event never comes then we probably do need
>>> this patch, but maybe instead with an L_WARN_ON or a big debug print because
>>> it really shouldn't be happening.
>>
>> Yes, this will be our next direction of inquiry. Easter is coming up and the
>> issue is reproducible by colleagues who are on holiday. But I will follow up.
> 
> Just a short update on this. Currently we root-caused the failure to a bug where
> the 2nd packet of the 4-Way Handshake gets silently dropped by the firmware we
> are using and the vendor is looking into it. Until this is fixed we can't point
> any fingers at IWD. I'll follow up once that is fixed. Thanks!

So was IWD not handling this? We keep a timer for eapol, so if packet 
2/4 was lost IWD should recover regardless. Even if its a FW bug IWD 
shouldn't get stuck.

> 
> Kind regards,
> Alvin
> 
>>
>> Thanks for the quick reply.
>>
>> Kind regards,
>> Alvin
>>
>>>
>>> Thanks,
>>>
>>> James
>>>
>>>> ---
>>>>    src/station.c | 1 +
>>>>    1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/src/station.c b/src/station.c
>>>> index d5650a2995a9..69f9be79880d 100644
>>>> --- a/src/station.c
>>>> +++ b/src/station.c
>>>> @@ -3201,6 +3201,7 @@ static void station_disconnect_event(struct station *station, void *event_data)
>>>>    					event_data, station);
>>>>    		return;
>>>>    	case STATION_STATE_CONNECTED:
>>>> +	case STATION_STATE_ROAMING:
>>>>    	case STATION_STATE_FT_ROAMING:
>>>>    	case STATION_STATE_FW_ROAMING:
>>>>    		station_disassociated(station);
Alvin Šipraga Nov. 13, 2023, 12:46 p.m. UTC | #5
Hello again James,

I'm resurrecting this old thread because we are hitting another roaming related
issue and I am not sure whether the problem is in iwd or in the wireless
driver/fw. I hope you can help me with some log analysis.

On Mon, Apr 03, 2023 at 03:35:28PM -0700, James Prestwood wrote:
> Hi Alvin,
> 
> On 4/3/23 07:19, Alvin Šipraga wrote:
> > From: Alvin Šipraga <alsi@bang-olufsen.dk>
> > 
> > The blamed commit clobbered the non-FT (Reassociation) roaming case
> > in the disconnect event handler, causing iwd to potentially get stuck in
> > the ROAMING state after a failed reassociation during non-FT roaming.
> > Add it back.
> > 
> > The subject is inspired by a previous commit which did the same thing,
> > but introduced before the new roaming states were added, namely commit
> > 8758cc89487b ("station: handle ROAMING state in disconnect event"). The
> > rationale here is the same.
> > 
> > Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)")

The last update I gave was on a silently dropped frame in the firmware when
doing the 4-Way Handshake. This was fixed inside the vendor's firmware so there
is nothing further to discuss there. See below for the current issue we are facing.

> 
> IIRC leaving out the old ROAMING state in this switch was actually done
> intentionally. This is because netdev_reassociate() takes a callback and any
> failures with the roam *should* show up in netdev_connect_event(), and in
> turn call the callback which then leads to the same station_disassociated()
> call. We wanted to avoid the potential of a disconnect event AND the connect
> callback both calling station_disassocated().

OK, so this makes sense, but I think we are hitting an edge case in our test
setup which slips through this logic. Allow me to explain.

We have an automated test setup with two BSSes in the same ESS and programmable
attenuators used to simulate a low-RSSI event on the station. The sequence of
events below begins with the station connected to the first BSS (BSS0), and it
is expected that iwd will then direct the driver to associate to the second BSS
(BSS1). This is without FT roaming.

1. iwd receives a CQM event and enters roaming state.

2. iwd issues a CMD_CONNECT. The driver defines .auth and .assoc ops, not
   .connect, so it takes the SME path in the kernel.

3. Auth/Assoc are successful and BSS1 kicks off the 4-Way Handshake.

4. After entering eapol_handle_ptk_3_of_4() iwd asynchronously installs the keys
   with CMD_NEW_KEY and then deletes the EAPOL timeout.

5. The group key gets installed successfully (CMD_NEW_KEY returns success).

6. In the mean time however the station receives a Deauth or Deassoc (not clear
   from our driver logs) and the driver sends this to iwd. In
   netdev_disconnect_event() iwd then follows this path:
    -> netdev_connect_free()
      -> handshake_state_free()
        -> netdev_handshake_state_free()
	  -> netdev_handshake_state_cancel_all()
	    -> l_genl_family_cancel(nl80211, nhs->pairwise_new_key_cmd_id);

   Then, re: the original patch in this thread, iwd's station code ignores this
   event, because the station state is ROAMING.

7. The subsequent CMD_NEW_KEY then fails in the driver, but the above genl
   callback is never invoked because it was cancelled.

In this way neither the station nor the handshake code then report the error, so
we get stuck in a roaming state.

> 
> Do you happen to have an iwmon pcap/log of this behavior? If its a kernel
> bug and a CMD_CONNECT connect event never comes then we probably do need
> this patch, but maybe instead with an L_WARN_ON or a big debug print because
> it really shouldn't be happening.

Yes, I have an iwmon pcap and also a journal log with iwd and kernel/driver
debug messages. Hope it's OK with you that I send it in a private mail.

I've put some engineers from NXP on cc as well, as they are helping us with this
case on the driver/FW side. The driver in question is the out-of-tree
Marvell/moal/mlan driver.

I have not yet tested the patch in this thread for this particular issue. First
I would like to hear your thoughts/feedback.

Thanks!

Kind regards,
Alvin
James Prestwood Nov. 13, 2023, 1:10 p.m. UTC | #6
Hi Alvin,

On 11/13/23 4:46 AM, Alvin Šipraga wrote:
> Hello again James,
> 
> I'm resurrecting this old thread because we are hitting another roaming related
> issue and I am not sure whether the problem is in iwd or in the wireless
> driver/fw. I hope you can help me with some log analysis.
> 
> On Mon, Apr 03, 2023 at 03:35:28PM -0700, James Prestwood wrote:
>> Hi Alvin,
>>
>> On 4/3/23 07:19, Alvin Šipraga wrote:
>>> From: Alvin Šipraga <alsi@bang-olufsen.dk>
>>>
>>> The blamed commit clobbered the non-FT (Reassociation) roaming case
>>> in the disconnect event handler, causing iwd to potentially get stuck in
>>> the ROAMING state after a failed reassociation during non-FT roaming.
>>> Add it back.
>>>
>>> The subject is inspired by a previous commit which did the same thing,
>>> but introduced before the new roaming states were added, namely commit
>>> 8758cc89487b ("station: handle ROAMING state in disconnect event"). The
>>> rationale here is the same.
>>>
>>> Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)")
> 
> The last update I gave was on a silently dropped frame in the firmware when
> doing the 4-Way Handshake. This was fixed inside the vendor's firmware so there
> is nothing further to discuss there. See below for the current issue we are facing.
> 
>>
>> IIRC leaving out the old ROAMING state in this switch was actually done
>> intentionally. This is because netdev_reassociate() takes a callback and any
>> failures with the roam *should* show up in netdev_connect_event(), and in
>> turn call the callback which then leads to the same station_disassociated()
>> call. We wanted to avoid the potential of a disconnect event AND the connect
>> callback both calling station_disassocated().
> 
> OK, so this makes sense, but I think we are hitting an edge case in our test
> setup which slips through this logic. Allow me to explain.
> 
> We have an automated test setup with two BSSes in the same ESS and programmable
> attenuators used to simulate a low-RSSI event on the station. The sequence of
> events below begins with the station connected to the first BSS (BSS0), and it
> is expected that iwd will then direct the driver to associate to the second BSS
> (BSS1). This is without FT roaming.
> 
> 1. iwd receives a CQM event and enters roaming state.
> 
> 2. iwd issues a CMD_CONNECT. The driver defines .auth and .assoc ops, not
>     .connect, so it takes the SME path in the kernel.
> 
> 3. Auth/Assoc are successful and BSS1 kicks off the 4-Way Handshake.
> 
> 4. After entering eapol_handle_ptk_3_of_4() iwd asynchronously installs the keys
>     with CMD_NEW_KEY and then deletes the EAPOL timeout.
> 
> 5. The group key gets installed successfully (CMD_NEW_KEY returns success).
> 
> 6. In the mean time however the station receives a Deauth or Deassoc (not clear
>     from our driver logs) and the driver sends this to iwd. In
>     netdev_disconnect_event() iwd then follows this path:
>      -> netdev_connect_free()
>        -> handshake_state_free()
>          -> netdev_handshake_state_free()
> 	  -> netdev_handshake_state_cancel_all()
> 	    -> l_genl_family_cancel(nl80211, nhs->pairwise_new_key_cmd_id);
> 
>     Then, re: the original patch in this thread, iwd's station code ignores this
>     event, because the station state is ROAMING.
> 
> 7. The subsequent CMD_NEW_KEY then fails in the driver, but the above genl
>     callback is never invoked because it was cancelled.

Ok this does make sense to me. A disconnect from the AP could come in at 
any point during reassociation (during the ROAMING state). I think we 
assume the kernel handles this during the actual association step, i.e. 
relays this in the connect event, but during eapol/key setting I think 
your original patch holds some merit.

Just so we have public record I've sanitized the logs you sent (removed 
dates, hostnames, driver prints), and its clear that a disconnect at 
this moment would trigger this "Unexpected disconnect" case:

iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start()
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on 
ifindex 8
iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8
iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1
iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0
iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_new_group_key_cb() ifindex: 8, 
err: 0
iwd[3589]: ../iwd-2.8/src/netdev.c:try_handshake_complete() 
ptk_installed: 0, gtk_installed: 1, igtk_installed: 1
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME 
notification Disassociate(40)
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME 
notification Disconnect(48)
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_disconnect_event()
iwd[3589]: Received Deauthentication event, reason: 1, from_ap: true
iwd[3589]: ../iwd-2.8/src/wiphy.c:wiphy_radio_work_done() Work item 15 done
iwd[3589]: ../iwd-2.8/src/station.c:station_disconnect_event() 8
iwd[3589]: Unexpected disconnect event

So I'm thinking we should merge the original patch adding ROAMING to the 
switch statement. Maybe Denis has some additional insight

Thanks,
James
Denis Kenzior Nov. 13, 2023, 3:20 p.m. UTC | #7
Hi Alvin,

> 
> 1. iwd receives a CQM event and enters roaming state.
> 
> 2. iwd issues a CMD_CONNECT. The driver defines .auth and .assoc ops, not
>     .connect, so it takes the SME path in the kernel.
> 
> 3. Auth/Assoc are successful and BSS1 kicks off the 4-Way Handshake.
> 
> 4. After entering eapol_handle_ptk_3_of_4() iwd asynchronously installs the keys
>     with CMD_NEW_KEY and then deletes the EAPOL timeout.
> 
> 5. The group key gets installed successfully (CMD_NEW_KEY returns success).
> 
> 6. In the mean time however the station receives a Deauth or Deassoc (not clear
>     from our driver logs) and the driver sends this to iwd. In

Do we know why?  And the Deauth/Deassoc is from BSS1, not BSS0, correct?

>     netdev_disconnect_event() iwd then follows this path:
>      -> netdev_connect_free()
>        -> handshake_state_free()
>          -> netdev_handshake_state_free()
> 	  -> netdev_handshake_state_cancel_all()
> 	    -> l_genl_family_cancel(nl80211, nhs->pairwise_new_key_cmd_id);
> 
>     Then, re: the original patch in this thread, iwd's station code ignores this
>     event, because the station state is ROAMING.
> 
> 7. The subsequent CMD_NEW_KEY then fails in the driver, but the above genl
>     callback is never invoked because it was cancelled.

There should be three NEW_KEYs generated:
set_gtk
set_igtk (if you have MFP enabled)
set_tk

Normally in that order.  Which CMD_NEW_KEY fails?

> 
> In this way neither the station nor the handshake code then report the error, so
> we get stuck in a roaming state.
> 

Ok, I can see that happening.

Regards,
-Denis
Alvin Šipraga Nov. 13, 2023, 3:22 p.m. UTC | #8
Hi James,

On Mon, Nov 13, 2023 at 05:10:20AM -0800, James Prestwood wrote:
> Hi Alvin,
> 
> On 11/13/23 4:46 AM, Alvin Šipraga wrote:
> > Hello again James,
> > 
> > I'm resurrecting this old thread because we are hitting another roaming related
> > issue and I am not sure whether the problem is in iwd or in the wireless
> > driver/fw. I hope you can help me with some log analysis.
> > 
> > On Mon, Apr 03, 2023 at 03:35:28PM -0700, James Prestwood wrote:
> > > Hi Alvin,
> > > 
> > > On 4/3/23 07:19, Alvin Šipraga wrote:
> > > > From: Alvin Šipraga <alsi@bang-olufsen.dk>
> > > > 
> > > > The blamed commit clobbered the non-FT (Reassociation) roaming case
> > > > in the disconnect event handler, causing iwd to potentially get stuck in
> > > > the ROAMING state after a failed reassociation during non-FT roaming.
> > > > Add it back.
> > > > 
> > > > The subject is inspired by a previous commit which did the same thing,
> > > > but introduced before the new roaming states were added, namely commit
> > > > 8758cc89487b ("station: handle ROAMING state in disconnect event"). The
> > > > rationale here is the same.
> > > > 
> > > > Fixes: edf556cc7bb9 ("station: add two new roaming states (FT/FW)")
> > 
> > The last update I gave was on a silently dropped frame in the firmware when
> > doing the 4-Way Handshake. This was fixed inside the vendor's firmware so there
> > is nothing further to discuss there. See below for the current issue we are facing.
> > 
> > > 
> > > IIRC leaving out the old ROAMING state in this switch was actually done
> > > intentionally. This is because netdev_reassociate() takes a callback and any
> > > failures with the roam *should* show up in netdev_connect_event(), and in
> > > turn call the callback which then leads to the same station_disassociated()
> > > call. We wanted to avoid the potential of a disconnect event AND the connect
> > > callback both calling station_disassocated().
> > 
> > OK, so this makes sense, but I think we are hitting an edge case in our test
> > setup which slips through this logic. Allow me to explain.
> > 
> > We have an automated test setup with two BSSes in the same ESS and programmable
> > attenuators used to simulate a low-RSSI event on the station. The sequence of
> > events below begins with the station connected to the first BSS (BSS0), and it
> > is expected that iwd will then direct the driver to associate to the second BSS
> > (BSS1). This is without FT roaming.
> > 
> > 1. iwd receives a CQM event and enters roaming state.
> > 
> > 2. iwd issues a CMD_CONNECT. The driver defines .auth and .assoc ops, not
> >     .connect, so it takes the SME path in the kernel.
> > 
> > 3. Auth/Assoc are successful and BSS1 kicks off the 4-Way Handshake.
> > 
> > 4. After entering eapol_handle_ptk_3_of_4() iwd asynchronously installs the keys
> >     with CMD_NEW_KEY and then deletes the EAPOL timeout.
> > 
> > 5. The group key gets installed successfully (CMD_NEW_KEY returns success).
> > 
> > 6. In the mean time however the station receives a Deauth or Deassoc (not clear
> >     from our driver logs) and the driver sends this to iwd. In
> >     netdev_disconnect_event() iwd then follows this path:
> >      -> netdev_connect_free()
> >        -> handshake_state_free()
> >          -> netdev_handshake_state_free()
> > 	  -> netdev_handshake_state_cancel_all()
> > 	    -> l_genl_family_cancel(nl80211, nhs->pairwise_new_key_cmd_id);
> > 
> >     Then, re: the original patch in this thread, iwd's station code ignores this
> >     event, because the station state is ROAMING.
> > 
> > 7. The subsequent CMD_NEW_KEY then fails in the driver, but the above genl
> >     callback is never invoked because it was cancelled.
> 
> Ok this does make sense to me. A disconnect from the AP could come in at any
> point during reassociation (during the ROAMING state). I think we assume the
> kernel handles this during the actual association step, i.e. relays this in
> the connect event, but during eapol/key setting I think your original patch
> holds some merit.
> 
> Just so we have public record I've sanitized the logs you sent (removed
> dates, hostnames, driver prints), and its clear that a disconnect at this
> moment would trigger this "Unexpected disconnect" case:
> 
> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start()
> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on ifindex
> 8
> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8
> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8
> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1
> iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys
> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0
> iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys
> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0
> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_new_group_key_cb() ifindex: 8,
> err: 0
> iwd[3589]: ../iwd-2.8/src/netdev.c:try_handshake_complete() ptk_installed:
> 0, gtk_installed: 1, igtk_installed: 1
> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification
> Disassociate(40)
> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification
> Disconnect(48)
> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_disconnect_event()
> iwd[3589]: Received Deauthentication event, reason: 1, from_ap: true
> iwd[3589]: ../iwd-2.8/src/wiphy.c:wiphy_radio_work_done() Work item 15 done
> iwd[3589]: ../iwd-2.8/src/station.c:station_disconnect_event() 8
> iwd[3589]: Unexpected disconnect event
> 
> So I'm thinking we should merge the original patch adding ROAMING to the
> switch statement. Maybe Denis has some additional insight

Great, thank you very much James for the swift response. Let's see what Denis
says too. In the mean time we will run the test with the patch applied (+
logging) and see if the problem is solved in this way. For the record the test
also triggers other roaming bugs with the same overall symptom (stuck in
roaming), but we suspect these bugs are on the driver or FW level. It might
complicate validation of the patch though. I will keep you posted.

Kind regards,
Alvin

> 
> Thanks,
> James
Denis Kenzior Nov. 13, 2023, 3:26 p.m. UTC | #9
Hi James,

> 
> Ok this does make sense to me. A disconnect from the AP could come in at any 
> point during reassociation (during the ROAMING state). I think we assume the 
> kernel handles this during the actual association step, i.e. relays this in the 
> connect event, but during eapol/key setting I think your original patch holds 
> some merit.
> 
> Just so we have public record I've sanitized the logs you sent (removed dates, 
> hostnames, driver prints), and its clear that a disconnect at this moment would 
> trigger this "Unexpected disconnect" case:
> 
> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start()
> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on ifindex 8
> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8
> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8
> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1
> iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys
> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0
> iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys
> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0

Why is set_tk being called twice?  Or is this a copy-paste artifact?

Regards,
-Denis
James Prestwood Nov. 13, 2023, 3:29 p.m. UTC | #10
Hi Denis,

On 11/13/23 7:26 AM, Denis Kenzior wrote:
> Hi James,
> 
>>
>> Ok this does make sense to me. A disconnect from the AP could come in 
>> at any point during reassociation (during the ROAMING state). I think 
>> we assume the kernel handles this during the actual association step, 
>> i.e. relays this in the connect event, but during eapol/key setting I 
>> think your original patch holds some merit.
>>
>> Just so we have public record I've sanitized the logs you sent 
>> (removed dates, hostnames, driver prints), and its clear that a 
>> disconnect at this moment would trigger this "Unexpected disconnect" 
>> case:
>>
>> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start()
>> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on 
>> ifindex 8
>> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8
>> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8
>> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1
>> iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting 
>> keys
>> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0
>> iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting 
>> keys
>> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0
> 
> Why is set_tk being called twice?  Or is this a copy-paste artifact?

Hmm, good question. I didn't notice that but its called 2x in the logs 
Alvin sent me. Alvin, was it a copy-paste error on your side? If not 
something weird is going on.

Thanks,
James

> 
> Regards,
> -Denis
Denis Kenzior Nov. 13, 2023, 3:38 p.m. UTC | #11
Hi James, Alvin,

On 11/13/23 09:29, James Prestwood wrote:
> Hi Denis,
> 
> On 11/13/23 7:26 AM, Denis Kenzior wrote:
>> Hi James,
>>
>>>
>>> Ok this does make sense to me. A disconnect from the AP could come in at any 
>>> point during reassociation (during the ROAMING state). I think we assume the 
>>> kernel handles this during the actual association step, i.e. relays this in 
>>> the connect event, but during eapol/key setting I think your original patch 
>>> holds some merit.
>>>
>>> Just so we have public record I've sanitized the logs you sent (removed 
>>> dates, hostnames, driver prints), and its clear that a disconnect at this 
>>> moment would trigger this "Unexpected disconnect" case:
>>>
>>> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start()
>>> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on ifindex 8
>>> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8
>>> iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8
>>> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1
>>> iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys
>>> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0
>>> iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys
>>> iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0
>>
>> Why is set_tk being called twice?  Or is this a copy-paste artifact?
> 
> Hmm, good question. I didn't notice that but its called 2x in the logs Alvin 
> sent me. Alvin, was it a copy-paste error on your side? If not something weird 
> is going on.
> 

We're generating the Setting keys event twice.  I don't really understand how 
this can happen, honestly.  I hope this is an artifact of some sort.

Anyhow, I think overall the patch approach is incorrect.  When we are roaming, 
we expect the callback provided to netdev_reassociate() to be invoked on any 
error.  So the real fix is in netdev.c somewhere.

Regards,
-Denis
Alvin Šipraga Nov. 13, 2023, 3:39 p.m. UTC | #12
Hi Denis,

On Mon, Nov 13, 2023 at 09:20:14AM -0600, Denis Kenzior wrote:
> Hi Alvin,
> 
> > 
> > 1. iwd receives a CQM event and enters roaming state.
> > 
> > 2. iwd issues a CMD_CONNECT. The driver defines .auth and .assoc ops, not
> >     .connect, so it takes the SME path in the kernel.
> > 
> > 3. Auth/Assoc are successful and BSS1 kicks off the 4-Way Handshake.
> > 
> > 4. After entering eapol_handle_ptk_3_of_4() iwd asynchronously installs the keys
> >     with CMD_NEW_KEY and then deletes the EAPOL timeout.
> > 
> > 5. The group key gets installed successfully (CMD_NEW_KEY returns success).
> > 
> > 6. In the mean time however the station receives a Deauth or Deassoc (not clear
> >     from our driver logs) and the driver sends this to iwd. In
> 
> Do we know why?  And the Deauth/Deassoc is from BSS1, not BSS0, correct?

No, we don't know why. But it's coming from BSS1, correct, at least if the iwmon
pcap is to be believed. The driver debug logs themselves don't print the MAC
address and we don't have a sniffer log. Also, now I had a second look and it is
indeed a Disassoc frame with unspecified reason code (1).

> 
> >     netdev_disconnect_event() iwd then follows this path:
> >      -> netdev_connect_free()
> >        -> handshake_state_free()
> >          -> netdev_handshake_state_free()
> > 	  -> netdev_handshake_state_cancel_all()
> > 	    -> l_genl_family_cancel(nl80211, nhs->pairwise_new_key_cmd_id);
> > 
> >     Then, re: the original patch in this thread, iwd's station code ignores this
> >     event, because the station state is ROAMING.
> > 
> > 7. The subsequent CMD_NEW_KEY then fails in the driver, but the above genl
> >     callback is never invoked because it was cancelled.
> 
> There should be three NEW_KEYs generated:
> set_gtk
> set_igtk (if you have MFP enabled)
> set_tk
> 
> Normally in that order.  Which CMD_NEW_KEY fails?

set_tk is failing. Here you can see the order (no set_igtk called):

iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1
iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_rekey_offload() 8
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_new_group_key_cb() ifindex: 8, err: 0
iwd[3589]: ../iwd-2.8/src/netdev.c:try_handshake_complete() ptk_installed: 0, gtk_installed: 1, igtk_installed: 1
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Disassociate(40)
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48)
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_disconnect_event()
iwd[3589]: Received Deauthentication event, reason: 1, from_ap: true
iwd[3589]: ../iwd-2.8/src/wiphy.c:wiphy_radio_work_done() Work item 15 done
iwd[3589]: ../iwd-2.8/src/station.c:station_disconnect_event() 8
iwd[3589]: Unexpected disconnect event

Here you can see that set_gtk succeeds. You don't see the callback for set_tk
for the reasons I explained in mail above, but in the iwmon pcap I can see that
the CMD_NEW_KEY fails with -ENOLINK. Makes sense, since the driver got the
Deauth/Disassoc in the mean time.

> 
> > 
> > In this way neither the station nor the handshake code then report the error, so
> > we get stuck in a roaming state.
> > 
> 
> Ok, I can see that happening.
> 
> Regards,
> -Denis
Alvin Šipraga Nov. 13, 2023, 3:45 p.m. UTC | #13
Hi James,

On Mon, Nov 13, 2023 at 07:29:14AM -0800, James Prestwood wrote:
> Hi Denis,
> 
> On 11/13/23 7:26 AM, Denis Kenzior wrote:
> > Hi James,
> > 
> > > 
> > > Ok this does make sense to me. A disconnect from the AP could come
> > > in at any point during reassociation (during the ROAMING state). I
> > > think we assume the kernel handles this during the actual
> > > association step, i.e. relays this in the connect event, but during
> > > eapol/key setting I think your original patch holds some merit.
> > > 
> > > Just so we have public record I've sanitized the logs you sent
> > > (removed dates, hostnames, driver prints), and its clear that a
> > > disconnect at this moment would trigger this "Unexpected disconnect"
> > > case:
> > > 
> > > iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start()
> > > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on
> > > ifindex 8
> > > iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8
> > > iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8
> > > iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1
> > > iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event()
> > > Setting keys
> 
> Hmm, good question. I didn't notice that but its called 2x in the logs Alvin
> sent me. Alvin, was it a copy-paste error on your side? If not something
> weird is going on.

It's my bad. I sent James a snippet in my email when attaching the logs and he
just pasted it back. Something must have gotten screwed up on my end while
copy-pasting. Here is the proper snippet one more time:

iwd[3589]: ../iwd-2.8/src/station.c:station_enter_state() Old State: connected, new state: roaming
iwd[3589]: ../iwd-2.8/src/wiphy.c:wiphy_radio_work_done() Work item 14 done
iwd[3589]: ../iwd-2.8/src/wiphy.c:wiphy_radio_work_next() Starting work item 15
iwd[3589]: ../iwd-2.8/src/station.c:station_netdev_event() Associating
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Authenticate(37)
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_authenticate_event()
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Associate(38)
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_associate_event()
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Connect(46)
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_connect_event()
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_connect_event() aborting and ignore_connect_event not set, proceed
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_connect_event() expect_connect_failure not set, proceed
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_connect_event() Request / Response IEs parsed
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_get_oci()
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on ifindex 8
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_get_oci_cb() Obtained OCI: freq: 2412, width: 1, center1: 2412, center2: 0
iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_start()
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_link_notify() event 16 on ifindex 8
iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_1_of_4() ifindex=8
iwd[3589]: ../iwd-2.8/src/eapol.c:eapol_handle_ptk_3_of_4() ifindex=8
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_gtk() ifindex=8 key_idx=1
iwd[3589]: ../iwd-2.8/src/station.c:station_handshake_event() Setting keys
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_tk() ifindex=8 key_idx=0
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_set_rekey_offload() 8
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_new_group_key_cb() ifindex: 8, err: 0
iwd[3589]: ../iwd-2.8/src/netdev.c:try_handshake_complete() ptk_installed: 0, gtk_installed: 1, igtk_installed: 1
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Disassociate(40)
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48)
iwd[3589]: ../iwd-2.8/src/netdev.c:netdev_disconnect_event()
iwd[3589]: Received Deauthentication event, reason: 1, from_ap: true
iwd[3589]: ../iwd-2.8/src/wiphy.c:wiphy_radio_work_done() Work item 15 done
iwd[3589]: ../iwd-2.8/src/station.c:station_disconnect_event() 8
iwd[3589]: Unexpected disconnect event

Sorry about that.

Kind regards,
Alvin
diff mbox series

Patch

diff --git a/src/station.c b/src/station.c
index d5650a2995a9..69f9be79880d 100644
--- a/src/station.c
+++ b/src/station.c
@@ -3201,6 +3201,7 @@  static void station_disconnect_event(struct station *station, void *event_data)
 					event_data, station);
 		return;
 	case STATION_STATE_CONNECTED:
+	case STATION_STATE_ROAMING:
 	case STATION_STATE_FT_ROAMING:
 	case STATION_STATE_FW_ROAMING:
 		station_disassociated(station);