diff mbox series

[v2] offchannel: handle out of order ACKs/events

Message ID 20231023122054.34100-1-prestwoj@gmail.com (mailing list archive)
State New
Headers show
Series [v2] offchannel: handle out of order ACKs/events | expand

Checks

Context Check Description
tedd_an/pre-ci_am success Success
prestwoj/iwd-alpine-ci-fetch success Fetch PR
prestwoj/iwd-ci-gitlint success GitLint
prestwoj/iwd-ci-fetch success Fetch PR
prestwoj/iwd-alpine-ci-makedistcheck fail Make Distcheck Make FAIL: ../../build-aux/test-driver: line 112: 20884 Segmentation fault (core dumped) "$@" >> "$log_file" 2>&1 make[4]: *** [Makefile:2924: test-suite.log] Error 1 make[3]: *** [Makefile:3032: check-TESTS] Error 2 make[2]: *** [Makefile:3389: check-am] Error 2 make[1]: *** [Makefile:3391: check] Error 2 make: *** [Makefile:3310: distcheck] Error 1
prestwoj/iwd-alpine-ci-incremental_build success Incremental build not run PASS
prestwoj/iwd-ci-makedistcheck success Make Distcheck
prestwoj/iwd-ci-incremental_build success Incremental build not run PASS
prestwoj/iwd-alpine-ci-build success Build - Configure
prestwoj/iwd-ci-build success Build - Configure
prestwoj/iwd-alpine-ci-makecheckvalgrind fail Make Check FAIL: ./build-aux/test-driver: line 112: 38335 Segmentation fault (core dumped) "$@" >> "$log_file" 2>&1 make[3]: *** [Makefile:2924: test-suite.log] Error 1 make[2]: *** [Makefile:3032: check-TESTS] Error 2 make[1]: *** [Makefile:3389: check-am] Error 2 make: *** [Makefile:3391: check] Error 2
prestwoj/iwd-alpine-ci-makecheck fail Make Check FAIL: ./build-aux/test-driver: line 112: 38488 Segmentation fault (core dumped) "$@" >> "$log_file" 2>&1 make[3]: *** [Makefile:2924: test-suite.log] Error 1 make[2]: *** [Makefile:3032: check-TESTS] Error 2 make[1]: *** [Makefile:3389: check-am] Error 2 make: *** [Makefile:3391: check] Error 2
prestwoj/iwd-ci-makecheckvalgrind success Make Check w/Valgrind
prestwoj/iwd-ci-clang success clang PASS
prestwoj/iwd-ci-makecheck success Make Check
prestwoj/iwd-ci-testrunner fail test-runner - FAIL: testPSK-roam

Commit Message

James Prestwood Oct. 23, 2023, 12:20 p.m. UTC
Its been seen (so far only in mac80211_hwsim + UML) where an
offchannel requests ACK comes after the ROC started event. This
causes the ROC started event to never call back to notify since
info->roc_cookie is unset and it appears to be coming from an
external process.

We can detect this situation in a few ways, first by looking up
the offchannel info by both wdev and cookie. If found, continue
as normal. If not lookup by only wdev and if there is still no
roc_cookie set the cookie from the notify callback to a temporary
'early_cookie' placeholder and return. This can be checked in
the ACK and if it matches we know the ACK came out of order and
the started event can be sent.

This also handles external process case. If we got a spurious
notify event before the ACK, the ACK callback will fail to
verify the early cookie, set roc_cookie correctly, and the expected
ROC event will come and issue started. Since there should be at
most one ROC session per wdev we can assume the temporary
early_cookie won't be overwritten.

Another minor change was made to the lookup on the cancel path.
Instead of looking up by wdev, lookup by the ID itself. We
shouldn't ever have more than one info per wdev in the queue but
looking up the _exact_ info structure doesn't hurt in case things
change in the future.
---
 src/offchannel.c | 73 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 65 insertions(+), 8 deletions(-)

v2:
 * Skip started if the event came early, and start in the ACK
 * Add check for roc_cookie && roc_cmd_id in notify event. This
   verifies that we actually have a pending request, if not it
   came externally.
 * Add 'early_cookie' placeholder to guard against the case of an
   external event coming before the ACK. Prior, this would have
   set roc_cookie to an incorrect cookie and called the started
   callback prematurely.

Comments

Denis Kenzior Oct. 25, 2023, 2:39 a.m. UTC | #1
Hi James,

On 10/23/23 07:20, James Prestwood wrote:
> Its been seen (so far only in mac80211_hwsim + UML) where an
> offchannel requests ACK comes after the ROC started event. This
> causes the ROC started event to never call back to notify since
> info->roc_cookie is unset and it appears to be coming from an
> external process.
> 
> We can detect this situation in a few ways, first by looking up
> the offchannel info by both wdev and cookie. If found, continue
> as normal. If not lookup by only wdev and if there is still no
> roc_cookie set the cookie from the notify callback to a temporary
> 'early_cookie' placeholder and return. This can be checked in
> the ACK and if it matches we know the ACK came out of order and
> the started event can be sent.
> 
> This also handles external process case. If we got a spurious
> notify event before the ACK, the ACK callback will fail to
> verify the early cookie, set roc_cookie correctly, and the expected
> ROC event will come and issue started. Since there should be at
> most one ROC session per wdev we can assume the temporary
> early_cookie won't be overwritten.
> 
> Another minor change was made to the lookup on the cancel path.
> Instead of looking up by wdev, lookup by the ID itself. We
> shouldn't ever have more than one info per wdev in the queue but
> looking up the _exact_ info structure doesn't hurt in case things
> change in the future.
> ---
>   src/offchannel.c | 73 ++++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 65 insertions(+), 8 deletions(-)
> 
> v2:
>   * Skip started if the event came early, and start in the ACK
>   * Add check for roc_cookie && roc_cmd_id in notify event. This
>     verifies that we actually have a pending request, if not it
>     came externally.
>   * Add 'early_cookie' placeholder to guard against the case of an
>     external event coming before the ACK. Prior, this would have
>     set roc_cookie to an incorrect cookie and called the started
>     callback prematurely.
> 

<snip>

> @@ -98,9 +118,19 @@ static void offchannel_roc_cb(struct l_genl_msg *msg, void *user_data)
>   		goto work_done;
>   	}
>   
> -	/* This request was cancelled, and ROC needs to be cancelled */
> +	/*
> +	 * If this request was cancelled prior to the request ever hitting the
> +	 * kernel, cancel now.

This comment makes no sense.  If we're here, then the request hit the kernel as 
best we can tell (it was sent on the socket).

> +	 *
> +	 * If the ROC event came before the ACK, call back now since the
> +	 * callback was skipped in the notify event. There is the potential that
> +	 * an external process issued the ROC, but if the cookies don't match
> +	 * here we can be sure it wasn't for us.
> +	 */
>   	if (info->needs_cancel)
>   		offchannel_cancel_roc(info);
> +	else if (info->early_cookie == info->roc_cookie && info->started)
> +		info->started(info->user_data);
>   
>   	return;
>   
> @@ -191,7 +221,8 @@ void offchannel_cancel(uint64_t wdev_id, uint32_t id)
>   	else if (ret == false)
>   		goto work_done;
>   
> -	info = l_queue_find(offchannel_list, match_wdev, &wdev_id);
> +
> +	info = l_queue_find(offchannel_list, match_id, &id);
>   	if (!info)
>   		return;
>   

So I split this chunk out into a separate commit and applied this.  Please 
double check that I didn't screw anything up.

> @@ -246,6 +277,7 @@ work_done:
>   static void offchannel_mlme_notify(struct l_genl_msg *msg, void *user_data)
>   {
>   	struct offchannel_info *info;
> +	struct match_data match = {0};
>   	uint64_t wdev_id;
>   	uint64_t cookie;
>   	uint8_t cmd;
> @@ -261,12 +293,37 @@ static void offchannel_mlme_notify(struct l_genl_msg *msg, void *user_data)
>   					NL80211_ATTR_UNSPEC) < 0)
>   		return;
>   
> -	info = l_queue_find(offchannel_list, match_wdev, &wdev_id);
> +	match.wdev_id = wdev_id;
> +	match.cookie = cookie;
> +
> +	info = l_queue_find(offchannel_list, match_info, &match);
> +	if (!info) {
> +		/* Try again without cookie */
> +		match.cookie = 0;
> +		info = l_queue_find(offchannel_list, match_info, &match);
> +	}
> +

I think this can be further nailed down as follows:

- Search by cookie, wdev and info->roc_cmd_id being 0.  That is our 'normal' 
case where we sent the ROC, get the ack, and the ROC event comes in.  Handle it 
normally and return.

If that fails, then search by wdev, info->rcmd_id_id != 0 and 
l_genl_family_request_sent is true.  If you don't find anything, then this 
request is from outside iwd.  This logic might be better as a for loop based on 
l_queue_get_entries().

Otherwise, set the early_cookie and return.

>   	if (!info)
>   		return;
>   
> -	/* ROC must have been started elsewhere, not by IWD */
> -	if (info->roc_cookie != cookie)
> +	/*
> +	 * If the cookie is zero and there is a pending ROC command there are
> +	 * two possibilities:
> +	 *   - The ACK callback came out of order. This has been seen in UML
> +	 *     when an offchannel request is canceled followed by another
> +	 *     request on the same channel. To handle this delay the started
> +	 *     callback until the ACK comes in when we can check the cookie.
> +	 *
> +	 *   - Event came from external process doing ROC. Checking the cookie
> +	 *     in the ACK lets us verify if this is the case.
> +	 *
> +	 * If the cookie is set but does not match, this ROC request came from
> +	 * outside IWD.

This last sentence probably belongs in the "If you don't find anything" part above.

> +	 */
> +	if (!info->roc_cookie && info->roc_cmd_id) {
> +		info->early_cookie = cookie;
> +		return;
> +	} else if (info->roc_cookie != cookie)

What does this do?

>   		return;
>   
>   	switch (cmd) {

Regards,
-Denis
James Prestwood Oct. 25, 2023, 12:39 p.m. UTC | #2
Hi Denis

On 10/24/23 7:39 PM, Denis Kenzior wrote:
> Hi James,
> 
> On 10/23/23 07:20, James Prestwood wrote:
>> Its been seen (so far only in mac80211_hwsim + UML) where an
>> offchannel requests ACK comes after the ROC started event. This
>> causes the ROC started event to never call back to notify since
>> info->roc_cookie is unset and it appears to be coming from an
>> external process.
>>
>> We can detect this situation in a few ways, first by looking up
>> the offchannel info by both wdev and cookie. If found, continue
>> as normal. If not lookup by only wdev and if there is still no
>> roc_cookie set the cookie from the notify callback to a temporary
>> 'early_cookie' placeholder and return. This can be checked in
>> the ACK and if it matches we know the ACK came out of order and
>> the started event can be sent.
>>
>> This also handles external process case. If we got a spurious
>> notify event before the ACK, the ACK callback will fail to
>> verify the early cookie, set roc_cookie correctly, and the expected
>> ROC event will come and issue started. Since there should be at
>> most one ROC session per wdev we can assume the temporary
>> early_cookie won't be overwritten.
>>
>> Another minor change was made to the lookup on the cancel path.
>> Instead of looking up by wdev, lookup by the ID itself. We
>> shouldn't ever have more than one info per wdev in the queue but
>> looking up the _exact_ info structure doesn't hurt in case things
>> change in the future.
>> ---
>>   src/offchannel.c | 73 ++++++++++++++++++++++++++++++++++++++++++------
>>   1 file changed, 65 insertions(+), 8 deletions(-)
>>
>> v2:
>>   * Skip started if the event came early, and start in the ACK
>>   * Add check for roc_cookie && roc_cmd_id in notify event. This
>>     verifies that we actually have a pending request, if not it
>>     came externally.
>>   * Add 'early_cookie' placeholder to guard against the case of an
>>     external event coming before the ACK. Prior, this would have
>>     set roc_cookie to an incorrect cookie and called the started
>>     callback prematurely.
>>
> 
> <snip>
> 
>> @@ -98,9 +118,19 @@ static void offchannel_roc_cb(struct l_genl_msg 
>> *msg, void *user_data)
>>           goto work_done;
>>       }
>> -    /* This request was cancelled, and ROC needs to be cancelled */
>> +    /*
>> +     * If this request was cancelled prior to the request ever 
>> hitting the
>> +     * kernel, cancel now.
> 
> This comment makes no sense.  If we're here, then the request hit the 
> kernel as best we can tell (it was sent on the socket).

Yeah what I meant was:

The request was canceled before the kernel ACKed.

> 
>> +     *
>> +     * If the ROC event came before the ACK, call back now since the
>> +     * callback was skipped in the notify event. There is the 
>> potential that
>> +     * an external process issued the ROC, but if the cookies don't 
>> match
>> +     * here we can be sure it wasn't for us.
>> +     */
>>       if (info->needs_cancel)
>>           offchannel_cancel_roc(info);
>> +    else if (info->early_cookie == info->roc_cookie && info->started)
>> +        info->started(info->user_data);
>>       return;
>> @@ -191,7 +221,8 @@ void offchannel_cancel(uint64_t wdev_id, uint32_t id)
>>       else if (ret == false)
>>           goto work_done;
>> -    info = l_queue_find(offchannel_list, match_wdev, &wdev_id);
>> +
>> +    info = l_queue_find(offchannel_list, match_id, &id);
>>       if (!info)
>>           return;
> 
> So I split this chunk out into a separate commit and applied this.  
> Please double check that I didn't screw anything up.
> 
>> @@ -246,6 +277,7 @@ work_done:
>>   static void offchannel_mlme_notify(struct l_genl_msg *msg, void 
>> *user_data)
>>   {
>>       struct offchannel_info *info;
>> +    struct match_data match = {0};
>>       uint64_t wdev_id;
>>       uint64_t cookie;
>>       uint8_t cmd;
>> @@ -261,12 +293,37 @@ static void offchannel_mlme_notify(struct 
>> l_genl_msg *msg, void *user_data)
>>                       NL80211_ATTR_UNSPEC) < 0)
>>           return;
>> -    info = l_queue_find(offchannel_list, match_wdev, &wdev_id);
>> +    match.wdev_id = wdev_id;
>> +    match.cookie = cookie;
>> +
>> +    info = l_queue_find(offchannel_list, match_info, &match);
>> +    if (!info) {
>> +        /* Try again without cookie */
>> +        match.cookie = 0;
>> +        info = l_queue_find(offchannel_list, match_info, &match);
>> +    }
>> +
> 
> I think this can be further nailed down as follows:
> 
> - Search by cookie, wdev and info->roc_cmd_id being 0.  That is our 
> 'normal' case where we sent the ROC, get the ack, and the ROC event 
> comes in.  Handle it normally and return.
> 
> If that fails, then search by wdev, info->rcmd_id_id != 0 and 
> l_genl_family_request_sent is true.  If you don't find anything, then 
> this request is from outside iwd.  This logic might be better as a for 
> loop based on l_queue_get_entries().
> 
> Otherwise, set the early_cookie and return.
> 
>>       if (!info)
>>           return;
>> -    /* ROC must have been started elsewhere, not by IWD */
>> -    if (info->roc_cookie != cookie)
>> +    /*
>> +     * If the cookie is zero and there is a pending ROC command there 
>> are
>> +     * two possibilities:
>> +     *   - The ACK callback came out of order. This has been seen in UML
>> +     *     when an offchannel request is canceled followed by another
>> +     *     request on the same channel. To handle this delay the started
>> +     *     callback until the ACK comes in when we can check the cookie.
>> +     *
>> +     *   - Event came from external process doing ROC. Checking the 
>> cookie
>> +     *     in the ACK lets us verify if this is the case.
>> +     *
>> +     * If the cookie is set but does not match, this ROC request came 
>> from
>> +     * outside IWD.
> 
> This last sentence probably belongs in the "If you don't find anything" 
> part above.
> 
>> +     */
>> +    if (!info->roc_cookie && info->roc_cmd_id) {
>> +        info->early_cookie = cookie;
>> +        return;
>> +    } else if (info->roc_cookie != cookie)
> 
> What does this do?

Handling the external process case. But since I added && 
info->roc_cmd_id above it should actually check that info->roc_cookie 
!=0 && roc_cookie != cookie. But your suggestion above covers all the 
bases, I'll do that.

Thanks,
James

> 
>>           return;
>>       switch (cmd) {
> 
> Regards,
> -Denis
diff mbox series

Patch

diff --git a/src/offchannel.c b/src/offchannel.c
index b9cdc117..98cbe560 100644
--- a/src/offchannel.c
+++ b/src/offchannel.c
@@ -43,6 +43,7 @@  struct offchannel_info {
 
 	uint32_t roc_cmd_id;
 	uint64_t roc_cookie;
+	uint64_t early_cookie;
 
 	offchannel_started_cb_t started;
 	offchannel_destroy_cb_t destroy;
@@ -54,15 +55,34 @@  struct offchannel_info {
 	bool needs_cancel : 1;
 };
 
+struct match_data {
+	uint64_t wdev_id;
+	uint64_t cookie;
+};
+
 static struct l_genl_family *nl80211;
 static struct l_queue *offchannel_list;
 
-static bool match_wdev(const void *a, const void *user_data)
+static bool match_info(const void *a, const void *user_data)
 {
+	const struct match_data *match = user_data;
 	const struct offchannel_info *info = a;
-	const uint64_t *wdev_id = user_data;
 
-	return info->wdev_id == *wdev_id;
+	if (match->wdev_id != info->wdev_id)
+		return false;
+
+	if (!match->cookie)
+		return true;
+
+	return match->cookie == info->roc_cookie;
+}
+
+static bool match_id(const void *a, const void *user_data)
+{
+	const uint32_t *id = user_data;
+	const struct offchannel_info *info = a;
+
+	return *id == info->work.id;
 }
 
 static void offchannel_cancel_roc(struct offchannel_info *info)
@@ -98,9 +118,19 @@  static void offchannel_roc_cb(struct l_genl_msg *msg, void *user_data)
 		goto work_done;
 	}
 
-	/* This request was cancelled, and ROC needs to be cancelled */
+	/*
+	 * If this request was cancelled prior to the request ever hitting the
+	 * kernel, cancel now.
+	 *
+	 * If the ROC event came before the ACK, call back now since the
+	 * callback was skipped in the notify event. There is the potential that
+	 * an external process issued the ROC, but if the cookies don't match
+	 * here we can be sure it wasn't for us.
+	 */
 	if (info->needs_cancel)
 		offchannel_cancel_roc(info);
+	else if (info->early_cookie == info->roc_cookie && info->started)
+		info->started(info->user_data);
 
 	return;
 
@@ -191,7 +221,8 @@  void offchannel_cancel(uint64_t wdev_id, uint32_t id)
 	else if (ret == false)
 		goto work_done;
 
-	info = l_queue_find(offchannel_list, match_wdev, &wdev_id);
+
+	info = l_queue_find(offchannel_list, match_id, &id);
 	if (!info)
 		return;
 
@@ -246,6 +277,7 @@  work_done:
 static void offchannel_mlme_notify(struct l_genl_msg *msg, void *user_data)
 {
 	struct offchannel_info *info;
+	struct match_data match = {0};
 	uint64_t wdev_id;
 	uint64_t cookie;
 	uint8_t cmd;
@@ -261,12 +293,37 @@  static void offchannel_mlme_notify(struct l_genl_msg *msg, void *user_data)
 					NL80211_ATTR_UNSPEC) < 0)
 		return;
 
-	info = l_queue_find(offchannel_list, match_wdev, &wdev_id);
+	match.wdev_id = wdev_id;
+	match.cookie = cookie;
+
+	info = l_queue_find(offchannel_list, match_info, &match);
+	if (!info) {
+		/* Try again without cookie */
+		match.cookie = 0;
+		info = l_queue_find(offchannel_list, match_info, &match);
+	}
+
 	if (!info)
 		return;
 
-	/* ROC must have been started elsewhere, not by IWD */
-	if (info->roc_cookie != cookie)
+	/*
+	 * If the cookie is zero and there is a pending ROC command there are
+	 * two possibilities:
+	 *   - The ACK callback came out of order. This has been seen in UML
+	 *     when an offchannel request is canceled followed by another
+	 *     request on the same channel. To handle this delay the started
+	 *     callback until the ACK comes in when we can check the cookie.
+	 *
+	 *   - Event came from external process doing ROC. Checking the cookie
+	 *     in the ACK lets us verify if this is the case.
+	 *
+	 * If the cookie is set but does not match, this ROC request came from
+	 * outside IWD.
+	 */
+	if (!info->roc_cookie && info->roc_cmd_id) {
+		info->early_cookie = cookie;
+		return;
+	} else if (info->roc_cookie != cookie)
 		return;
 
 	switch (cmd) {