diff mbox series

[v2,1/1] wiphy: fix regdom change wiphy dump logic

Message ID 20230510204508.715558-2-prestwoj@gmail.com (mailing list archive)
State New
Headers show
Series Fix wiphy regdom logic | expand

Checks

Context Check Description
tedd_an/pre-ci_am success Success
prestwoj/iwd-ci-gitlint success GitLint
prestwoj/iwd-ci-fetch success Fetch PR
prestwoj/iwd-alpine-ci-fetch success Fetch PR
prestwoj/iwd-ci-incremental_build success Incremental build not run PASS
prestwoj/iwd-alpine-ci-makedistcheck fail Make Distcheck Make FAIL: make[2]: *** No rule to make target 'ell/sysctl.h', needed by 'distdir-am'. Stop. make[1]: *** [Makefile:3218: distdir] Error 2 make: *** [Makefile:3298: dist] Error 2
prestwoj/iwd-alpine-ci-incremental_build success Incremental build not run PASS
prestwoj/iwd-alpine-ci-build success Build - Configure
prestwoj/iwd-alpine-ci-makecheckvalgrind fail Make FAIL: make[1]: *** No rule to make target 'ell/sysctl.c', needed by 'ell/sysctl.lo'. Stop. make[1]: *** Waiting for unfinished jobs.... make: *** [Makefile:1710: all] Error 2
prestwoj/iwd-alpine-ci-makecheck pending makecheck SKIP
prestwoj/iwd-ci-build success Build - Configure
prestwoj/iwd-ci-makecheckvalgrind fail Make FAIL: make[1]: *** No rule to make target 'ell/sysctl.c', needed by 'ell/sysctl.lo'. Stop. make[1]: *** Waiting for unfinished jobs.... make: *** [Makefile:1709: all] Error 2
prestwoj/iwd-ci-makecheck pending makecheck SKIP
prestwoj/iwd-ci-clang fail Clang IWD - make FAIL: make[1]: *** No rule to make target 'ell/sysctl.c', needed by 'ell/sysctl.lo'. Stop. make[1]: *** Waiting for unfinished jobs.... make: *** [Makefile:1709: all] Error 2
prestwoj/iwd-ci-makedistcheck fail Make Distcheck Make FAIL: make[2]: *** No rule to make target 'ell/sysctl.h', needed by 'distdir-am'. Stop. make[1]: *** [Makefile:3217: distdir] Error 2 make: *** [Makefile:3297: dist] Error 2
prestwoj/iwd-ci-testrunner pending testrunner SKIP

Commit Message

James Prestwood May 10, 2023, 8:45 p.m. UTC
Certain drivers change the regulatory domain when the client roams.
First XX, then the country the roam target is using. Depending on the
timing the second regdom event can come while the first is still
dumping. This was not handled gracefully/nicely in IWD and it would
just issue another dump (it tries to cancel but if the dump started
there isn't anything that can be done). AFAIK the kernel should be
able to handle this but based on the behavior we saw it seems to not
be able to, or at least not reliably.

When the second dump comes in and GET_WIPHY is issued the kernel
stops replying to any messages which obviously causes IWD to completely
hang up waiting for various acks (set_station, set_key, etc.).

Now if another dump is ongoing (and cant be safely canceled) a flag
will be set when another dump is needed. The first dump will be
waited for and only once completed will another start. This also
ensures there is at most 1 pending dump when before any number of
dumps could be queued which is pointless since only the last dump
request matters.
---
 src/wiphy.c | 103 +++++++++++++++++++++++++++-------------------------
 1 file changed, 54 insertions(+), 49 deletions(-)

Comments

Denis Kenzior May 17, 2023, 12:53 a.m. UTC | #1
Hi James,

On 5/10/23 15:45, James Prestwood wrote:
> Certain drivers change the regulatory domain when the client roams.
> First XX, then the country the roam target is using. Depending on the
> timing the second regdom event can come while the first is still
> dumping. This was not handled gracefully/nicely in IWD and it would
> just issue another dump (it tries to cancel but if the dump started
> there isn't anything that can be done). AFAIK the kernel should be
> able to handle this but based on the behavior we saw it seems to not
> be able to, or at least not reliably.

Okay.  Might be useful to know what the kernel actually expects to happen in 
case we send on a socket during an ongoing command.

> 
> When the second dump comes in and GET_WIPHY is issued the kernel
> stops replying to any messages which obviously causes IWD to completely
> hang up waiting for various acks (set_station, set_key, etc.).
> 

Sounds like genl isn't doing the right thing...

> Now if another dump is ongoing (and cant be safely canceled) a flag
> will be set when another dump is needed. The first dump will be
> waited for and only once completed will another start. This also
> ensures there is at most 1 pending dump when before any number of
> dumps could be queued which is pointless since only the last dump
> request matters.

NAK.  Please stop right there.  genl is already a queue.  There's no sense 
adding queuing on top of it in this particular case.  If l_genl_family_cancel 
isn't doing the right thing, then fix that instead.

> ---
>   src/wiphy.c | 103 +++++++++++++++++++++++++++-------------------------
>   1 file changed, 54 insertions(+), 49 deletions(-)
> 

Regards,
-Denis
James Prestwood May 17, 2023, 3:11 p.m. UTC | #2
Hi Denis,

On 5/16/23 5:53 PM, Denis Kenzior wrote:
> Hi James,
> 
> On 5/10/23 15:45, James Prestwood wrote:
>> Certain drivers change the regulatory domain when the client roams.
>> First XX, then the country the roam target is using. Depending on the
>> timing the second regdom event can come while the first is still
>> dumping. This was not handled gracefully/nicely in IWD and it would
>> just issue another dump (it tries to cancel but if the dump started
>> there isn't anything that can be done). AFAIK the kernel should be
>> able to handle this but based on the behavior we saw it seems to not
>> be able to, or at least not reliably.
> 
> Okay.  Might be useful to know what the kernel actually expects to 
> happen in case we send on a socket during an ongoing command.

I can try and get this to happen locally to investigate further, but so 
far its been a rare case. Maybe happening 1 time in a 24 hour period on 
20-30 clients. I'm thinking maybe a locking/race problem in 
nl80211_dump_wiphy(), but its just a guess.

> 
>>
>> When the second dump comes in and GET_WIPHY is issued the kernel
>> stops replying to any messages which obviously causes IWD to completely
>> hang up waiting for various acks (set_station, set_key, etc.).
>>
> 
> Sounds like genl isn't doing the right thing...
> 
>> Now if another dump is ongoing (and cant be safely canceled) a flag
>> will be set when another dump is needed. The first dump will be
>> waited for and only once completed will another start. This also
>> ensures there is at most 1 pending dump when before any number of
>> dumps could be queued which is pointless since only the last dump
>> request matters.
> 
> NAK.  Please stop right there.  genl is already a queue.  There's no 
> sense adding queuing on top of it in this particular case.  If 
> l_genl_family_cancel isn't doing the right thing, then fix that instead.

Well l_genl_family_cancel really cant do anything about it. Once the 
dump hits the kernel you cant cancel it (this is why we added 
l_genl_family_request_sent). So the only alternative I can think of, if 
we cant do it in IWD, would be to make l_genl queue dumps and not issue 
more dumps until the previous one finishes. This is obviously a 
significant behavioral change.

> 
>> ---
>>   src/wiphy.c | 103 +++++++++++++++++++++++++++-------------------------
>>   1 file changed, 54 insertions(+), 49 deletions(-)
>>
> 
> Regards,
> -Denis
Denis Kenzior May 18, 2023, 3:21 a.m. UTC | #3
Hi James,

>> NAK.  Please stop right there.  genl is already a queue.  There's no sense 
>> adding queuing on top of it in this particular case.  If l_genl_family_cancel 
>> isn't doing the right thing, then fix that instead.
> 
> Well l_genl_family_cancel really cant do anything about it. Once the dump hits 
> the kernel you cant cancel it (this is why we added l_genl_family_request_sent). 

This was done for a different reason.  In the case of CMD_REMAIN_ON_CHANNEL we 
must wait for the ack to obtain the cookie.  We can cancel the ROC only if we 
know the cookie.  Otherwise, even if we call l_genl_family_cancel, the ROC would 
still go ahead.

What you have here is a different and simpler case.

> So the only alternative I can think of, if we cant do it in IWD, would be to 
> make l_genl queue dumps and not issue more dumps until the previous one 
> finishes. This is obviously a significant behavioral change.

I'm lost?  That is what is supposed to happen and happens already today.  If you 
issue two dumps like:

l_genl_family_dump(..., msg1, ...);
l_genl_family_dump(..., msg2, ...);

Then msg2 will not be sent to the kernel until NLMSG_DONE flagged message is 
received.  So what exactly would be the behavioral change you are referring to?

Anyway, looking at l_genl_family_cancel, it is doing the wrong thing, so that 
should be fixed.

Regards,
-Denis
James Prestwood May 18, 2023, 2:03 p.m. UTC | #4
Hi Denis,

On 5/17/23 8:21 PM, Denis Kenzior wrote:
> Hi James,
> 
>>> NAK.  Please stop right there.  genl is already a queue.  There's no 
>>> sense adding queuing on top of it in this particular case.  If 
>>> l_genl_family_cancel isn't doing the right thing, then fix that instead.
>>
>> Well l_genl_family_cancel really cant do anything about it. Once the 
>> dump hits the kernel you cant cancel it (this is why we added 
>> l_genl_family_request_sent). 
> 
> This was done for a different reason.  In the case of 
> CMD_REMAIN_ON_CHANNEL we must wait for the ack to obtain the cookie.  We 
> can cancel the ROC only if we know the cookie.  Otherwise, even if we 
> call l_genl_family_cancel, the ROC would still go ahead.
> 
> What you have here is a different and simpler case.
> 
>> So the only alternative I can think of, if we cant do it in IWD, would 
>> be to make l_genl queue dumps and not issue more dumps until the 
>> previous one finishes. This is obviously a significant behavioral change.
> 
> I'm lost?  That is what is supposed to happen and happens already 
> today.  If you issue two dumps like:
> 
> l_genl_family_dump(..., msg1, ...);
> l_genl_family_dump(..., msg2, ...);
> 
> Then msg2 will not be sent to the kernel until NLMSG_DONE flagged 
> message is received.  So what exactly would be the behavioral change you 
> are referring to?

Ok this was not my understanding of how l_genl messaging worked. I 
thought you could have multiple in-flight messages. But looking at it 
you can't, the IO writer only sends a single message.

So I'll fix l_genl_family_cancel to not remove the message from the 
pending queue until we get NLMSG_DONE.

> 
> Anyway, looking at l_genl_family_cancel, it is doing the wrong thing, so 
> that should be fixed.
> 
> Regards,
> -Denis
diff mbox series

Patch

diff --git a/src/wiphy.c b/src/wiphy.c
index 2db2d2cd..dbffa44e 100644
--- a/src/wiphy.c
+++ b/src/wiphy.c
@@ -67,6 +67,7 @@  static int mac_randomize_bytes = 6;
 static char regdom_country[2];
 static uint32_t work_ids;
 static unsigned int wiphy_dump_id;
+static bool dump_queued;
 
 enum driver_flag {
 	DEFAULT_IF = 0x1,
@@ -123,6 +124,7 @@  struct wiphy {
 	bool work_in_callback;
 	unsigned int get_reg_id;
 	unsigned int dump_id;
+	bool dump_queued : 1;
 
 	bool support_scheduled_scan:1;
 	bool support_rekey_offload:1;
@@ -2066,18 +2068,23 @@  static void wiphy_setup_rm_enabled_capabilities(struct wiphy *wiphy)
 	 */
 }
 
+static void wiphy_dump_after_regdom(struct wiphy *wiphy);
+
 static void wiphy_dump_done(void *user_data)
 {
 	struct wiphy *wiphy = user_data;
 	const struct l_queue_entry *e;
 
-	/* This dump was canceled due to another dump */
-	if ((wiphy && !wiphy->dump_id) || (!wiphy && !wiphy_dump_id))
-		return;
-
 	if (wiphy) {
 		wiphy->dump_id = 0;
 
+		if (wiphy->dump_queued) {
+			wiphy->dump_queued = false;
+			wiphy_dump_after_regdom(wiphy);
+			return;
+		}
+
+		/* No other dumps queued, emit done */
 		WATCHLIST_NOTIFY(&wiphy->state_watches,
 				wiphy_state_watch_func_t, wiphy,
 				WIPHY_STATE_WATCH_EVENT_REGDOM_DONE);
@@ -2087,6 +2094,12 @@  static void wiphy_dump_done(void *user_data)
 
 	wiphy_dump_id = 0;
 
+	if (dump_queued) {
+		dump_queued = false;
+		wiphy_dump_after_regdom(NULL);
+		return;
+	}
+
 	for (e = l_queue_get_entries(wiphy_list); e; e = e->next) {
 		wiphy = e->data;
 
@@ -2160,38 +2173,48 @@  static void wiphy_dump_callback(struct l_genl_msg *msg,
 	}
 }
 
-static bool wiphy_cancel_last_dump(struct wiphy *wiphy)
+static void wiphy_dump_after_regdom(struct wiphy *wiphy)
 {
-	unsigned int id = 0;
+	const struct l_queue_entry *e;
+	struct l_genl_msg *msg;
+	unsigned int *id = NULL;
 
-	/*
-	 * Zero command ID to signal that wiphy_dump_done doesn't need to do
-	 * anything.
-	 */
-	if (wiphy && wiphy->dump_id) {
-		id = wiphy->dump_id;
-		wiphy->dump_id = 0;
-	} else if (!wiphy && wiphy_dump_id) {
-		id = wiphy_dump_id;
-		wiphy_dump_id = 0;
-	}
+	if (wiphy)
+		id = &wiphy->dump_id;
+	else
+		id = &wiphy_dump_id;
 
-	if (id) {
-		l_debug("Canceling pending regdom wiphy dump (%s)",
+	if (*id) {
+		/*
+		 * If the dump hasn't hit the kernel we can safely cancel it.
+		 * Otherwise set the dump_queued flag and once the current dump
+		 * has completed, another will start. This also ensures there is
+		 * only at most 1 pending dump.
+		 */
+		if (!l_genl_family_request_sent(nl80211, *id)) {
+			l_debug("Canceling pending regdom wiphy dump (%s)",
 					wiphy ? wiphy->name : "global");
 
-		l_genl_family_cancel(nl80211, id);
+			l_genl_family_cancel(nl80211, *id);
+
+			/*
+			 * If there is already dump queued (dump_queued=true)
+			 * the destroy callback will end up starting another
+			 * dump. If this happens nothing more needs to be done.
+			 */
+			if (*id)
+				return;
+		} else if (wiphy) {
+			wiphy->dump_queued = true;
+			return;
+		} else {
+			dump_queued = true;
+			return;
+		}
 	}
 
-	return id != 0;
-}
-
-static void wiphy_dump_after_regdom(struct wiphy *wiphy)
-{
-	const struct l_queue_entry *e;
-	struct l_genl_msg *msg;
-	unsigned int id;
-	bool no_start_event;
+	l_debug("Dumping wiphy after regdom change (%s)",
+			wiphy ? wiphy->name : "global");
 
 	msg = l_genl_msg_new_sized(NL80211_CMD_GET_WIPHY, 128);
 
@@ -2199,37 +2222,22 @@  static void wiphy_dump_after_regdom(struct wiphy *wiphy)
 		l_genl_msg_append_attr(msg, NL80211_ATTR_WIPHY, 4, &wiphy->id);
 
 	l_genl_msg_append_attr(msg, NL80211_ATTR_SPLIT_WIPHY_DUMP, 0, NULL);
-	id = l_genl_family_dump(nl80211, msg, wiphy_dump_callback,
+	*id = l_genl_family_dump(nl80211, msg, wiphy_dump_callback,
 						wiphy, wiphy_dump_done);
-	if (!id) {
+	if (!*id) {
 		l_error("Wiphy information dump failed");
 		l_genl_msg_unref(msg);
 		return;
 	}
 
-	/*
-	 * Another update while dumping wiphy. This next dump should supercede
-	 * the first and not result in a DONE event until this new dump is
-	 * finished. This is because the disabled frequencies are in an unknown
-	 * state and could cause incorrect behavior by any watchers.
-	 */
-	no_start_event = wiphy_cancel_last_dump(wiphy);
-
 	/* Limited dump so just emit the event for this wiphy */
 	if (wiphy) {
-		wiphy->dump_id = id;
-
-		if (no_start_event)
-			return;
-
 		WATCHLIST_NOTIFY(&wiphy->state_watches,
 				wiphy_state_watch_func_t, wiphy,
 				WIPHY_STATE_WATCH_EVENT_REGDOM_STARTED);
 		return;
 	}
 
-	wiphy_dump_id = id;
-
 	/* Otherwise for a global regdom change notify for all wiphy's */
 	for (e = l_queue_get_entries(wiphy_list); e; e = e->next) {
 		struct wiphy *w = e->data;
@@ -2237,9 +2245,6 @@  static void wiphy_dump_after_regdom(struct wiphy *wiphy)
 		if (w->self_managed)
 			continue;
 
-		if (no_start_event)
-			continue;
-
 		WATCHLIST_NOTIFY(&w->state_watches, wiphy_state_watch_func_t,
 				w, WIPHY_STATE_WATCH_EVENT_REGDOM_STARTED);
 	}