Message ID | 20231012200150.338401-11-prestwoj@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | DPP PKEX Changes | expand |
Hi James, On 10/12/23 15:01, James Prestwood wrote: > There was nothing which prevented this, but due to the behavior of > some drivers multiple offchannel requests on the same channel Isn't the point of offchannel API to serialize all such requests so this does not happen? > resulted in the second request never starting and eventually timing > out. This is because some drivers combine offchannel requests if > they are on the same channel and this ultimately results in the > netlink ACK coming after the ROC started event. This patch fixes > some logic to allow for this case. Kernel ROC APIs have no enforcement of semantics at all. Different drivers just do whatever. Have you tested that this works on brcmfmac for example? It might be better to explicitly wait for the ROC event to be ended before starting a new one. > > The motivation to support this is so modules can start offchannel > work items for short durations and wait for a response, if a frame > is received the offchannel request can be canceled/restarted for > a longer duration. > > This could also be done instead by using a long duration initially > and an extra timer to cancel, but its more convenient if offchannel > supports this natively. In addition, this driver quirk should be > supported regardless (e.g. if two IWD modules happen to issue ROC's > on the same channel). Then shouldn't offchannel serialize such requests? > > Furthermore, the offchannel module was only looking up requests by > wdev_id which could result in the wrong request being found. > Instead the request should be looked up by both wdev_id and cookie > (when possible), or the ID in the case of canceling. This part makes sense and probably belongs in a separate patch. > --- > src/offchannel.c | 55 ++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 48 insertions(+), 7 deletions(-) > Regards, -Denis
Hi Denis, On 10/19/23 7:51 AM, Denis Kenzior wrote: > Hi James, > > On 10/12/23 15:01, James Prestwood wrote: >> There was nothing which prevented this, but due to the behavior of >> some drivers multiple offchannel requests on the same channel > > Isn't the point of offchannel API to serialize all such requests so this > does not happen? > >> resulted in the second request never starting and eventually timing >> out. This is because some drivers combine offchannel requests if >> they are on the same channel and this ultimately results in the >> netlink ACK coming after the ROC started event. This patch fixes >> some logic to allow for this case. > > Kernel ROC APIs have no enforcement of semantics at all. Different > drivers just do whatever. Have you tested that this works on brcmfmac > for example? It might be better to explicitly wait for the ROC event to > be ended before starting a new one. > >> >> The motivation to support this is so modules can start offchannel >> work items for short durations and wait for a response, if a frame >> is received the offchannel request can be canceled/restarted for >> a longer duration. >> >> This could also be done instead by using a long duration initially >> and an extra timer to cancel, but its more convenient if offchannel >> supports this natively. In addition, this driver quirk should be >> supported regardless (e.g. if two IWD modules happen to issue ROC's >> on the same channel). > > Then shouldn't offchannel serialize such requests? > >> >> Furthermore, the offchannel module was only looking up requests by >> wdev_id which could result in the wrong request being found. >> Instead the request should be looked up by both wdev_id and cookie >> (when possible), or the ID in the case of canceling. > > This part makes sense and probably belongs in a separate patch. So I'm not sure what exactly is going on here. I have yet to see this on actual hardware, but it happens in hwsim frequently. You are right that the offchannel module gates all the work items, and correctly waits for one to finish. So I was wrong that it needed "fixing" in that regard. Here are some logs with the ack coming out of order, maybe this is some UML scheduling thing, not sure: src/dpp.c:dpp_start_pkex_enrollee() PKEX start enrollee (id=test) src/wiphy.c:wiphy_radio_work_insert() Inserting work item 1 src/wiphy.c:wiphy_radio_work_next() Starting work item 1 src/offchannel.c:offchannel_work_ready() Issuing ROC src/offchannel.c:offchannel_roc_cb() cookie=1 src/netdev.c:netdev_mlme_notify() MLME notification Remain on Channel(55) src/offchannel.c:offchannel_mlme_notify() ROC notify, cookie=1 src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 src/netdev.c:netdev_unicast_notify() Unicast notification Frame(59) src/dpp.c:dpp_handle_pkex_exchange_request() PKEX exchange request 02:00:00:00:02:00 src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 src/netdev.c:netdev_mlme_notify() MLME notification Frame TX Status(60) src/netdev.c:netdev_unicast_notify() Unicast notification Frame(59) src/dpp.c:dpp_handle_pkex_commit_reveal_request() PKEX commit-reveal request 02:00:00:00:02:00 src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 src/netdev.c:netdev_mlme_notify() MLME notification Frame TX Status(60) src/netdev.c:netdev_unicast_notify() Unicast notification Frame(59) src/dpp.c:dpp_handle_pkex_exchange_response() PKEX response 02:00:00:00:03:00 # Got a response, so the prior offchannel work (1) was canceled, and a new item inserted (2) src/wiphy.c:wiphy_radio_work_insert() Inserting work item 2 src/netdev.c:netdev_mlme_notify() MLME notification Cancel Remain on Channel(56) src/offchannel.c:offchannel_mlme_notify() ROC cancel, cookie=1 # Cancel ROC is correctly waited for before starting the next item src/wiphy.c:wiphy_radio_work_done() Work item 1 done src/wiphy.c:wiphy_radio_work_next() Starting work item 2 src/offchannel.c:offchannel_work_ready() Issuing ROC src/netdev.c:netdev_mlme_notify() MLME notification Remain on Channel(55) # Then immediately we get a Remain on Channel event src/offchannel.c:offchannel_mlme_notify() ROC notify, cookie=3 src/offchannel.c:offchannel_mlme_notify() ROC started prior to ACK, setting cookie 3 src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 # And finally the ack comes in src/offchannel.c:offchannel_roc_cb() cookie=3 I need to look at the kernel a bit more to see how this can happen, but apart from me removing some of the commit description I do think the patch is required to handle this case. Thanks, James
Hi James, > > So I'm not sure what exactly is going on here. I have yet to see this on actual > hardware, but it happens in hwsim frequently. You are right that the offchannel > module gates all the work items, and correctly waits for one to finish. So I was > wrong that it needed "fixing" in that regard. Here are some logs with the ack > coming out of order, maybe this is some UML scheduling thing, not sure: > > src/dpp.c:dpp_start_pkex_enrollee() PKEX start enrollee (id=test) > src/wiphy.c:wiphy_radio_work_insert() Inserting work item 1 > src/wiphy.c:wiphy_radio_work_next() Starting work item 1 > src/offchannel.c:offchannel_work_ready() Issuing ROC > src/offchannel.c:offchannel_roc_cb() cookie=1 > src/netdev.c:netdev_mlme_notify() MLME notification Remain on Channel(55) > src/offchannel.c:offchannel_mlme_notify() ROC notify, cookie=1 > src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 > src/netdev.c:netdev_unicast_notify() Unicast notification Frame(59) > src/dpp.c:dpp_handle_pkex_exchange_request() PKEX exchange request > 02:00:00:00:02:00 > src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 > src/netdev.c:netdev_mlme_notify() MLME notification Frame TX Status(60) > src/netdev.c:netdev_unicast_notify() Unicast notification Frame(59) > src/dpp.c:dpp_handle_pkex_commit_reveal_request() PKEX commit-reveal request > 02:00:00:00:02:00 > src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 > src/netdev.c:netdev_mlme_notify() MLME notification Frame TX Status(60) > src/netdev.c:netdev_unicast_notify() Unicast notification Frame(59) > src/dpp.c:dpp_handle_pkex_exchange_response() PKEX response 02:00:00:00:03:00 > > # Got a response, so the prior offchannel work (1) was canceled, and a new item > inserted (2) > > src/wiphy.c:wiphy_radio_work_insert() Inserting work item 2 > src/netdev.c:netdev_mlme_notify() MLME notification Cancel Remain on Channel(56) So is this iwd sending a NL80211_CMD_CANCEL_REMAIN_ON_CHANNEL? Should we also wait for the ack of this one too? > src/offchannel.c:offchannel_mlme_notify() ROC cancel, cookie=1 > > # Cancel ROC is correctly waited for before starting the next item > src/wiphy.c:wiphy_radio_work_done() Work item 1 done > src/wiphy.c:wiphy_radio_work_next() Starting work item 2 > src/offchannel.c:offchannel_work_ready() Issuing ROC > src/netdev.c:netdev_mlme_notify() MLME notification Remain on Channel(55) This seems fishy? What else is going offchannel? > > # Then immediately we get a Remain on Channel event > src/offchannel.c:offchannel_mlme_notify() ROC notify, cookie=3 > src/offchannel.c:offchannel_mlme_notify() ROC started prior to ACK, setting > cookie 3 > src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 > > # And finally the ack comes in > src/offchannel.c:offchannel_roc_cb() cookie=3 Yeah, why is the cookie 3? Shouldn't it be 2? > > I need to look at the kernel a bit more to see how this can happen, but apart > from me removing some of the commit description I do think the patch is required > to handle this case. > Regards, -Denis
Hi Denis, On 10/19/23 12:55 PM, Denis Kenzior wrote: > Hi James, > >> >> So I'm not sure what exactly is going on here. I have yet to see this >> on actual hardware, but it happens in hwsim frequently. You are right >> that the offchannel module gates all the work items, and correctly >> waits for one to finish. So I was wrong that it needed "fixing" in >> that regard. Here are some logs with the ack coming out of order, >> maybe this is some UML scheduling thing, not sure: >> >> src/dpp.c:dpp_start_pkex_enrollee() PKEX start enrollee (id=test) >> src/wiphy.c:wiphy_radio_work_insert() Inserting work item 1 >> src/wiphy.c:wiphy_radio_work_next() Starting work item 1 >> src/offchannel.c:offchannel_work_ready() Issuing ROC >> src/offchannel.c:offchannel_roc_cb() cookie=1 >> src/netdev.c:netdev_mlme_notify() MLME notification Remain on Channel(55) >> src/offchannel.c:offchannel_mlme_notify() ROC notify, cookie=1 >> src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 >> src/netdev.c:netdev_unicast_notify() Unicast notification Frame(59) >> src/dpp.c:dpp_handle_pkex_exchange_request() PKEX exchange request >> 02:00:00:00:02:00 >> src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 >> src/netdev.c:netdev_mlme_notify() MLME notification Frame TX Status(60) >> src/netdev.c:netdev_unicast_notify() Unicast notification Frame(59) >> src/dpp.c:dpp_handle_pkex_commit_reveal_request() PKEX commit-reveal >> request 02:00:00:00:02:00 >> src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 >> src/netdev.c:netdev_mlme_notify() MLME notification Frame TX Status(60) >> src/netdev.c:netdev_unicast_notify() Unicast notification Frame(59) >> src/dpp.c:dpp_handle_pkex_exchange_response() PKEX response >> 02:00:00:00:03:00 >> >> # Got a response, so the prior offchannel work (1) was canceled, and a >> new item inserted (2) >> >> src/wiphy.c:wiphy_radio_work_insert() Inserting work item 2 >> src/netdev.c:netdev_mlme_notify() MLME notification Cancel Remain on >> Channel(56) > > So is this iwd sending a NL80211_CMD_CANCEL_REMAIN_ON_CHANNEL? Should > we also wait for the ack of this one too? It wouldn't hurt, but I think we were always under the assuming that the ack would come before the event so I never bothered using a callback :) > >> src/offchannel.c:offchannel_mlme_notify() ROC cancel, cookie=1 >> >> # Cancel ROC is correctly waited for before starting the next item >> src/wiphy.c:wiphy_radio_work_done() Work item 1 done >> src/wiphy.c:wiphy_radio_work_next() Starting work item 2 >> src/offchannel.c:offchannel_work_ready() Issuing ROC >> src/netdev.c:netdev_mlme_notify() MLME notification Remain on Channel(55) > > This seems fishy? What else is going offchannel? No, this is the same event as below, just netdev printing it. > >> >> # Then immediately we get a Remain on Channel event >> src/offchannel.c:offchannel_mlme_notify() ROC notify, cookie=3 >> src/offchannel.c:offchannel_mlme_notify() ROC started prior to ACK, >> setting cookie 3 >> src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 >> >> # And finally the ack comes in >> src/offchannel.c:offchannel_roc_cb() cookie=3 > > Yeah, why is the cookie 3? Shouldn't it be 2? 2 is the work item, 3 is the cookie above. > >> >> I need to look at the kernel a bit more to see how this can happen, >> but apart from me removing some of the commit description I do think >> the patch is required to handle this case. >> > > Regards, > -Denis >
Hi James, > > It wouldn't hurt, but I think we were always under the assuming that the ack > would come before the event so I never bothered using a callback :) > My memory is fuzzy now, but I think brcmfmac was very weird in this area. Which might explain why offchannel code is written the way it is. >> >>> src/offchannel.c:offchannel_mlme_notify() ROC cancel, cookie=1 >>> >>> # Cancel ROC is correctly waited for before starting the next item >>> src/wiphy.c:wiphy_radio_work_done() Work item 1 done >>> src/wiphy.c:wiphy_radio_work_next() Starting work item 2 >>> src/offchannel.c:offchannel_work_ready() Issuing ROC >>> src/netdev.c:netdev_mlme_notify() MLME notification Remain on Channel(55) >> >> This seems fishy? What else is going offchannel? > > No, this is the same event as below, just netdev printing it. Ah, ok. > >> >>> >>> # Then immediately we get a Remain on Channel event >>> src/offchannel.c:offchannel_mlme_notify() ROC notify, cookie=3 >>> src/offchannel.c:offchannel_mlme_notify() ROC started prior to ACK, setting >>> cookie 3 >>> src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 >>> >>> # And finally the ack comes in >>> src/offchannel.c:offchannel_roc_cb() cookie=3 >> >> Yeah, why is the cookie 3? Shouldn't it be 2? > > 2 is the work item, 3 is the cookie above. Yeah, I get that. But shouldn't the cookie from the kernel be 2 and not 3? Or is the cookie also being incremented by CMD_FRAME? Regards, -Denis
Hi Denis, On 10/19/23 2:42 PM, Denis Kenzior wrote: > Hi James, > >> >> It wouldn't hurt, but I think we were always under the assuming that >> the ack would come before the event so I never bothered using a >> callback :) >> > > My memory is fuzzy now, but I think brcmfmac was very weird in this > area. Which might explain why offchannel code is written the way it is. I think Andrew sorted much of that out in frame-xchg, and I think there are comments about similar behavior of acks arriving late. I can dig up a brcmfmac card and play around with it. > >>> >>>> src/offchannel.c:offchannel_mlme_notify() ROC cancel, cookie=1 >>>> >>>> # Cancel ROC is correctly waited for before starting the next item >>>> src/wiphy.c:wiphy_radio_work_done() Work item 1 done >>>> src/wiphy.c:wiphy_radio_work_next() Starting work item 2 >>>> src/offchannel.c:offchannel_work_ready() Issuing ROC >>>> src/netdev.c:netdev_mlme_notify() MLME notification Remain on >>>> Channel(55) >>> >>> This seems fishy? What else is going offchannel? >> >> No, this is the same event as below, just netdev printing it. > > Ah, ok. > >> >>> >>>> >>>> # Then immediately we get a Remain on Channel event >>>> src/offchannel.c:offchannel_mlme_notify() ROC notify, cookie=3 >>>> src/offchannel.c:offchannel_mlme_notify() ROC started prior to ACK, >>>> setting cookie 3 >>>> src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 >>>> >>>> # And finally the ack comes in >>>> src/offchannel.c:offchannel_roc_cb() cookie=3 >>> >>> Yeah, why is the cookie 3? Shouldn't it be 2? >> >> 2 is the work item, 3 is the cookie above. > > Yeah, I get that. But shouldn't the cookie from the kernel be 2 and not > 3? Or is the cookie also being incremented by CMD_FRAME? Oh I see, since it jumped from 1 to 3. I would need to verify, but I'd guess yes the CMD_FRAME increments the same counter. > > Regards, > -Denis
Hi Denis, On 10/19/23 2:47 PM, James Prestwood wrote: > Hi Denis, > > On 10/19/23 2:42 PM, Denis Kenzior wrote: >> Hi James, >> >>> >>> It wouldn't hurt, but I think we were always under the assuming that >>> the ack would come before the event so I never bothered using a >>> callback :) >>> >> >> My memory is fuzzy now, but I think brcmfmac was very weird in this >> area. Which might explain why offchannel code is written the way it is. > > I think Andrew sorted much of that out in frame-xchg, and I think there > are comments about similar behavior of acks arriving late. I can dig up > a brcmfmac card and play around with it. > >> >>>> >>>>> src/offchannel.c:offchannel_mlme_notify() ROC cancel, cookie=1 >>>>> >>>>> # Cancel ROC is correctly waited for before starting the next item >>>>> src/wiphy.c:wiphy_radio_work_done() Work item 1 done >>>>> src/wiphy.c:wiphy_radio_work_next() Starting work item 2 >>>>> src/offchannel.c:offchannel_work_ready() Issuing ROC >>>>> src/netdev.c:netdev_mlme_notify() MLME notification Remain on >>>>> Channel(55) >>>> >>>> This seems fishy? What else is going offchannel? >>> >>> No, this is the same event as below, just netdev printing it. >> >> Ah, ok. >> >>> >>>> >>>>> >>>>> # Then immediately we get a Remain on Channel event >>>>> src/offchannel.c:offchannel_mlme_notify() ROC notify, cookie=3 >>>>> src/offchannel.c:offchannel_mlme_notify() ROC started prior to ACK, >>>>> setting cookie 3 >>>>> src/dpp.c:dpp_send_frame() Sending frame on frequency 2437 >>>>> >>>>> # And finally the ack comes in >>>>> src/offchannel.c:offchannel_roc_cb() cookie=3 >>>> >>>> Yeah, why is the cookie 3? Shouldn't it be 2? >>> >>> 2 is the work item, 3 is the cookie above. >> >> Yeah, I get that. But shouldn't the cookie from the kernel be 2 and >> not 3? Or is the cookie also being incremented by CMD_FRAME? > > Oh I see, since it jumped from 1 to 3. I would need to verify, but I'd > guess yes the CMD_FRAME increments the same counter. Maybe this is a scheduling thing. I see in the kernel (mac80211/offchannel.c) the next offchannel work gets started from within the cancel call (ieee80211_cancel_roc). list_for_each_entry_safe(roc, tmp, &local->roc_list, list) { if (!roc->started) break; if (roc == found) found = NULL; ieee80211_roc_notify_destroy(roc); } /* that really must not happen - it was started */ WARN_ON(found); ieee80211_start_next_roc(local); There is a &local->mtx lock at the beginning so I'd think that would prevent anything from being shoved into roc_list, but maybe not? If IWD is able to start another ROC before ieee80211_start_next_roc is called that would explain it, but it seems unlikely, maybe only with UML. Thanks, James > >> >> Regards, >> -Denis
diff --git a/src/offchannel.c b/src/offchannel.c index b9cdc117..6a705092 100644 --- a/src/offchannel.c +++ b/src/offchannel.c @@ -54,15 +54,34 @@ struct offchannel_info { bool needs_cancel : 1; }; +struct match_data { + uint64_t wdev_id; + uint64_t cookie; +}; + static struct l_genl_family *nl80211; static struct l_queue *offchannel_list; -static bool match_wdev(const void *a, const void *user_data) +static bool match_info(const void *a, const void *user_data) +{ + const struct match_data *match = user_data; + const struct offchannel_info *info = a; + + if (match->wdev_id != info->wdev_id) + return false; + + if (!match->cookie) + return true; + + return match->cookie == info->roc_cookie; +} + +static bool match_id(const void *a, const void *user_data) { + const uint32_t *id = user_data; const struct offchannel_info *info = a; - const uint64_t *wdev_id = user_data; - return info->wdev_id == *wdev_id; + return *id == info->work.id; } static void offchannel_cancel_roc(struct offchannel_info *info) @@ -191,7 +210,8 @@ void offchannel_cancel(uint64_t wdev_id, uint32_t id) else if (ret == false) goto work_done; - info = l_queue_find(offchannel_list, match_wdev, &wdev_id); + + info = l_queue_find(offchannel_list, match_id, &id); if (!info) return; @@ -246,6 +266,7 @@ work_done: static void offchannel_mlme_notify(struct l_genl_msg *msg, void *user_data) { struct offchannel_info *info; + struct match_data match = {0}; uint64_t wdev_id; uint64_t cookie; uint8_t cmd; @@ -261,12 +282,32 @@ static void offchannel_mlme_notify(struct l_genl_msg *msg, void *user_data) NL80211_ATTR_UNSPEC) < 0) return; - info = l_queue_find(offchannel_list, match_wdev, &wdev_id); + match.wdev_id = wdev_id; + match.cookie = cookie; + + info = l_queue_find(offchannel_list, match_info, &match); + if (!info) { + /* Try again without cookie */ + match.cookie = 0; + info = l_queue_find(offchannel_list, match_info, &match); + } + if (!info) return; - /* ROC must have been started elsewhere, not by IWD */ - if (info->roc_cookie != cookie) + /* + * If the cookie is zero this could be from the ACK callback coming out + * of order which can happen when there is an ongoing offchannel request + * that is canceled (but no ROC_CANCEL event yet) and another is issued + * on the same frequency. Some drivers will combine these two requests + * and the ACK ends up coming after the ROC started event. + * + * If the cookie is set but does not match, this ROC request came from + * outside IWD. + */ + if (!info->roc_cookie) { + info->roc_cookie = cookie; + } else if (info->roc_cookie != cookie) return; switch (cmd) {