diff mbox series

[2/3] mac80211: Add support to trigger sta disconnect on hardware restart

Message ID 20201215172352.5311-1-youghand@codeaurora.org (mailing list archive)
State Awaiting Upstream
Delegated to: Netdev Maintainers
Headers show
Series mac80211: Trigger disconnect for STA during recovery | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Guessed tree name to be net-next
netdev/subject_prefix success Link
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/kdoc success Errors and warnings before: 8 this patch: 8
netdev/verify_fixes success Link
netdev/checkpatch warning CHECK: Please don't use multiple blank lines
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Youghandhar Chintala Dec. 15, 2020, 5:23 p.m. UTC
Currently in case of target hardware restart, we just reconfig and
re-enable the security keys and enable the network queues to start
data traffic back from where it was interrupted.

Many ath10k wifi chipsets have sequence numbers for the data
packets assigned by firmware and the mac sequence number will
restart from zero after target hardware restart leading to mismatch
in the sequence number expected by the remote peer vs the sequence
number of the frame sent by the target firmware.

This mismatch in sequence number will cause out-of-order packets
on the remote peer and all the frames sent by the device are dropped
until we reach the sequence number which was sent before we restarted
the target hardware

In order to fix this, we trigger a sta disconnect, for the targets
which expose this corresponding wiphy flag, in case of target hw
restart. After this there will be a fresh connection and thereby
avoiding the dropping of frames by remote peer.

The right fix would be to pull the entire data path into the host
which is not feasible or would need lots of complex changes and
will still be inefficient.

Tested on ath10k using WCN3990, QCA6174

Signed-off-by: Youghandhar Chintala <youghand@codeaurora.org>
---
 net/mac80211/ieee80211_i.h |  3 +++
 net/mac80211/mlme.c        |  9 +++++++++
 net/mac80211/util.c        | 22 +++++++++++++++++++---
 3 files changed, 31 insertions(+), 3 deletions(-)

Comments

Felix Fietkau Dec. 15, 2020, 5:40 p.m. UTC | #1
On 2020-12-15 18:23, Youghandhar Chintala wrote:
> Currently in case of target hardware restart, we just reconfig and
> re-enable the security keys and enable the network queues to start
> data traffic back from where it was interrupted.
> 
> Many ath10k wifi chipsets have sequence numbers for the data
> packets assigned by firmware and the mac sequence number will
> restart from zero after target hardware restart leading to mismatch
> in the sequence number expected by the remote peer vs the sequence
> number of the frame sent by the target firmware.
> 
> This mismatch in sequence number will cause out-of-order packets
> on the remote peer and all the frames sent by the device are dropped
> until we reach the sequence number which was sent before we restarted
> the target hardware
> 
> In order to fix this, we trigger a sta disconnect, for the targets
> which expose this corresponding wiphy flag, in case of target hw
> restart. After this there will be a fresh connection and thereby
> avoiding the dropping of frames by remote peer.
> 
> The right fix would be to pull the entire data path into the host
> which is not feasible or would need lots of complex changes and
> will still be inefficient.
How about simply tracking which tids have aggregation enabled and send
DELBA frames for those after the restart?
It would mean less disruption for affected stations and less ugly hacks
in the stack for unreliable hardware.

- Felix
Youghandhar Chintala Jan. 28, 2021, 8:08 a.m. UTC | #2
On 2020-12-15 23:10, Felix Fietkau wrote:
> On 2020-12-15 18:23, Youghandhar Chintala wrote:
>> Currently in case of target hardware restart, we just reconfig and
>> re-enable the security keys and enable the network queues to start
>> data traffic back from where it was interrupted.
>> 
>> Many ath10k wifi chipsets have sequence numbers for the data
>> packets assigned by firmware and the mac sequence number will
>> restart from zero after target hardware restart leading to mismatch
>> in the sequence number expected by the remote peer vs the sequence
>> number of the frame sent by the target firmware.
>> 
>> This mismatch in sequence number will cause out-of-order packets
>> on the remote peer and all the frames sent by the device are dropped
>> until we reach the sequence number which was sent before we restarted
>> the target hardware
>> 
>> In order to fix this, we trigger a sta disconnect, for the targets
>> which expose this corresponding wiphy flag, in case of target hw
>> restart. After this there will be a fresh connection and thereby
>> avoiding the dropping of frames by remote peer.
>> 
>> The right fix would be to pull the entire data path into the host
>> which is not feasible or would need lots of complex changes and
>> will still be inefficient.
> How about simply tracking which tids have aggregation enabled and send
> DELBA frames for those after the restart?
> It would mean less disruption for affected stations and less ugly hacks
> in the stack for unreliable hardware.
> 
> - Felix

Hi Felix,

We did try to send an ADDBA frame to the AP once the SSR happened. The 
AP ack’ed the frame and the new BA session with renewed sequence number 
was established. But still, the AP did not respond to the ping requests 
with the new sequence number. It did not respond until one of the two 
happened.
1.	The sequence number was more than the sequence number that DUT had 
used before SSR happened
2.	DUT disconnected and then reconnected.
The other option is to send a DELBA frame to the AP and make the AP also 
force to establish the BA session from its side. This we feel can have 
some interoperability issues as some of the AP’s may not honour the 
DELBA frame and will continue to use the earlier BA session that it had 
established. Given that re-negotiating the BA session is prone to IOT 
issues, we feel that it would be good to go with the 
Disconnect/Reconnect solution which is foolproof and will work in all 
scenarios.

Regards,
Youghandhar
Abhishek Kumar Feb. 5, 2021, 9:51 p.m. UTC | #3
Since using DELBA frame to APs to re-establish BA session has a
dependency on APs and also some APs may not honour the DELBA frame. I
am fine with having the disconnect/reconnect solution. The change
looks good to me.

Reviewed-by: Abhishek Kumar <kuabhs@chromium.org>

Thanks
Abhishek

On Thu, Jan 28, 2021 at 12:08 AM <youghand@codeaurora.org> wrote:
>
> On 2020-12-15 23:10, Felix Fietkau wrote:
> > On 2020-12-15 18:23, Youghandhar Chintala wrote:
> >> Currently in case of target hardware restart, we just reconfig and
> >> re-enable the security keys and enable the network queues to start
> >> data traffic back from where it was interrupted.
> >>
> >> Many ath10k wifi chipsets have sequence numbers for the data
> >> packets assigned by firmware and the mac sequence number will
> >> restart from zero after target hardware restart leading to mismatch
> >> in the sequence number expected by the remote peer vs the sequence
> >> number of the frame sent by the target firmware.
> >>
> >> This mismatch in sequence number will cause out-of-order packets
> >> on the remote peer and all the frames sent by the device are dropped
> >> until we reach the sequence number which was sent before we restarted
> >> the target hardware
> >>
> >> In order to fix this, we trigger a sta disconnect, for the targets
> >> which expose this corresponding wiphy flag, in case of target hw
> >> restart. After this there will be a fresh connection and thereby
> >> avoiding the dropping of frames by remote peer.
> >>
> >> The right fix would be to pull the entire data path into the host
> >> which is not feasible or would need lots of complex changes and
> >> will still be inefficient.
> > How about simply tracking which tids have aggregation enabled and send
> > DELBA frames for those after the restart?
> > It would mean less disruption for affected stations and less ugly hacks
> > in the stack for unreliable hardware.
> >
> > - Felix
>
> Hi Felix,
>
> We did try to send an ADDBA frame to the AP once the SSR happened. The
> AP ack’ed the frame and the new BA session with renewed sequence number
> was established. But still, the AP did not respond to the ping requests
> with the new sequence number. It did not respond until one of the two
> happened.
> 1.      The sequence number was more than the sequence number that DUT had
> used before SSR happened
> 2.      DUT disconnected and then reconnected.
> The other option is to send a DELBA frame to the AP and make the AP also
> force to establish the BA session from its side. This we feel can have
> some interoperability issues as some of the AP’s may not honour the
> DELBA frame and will continue to use the earlier BA session that it had
> established. Given that re-negotiating the BA session is prone to IOT
> issues, we feel that it would be good to go with the
> Disconnect/Reconnect solution which is foolproof and will work in all
> scenarios.
>
> Regards,
> Youghandhar
Guenter Roeck Feb. 8, 2021, 3:43 p.m. UTC | #4
On Tue, Dec 15, 2020 at 10:53:52PM +0530, Youghandhar Chintala wrote:
> Currently in case of target hardware restart, we just reconfig and
> re-enable the security keys and enable the network queues to start
> data traffic back from where it was interrupted.
> 
> Many ath10k wifi chipsets have sequence numbers for the data
> packets assigned by firmware and the mac sequence number will
> restart from zero after target hardware restart leading to mismatch
> in the sequence number expected by the remote peer vs the sequence
> number of the frame sent by the target firmware.
> 
> This mismatch in sequence number will cause out-of-order packets
> on the remote peer and all the frames sent by the device are dropped
> until we reach the sequence number which was sent before we restarted
> the target hardware
> 
> In order to fix this, we trigger a sta disconnect, for the targets
> which expose this corresponding wiphy flag, in case of target hw
> restart. After this there will be a fresh connection and thereby
> avoiding the dropping of frames by remote peer.
> 
> The right fix would be to pull the entire data path into the host
> which is not feasible or would need lots of complex changes and
> will still be inefficient.
> 
> Tested on ath10k using WCN3990, QCA6174
> 
> Signed-off-by: Youghandhar Chintala <youghand@codeaurora.org>
> Reviewed-by: Abhishek Kumar <kuabhs@chromium.org>
> ---
>  net/mac80211/ieee80211_i.h |  3 +++
>  net/mac80211/mlme.c        |  9 +++++++++
>  net/mac80211/util.c        | 22 +++++++++++++++++++---
>  3 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
> index cde2e3f..8cbeb5f 100644
> --- a/net/mac80211/ieee80211_i.h
> +++ b/net/mac80211/ieee80211_i.h
> @@ -748,6 +748,8 @@ struct ieee80211_if_mesh {
>   *	back to wireless media and to the local net stack.
>   * @IEEE80211_SDATA_DISCONNECT_RESUME: Disconnect after resume.
>   * @IEEE80211_SDATA_IN_DRIVER: indicates interface was added to driver
> + * @IEEE80211_SDATA_DISCONNECT_HW_RESTART: Disconnect after hardware restart
> + *	recovery
>   */
>  enum ieee80211_sub_if_data_flags {
>  	IEEE80211_SDATA_ALLMULTI		= BIT(0),
> @@ -755,6 +757,7 @@ enum ieee80211_sub_if_data_flags {
>  	IEEE80211_SDATA_DONT_BRIDGE_PACKETS	= BIT(3),
>  	IEEE80211_SDATA_DISCONNECT_RESUME	= BIT(4),
>  	IEEE80211_SDATA_IN_DRIVER		= BIT(5),
> +	IEEE80211_SDATA_DISCONNECT_HW_RESTART	= BIT(6),
>  };
>  
>  /**
> diff --git a/net/mac80211/mlme.c b/net/mac80211/mlme.c
> index 6adfcb9..e4d0d16 100644
> --- a/net/mac80211/mlme.c
> +++ b/net/mac80211/mlme.c
> @@ -4769,6 +4769,15 @@ void ieee80211_sta_restart(struct ieee80211_sub_if_data *sdata)
>  					      true);
>  		sdata_unlock(sdata);
>  		return;
> +	} else if (sdata->flags & IEEE80211_SDATA_DISCONNECT_HW_RESTART) {
> +		sdata->flags &= ~IEEE80211_SDATA_DISCONNECT_HW_RESTART;
> +		mlme_dbg(sdata, "driver requested disconnect after hardware restart\n");
> +		ieee80211_sta_connection_lost(sdata,
> +					      ifmgd->associated->bssid,
> +					      WLAN_REASON_UNSPECIFIED,
> +					      true);
> +		sdata_unlock(sdata);
> +		return;
>  	}
>  	sdata_unlock(sdata);
>  }
> diff --git a/net/mac80211/util.c b/net/mac80211/util.c
> index 8c3c01a..98567a3 100644
> --- a/net/mac80211/util.c
> +++ b/net/mac80211/util.c
> @@ -2567,9 +2567,12 @@ int ieee80211_reconfig(struct ieee80211_local *local)
>  	}
>  	mutex_unlock(&local->sta_mtx);
>  
> -	/* add back keys */
> -	list_for_each_entry(sdata, &local->interfaces, list)
> -		ieee80211_reenable_keys(sdata);
> +
> +	if (!(hw->wiphy->flags & WIPHY_FLAG_STA_DISCONNECT_ON_HW_RESTART)) {
> +		/* add back keys */
> +		list_for_each_entry(sdata, &local->interfaces, list)
> +			ieee80211_reenable_keys(sdata);
> +	}
>  
>  	/* Reconfigure sched scan if it was interrupted by FW restart */
>  	mutex_lock(&local->mtx);
> @@ -2643,6 +2646,19 @@ int ieee80211_reconfig(struct ieee80211_local *local)
>  					IEEE80211_QUEUE_STOP_REASON_SUSPEND,
>  					false);
>  
> +	if ((hw->wiphy->flags & WIPHY_FLAG_STA_DISCONNECT_ON_HW_RESTART) &&
> +	    !reconfig_due_to_wowlan) {
> +		list_for_each_entry(sdata, &local->interfaces, list) {
> +			if (!ieee80211_sdata_running(sdata))
> +				continue;
> +			if (sdata->vif.type == NL80211_IFTYPE_STATION) {
> +				sdata->flags |=
> +					IEEE80211_SDATA_DISCONNECT_HW_RESTART;
> +				ieee80211_sta_restart(sdata);

If CONFIG_PM=n:

ERROR: "ieee80211_sta_restart" [net/mac80211/mac80211.ko] undefined!

Guenter

> +			}
> +		}
> +	}
> +
>  	/*
>  	 * If this is for hw restart things are still running.
>  	 * We may want to change that later, however.
Johannes Berg Feb. 12, 2021, 8:37 a.m. UTC | #5
On Fri, 2021-02-05 at 13:51 -0800, Abhishek Kumar wrote:
> Since using DELBA frame to APs to re-establish BA session has a
> dependency on APs and also some APs may not honour the DELBA frame.


That's completely out of spec ... Can you say which AP this was?

You could also try sending a BAR that updates the SN.

johannes
Johannes Berg Feb. 12, 2021, 8:42 a.m. UTC | #6
On Tue, 2020-12-15 at 22:53 +0530, Youghandhar Chintala wrote:
> The right fix would be to pull the entire data path into the host

> +++ b/net/mac80211/ieee80211_i.h
> @@ -748,6 +748,8 @@ struct ieee80211_if_mesh {
>   *	back to wireless media and to the local net stack.
>   * @IEEE80211_SDATA_DISCONNECT_RESUME: Disconnect after resume.
>   * @IEEE80211_SDATA_IN_DRIVER: indicates interface was added to driver
> + * @IEEE80211_SDATA_DISCONNECT_HW_RESTART: Disconnect after hardware restart
> + *	recovery

How did you model this on IEEE80211_SDATA_DISCONNECT_RESUME, but than
didn't check how that's actually used?

Please change it so that the two models are the same. You really don't
need the wiphy flag.

johannes
Johannes Berg Feb. 12, 2021, 8:44 a.m. UTC | #7
On Fri, 2021-02-12 at 09:42 +0100, Johannes Berg wrote:
> On Tue, 2020-12-15 at 22:53 +0530, Youghandhar Chintala wrote:
> > The right fix would be to pull the entire data path into the host
> > +++ b/net/mac80211/ieee80211_i.h
> > @@ -748,6 +748,8 @@ struct ieee80211_if_mesh {
> >   *	back to wireless media and to the local net stack.
> >   * @IEEE80211_SDATA_DISCONNECT_RESUME: Disconnect after resume.
> >   * @IEEE80211_SDATA_IN_DRIVER: indicates interface was added to driver
> > + * @IEEE80211_SDATA_DISCONNECT_HW_RESTART: Disconnect after hardware restart
> > + *	recovery
> 
> How did you model this on IEEE80211_SDATA_DISCONNECT_RESUME, but than
> didn't check how that's actually used?
> 
> Please change it so that the two models are the same. You really don't
> need the wiphy flag.

In fact, you could even simply
generalize IEEE80211_SDATA_DISCONNECT_RESUME
and ieee80211_resume_disconnect() to _reconfig_ instead of _resume_, and
call it from the driver just before requesting HW restart.

johannes
Youghandhar Chintala Sept. 24, 2021, 7:37 a.m. UTC | #8
Hi Johannes and felix,

We have tested with DELBA experiment during post SSR, DUT packet seq 
number and tx pn is resetting to 0 as expected but AP(Netgear R8000) is 
not honoring the tx pn from DUT.
Whereas when we tested with DELBA experiment by making Linux android 
device as SAP and DUT as STA with which we don’t see any issue. Ping got 
resumed post SSR without disconnect.

Please find below logs collected during my test for reference.

192.168.0.15(AtherosC_12:af:af)  ===> DUT IP and MAC
192.168.0.55(Netgear_d2:93:3d)   ===> AP IP and MAC

No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code TID   
      Info
     474 22.186433      192.168.0.15          192.168.0.55          ICMP  
    44         37              Data is protected                          
                  0x000000000026                              0          
Echo (ping) request  id=0x0d00, seq=256/1, ttl=64 (reply in 480)

No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code TID   
      Info
     480 22.188371      192.168.0.55          192.168.0.15          ICMP  
    44         5               Data is protected                          
                  0x000000000011                              6          
Echo (ping) reply    id=0x0d00, seq=256/1, ttl=64 (request in 474)

No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code TID   
      Info
     483 22.246335      192.168.0.15          192.168.0.55          ICMP  
    44         38              Data is protected                          
                  0x000000000027                              0          
Echo (ping) request  id=0x1258, seq=11/2816, ttl=64 (reply in 489)

No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code TID   
      Info
     489 22.248127      192.168.0.55          192.168.0.15          ICMP  
    44         13              Data is protected                          
                  0x000000000012                              0          
Echo (ping) reply    id=0x1258, seq=11/2816, ttl=64 (request in 483)


The above pings(with TID 0) are before SSR. As soon as DUT recovers 
after SSR, DUT is sending DELBAs to AP.

No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code       
                     TID        Info
     546 26.129127      AtherosC_12:af:af     Netgear_d2:93:3d      
802.11   44         4               Data is not protected                
                                                     Delete Block Ack     
0x0       Action, SN=4, FN=0, Flags=........C

No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code       
                     TID        Info
     548 26.129977      AtherosC_12:af:af     Netgear_d2:93:3d      
802.11   44         5               Data is not protected                
                                                      Delete Block Ack    
0x6        Action, SN=5, FN=0, Flags=........C


After SSR, we started ping traffic with TID 7 and 0. ping is successful 
for TID 7 and failed for TID 0.
For TID 0, ping requests tx PN is reset to 0 but it seems AP is not 
reset its PN hence we see this ping failure for TID 0.
Whereas TID 7 ping success because we started it after SSR.


No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code TID   
      Info
     557 26.355256      192.168.0.15          192.168.0.55          ICMP  
    44         0               Data is protected                          
                  0x000000000001                              0          
Echo (ping) request  id=0x1258, seq=15/3840, ttl=64 (no response found!)

No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code TID   
      Info
     571 27.376895      192.168.0.15          192.168.0.55          ICMP  
    44         1               Data is protected                          
                  0x000000000002                              0          
Echo (ping) request  id=0x1258, seq=16/4096, ttl=64 (no response found!)

No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code TID   
      Info
     588 28.400946      192.168.0.15          192.168.0.55          ICMP  
    44         2               Data is protected                          
                  0x000000000003                              0          
Echo (ping) request  id=0x1258, seq=17/4352, ttl=64 (no response found!)

No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code TID   
      Info
     600 29.424881      192.168.0.15          192.168.0.55          ICMP  
    44         3               Data is protected                          
                  0x000000000004                              0          
Echo (ping) request  id=0x1258, seq=18/4608, ttl=64 (no response found!)


Below ping packets are with TID 7

No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code TID   
      Info
     622 30.898249      192.168.0.15          192.168.0.55          ICMP  
    44         0               Data is protected                          
                  0x000000000006                              7          
Echo (ping) request  id=0x1276, seq=1/256, ttl=64 (reply in 626)

No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code TID   
      Info
     626 30.900015      192.168.0.55          192.168.0.15          ICMP  
    44         0               Data is protected                          
                  0x000000000013                              7          
Echo (ping) reply    id=0x1276, seq=1/256, ttl=64 (request in 622)

No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code TID   
      Info
     644 31.897456      192.168.0.15          192.168.0.55          ICMP  
    44         1               Data is protected                          
                  0x000000000008                              7          
Echo (ping) request  id=0x1276, seq=2/512, ttl=64 (reply in 648)

No.     Time           Source                Destination           
Protocol Channel    Sequence number Protected flag Block Ack Starting 
Sequence Control (SSC) CCMP Ext. Initialization Vector Action code TID   
      Info
     648 31.899266      192.168.0.55          192.168.0.15          ICMP  
    44         1               Data is protected                          
                  0x000000000014                              7          
Echo (ping) reply    id=0x1276, seq=2/512, ttl=64 (request in 644)

Regards,
Youghandhar


On 2021-02-12 14:07, Johannes Berg wrote:
> On Fri, 2021-02-05 at 13:51 -0800, Abhishek Kumar wrote:
>> Since using DELBA frame to APs to re-establish BA session has a
>> dependency on APs and also some APs may not honor the DELBA frame.
> 
> 
> That's completely out of spec ... Can you say which AP this was?
> 
> You could also try sending a BAR that updates the SN.
> 
> johannes

Regards,
Youghandhar
Johannes Berg Sept. 24, 2021, 7:39 a.m. UTC | #9
On Fri, 2021-09-24 at 13:07 +0530, Youghandhar Chintala wrote:
> Hi Johannes and felix,
> 
> We have tested with DELBA experiment during post SSR, DUT packet seq 
> number and tx pn is resetting to 0 as expected but AP(Netgear R8000) is 
> not honoring the tx pn from DUT.
> Whereas when we tested with DELBA experiment by making Linux android 
> device as SAP and DUT as STA with which we don’t see any issue. Ping got 
> resumed post SSR without disconnect.

Hm. That's a lot of data, and not a lot of explanation :)

I don't understand how DelBA and PN are related?

johannes
Youghandhar Chintala Sept. 24, 2021, 9:13 a.m. UTC | #10
Hi Johannes

We thought sending the delba would solve the problem as earlier thought 
but the actual problem is with TX PN in a secure mode.
It is not because of delba that the Seq number and TX PN are reset to 
zero.
It’s because of the HW restart, these parameters are reset to zero.
Since FW/HW is the one which decides the TX PN, when it goes through 
SSR, all these parameters are reset.
The other peer say an AP, it does not know anything about the SSR on the 
peer device. It expects the next TX PN to be current PN + 1.
Since TX PN starts from zero after SSR, PN check at AP will fail and it 
will silently drop all the packets.

Regards,
Youghandhar

On 2021-09-24 13:09, Johannes Berg wrote:
> On Fri, 2021-09-24 at 13:07 +0530, Youghandhar Chintala wrote:
>> Hi Johannes and felix,
>> 
>> We have tested with DELBA experiment during post SSR, DUT packet seq
>> number and tx pn is resetting to 0 as expected but AP(Netgear R8000) 
>> is
>> not honoring the tx pn from DUT.
>> Whereas when we tested with DELBA experiment by making Linux android
>> device as SAP and DUT as STA with which we don’t see any issue. Ping 
>> got
>> resumed post SSR without disconnect.
> 
> Hm. That's a lot of data, and not a lot of explanation :)
> 
> I don't understand how DelBA and PN are related?
> 
> johannes

Regards,
Youghandhar
Johannes Berg Sept. 24, 2021, 9:20 a.m. UTC | #11
Hi,


> We thought sending the delba would solve the problem as earlier thought 
> but the actual problem is with TX PN in a secure mode.
> It is not because of delba that the Seq number and TX PN are reset to 
> zero.
> It’s because of the HW restart, these parameters are reset to zero.
> Since FW/HW is the one which decides the TX PN, when it goes through 
> SSR, all these parameters are reset.

Right, we solved this problem too - in a sense the driver reads the
database (not just TX PN btw, also RX replay counters) when the firmware
crashes, and sending it back after the restart. mac80211 has some hooks
for that.

johannes
Jouni Malinen Oct. 5, 2021, 8:20 p.m. UTC | #12
On Fri, Sep 24, 2021 at 11:20:50AM +0200, Johannes Berg wrote:
> > We thought sending the delba would solve the problem as earlier thought 
> > but the actual problem is with TX PN in a secure mode.
> > It is not because of delba that the Seq number and TX PN are reset to 
> > zero.
> > It’s because of the HW restart, these parameters are reset to zero.
> > Since FW/HW is the one which decides the TX PN, when it goes through 
> > SSR, all these parameters are reset.
> 
> Right, we solved this problem too - in a sense the driver reads the
> database (not just TX PN btw, also RX replay counters) when the firmware
> crashes, and sending it back after the restart. mac80211 has some hooks
> for that.

This might be doable for some cases where the firmware is the component
assigning the PN values on TX and the firmware still being in a state
where the counter used for this could be fetched after a crash or
detected misbehavior. However, this does not sound like a very reliable
mechanism for cases where the firmware state for this cannot be trusted
or for the cases where the TX PN is actually assigned by the hardware
(which would get cleared on that restart and the value might be
unreadable before that restart). Trying to pull for this information
periodically before the issue is detected does not sound like a very
robust design either, since that would both waste resources and have a
race condition with the lower layers having transmitted additional
frames.

Obviously it would be nice to be able to restore this type of state in
all cases accurately, but that may not really be a viable approach for
all designs and it would seem to make sense to provide an alternative
approach to minimize the user visible impact from the rare cases of
having to restart some low level components during an association.
diff mbox series

Patch

diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index cde2e3f..8cbeb5f 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -748,6 +748,8 @@  struct ieee80211_if_mesh {
  *	back to wireless media and to the local net stack.
  * @IEEE80211_SDATA_DISCONNECT_RESUME: Disconnect after resume.
  * @IEEE80211_SDATA_IN_DRIVER: indicates interface was added to driver
+ * @IEEE80211_SDATA_DISCONNECT_HW_RESTART: Disconnect after hardware restart
+ *	recovery
  */
 enum ieee80211_sub_if_data_flags {
 	IEEE80211_SDATA_ALLMULTI		= BIT(0),
@@ -755,6 +757,7 @@  enum ieee80211_sub_if_data_flags {
 	IEEE80211_SDATA_DONT_BRIDGE_PACKETS	= BIT(3),
 	IEEE80211_SDATA_DISCONNECT_RESUME	= BIT(4),
 	IEEE80211_SDATA_IN_DRIVER		= BIT(5),
+	IEEE80211_SDATA_DISCONNECT_HW_RESTART	= BIT(6),
 };
 
 /**
diff --git a/net/mac80211/mlme.c b/net/mac80211/mlme.c
index 6adfcb9..e4d0d16 100644
--- a/net/mac80211/mlme.c
+++ b/net/mac80211/mlme.c
@@ -4769,6 +4769,15 @@  void ieee80211_sta_restart(struct ieee80211_sub_if_data *sdata)
 					      true);
 		sdata_unlock(sdata);
 		return;
+	} else if (sdata->flags & IEEE80211_SDATA_DISCONNECT_HW_RESTART) {
+		sdata->flags &= ~IEEE80211_SDATA_DISCONNECT_HW_RESTART;
+		mlme_dbg(sdata, "driver requested disconnect after hardware restart\n");
+		ieee80211_sta_connection_lost(sdata,
+					      ifmgd->associated->bssid,
+					      WLAN_REASON_UNSPECIFIED,
+					      true);
+		sdata_unlock(sdata);
+		return;
 	}
 	sdata_unlock(sdata);
 }
diff --git a/net/mac80211/util.c b/net/mac80211/util.c
index 8c3c01a..98567a3 100644
--- a/net/mac80211/util.c
+++ b/net/mac80211/util.c
@@ -2567,9 +2567,12 @@  int ieee80211_reconfig(struct ieee80211_local *local)
 	}
 	mutex_unlock(&local->sta_mtx);
 
-	/* add back keys */
-	list_for_each_entry(sdata, &local->interfaces, list)
-		ieee80211_reenable_keys(sdata);
+
+	if (!(hw->wiphy->flags & WIPHY_FLAG_STA_DISCONNECT_ON_HW_RESTART)) {
+		/* add back keys */
+		list_for_each_entry(sdata, &local->interfaces, list)
+			ieee80211_reenable_keys(sdata);
+	}
 
 	/* Reconfigure sched scan if it was interrupted by FW restart */
 	mutex_lock(&local->mtx);
@@ -2643,6 +2646,19 @@  int ieee80211_reconfig(struct ieee80211_local *local)
 					IEEE80211_QUEUE_STOP_REASON_SUSPEND,
 					false);
 
+	if ((hw->wiphy->flags & WIPHY_FLAG_STA_DISCONNECT_ON_HW_RESTART) &&
+	    !reconfig_due_to_wowlan) {
+		list_for_each_entry(sdata, &local->interfaces, list) {
+			if (!ieee80211_sdata_running(sdata))
+				continue;
+			if (sdata->vif.type == NL80211_IFTYPE_STATION) {
+				sdata->flags |=
+					IEEE80211_SDATA_DISCONNECT_HW_RESTART;
+				ieee80211_sta_restart(sdata);
+			}
+		}
+	}
+
 	/*
 	 * If this is for hw restart things are still running.
 	 * We may want to change that later, however.