diff mbox

mac80211: Fix wlan freezes under load at rekey

Message ID 20180324102921.9814-1-alexander.wetzel@web.de (mailing list archive)
State Changes Requested
Delegated to: Johannes Berg
Headers show

Commit Message

Alexander Wetzel March 24, 2018, 10:29 a.m. UTC
Rekeying a pairwise key with encryption offload and only keyid 0 has two
potential races which can freeze the wlan conection till rekeyed again:

 1) For incomming packets:
    If the local STA installs the key prior to the remote STA we still
    have the old key active in the hardware for a short time after
    mac80211 switched to the new key.
    The card can still hand over packets decoded with the old key to
    mac80211, bumping the new PN (IV) value to an incorrect high number and
    tricking the local replay detection to drop all packets really sent
    with the new key.

 2) For outgoing packets:
    If mac80211 is providing the PN (IV) and hands over the cleartext
    packets for encryption to the hardware immediately prior to a key
    change the driver/card may process the queued packets after
    switching to the new key.
    This will immediatelly bump the PN (IV) value on the remote STA to
    an incorrect high number, also freezing the connection.

Both issues can be prevented by deleting the key from the hardware prior
to switching to the new key in mac80211, falling back to software
encryption/decryption till the switch to the new key is completed.

Signed-off-by: Alexander Wetzel <alexander.wetzel@web.de>
---
 net/mac80211/key.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Ben Greear March 24, 2018, 3:29 p.m. UTC | #1
On 03/24/2018 03:29 AM, Alexander Wetzel wrote:
> Rekeying a pairwise key with encryption offload and only keyid 0 has two
> potential races which can freeze the wlan conection till rekeyed again:
>
>  1) For incomming packets:
>     If the local STA installs the key prior to the remote STA we still
>     have the old key active in the hardware for a short time after
>     mac80211 switched to the new key.
>     The card can still hand over packets decoded with the old key to
>     mac80211, bumping the new PN (IV) value to an incorrect high number and
>     tricking the local replay detection to drop all packets really sent
>     with the new key.
>
>  2) For outgoing packets:
>     If mac80211 is providing the PN (IV) and hands over the cleartext
>     packets for encryption to the hardware immediately prior to a key
>     change the driver/card may process the queued packets after
>     switching to the new key.
>     This will immediatelly bump the PN (IV) value on the remote STA to
>     an incorrect high number, also freezing the connection.
>
> Both issues can be prevented by deleting the key from the hardware prior
> to switching to the new key in mac80211, falling back to software
> encryption/decryption till the switch to the new key is completed.

What will happen to drivers like ath10k that cannot do software encrypt/decrypt?

ath10k can support multiple key-ids as far as I can tell,
so maybe it would just never hit this code?

Thanks,
Ben

>
> Signed-off-by: Alexander Wetzel <alexander.wetzel@web.de>
> ---
>  net/mac80211/key.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/net/mac80211/key.c b/net/mac80211/key.c
> index aee05ec3f7ea..266ea0b507e7 100644
> --- a/net/mac80211/key.c
> +++ b/net/mac80211/key.c
> @@ -332,10 +332,15 @@ static void ieee80211_key_replace(struct ieee80211_sub_if_data *sdata,
>
>  	WARN_ON(new && old && new->conf.keyidx != old->conf.keyidx);
>
> -	if (old)
> +	if (old) {
>  		idx = old->conf.keyidx;
> -	else
> +		/* Make sure the card can't encrypt/decrypt packets with
> +		 * the old key prior to switching to new key in mac80211.
> +		 */
> +		ieee80211_key_disable_hw_accel(old);
> +	} else {
>  		idx = new->conf.keyidx;
> +	}
>
>  	if (sta) {
>  		if (pairwise) {
>
Alexander Wetzel March 25, 2018, 7:45 p.m. UTC | #2
> What will happen to drivers like ath10k that cannot do software
encrypt/decrypt?
>
> ath10k can support multiple key-ids as far as I can tell,
> so maybe it would just never hit this code?

Still learning how that all fits together, but I'm sure any card using
mac80211 will also use ieee80211_key_replace, including ath10k.

We are in a race with the remote station there is no chance that we can
switch over exactly at the same time. If we can't fall pack to software
encryption we'll just have to drop some more packets.

I'm pretty sure mac80211 will just encrypt a frame in software and
send it to ath10 for processing once we have removed the key from the hw
in the same way as for any other card.
My expectation here would be, that the driver detects and drops the
pre-encrypted frames it no longer has a hw key for.

Unfortunately this is just an assumption, since I haven't found the code
handling this case in ath10k. And even if true this could well cause
some undesired warning messages.

I guess we should therefore make sure we do not send out any packets in
the critical time window.

Now stopping and flushing the queues seems to be bad idea which could
cause a real performance impact for on a busy AP with many stations and
rekeys enabled...
Luckily it looks like we can instead just set KEY_FLAG_TAINTED for the
old key to make sure we stop sending packets till the rekey is done.

That should cause ieee80211_tx_h_select_key to drop all packets without
a new per-packet check and also should cover potential undesired side
effects, isn't it?


Regards,

Alexander
Ben Greear March 25, 2018, 9:59 p.m. UTC | #3
On 03/25/2018 12:45 PM, Alexander Wetzel wrote:
>
>> What will happen to drivers like ath10k that cannot do software
> encrypt/decrypt?
>>
>> ath10k can support multiple key-ids as far as I can tell,
>> so maybe it would just never hit this code?
>
> Still learning how that all fits together, but I'm sure any card using
> mac80211 will also use ieee80211_key_replace, including ath10k.
>
> We are in a race with the remote station there is no chance that we can
> switch over exactly at the same time. If we can't fall pack to software
> encryption we'll just have to drop some more packets.
>
> I'm pretty sure mac80211 will just encrypt a frame in software and
> send it to ath10 for processing once we have removed the key from the hw
> in the same way as for any other card.

I don't think ath10k can handle sending already-encrypted data packets,
but possibly it works with newer upstream firmware/driver.

Either way, as long as it does not fundamentally break something (like
a non-recoverable data stall), then maybe your patch is fine anyway
and ath10k may just drop a few extra frames.

> My expectation here would be, that the driver detects and drops the
> pre-encrypted frames it no longer has a hw key for.
>
> Unfortunately this is just an assumption, since I haven't found the code
> handling this case in ath10k. And even if true this could well cause
> some undesired warning messages.
>
> I guess we should therefore make sure we do not send out any packets in
> the critical time window.
>
> Now stopping and flushing the queues seems to be bad idea which could
> cause a real performance impact for on a busy AP with many stations and
> rekeys enabled...
> Luckily it looks like we can instead just set KEY_FLAG_TAINTED for the
> old key to make sure we stop sending packets till the rekey is done.
>
> That should cause ieee80211_tx_h_select_key to drop all packets without
> a new per-packet check and also should cover potential undesired side
> effects, isn't it?

I get lost in the weeds when trying to understand all of this, and some
previous attempts of mine to fix some of this evidently wasn't correct
enough to accept upstream:

https://www.spinics.net/lists/hostap/msg03677.html

So I really don't know enough to properly review
your patch.  Just be aware that ath10k is weird about sw-crypt, maybe make
sure your patch is tested on it to make sure it doesn't out-right break something.

Thanks,
Ben
Sebastian Gottschall March 26, 2018, 7:43 a.m. UTC | #4
>
> So I really don't know enough to properly review
> your patch.  Just be aware that ath10k is weird about sw-crypt, maybe 
> make
> sure your patch is tested on it to make sure it doesn't out-right 
> break something.
i will test it today in sta and ap mode. lets see whats the result after 
some hours

Sebastian
Sebastian Gottschall March 26, 2018, 12:22 p.m. UTC | #5
so far i see no regressions with 9984 with that patch

except that 9984 has a rekeying problem at all. with wds ap -> wds sta 
mode rekeying will fail and it will reauthenticate at each interval. (it 
disconnects and reconnects)
but this is a long term issue qca never fixed for years. 988x doesnt 
suffer from that issue

Am 25.03.2018 um 23:59 schrieb Ben Greear:
>
>
> On 03/25/2018 12:45 PM, Alexander Wetzel wrote:
>>
>>> What will happen to drivers like ath10k that cannot do software
>> encrypt/decrypt?
>>>
>>> ath10k can support multiple key-ids as far as I can tell,
>>> so maybe it would just never hit this code?
>>
>> Still learning how that all fits together, but I'm sure any card using
>> mac80211 will also use ieee80211_key_replace, including ath10k.
>>
>> We are in a race with the remote station there is no chance that we can
>> switch over exactly at the same time. If we can't fall pack to software
>> encryption we'll just have to drop some more packets.
>>
>> I'm pretty sure mac80211 will just encrypt a frame in software and
>> send it to ath10 for processing once we have removed the key from the hw
>> in the same way as for any other card.
>
> I don't think ath10k can handle sending already-encrypted data packets,
> but possibly it works with newer upstream firmware/driver.
>
> Either way, as long as it does not fundamentally break something (like
> a non-recoverable data stall), then maybe your patch is fine anyway
> and ath10k may just drop a few extra frames.
>
>> My expectation here would be, that the driver detects and drops the
>> pre-encrypted frames it no longer has a hw key for.
>>
>> Unfortunately this is just an assumption, since I haven't found the code
>> handling this case in ath10k. And even if true this could well cause
>> some undesired warning messages.
>>
>> I guess we should therefore make sure we do not send out any packets in
>> the critical time window.
>>
>> Now stopping and flushing the queues seems to be bad idea which could
>> cause a real performance impact for on a busy AP with many stations and
>> rekeys enabled...
>> Luckily it looks like we can instead just set KEY_FLAG_TAINTED for the
>> old key to make sure we stop sending packets till the rekey is done.
>>
>> That should cause ieee80211_tx_h_select_key to drop all packets without
>> a new per-packet check and also should cover potential undesired side
>> effects, isn't it?
>
> I get lost in the weeds when trying to understand all of this, and some
> previous attempts of mine to fix some of this evidently wasn't correct
> enough to accept upstream:
>
> https://www.spinics.net/lists/hostap/msg03677.html
>
> So I really don't know enough to properly review
> your patch.  Just be aware that ath10k is weird about sw-crypt, maybe 
> make
> sure your patch is tested on it to make sure it doesn't out-right 
> break something.
>
> Thanks,
> Ben
>
>
Alexander Wetzel March 26, 2018, 8:24 p.m. UTC | #6
> so far i see no regressions with 9984 with that patch
> 
> except that 9984 has a rekeying problem at all. with wds ap -> wds sta
> mode rekeying will fail and it will reauthenticate at each interval. (it
> disconnects and reconnects)
> but this is a long term issue qca never fixed for years. 988x doesnt
> suffer from that issue

Thanks for testing, sounds promising.

If anyone is interested how it looks in my test environment I've
uploaded two sample captures to
https://www.awhome.eu/index.php/s/abxgp9pfi2ssCNy, showing how the
unpatched and patched mac80211 are reacting to the rekey. The WPA
Password is Induction and the AP rekeys all 30s.

The AP is running lede 17.01.4, so it's way off from the current
kernel/mac80211.
The client is a HTC 10 phone running Lineageos. (The phone also has a
WLAN card which has problems when rekeying.)

There are quite many interesting things visible here, not the least one
that ath9k leaks unencrypted frames for both patched and unpatched
mac80211 which at least for my patched variant probably allow to
calculate the TK key and encrypt all frames.

I'm now experimenting now with KEY_FLAG_TAINTED, but it's not as
straight forward as I expected.
Johannes Berg March 27, 2018, 1:03 p.m. UTC | #7
On Mon, 2018-03-26 at 22:24 +0200, Alexander Wetzel wrote:

> There are quite many interesting things visible here, not the least one
> that ath9k leaks unencrypted frames for both patched and unpatched
> mac80211 which at least for my patched variant probably allow to
> calculate the TK key and encrypt all frames.

That's odd - any thoughts on how that might be happening?


johannes
Johannes Berg March 27, 2018, 1:13 p.m. UTC | #8
On Sat, 2018-03-24 at 11:29 +0100, Alexander Wetzel wrote:
> Rekeying a pairwise key with encryption offload and only keyid 0 has two
> potential races which can freeze the wlan conection till rekeyed again:
> 
>  1) For incomming packets:
>     If the local STA installs the key prior to the remote STA we still
>     have the old key active in the hardware for a short time after
>     mac80211 switched to the new key.
>     The card can still hand over packets decoded with the old key to
>     mac80211, bumping the new PN (IV) value to an incorrect high number and
>     tricking the local replay detection to drop all packets really sent
>     with the new key.
> 
>  2) For outgoing packets:
>     If mac80211 is providing the PN (IV) and hands over the cleartext
>     packets for encryption to the hardware immediately prior to a key
>     change the driver/card may process the queued packets after
>     switching to the new key.
>     This will immediatelly bump the PN (IV) value on the remote STA to
>     an incorrect high number, also freezing the connection.

Correct for both, yes, this is a known issue.

> Both issues can be prevented by deleting the key from the hardware prior
> to switching to the new key in mac80211, falling back to software
> encryption/decryption till the switch to the new key is completed.

This, however, is not true in general. It only works if the queues are
flushed quickly enough between removing the key and setting a new one.
Otherwise, the same is still possible, e.g. on TX:

 * many packets are in the (HW) queue
 * enqueue packet with PN=0x10000
 * turn off HW crypto
   [here I can actually see how you might now leak packets as
    unencrypted that are already in the queue]
 * install a new key
 * turn on HW crypto for the new key
 * process packet with PN=0x10000

Fundamentally, nothing stops this from happening, as we (still) don't
have any synchronization between transmission and key configuration.

Also, in this case it seems pretty obvious how you can leak packets
unencrypted, since the HW now no longer has a key. This might not even
be fixable if the NIC cannot reject packets that are supposed to be
encrypted to a key that's no longer valid in the HW.

I don't really see how the unencrypted leak happens with the current
code though, unless the driver somehow first invalidates the key and
then installs a new one, and there's a race with this?

Ultimately, I don't think this patch is good enough. We clearly have a
problem here with leaking unencrypted frames, if the device can't reject
them (and I can't really fault it for that), so in order to really fix
it we'd have to completely flush all software and hardware queues, and
then start again with a new key - and for that we don't even need to
disable HW crypto.


(FWIW, iwlwifi mostly avoids this problem on TX - at least for keys
other than GCMP-256 - by embedding the key material into the frame
itself, so we simply don't have such a race condition there)

johannes
Alexander Wetzel April 8, 2018, 8:31 p.m. UTC | #9
> Fundamentally, nothing stops this from happening, as we (still) don't
> have any synchronization between transmission and key configuration.
I'm currently working on that and nearly have a rfc version ready to
share. Unfortunately I still continue to find issues to iron out.
May be some more days to sort the latest and hopefully the last issue.

The current version is already fixing the issue with my ath9k AP but it
looks like it's now racing with eapol #4 and needs more tweaks.

> Also, in this case it seems pretty obvious how you can leak packets
> unencrypted, since the HW now no longer has a key. This might not even
> be fixable if the NIC cannot reject packets that are supposed to be
> encrypted to a key that's no longer valid in the HW.
>
Exactly. I'm planing to avoid that issue by just dropping (and flushing)
all packets while mac80211 replaces the keys. Queuing them in mac80211
should also be possible, but I abandoned that for now  - after figuring
out that the PS code currently using those queues allows only an AP (or
a mesh) to queue. Still looks doable, but too invasive for now.

> I don't really see how the unencrypted leak happens with the current
> code though, unless the driver somehow first invalidates the key and
> then installs a new one, and there's a race with this?

Well, current mac80211 code always handling a key replace in that order:
- set the new key in mac80211
- remove the key from hw
- delete the old key
- enable hw accel for new key

Problem here is, that when we remove the key from the hw we first drop
the key from the card and after clear KEY_FLAG_UPLOADED_TO_HARDWARE.
So yes, we have a short window where mac80211 incorrectly assumes the hw
is encrypting packets when it's not.

That is trivial to fix, we just have to remove the flag prior to calling
drv_set_key in ieee80211_key_disable_hw_accel.

This will of course hand over software encrypted packets to the driver
again, but this also happens when we enable hw encryption and therefore
should be pretty well tested with all drivers.

> 
> Ultimately, I don't think this patch is good enough. We clearly have a
> problem here with leaking unencrypted frames, if the device can't reject
> them (and I can't really fault it for that), so in order to really fix
> it we'd have to completely flush all software and hardware queues, and
> then start again with a new key - and for that we don't even need to
> disable HW crypto.

I agree, but we still have to disable the hw encryption in the final
solution, haven't we?
If not the remote STA may still send us some frames with the old IV and
key and our RX will decode and hand them over to mac80211. And mac80211
will bump the IV for the new key to the value the old had.

Or is there a way mac80211 can see which key was used to decode the
frame? I did not see anything, but did not dug deeper.

Alexander
Johannes Berg April 9, 2018, 7:25 a.m. UTC | #10
On Sun, 2018-04-08 at 22:31 +0200, Alexander Wetzel wrote:

> Exactly. I'm planing to avoid that issue by just dropping (and flushing)
> all packets while mac80211 replaces the keys. 

That would again need driver support though, and I'm not sure all
drivers implement the flush method (correctly) today.

> Queuing them in mac80211
> should also be possible, but I abandoned that for now  - after figuring
> out that the PS code currently using those queues allows only an AP (or
> a mesh) to queue. Still looks doable, but too invasive for now.

Yeah, the whole queueing is messy for sure.

> > I don't really see how the unencrypted leak happens with the current
> > code though, unless the driver somehow first invalidates the key and
> > then installs a new one, and there's a race with this?
> 
> Well, current mac80211 code always handling a key replace in that order:
> - set the new key in mac80211
> - remove the key from hw
> - delete the old key
> - enable hw accel for new key
> 
> Problem here is, that when we remove the key from the hw we first drop
> the key from the card and after clear KEY_FLAG_UPLOADED_TO_HARDWARE.
> So yes, we have a short window where mac80211 incorrectly assumes the hw
> is encrypting packets when it's not.
> 
> That is trivial to fix, we just have to remove the flag prior to calling
> drv_set_key in ieee80211_key_disable_hw_accel.

Oh, ok. That's indeed not such a big deal then. This code was just never
really meant to work properly while operating on this key - typically it
either gets invoked when the connection is already no longer authorized,
or when swapping GTK where we only RX anyway, or on AP side when
swapping GTK the second time, but then that key has (usually) long been
unused.

> This will of course hand over software encrypted packets to the driver
> again, but this also happens when we enable hw encryption and therefore
> should be pretty well tested with all drivers.

No, it doesn't generally happen with all drivers, because we usually
install keys before we have a functioning datapath up. In the case of
ath10k, it's even not possible to send those frames, as has been pointed
out in this thread before.

> > Ultimately, I don't think this patch is good enough. We clearly have a
> > problem here with leaking unencrypted frames, if the device can't reject
> > them (and I can't really fault it for that), so in order to really fix
> > it we'd have to completely flush all software and hardware queues, and
> > then start again with a new key - and for that we don't even need to
> > disable HW crypto.
> 
> I agree, but we still have to disable the hw encryption in the final
> solution, haven't we?

I don't know. Honestly, I'm not even sure this problem is worth solving
right now. AFAICT it's a pretty special and fringe configuration to
enable PTK rekeying to start with, and the latest 802.11 allows using
non-zero key ID for PTKs, so we could implement that on both AP/client
sides instead, and thus really solve the problem without any races.

> If not the remote STA may still send us some frames with the old IV and
> key and our RX will decode and hand them over to mac80211. And mac80211
> will bump the IV for the new key to the value the old had.

Yes, this does happen.

> Or is there a way mac80211 can see which key was used to decode the
> frame? I did not see anything, but did not dug deeper.

Not right now. I guess you could add one, but it may be difficult to
even know. I think iwlwifi might have a "key color" bit it reports up
and that you could use (just swap it every time you overwrite the key),
but I guess not all hardware would have that.

johannes
Johannes Berg May 15, 2018, 3:50 p.m. UTC | #11
On Tue, 2018-05-15 at 12:22 +0200, Alexander Wetzel wrote:
> 
> Both issues can be prevented by first replacing the key in the HW and
> makeing sure no aggregation sessions are running during the rekey.

I don't think you can do this - just tear down all aggregation sessions
- there are APs out there that will not re-establish them if you tear
them down, or only attempt a given number of times, etc. This will cause
interoperability problems.

OTOH, arguably we have worse interoperability problems today, and anyone
who configures PTK rekeying is deluded that it'll work properly, so
maybe that's not _that_ bad. Hmm.

johannes
Alexander Wetzel May 15, 2018, 10:41 p.m. UTC | #12
Hello,

> On Tue, 2018-05-15 at 12:22 +0200, Alexander Wetzel wrote:
>>
>> Both issues can be prevented by first replacing the key in the HW and
>> makeing sure no aggregation sessions are running during the rekey.
> 
> I don't think you can do this - just tear down all aggregation sessions
> - there are APs out there that will not re-establish them if you tear
> them down, or only attempt a given number of times, etc. This will cause
> interoperability problems.

I'm on very thin ice here, but my impression was that this should work
without too many problems for all (most?) systems:
- An aggregation session is only started when needed
- ADDBA can't be expected to succeed
- It's normal to tear down an aggregation session once your queue is empty.

The only unusual thing here is, that the originator can get a DELBA from
the recipient while transmitting data and not after some inactivity
timeout. But reading IEEE802.11-2016 chapter 11.5.4 seems to indicate
that you have to expect and handle DELBA frames any time.

So far I've found only one device which is handling a PSK rekey
correctly (Windows Surface Pro 3 running Win 10) and that one was
working fine with my patched AP for three rekeys while downloading at
full speed. The fourth rekey failed and caused an re-associated, but
according to the OTA capture the AP did not respond to at least 5 EAPOL
#2 frames and we therefore never got to the code stopping the
aggregation for rekey.

That said I think I can get the code working without stopping RX
aggregation and a spoofed "idle" tear down of the TX aggregation.

Problem here is the reorder buffer can have already decoded packets
queued from both the old and the new key. And once the session is
complete will releases those when we are on the new key, poisoning our PN.
First plan would be to mark any running RX aggregation queue as tainted
and once the aggregation is complete discard all packets in it.

> 
> OTOH, arguably we have worse interoperability problems today, and anyone
> who configures PTK rekeying is deluded that it'll work properly, so
> maybe that's not _that_ bad. Hmm.

Assuming the other STA is not totally broken this should only degrade
the speed, but keep the connection operable.
If you prefer to not stop the RX aggregation I'll try my hand on that
next. (I assume stopping TX is fine?)

The tests I've run so far are showing that we have at least two group of
"broken" devices out in the wild:
1) The first group is handling rekeys pretty much like mac80211. Some
are better on TX like my HTC 10 (seems to be fullmac) but are failing to
separate RX frames properly based on the key used to decode it.

2) The second group is even worse implemented, but in a nice twist are
seeming to work quite fine for the users. Those are simply encoding
eapol #4 with the new key, preventing any rekey to ever succeed and
triggering a re-associate.

Statistically my data is less than insufficient, but I suspect that
there are quite some APs in the wild running rekeys but the combination
of hour long rekey intervals, the fact hat you must have traffic during
the rekey and that at least some common network cards handle eapol #4
wrong keeps the heat down. And of course this issue is next to
impossible to track down if you are not some kind of expert.

Nevertheless I can you find many "magic" solutions to fix linux wlan
issues by switching over to software encryption and disabling 802.11n,
which are exactly the actions which drastically reduce the chance to
freeze a wlan during a PSK rekey. (I'm sure many of those are other
issues, but I'm equally sure a sizeable fraction is not.)

One of the "better" reports is here:
https://bugzilla.kernel.org/show_bug.cgi?id=42877

Alexander
Johannes Berg May 16, 2018, 6:56 a.m. UTC | #13
On Wed, 2018-05-16 at 00:41 +0200, Alexander Wetzel wrote:
> 
> I'm on very thin ice here,

I think we all are, since we're talking about interoperability :-)

> but my impression was that this should work
> without too many problems for all (most?) systems:
> - An aggregation session is only started when needed
> - ADDBA can't be expected to succeed
> - It's normal to tear down an aggregation session once your queue is empty.
> 
> The only unusual thing here is, that the originator can get a DELBA from
> the recipient while transmitting data and not after some inactivity
> timeout. But reading IEEE802.11-2016 chapter 11.5.4 seems to indicate
> that you have to expect and handle DELBA frames any time.

There's not necessarily any inactivity timeout. Many STAs (AP STAs
included) will set that to 0 if the aggregation session doesn't cost
them an appreciable amount of resources.

You're reading the spec correctly, but the spec also omits entirely when
you would (want to) start a session, and in my experience the logic in
many STAs will not easily reopen sessions that were torn down (multiple
times) by a recipient DelBA, since they don't know that it won't be torn
down immediately again.

I think the issue is more pronounced with *rejecting* the AddBA, rather
than sending a recipient DelBA, but I'd still be cautious about it.

> So far I've found only one device which is handling a PSK rekey
> correctly (Windows Surface Pro 3 running Win 10) and that one was
> working fine with my patched AP for three rekeys while downloading at
> full speed. The fourth rekey failed and caused an re-associated, but
> according to the OTA capture the AP did not respond to at least 5 EAPOL
> #2 frames and we therefore never got to the code stopping the
> aggregation for rekey.

:-)

That's kinda my point though - you will not really find devices that
work correctly ;-)

> Problem here is the reorder buffer can have already decoded packets
> queued from both the old and the new key. And once the session is
> complete will releases those when we are on the new key, poisoning our PN.
> First plan would be to mark any running RX aggregation queue as tainted
> and once the aggregation is complete discard all packets in it.

Yes, good point, the reorder buffer definitely is a concern here. But
it's also completely managed in software in this case, so I guess we
could tag the key that was used into the frames somehow?

OTOH, if we allow that then we also open ourselves up to replay attacks
during this scenario since we can no longer check the frames properly,
so we'd better drop them.

> Assuming the other STA is not totally broken this should only degrade
> the speed, but keep the connection operable.

Yes.

> If you prefer to not stop the RX aggregation I'll try my hand on that
> next. (I assume stopping TX is fine?)

Stopping TX is fine, that's a local problem. We should make sure we
reopen the sessions but that's at least something we control.

Stopping RX - well, maybe I'm thinking now that it's not so bad. Given
that it's an infrequent DelBA rather than AddBA rejection, it should be
more or less OK.

> The tests I've run so far are showing that we have at least two group of
> "broken" devices out in the wild:
> 1) The first group is handling rekeys pretty much like mac80211. Some
> are better on TX like my HTC 10 (seems to be fullmac) but are failing to
> separate RX frames properly based on the key used to decode it.

No surprise here. It's a very common trade-off - do the PN checks in
software so that you don't waste precious device memory on all the PNs
for all the TIDs, but for speed/CPU reasons do the encryption in
hardware. Basically any time you have this, you run into the problem.
Even if you have full-MAC that may still happen since often this is
implemented in two different "CPUs" (like e.g. Broadcom) or different
hardware blocks perhaps.

> 2) The second group is even worse implemented, but in a nice twist are
> seeming to work quite fine for the users. Those are simply encoding
> eapol #4 with the new key, preventing any rekey to ever succeed and
> triggering a re-associate.

:-)

> Statistically my data is less than insufficient, but I suspect that
> there are quite some APs in the wild running rekeys but the combination
> of hour long rekey intervals, the fact hat you must have traffic during
> the rekey and that at least some common network cards handle eapol #4
> wrong keeps the heat down. And of course this issue is next to
> impossible to track down if you are not some kind of expert.

Indeed.

> Nevertheless I can you find many "magic" solutions to fix linux wlan
> issues by switching over to software encryption and disabling 802.11n,
> which are exactly the actions which drastically reduce the chance to
> freeze a wlan during a PSK rekey. (I'm sure many of those are other
> issues, but I'm equally sure a sizeable fraction is not.)
> 
> One of the "better" reports is here:
> https://bugzilla.kernel.org/show_bug.cgi?id=42877

Right.

I think the part that I misinterpreted in a way is that I thought these
issues mostly happened to people who were explicitly configuring their
(Open|*)Wrt routers to do PTK rekeying. I hadn't really seen any vendors
enabling it in their stock configuration. But perhaps I'm mistaken here,
or perhaps people are actually running into PN wraps after 2**48
packets, which requires a rekey?

Anyway, next step is I think for me to take a closer look at this patch
- I'm starting to think that the aggregation issue isn't so bad.

Thanks for all your work on this!

johannes
diff mbox

Patch

diff --git a/net/mac80211/key.c b/net/mac80211/key.c
index aee05ec3f7ea..266ea0b507e7 100644
--- a/net/mac80211/key.c
+++ b/net/mac80211/key.c
@@ -332,10 +332,15 @@  static void ieee80211_key_replace(struct ieee80211_sub_if_data *sdata,
 
 	WARN_ON(new && old && new->conf.keyidx != old->conf.keyidx);
 
-	if (old)
+	if (old) {
 		idx = old->conf.keyidx;
-	else
+		/* Make sure the card can't encrypt/decrypt packets with
+		 * the old key prior to switching to new key in mac80211.
+		 */
+		ieee80211_key_disable_hw_accel(old);
+	} else {
 		idx = new->conf.keyidx;
+	}
 
 	if (sta) {
 		if (pairwise) {