[v2] mac80211: Fix wlan freezes under load at rekey

Message ID	20180515102202.2021-1-alexander.wetzel@web.de (mailing list archive)
State	Changes Requested
Delegated to:	Johannes Berg
Headers	show Return-Path: <linux-wireless-owner@kernel.org> sender: postmaster@awhome.eu) by player696.ha.ovh.net (Postfix) with ESMTPSA id B36083C00C9; Tue, 15 May 2018 12:22:51 +0200 (CEST) From: Alexander Wetzel <alexander.wetzel@web.de> To: johannes@sipsolutions.net Cc: linux-wireless@vger.kernel.org, greearb@candelatech.com, s.gottschall@dd-wrt.com, Alexander Wetzel <alexander.wetzel@web.de> Subject: [PATCH v2] mac80211: Fix wlan freezes under load at rekey Date: Tue, 15 May 2018 12:22:02 +0200 Message-Id: <20180515102202.2021-1-alexander.wetzel@web.de> In-Reply-To: <1523258757.3076.5.camel@sipsolutions.net> References: <1523258757.3076.5.camel@sipsolutions.net> Sender: linux-wireless-owner@vger.kernel.org Precedence: bulk

Alexander Wetzel May 15, 2018, 10:22 a.m. UTC

Rekeying a pairwise key with encryption offload and only keyid 0 has two
potential races which can freeze the wlan conection till rekeyed again:

 1) For incomming packets:
    If the local STA installs the key prior to the remote STA we still
    have the old key active in the hardware for a short time after
    mac80211 switched to the new key.
    The card can still hand over packets decoded with the old key to
    mac80211, bumping the new PN (IV) value to an incorrect high number and
    tricking the local replay detection to drop all packets really sent
    with the new key.

 2) For outgoing packets:
    If mac80211 is providing the PN (IV) and hands over the cleartext
    packets for encryption to the hardware immediately prior to a key
    change the driver/card may process the queued packets after
    switching to the new key.
    This will immediatelly bump the PN (IV) value on the remote STA to
    an incorrect high number, also freezing the connection.

Both issues can be prevented by first replacing the key in the HW and
makeing sure no aggregation sessions are running during the rekey.

Signed-off-by: Alexander Wetzel <alexander.wetzel@web.de>
---
 net/mac80211/key.c | 54 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 40 insertions(+), 14 deletions(-)

Johannes Berg June 15, 2018, 11:33 a.m. UTC | #1

On Tue, 2018-05-15 at 12:22 +0200, Alexander Wetzel wrote:
> Rekeying a pairwise key with encryption offload and only keyid 0 has two
> potential races which can freeze the wlan conection till rekeyed again:
> 
>  1) For incomming packets:
>     If the local STA installs the key prior to the remote STA we still
>     have the old key active in the hardware for a short time after
>     mac80211 switched to the new key.
>     The card can still hand over packets decoded with the old key to
>     mac80211, bumping the new PN (IV) value to an incorrect high number and
>     tricking the local replay detection to drop all packets really sent
>     with the new key.
> 
>  2) For outgoing packets:
>     If mac80211 is providing the PN (IV) and hands over the cleartext
>     packets for encryption to the hardware immediately prior to a key
>     change the driver/card may process the queued packets after
>     switching to the new key.
>     This will immediatelly bump the PN (IV) value on the remote STA to
>     an incorrect high number, also freezing the connection.
> 
> Both issues can be prevented by first replacing the key in the HW and
> makeing sure no aggregation sessions are running during the rekey.

Getting back to this, am I understanding correctly that in the latter
(outgoing) case this would cause 

Also, I think you should probably describe better why the aggregation
session stuff is needed. I'm already thinking there times about it again
...

> +               ieee80211_sta_tear_down_BA_sessions(
> +                   sta, AGG_STOP_LOCAL_REQUEST);

minor indentation issue here

> +       ieee80211_flush_queues(key->local, key->sdata, false);
> +       ieee80211_key_disable_hw_accel(key);

I'm not sure all drivers implement drv_flush() [correctly], what happens
if they don't? I guess some packets end up being transmitted in clear
text or a dummy key, unless the hardware/firmware knows about this and
drops them?

Perhaps that means we should make this whole thing opt-in, and leave it
up to driver authors to first validate that they handle the flushing
correctly?

Regarding the code: the whole dance you do with ieee80211_key_link() and
ieee80211_key_replace() seems to be a little pointless because you still
add the key to debugfs and then free it, on errors that is?

johannes

Alexander Wetzel June 18, 2018, 9:03 p.m. UTC | #2

> On Tue, 2018-05-15 at 12:22 +0200, Alexander Wetzel wrote:
>> Rekeying a pairwise key with encryption offload and only keyid 0 has two
>> potential races which can freeze the wlan conection till rekeyed again:
>>
>>  1) For incomming packets:
>>     If the local STA installs the key prior to the remote STA we still
>>     have the old key active in the hardware for a short time after
>>     mac80211 switched to the new key.
>>     The card can still hand over packets decoded with the old key to
>>     mac80211, bumping the new PN (IV) value to an incorrect high number and
>>     tricking the local replay detection to drop all packets really sent
>>     with the new key.
>>
>>  2) For outgoing packets:
>>     If mac80211 is providing the PN (IV) and hands over the cleartext
>>     packets for encryption to the hardware immediately prior to a key
>>     change the driver/card may process the queued packets after
>>     switching to the new key.
>>     This will immediatelly bump the PN (IV) value on the remote STA to
>>     an incorrect high number, also freezing the connection.
>>
>> Both issues can be prevented by first replacing the key in the HW and
>> makeing sure no aggregation sessions are running during the rekey.
> 
> Getting back to this, am I understanding correctly that in the latter
> (outgoing) case this would cause 

I don't get the point here...Shall I flesh out the description of the
exiting races in the code a bit more?

> 
> Also, I think you should probably describe better why the aggregation
> session stuff is needed. I'm already thinking there times about it again
> ...
I'll rework the description and will go into more details.
Here a first quick draft what I plan amend to the commit message. But
this got kind of long:
...
Both issues can be prevented by first replacing the key in the HW and
making sure no aggregation sessions are running during the rekey.

The "core" of the fix is to change the old unicast key install sequence
 - atomic switch over to the new key in mac80211
 - remove the old key in the HW and mac80211
 - add new key in the hw
to
 - mark the old key as tainted to drop TX packets trying to use it
 - remove the old key in the HW and mac80211
 - add new key to the HW
 - atomic switch over to the new key in mac80211

Since the new key is not marked as tainted the last step will also
enable TX again.

With the new procedure the HW will be unable to decode packets still
encrypted with the old key prior to switching to the new key in
mac80211. Locking down TX during the rekey makes sure that there are no
outgoing packets mac80211 prepared for the old key the hw can encrypt
with the new key.
We can now decrypt and get incoming packets while mac80211 is still on
the old key, but the worst (and most likely) case is now, that the
replay detection will drop those packets till mac80211 also switch over.

Instead of checking PN from old key packets against the new key (and
bump the PN for the new key, killing our connection) we are now checking
the PN from new key packets against the old key (and drop a few packets
during rekey only).

When using TX aggregation we get an additional race which can poison the
remote station:
When a TX aggregation session is running during the key replacement it's
possible that one of the frames send out prior of the TX lock down (old
key) got lost and will be repeated after the TX lock down is lifted,
using the old skb prepared for the old key with the new installed key in
the hardware. This would bump the PN counter of our peer to an incorrect
value and breaking the RX for the remote station.
Stopping all running aggregation sessions and declining any new ones
till we have switched over to the new key side steps that problem.

/end commit draft

It would be possible to keep the aggregation sessions running, but that
seems to need some quite invasive changes and checks.

> 
>> +               ieee80211_sta_tear_down_BA_sessions(
>> +                   sta, AGG_STOP_LOCAL_REQUEST);
> 
> minor indentation issue here
I'll go for a 81 char long single line here.

> 
>> +       ieee80211_flush_queues(key->local, key->sdata, false);
>> +       ieee80211_key_disable_hw_accel(key);
> 
> I'm not sure all drivers implement drv_flush() [correctly], what happens
> if they don't? I guess some packets end up being transmitted in clear
> text or a dummy key, unless the hardware/firmware knows about this and
> drops them?
Flushing the queues is not needed for to the logic and only a workaround
for drivers like ath9k which can still send out packets for a key which
was already deleted. When the driver guarantees that after the key
deletion function returns to mac80211 that there will be no more packets
send out prepared for the old key it can be removed.
The driver could simply wait for whatever the may queue time may be or
detect and drop these packets. (In fact it can even send out the packet
again, it just must make sure to still use the old and official deleted
key. May be handy for drivers like iwlwifi.)

> 
> Perhaps that means we should make this whole thing opt-in, and leave it
> up to driver authors to first validate that they handle the flushing
> correctly?

I tried to minimize the changes, but if we can consider an API change I
would ask the drivers to provide a new function to switch directly from
one key to another, without the need to delete it first. And to
guarantee there, that after the function returns no packets with
prepared for the old key can be sent out. Including retransmissions.

That way drivers like ath9k should be able to directly program the new
key and drivers for cards where this is not saving time can simply call
delete and add internally.

As fallback I would still want to use the current code with a flush,
well knowing that not all drivers implement it and only remove it once
all drivers have switched to the new API for rekeys.

> 
> 
> Regarding the code: the whole dance you do with ieee80211_key_link() and
> ieee80211_key_replace() seems to be a little pointless because you still
> add the key to debugfs and then free it, on errors that is?

I hope the more verbose description above explains that part.
The intend of the dance is to finalize the switch to the new key in the
hardware prior to doing it in mac80211. It's probably possible to bypass
more of the code, but I did not touch the logic besides what I needed,
understand or verified to be harmless.

Alexander

Johannes Berg June 18, 2018, 9:27 p.m. UTC | #3

Hi,

> > >  2) For outgoing packets:
> > >     If mac80211 is providing the PN (IV) and hands over the cleartext
> > >     packets for encryption to the hardware immediately prior to a key
> > >     change the driver/card may process the queued packets after
> > >     switching to the new key.
> > >     This will immediatelly bump the PN (IV) value on the remote STA to
> > >     an incorrect high number, also freezing the connection.
> > > 
> > > Both issues can be prevented by first replacing the key in the HW and
> > > makeing sure no aggregation sessions are running during the rekey.
> > 
> > Getting back to this, am I understanding correctly that in the latter
> > (outgoing) case this would cause 
> 
> I don't get the point here...Shall I flesh out the description of the
> exiting races in the code a bit more?

You couldn't, I didn't finish this sentence for some reason ... or wrote
and then deleted it by accident?

I meant to say:

Am I understanding correctly that in the latter (outgoing) case this
might cause unencrypted packets to be transmitted?

I talked about that more below.

> > Also, I think you should probably describe better why the aggregation
> > session stuff is needed. I'm already thinking there times about it again
> > ...
> 
> I'll rework the description and will go into more details.
> Here a first quick draft what I plan amend to the commit message. But
> this got kind of long:
> ...
> Both issues can be prevented by first replacing the key in the HW and
> making sure no aggregation sessions are running during the rekey.
> 
> The "core" of the fix is to change the old unicast key install sequence
>  - atomic switch over to the new key in mac80211
>  - remove the old key in the HW and mac80211
>  - add new key in the hw
> to
>  - mark the old key as tainted to drop TX packets trying to use it
>  - remove the old key in the HW and mac80211
>  - add new key to the HW
>  - atomic switch over to the new key in mac80211
> 
> Since the new key is not marked as tainted the last step will also
> enable TX again.
> 
> With the new procedure the HW will be unable to decode packets still
> encrypted with the old key prior to switching to the new key in
> mac80211. Locking down TX during the rekey makes sure that there are no
> outgoing packets mac80211 prepared for the old key the hw can encrypt
> with the new key.

> We can now decrypt and get incoming packets while mac80211 is still on
> the old key, but the worst (and most likely) case is now, that the
> replay detection will drop those packets till mac80211 also switch over.

This actually is somewhat problematic, at least for TKIP it could cause
countermeasures. Should we exclude TKIP here somehow?

I don't think we can disable countermeasures because then we could be
attacked in this time window without starting them, and we have a really
hard time distinguishing attacks and "fake" replays.

> Instead of checking PN from old key packets against the new key (and
> bump the PN for the new key, killing our connection) we are now checking
> the PN from new key packets against the old key (and drop a few packets
> during rekey only).
> 
> When using TX aggregation we get an additional race which can poison the
> remote station:
> When a TX aggregation session is running during the key replacement it's
> possible that one of the frames send out prior of the TX lock down (old
> key) got lost and will be repeated after the TX lock down is lifted,
> using the old skb prepared for the old key with the new installed key in
> the hardware. This would bump the PN counter of our peer to an incorrect
> value and breaking the RX for the remote station.

Right, this is what I had forgotten about already.

> > > +       ieee80211_flush_queues(key->local, key->sdata, false);
> > > +       ieee80211_key_disable_hw_accel(key);
> > 
> > I'm not sure all drivers implement drv_flush() [correctly], what happens
> > if they don't? I guess some packets end up being transmitted in clear
> > text or a dummy key, unless the hardware/firmware knows about this and
> > drops them?
> 
> Flushing the queues is not needed for to the logic and only a workaround
> for drivers like ath9k which can still send out packets for a key which
> was already deleted. When the driver guarantees that after the key
> deletion function returns to mac80211 that there will be no more packets
> send out prepared for the old key it can be removed.

What will happen for other kinds of packets though?

For iwlwifi, for example, the key material is added into the key. It
thus doesn't have this race in the first place, but it will definitely
send out packets after 

Btw - perhaps that means we should avoid the complicated mechanisms like
TX aggregation shutdown for keys that the driver marks as being safe?
Clearly for iwlwifi (at least CCMP and before, not with the longer keys
in GCMP-256) the race can't possibly happen.

In other drivers though, I worry that removing the key will *not* mean
that there are no more packets transmitted with it. If you have a key
cache in hardware then it might just be that marking the cache entry as
invalid means the encryption will be skipped when encountering a packet
to be transmitted with the key from a given (in the packet) key cache
index, and then the frame goes out unprotected?

> I tried to minimize the changes, but if we can consider an API change I
> would ask the drivers to provide a new function to switch directly from
> one key to another, without the need to delete it first. And to
> guarantee there, that after the function returns no packets with
> prepared for the old key can be sent out. Including retransmissions.

This would be pretty tricky for most drivers though.

> That way drivers like ath9k should be able to directly program the new
> key and drivers for cards where this is not saving time can simply call
> delete and add internally.

I don't think that's really all that much savings, time-wise? But it
might address the "unprotected TX" issue I worry about above.

> > Regarding the code: the whole dance you do with ieee80211_key_link() and
> > ieee80211_key_replace() seems to be a little pointless because you still
> > add the key to debugfs and then free it, on errors that is?
> 
> I hope the more verbose description above explains that part.
> The intend of the dance is to finalize the switch to the new key in the
> hardware prior to doing it in mac80211. It's probably possible to bypass
> more of the code, but I did not touch the logic besides what I needed,
> understand or verified to be harmless.

I'm just saying that

err = add_to_hardware();

add_to_debugfs();

if (err)
	remove_from_debugfs();

is pointless :-)

johannes

Alexander Wetzel June 19, 2018, 8:12 p.m. UTC | #4

> Am I understanding correctly that in the latter (outgoing) case this
> might cause unencrypted packets to be transmitted?

Yes, if we have a driver handling the keys similar to ath9k which is not
implementing drv_flush (correctly) this is a possibility.

This ties somewhat into the discussion what to handle in mac80211 and
what delegate to the drivers. mac80211 itself should drop any packets
trying to use the old outgoing key till the new key is rolled out
correctly with the patch.


>> We can now decrypt and get incoming packets while mac80211 is still on
>> the old key, but the worst (and most likely) case is now, that the
>> replay detection will drop those packets till mac80211 also switch over.
> 
> This actually is somewhat problematic, at least for TKIP it could cause
> countermeasures. Should we exclude TKIP here somehow?
> 
> I don't think we can disable countermeasures because then we could be
> attacked in this time window without starting them, and we have a really
> hard time distinguishing attacks and "fake" replays.
> 
Yes, that is a definitely a possibility. But excluding TKIP seems to be
counterproductive for my understanding: TKIP is due to the weaker cipher
more likely to have unicast rekeying enabled.
So far I was hesitant to add new per packet check for that, but I guess
we have to... I'll add that in the next version, so we can discuss that
with some code.


>> Flushing the queues is not needed for to the logic and only a workaround
>> for drivers like ath9k which can still send out packets for a key which
>> was already deleted. When the driver guarantees that after the key
>> deletion function returns to mac80211 that there will be no more packets
>> send out prepared for the old key it can be removed.
> 
> What will happen for other kinds of packets though?
> 
> For iwlwifi, for example, the key material is added into the key. It
> thus doesn't have this race in the first place, but it will definitely
> send out packets after 

That case is fine. A non-broken remote STA will not be able to decode
the packet if it has already completed the switch to the new key and not
hand over the packet to mac80211 or the equivalent.

> 
> Btw - perhaps that means we should avoid the complicated mechanisms like
> TX aggregation shutdown for keys that the driver marks as being safe?
> Clearly for iwlwifi (at least CCMP and before, not with the longer keys
> in GCMP-256) the race can't possibly happen.

Sounds like that should work and I think I'll just try it out. We'll
lose the side benefit that shutting down TX aggregation will reduce the
risk that a unpatched remote sta freezes the connection, though.
And since I started out to first patch ath9k to drop packets for an
outgoing key: That looks to become either be ugly (delayed work to
complete the key clean up after a given time) or need some API change.

Remember we'll still have to shut down RX aggregation or drop all frames
of a session running during the rekey. We are not able to tell which key
was used to encrypt the frames and which have "old" PNs we can't allow
mac80211 to process after switching to the new key. So I'm not sure that
keeping TX up and running is worth it.

That said I have another working patch which is delaying mac80211 key
deletion from my first misguided attempts to at this issue.
I could repurpose that and keep RX aggregation also up and running,
allowing us to first check if the packet would have been accepted by the
old key and only switch to the new PN counter once it's not. But that
seems to be kind of invasive and overkill.

> 
> In other drivers though, I worry that removing the key will *not* mean
> that there are no more packets transmitted with it. If you have a key
> cache in hardware then it might just be that marking the cache entry as
> invalid means the encryption will be skipped when encountering a packet
> to be transmitted with the key from a given (in the packet) key cache
> index, and then the frame goes out unprotected?
> 
Yes, that is part of the reason I had to add the call to drv_flush for
ath9k. I've experimented with an additional "retire_key" command on top
of the existing to add and delete keys but came to the conclusion that
"replace_key" would make more sense for ath9k at least.

But in order of my preference I see this options:
1) removing a key in HW must also grantee that any packets queued for
this key has either been send or will be dropped after the call returns
to mac80211. (Simply sleeping the max queue and retransmit time would be
an ugly but simple way to implement that)

2) we add a new function/call like "replace_key" for this special case.
But in that case we have to sort out how to handle drivers not
implementing it. Thy only secure way would be to disconnect if someone
tries to rekey...

3) Is what I implemented. We try what we can with the existing API and
any driver not working with that has to be considered buggy.

>> I tried to minimize the changes, but if we can consider an API change I
>> would ask the drivers to provide a new function to switch directly from
>> one key to another, without the need to delete it first. And to
>> guarantee there, that after the function returns no packets with
>> prepared for the old key can be sent out. Including retransmissions.
> 
> This would be pretty tricky for most drivers though.

I do not see any real alternative. Sleeping some reasonable time should
be acceptable here, to allow the hw to flush out queued packets. (100ms
should probably acceptable here, especially compared to complete
connection loss..
Only other idea I have is to lock down TX for all but EAPOL packets
sooner, e.g. during the key handshake. But that's more or less only a
variant of the sleep above.

> 
>> That way drivers like ath9k should be able to directly program the new
>> key and drivers for cards where this is not saving time can simply call
>> delete and add internally.
> 
> I don't think that's really all that much savings, time-wise? But it
> might address the "unprotected TX" issue I worry about above.
> 
Fully agree.

> 
> I'm just saying that
> 
> err = add_to_hardware();
> 
> add_to_debugfs();
> 
> if (err)
> 	remove_from_debugfs();
> 
> is pointless :-)

Agree, that is backwards compatibility a step too far:-)
I'll bypass that in the next patch version.
Planning to work on v3 on the weekend.

Alexander

Johannes Berg June 29, 2018, 10:12 a.m. UTC | #5

Hi,

On Tue, 2018-06-19 at 22:12 +0200, Alexander Wetzel wrote:
> > Am I understanding correctly that in the latter (outgoing) case this
> > might cause unencrypted packets to be transmitted?
> 
> Yes, if we have a driver handling the keys similar to ath9k which is not
> implementing drv_flush (correctly) this is a possibility.

Sadly, I think many drivers fall into this category, so I'm not sure we
should "fix the bug" for them at the expense of a potential security
risk? So maybe we need to have an opt-in flag for drivers.

> This ties somewhat into the discussion what to handle in mac80211 and
> what delegate to the drivers. mac80211 itself should drop any packets
> trying to use the old outgoing key till the new key is rolled out
> correctly with the patch.

Yeah, but it can't do it for packets already queued to the
driver/hardware, and it can't know how the hardware behaves wrt.
encryption: old-iwlwifi style (key material in descriptor) vs. ath9k
style (key material in on-chip table).

> > > We can now decrypt and get incoming packets while mac80211 is still on
> > > the old key, but the worst (and most likely) case is now, that the
> > > replay detection will drop those packets till mac80211 also switch over.
> > 
> > This actually is somewhat problematic, at least for TKIP it could cause
> > countermeasures. Should we exclude TKIP here somehow?
> > 
> > I don't think we can disable countermeasures because then we could be
> > attacked in this time window without starting them, and we have a really
> > hard time distinguishing attacks and "fake" replays.
> > 
> 
> Yes, that is a definitely a possibility. But excluding TKIP seems to be
> counterproductive for my understanding: TKIP is due to the weaker cipher
> more likely to have unicast rekeying enabled.
> So far I was hesitant to add new per packet check for that, but I guess
> we have to... I'll add that in the next version, so we can discuss that
> with some code.

Ok.

> > Btw - perhaps that means we should avoid the complicated mechanisms like
> > TX aggregation shutdown for keys that the driver marks as being safe?
> > Clearly for iwlwifi (at least CCMP and before, not with the longer keys
> > in GCMP-256) the race can't possibly happen.
> 
> Sounds like that should work and I think I'll just try it out. We'll
> lose the side benefit that shutting down TX aggregation will reduce the
> risk that a unpatched remote sta freezes the connection, though.

Fair point.

> And since I started out to first patch ath9k to drop packets for an
> outgoing key: That looks to become either be ugly (delayed work to
> complete the key clean up after a given time) or need some API change.

Ok.

> Remember we'll still have to shut down RX aggregation or drop all frames
> of a session running during the rekey. We are not able to tell which key
> was used to encrypt the frames and which have "old" PNs we can't allow
> mac80211 to process after switching to the new key. So I'm not sure that
> keeping TX up and running is worth it.

Yeah, good point, and TX is less interesting because we control when/how
it's set up.

> That said I have another working patch which is delaying mac80211 key
> deletion from my first misguided attempts to at this issue.
> I could repurpose that and keep RX aggregation also up and running,
> allowing us to first check if the packet would have been accepted by the
> old key and only switch to the new PN counter once it's not. But that
> seems to be kind of invasive and overkill.

Yeah, let's not go there for now. Also if HW crypto is involved it
usually won't work unless you (a) get the packet and (b) re-encrypt the
packet with the wrong key, and then decrypt it with the correct key ...

> Yes, that is part of the reason I had to add the call to drv_flush for
> ath9k. I've experimented with an additional "retire_key" command on top
> of the existing to add and delete keys but came to the conclusion that
> "replace_key" would make more sense for ath9k at least.

Ok.

> But in order of my preference I see this options:
> 1) removing a key in HW must also grantee that any packets queued for
> this key has either been send or will be dropped after the call returns
> to mac80211. (Simply sleeping the max queue and retransmit time would be
> an ugly but simple way to implement that)

This works but we have to consider a lot of drivers. I guess we're back
to some form of "opt-in" scheme then?

> 2) we add a new function/call like "replace_key" for this special case.
> But in that case we have to sort out how to handle drivers not
> implementing it. Thy only secure way would be to disconnect if someone
> tries to rekey...

Maybe that's not so bad.

> 3) Is what I implemented. We try what we can with the existing API and
> any driver not working with that has to be considered buggy.

I don't think we can really do this though. We break - in a potentially
security-relevant way - the older drivers. We can't just say that's
driver bugs, IMHO.

> > > I tried to minimize the changes, but if we can consider an API change I
> > > would ask the drivers to provide a new function to switch directly from
> > > one key to another, without the need to delete it first. And to
> > > guarantee there, that after the function returns no packets with
> > > prepared for the old key can be sent out. Including retransmissions.
> > 
> > This would be pretty tricky for most drivers though.
> 
> I do not see any real alternative. Sleeping some reasonable time should
> be acceptable here, to allow the hw to flush out queued packets. (100ms
> should probably acceptable here, especially compared to complete
> connection loss..

You have no guarantee that even 100ms is enough to get all packets out.
They could be sitting on a BK queue and waiting for all BE/VI traffic in
the vicinity ... basically forever.

> Only other idea I have is to lock down TX for all but EAPOL packets
> sooner, e.g. during the key handshake. But that's more or less only a
> variant of the sleep above.

Yeah, doesn't really fully address this issue anyway, since EAPOL are on
VO and others might get way more delayed on HW queues due to contention.

I think our best bet is some form of opt-in scheme. Perhaps (2), which
drivers can implement also to get like the "best" behaviour. They could
also implement there the flushing if they can, for example. Others would
get a disconnect, but they probably effectively do already?

Obviously if SW crypto is used none of this is a concern, so that's
another factor to take into account in this decision logic of whether to
disconnect or not?

IOW - I'd rather get bugs that we now force a disconnect (if anyone even
notices), rather than potentially having a bug that causes unencrypted
packets to be sent.

johannes

Alexander Wetzel June 29, 2018, 9:14 p.m. UTC | #6

Hello,

>>> Am I understanding correctly that in the latter (outgoing) case this
>>> might cause unencrypted packets to be transmitted?
>>
>> Yes, if we have a driver handling the keys similar to ath9k which is not
>> implementing drv_flush (correctly) this is a possibility.
> 
> Sadly, I think many drivers fall into this category, so I'm not sure we
> should "fix the bug" for them at the expense of a potential security
> risk? So maybe we need to have an opt-in flag for drivers.

Makes sense.

> 
>> This ties somewhat into the discussion what to handle in mac80211 and
>> what delegate to the drivers. mac80211 itself should drop any packets
>> trying to use the old outgoing key till the new key is rolled out
>> correctly with the patch.
> 
> Yeah, but it can't do it for packets already queued to the
> driver/hardware, and it can't know how the hardware behaves wrt.
> encryption: old-iwlwifi style (key material in descriptor) vs. ath9k
> style (key material in on-chip table).
> 
>>>> We can now decrypt and get incoming packets while mac80211 is still on
>>>> the old key, but the worst (and most likely) case is now, that the
>>>> replay detection will drop those packets till mac80211 also switch over.
>>>
>>> This actually is somewhat problematic, at least for TKIP it could cause
>>> countermeasures. Should we exclude TKIP here somehow?
>>>
>>> I don't think we can disable countermeasures because then we could be
>>> attacked in this time window without starting them, and we have a really
>>> hard time distinguishing attacks and "fake" replays.
>>>
>>
>> Yes, that is a definitely a possibility. But excluding TKIP seems to be
>> counterproductive for my understanding: TKIP is due to the weaker cipher
>> more likely to have unicast rekeying enabled.
>> So far I was hesitant to add new per packet check for that, but I guess
>> we have to... I'll add that in the next version, so we can discuss that
>> with some code.
> 
> Ok.

I was wrong here, this is not an issue. Tkip is simply dropping frames
when the IV is too small and never verifies the MIC. And since only MIC
errors can trigger counter measures we are fine as it is...

Nevertheless I also gave TKIP a quick test. While it's really slow
(2.6MB/s instead of 6.5MB/s due to no aggregation) it worked fine with
the patch and my sole client downloading. (Again ath9k as AP and iwlwifi
for STA)

Quick reference to the code:
ieee80211_rx_h_decrypt will return "RX_DROP_UNUSABLE" when IV is too
small and we bypass ieee80211_rx_h_michael_mic_verify in
ieee80211_rx_handlers_result due to the packet drop.

> 
>>> Btw - perhaps that means we should avoid the complicated mechanisms like
>>> TX aggregation shutdown for keys that the driver marks as being safe?
>>> Clearly for iwlwifi (at least CCMP and before, not with the longer keys
>>> in GCMP-256) the race can't possibly happen.
>>
>> Sounds like that should work and I think I'll just try it out. We'll
>> lose the side benefit that shutting down TX aggregation will reduce the
>> risk that a unpatched remote sta freezes the connection, though.
> 
> Fair point.
> 
>> And since I started out to first patch ath9k to drop packets for an
>> outgoing key: That looks to become either be ugly (delayed work to
>> complete the key clean up after a given time) or need some API change.
> 
> Ok.
> 
>> Remember we'll still have to shut down RX aggregation or drop all frames
>> of a session running during the rekey. We are not able to tell which key
>> was used to encrypt the frames and which have "old" PNs we can't allow
>> mac80211 to process after switching to the new key. So I'm not sure that
>> keeping TX up and running is worth it.
> 
> Yeah, good point, and TX is less interesting because we control when/how
> it's set up.
> 

I'll keep the current aggregation tear down in the patch for now, then.
The idea with testing keeping TX aggregation up and running would
require an AP not stopping TX aggregation.


>> That said I have another working patch which is delaying mac80211 key
>> deletion from my first misguided attempts to at this issue.
>> I could repurpose that and keep RX aggregation also up and running,
>> allowing us to first check if the packet would have been accepted by the
>> old key and only switch to the new PN counter once it's not. But that
>> seems to be kind of invasive and overkill.
> 
> Yeah, let's not go there for now. Also if HW crypto is involved it
> usually won't work unless you (a) get the packet and (b) re-encrypt the
> packet with the wrong key, and then decrypt it with the correct key ...

I don't get which case this is... But let's not go into that till we
have to:-)

> 
>> Yes, that is part of the reason I had to add the call to drv_flush for
>> ath9k. I've experimented with an additional "retire_key" command on top
>> of the existing to add and delete keys but came to the conclusion that
>> "replace_key" would make more sense for ath9k at least.
> 
> Ok.
> 
>> But in order of my preference I see this options:
>> 1) removing a key in HW must also grantee that any packets queued for
>> this key has either been send or will be dropped after the call returns
>> to mac80211. (Simply sleeping the max queue and retransmit time would be
>> an ugly but simple way to implement that)
> 
> This works but we have to consider a lot of drivers. I guess we're back
> to some form of "opt-in" scheme then?
> 
>> 2) we add a new function/call like "replace_key" for this special case.
>> But in that case we have to sort out how to handle drivers not
>> implementing it. Thy only secure way would be to disconnect if someone
>> tries to rekey...
> 
> Maybe that's not so bad.

Agree. After all this is what most of my windows drivers were doing
accidentally.


>> 3) Is what I implemented. We try what we can with the existing API and
>> any driver not working with that has to be considered buggy.
> 
> I don't think we can really do this though. We break - in a potentially
> security-relevant way - the older drivers. We can't just say that's
> driver bugs, IMHO.

We would not break it, only not fix it for all drivers. The current code
is already leaking cleartext packets for at least ath9k and most likely
many others when the PTK is rekeyed. The patch would improve that, but
due to more working rekeys it could leak more packets in specific
scenarios, I assume...


>>>> I tried to minimize the changes, but if we can consider an API change I
>>>> would ask the drivers to provide a new function to switch directly from
>>>> one key to another, without the need to delete it first. And to
>>>> guarantee there, that after the function returns no packets with
>>>> prepared for the old key can be sent out. Including retransmissions.
>>>
>>> This would be pretty tricky for most drivers though.
>>
>> I do not see any real alternative. Sleeping some reasonable time should
>> be acceptable here, to allow the hw to flush out queued packets. (100ms
>> should probably acceptable here, especially compared to complete
>> connection loss..
> 
> You have no guarantee that even 100ms is enough to get all packets out.
> They could be sitting on a BK queue and waiting for all BE/VI traffic in
> the vicinity ... basically forever.
> 
>> Only other idea I have is to lock down TX for all but EAPOL packets
>> sooner, e.g. during the key handshake. But that's more or less only a
>> variant of the sleep above.
> 
> Yeah, doesn't really fully address this issue anyway, since EAPOL are on
> VO and others might get way more delayed on HW queues due to contention.
> 
> I think our best bet is some form of opt-in scheme. Perhaps (2), which
> drivers can implement also to get like the "best" behaviour. They could
> also implement there the flushing if they can, for example. Others would
> get a disconnect, but they probably effectively do already?
> 

I'll give (2) a shot, then.

> Obviously if SW crypto is used none of this is a concern, so that's
> another factor to take into account in this decision logic of whether to
> disconnect or not?

I did not check the SW crypto code but I'm pretty sure that indeed works
with the current code.
I assume the packets will only be decrypted after an RX aggregation is
completed. That would filter out all packets send with the old key,
since they simply can't be decrypted any more.

Shall we bypass stopping aggregation sessions if we are on SW crypto?
We'll again lose the benefit that we prevent a broken remote STA to try
a TX Agg. (I tend to still stop them, but do not have a real opinion
here...)

> 
> IOW - I'd rather get bugs that we now force a disconnect (if anyone even
> notices), rather than potentially having a bug that causes unencrypted
> packets to be sent.

Any suggestions how to trick hostap/wpa_supplicant into dropping the
connection? For me it looks like we can just report an error on key
install and expect the wpa_supplicant/hostapd to handle that.

Will try that for the next version of the patch, with the other
discussed improvements: If driver is not signalling "PTK0 rekey support"
we'll simply not accept key installs when we already have a old PTK key
for the connection.

Alexander

Johannes Berg July 3, 2018, 9:51 a.m. UTC | #7

On Fri, 2018-06-29 at 23:14 +0200, Alexander Wetzel wrote:

> I was wrong here, this is not an issue. Tkip is simply dropping frames
> when the IV is too small and never verifies the MIC. And since only MIC
> errors can trigger counter measures we are fine as it is...

Err, yes, of course. My bad.

> > > 3) Is what I implemented. We try what we can with the existing API and
> > > any driver not working with that has to be considered buggy.
> > 
> > I don't think we can really do this though. We break - in a potentially
> > security-relevant way - the older drivers. We can't just say that's
> > driver bugs, IMHO.
> 
> We would not break it, only not fix it for all drivers. The current code
> is already leaking cleartext packets for at least ath9k and most likely
> many others when the PTK is rekeyed. The patch would improve that, but
> due to more working rekeys it could leak more packets in specific
> scenarios, I assume...

Yeah, ok, fair point. I don't really know.

> > Obviously if SW crypto is used none of this is a concern, so that's
> > another factor to take into account in this decision logic of whether to
> > disconnect or not?
> 
> I did not check the SW crypto code but I'm pretty sure that indeed works
> with the current code.

Me too.

> I assume the packets will only be decrypted after an RX aggregation is
> completed. That would filter out all packets send with the old key,
> since they simply can't be decrypted any more.

Oh, good point, but that's true - reordering happens before software
decryption.

> Shall we bypass stopping aggregation sessions if we are on SW crypto?
> We'll again lose the benefit that we prevent a broken remote STA to try
> a TX Agg. (I tend to still stop them, but do not have a real opinion
> here...)

I don't really know.

> > IOW - I'd rather get bugs that we now force a disconnect (if anyone even
> > notices), rather than potentially having a bug that causes unencrypted
> > packets to be sent.
> 
> Any suggestions how to trick hostap/wpa_supplicant into dropping the
> connection? For me it looks like we can just report an error on key
> install and expect the wpa_supplicant/hostapd to handle that.

I think easier would be to just disconnect ourselves? At least if we're
in managed mode...

> Will try that for the next version of the patch, with the other
> discussed improvements: If driver is not signalling "PTK0 rekey support"
> we'll simply not accept key installs when we already have a old PTK key
> for the connection.

Since I see you have a new patch - how did that work out? :)

johannes

Alexander Wetzel July 3, 2018, 7:54 p.m. UTC | #8

Hello,


> 
>> I was wrong here, this is not an issue. Tkip is simply dropping frames
>> when the IV is too small and never verifies the MIC. And since only MIC
>> errors can trigger counter measures we are fine as it is...
> 
> Err, yes, of course. My bad.
> 
>>>> 3) Is what I implemented. We try what we can with the existing API and
>>>> any driver not working with that has to be considered buggy.
>>>
>>> I don't think we can really do this though. We break - in a potentially
>>> security-relevant way - the older drivers. We can't just say that's
>>> driver bugs, IMHO.
>>
>> We would not break it, only not fix it for all drivers. The current code
>> is already leaking cleartext packets for at least ath9k and most likely
>> many others when the PTK is rekeyed. The patch would improve that, but
>> due to more working rekeys it could leak more packets in specific
>> scenarios, I assume...
> 
> Yeah, ok, fair point. I don't really know.
> 
>>> Obviously if SW crypto is used none of this is a concern, so that's
>>> another factor to take into account in this decision logic of whether to
>>> disconnect or not?
>>
>> I did not check the SW crypto code but I'm pretty sure that indeed works
>> with the current code.
> 
> Me too.
> 
>> I assume the packets will only be decrypted after an RX aggregation is
>> completed. That would filter out all packets send with the old key,
>> since they simply can't be decrypted any more.
> 
> Oh, good point, but that's true - reordering happens before software
> decryption.
> 
>> Shall we bypass stopping aggregation sessions if we are on SW crypto?
>> We'll again lose the benefit that we prevent a broken remote STA to try
>> a TX Agg. (I tend to still stop them, but do not have a real opinion
>> here...)
> 
> I don't really know.
> 
>>> IOW - I'd rather get bugs that we now force a disconnect (if anyone even
>>> notices), rather than potentially having a bug that causes unencrypted
>>> packets to be sent.
>>
>> Any suggestions how to trick hostap/wpa_supplicant into dropping the
>> connection? For me it looks like we can just report an error on key
>> install and expect the wpa_supplicant/hostapd to handle that.
> 
> I think easier would be to just disconnect ourselves? At least if we're
> in managed mode...
>
I still have much to learn about 802.11, but so far I did not see way to
 directly disconnect a STA. (Maybe spoofing a "signal lost" event or
something like that, but I fear complications by losing the sync with
the remote STA.) Is there any call/signal you have in mind I could test?

hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
and are implementing the state machine and are in a good position to
handle the fallout... at least theoretically.


>> Will try that for the next version of the patch, with the other
>> discussed improvements: If driver is not signalling "PTK0 rekey support"
>> we'll simply not accept key installs when we already have a old PTK key
>> for the connection.
> 
> Since I see you have a new patch - how did that work out? :)

So far I've only tested it with iwlwifi client (dvm, so no AP) on my
normal "desktop".

When implementing replace_key it seems to work fine. (No OTA captures
done, just connection tests with the AP running still using the previous
patch.)

Using a unpatched iwlwifi does not reconnect automatically as expected.

Instead I get a pop up asking for the PSK. Entering it reconnects
normally. Cancel the prompt disconnect till a manual reconnect.
I suspect NetworkManager is handling the rekey like the initial key
install and then assumes the PSK is wrong. Hardly surprising but also
highly visible to the users.
But then only to those using the now broken rekey...


Using wpa_supplicant directly reconnects after ~15s.
It also assumes the key is wrong and seems to rate limit the connection
attempts. Here a log with wpa_supplicat running in the console and dmesg
-wT output on top of that:

wlp3s0: WPA: Failed to set PTK to the driver (alg=3 keylen=16
bssid=12:6f:3f:0e:33:3c)
[Tue Jul  3 21:13:17 2018] wlp3s0: Driver is not supporting save PTK key
replacement. Insecure rekey attempt for STA 12:6f:3f:0e:33:3c denied.
[Tue Jul  3 21:13:17 2018] wlp3s0: deauthenticating from
12:6f:3f:0e:33:3c by local choice (Reason: 1=UNSPECIFIED)
wlp3s0: CTRL-EVENT-DISCONNECTED bssid=12:6f:3f:0e:33:3c reason=1
locally_generated=1
wlp3s0: WPA: 4-Way Handshake failed - pre-shared key may be incorrect
wlp3s0: CTRL-EVENT-SSID-TEMP-DISABLED id=0 ssid="mordor-g"
auth_failures=1 duration=10 reason=WRONG_KEY
wlp3s0: CTRL-EVENT-REGDOM-CHANGE init=CORE type=WORLD
[Tue Jul  3 21:13:17 2018] wlp3s0: delba from 12:6f:3f:0e:33:3c
(initiator) tid 0 reason code 37
[Tue Jul  3 21:13:17 2018] wlp3s0: Rx BA session stop requested for
12:6f:3f:0e:33:3c tid 0 initiator reason: 0
[Tue Jul  3 21:13:17 2018] wlp3s0: moving STA 12:6f:3f:0e:33:3c to state 3
[Tue Jul  3 21:13:17 2018] wlp3s0: moving STA 12:6f:3f:0e:33:3c to state 2
[Tue Jul  3 21:13:17 2018] wlp3s0: moving STA 12:6f:3f:0e:33:3c to state 1
[Tue Jul  3 21:13:17 2018] wlp3s0: Removed STA 12:6f:3f:0e:33:3c
[Tue Jul  3 21:13:17 2018] wlp3s0: Destroyed STA 12:6f:3f:0e:33:3c


wlp3s0: CTRL-EVENT-SSID-REENABLED id=0 ssid="mordor-g"
wlp3s0: SME: Trying to authenticate with 12:6f:3f:0e:33:3c
(SSID='mordor-g' freq=2432 MHz)
wlp3s0: Trying to associate with 12:6f:3f:0e:33:3c (SSID='mordor-g'
freq=2432 MHz)
[Tue Jul  3 21:13:31 2018] wlp3s0: authenticate with 12:6f:3f:0e:33:3c
[Tue Jul  3 21:13:31 2018] wlp3s0: Allocated STA 12:6f:3f:0e:33:3c
[Tue Jul  3 21:13:31 2018] wlp3s0: Inserted STA 12:6f:3f:0e:33:3c
[Tue Jul  3 21:13:31 2018] wlp3s0: send auth to 12:6f:3f:0e:33:3c (try 1/3)
[Tue Jul  3 21:13:31 2018] wlp3s0: authenticated
[Tue Jul  3 21:13:31 2018] wlp3s0: moving STA 12:6f:3f:0e:33:3c to state 2
[Tue Jul  3 21:13:31 2018] wlp3s0: associate with 12:6f:3f:0e:33:3c (try
1/3)
wlp3s0: Associated with 12:6f:3f:0e:33:3c
wlp3s0: CTRL-EVENT-SUBNET-STATUS-UPDATE status=0
wlp3s0: CTRL-EVENT-REGDOM-CHANGE init=COUNTRY_IE type=COUNTRY alpha2=DE
[Tue Jul  3 21:13:31 2018] wlp3s0: RX AssocResp from 12:6f:3f:0e:33:3c
(capab=0x431 status=0 aid=1)
[Tue Jul  3 21:13:31 2018] wlp3s0: moving STA 12:6f:3f:0e:33:3c to state 3
[Tue Jul  3 21:13:31 2018] wlp3s0: WMM AC=0 acm=0 aifs=2 cWmin=3 cWmax=7
txop=47 uapsd=0, downgraded=0
[Tue Jul  3 21:13:31 2018] wlp3s0: WMM AC=1 acm=0 aifs=2 cWmin=7
cWmax=15 txop=94 uapsd=0, downgraded=0
[Tue Jul  3 21:13:31 2018] wlp3s0: WMM AC=2 acm=0 aifs=3 cWmin=15
cWmax=1023 txop=0 uapsd=0, downgraded=0
[Tue Jul  3 21:13:31 2018] wlp3s0: WMM AC=3 acm=0 aifs=7 cWmin=15
cWmax=1023 txop=0 uapsd=0, downgraded=0
[Tue Jul  3 21:13:31 2018] wlp3s0: associated
wlp3s0: WPA: Key negotiation completed with 12:6f:3f:0e:33:3c [PTK=CCMP
GTK=CCMP]
wlp3s0: CTRL-EVENT-CONNECTED - Connection to 12:6f:3f:0e:33:3c completed
[id=0 id_str=]
[Tue Jul  3 21:13:31 2018] wlp3s0: moving STA 12:6f:3f:0e:33:3c to state 4
[Tue Jul  3 21:13:31 2018] wlp3s0: AddBA Req buf_size=64 for
12:6f:3f:0e:33:3c
[Tue Jul  3 21:13:31 2018] wlp3s0: Rx A-MPDU request on
12:6f:3f:0e:33:3c tid 0 result 0

A test with the code on a AP is still pending. (I'll probably try that
on the weekend.)

Alexander

Johannes Berg July 4, 2018, 12:06 a.m. UTC | #9

On Tue, 2018-07-03 at 21:54 +0200, Alexander Wetzel wrote:

> > I think easier would be to just disconnect ourselves? At least if we're
> > in managed mode...
> > 
> 
> I still have much to learn about 802.11, but so far I did not see way to
>  directly disconnect a STA. (Maybe spoofing a "signal lost" event or
> something like that, but I fear complications by losing the sync with
> the remote STA.) Is there any call/signal you have in mind I could test?

ieee80211_set_disassoc(), this can also send a frame out to indicate to
the AP that we're disconnecting instead.

> hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
> and are implementing the state machine and are in a good position to
> handle the fallout... at least theoretically.

Ideally it would even know beforehand that we don't want to handle the
PTK rekeying, and then could reconnect instead of going through the
handshake.

> Instead I get a pop up asking for the PSK. Entering it reconnects
> normally. Cancel the prompt disconnect till a manual reconnect.
> I suspect NetworkManager is handling the rekey like the initial key
> install and then assumes the PSK is wrong. Hardly surprising but also
> highly visible to the users.

That's pretty awkward.

> But then only to those using the now broken rekey...

Yeah, but you don't necessarily control that - i.e. client device and AP
might have different owners :-)

> Using wpa_supplicant directly reconnects after ~15s.
> It also assumes the key is wrong and seems to rate limit the connection
> attempts. Here a log with wpa_supplicat running in the console and dmesg
> -wT output on top of that:

So I think we're probably better off accepting the set_key but not
actually using it, and instead disconnecting... even if that's awkward
and should come with a big comment :-)

johannes

Alexander Wetzel July 8, 2018, 8:10 a.m. UTC | #10

>>> I think easier would be to just disconnect ourselves? At least if we're
>>> in managed mode...
>>>
>>
>> I still have much to learn about 802.11, but so far I did not see way to
>>  directly disconnect a STA. (Maybe spoofing a "signal lost" event or
>> something like that, but I fear complications by losing the sync with
>> the remote STA.) Is there any call/signal you have in mind I could test?
> 
> ieee80211_set_disassoc(), this can also send a frame out to indicate to
> the AP that we're disconnecting instead.
>

I'll try that, but will probably take another week. My main main work
station got severe file system corruption, forcing me to reinstall it
from scratch. I suspect it was something in the wireless testing kernel
4.18-rc1 (944ae08d4e71) related to either btrfs or the ssd disk but
since I needed the system just started over and avoid to run 4.18 if I
do not have a full backup...

>> hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
>> and are implementing the state machine and are in a good position to
>> handle the fallout... at least theoretically.
> 
> Ideally it would even know beforehand that we don't want to handle the
> PTK rekeying, and then could reconnect instead of going through the
> handshake.
> 

Don't see how we could do that in the kernel, all the relevant
information is handled in the state machine. I guess an API extension
telling hostap/supplicant if we can handle rekeys or not would tbe he
only way to avoid that.

>> Instead I get a pop up asking for the PSK. Entering it reconnects
>> normally. Cancel the prompt disconnect till a manual reconnect.
>> I suspect NetworkManager is handling the rekey like the initial key
>> install and then assumes the PSK is wrong. Hardly surprising but also
>> highly visible to the users.
> 
> That's pretty awkward.
> 
>> But then only to those using the now broken rekey...
> 
> Yeah, but you don't necessarily control that - i.e. client device and AP
> might have different owners :-)
> 
>> Using wpa_supplicant directly reconnects after ~15s.
>> It also assumes the key is wrong and seems to rate limit the connection
>> attempts. Here a log with wpa_supplicat running in the console and dmesg
>> -wT output on top of that:
> 
> So I think we're probably better off accepting the set_key but not
> actually using it, and instead disconnecting... even if that's awkward
> and should come with a big comment :-)

Instead of returning an error I'll change the code to accept the rekey
but do nothing with it. (Basically delete the new key and keep the old
active).
And of course calling ieee80211_set_disassoc() prior to return "success".

Let's see how the supplicant will react on a disassoc while doing a rekey...

Alexander

Johannes Berg July 9, 2018, 7:13 a.m. UTC | #11

Hi,

> I'll try that, but will probably take another week. My main main work
> station got severe file system corruption, forcing me to reinstall it
> from scratch. I suspect it was something in the wireless testing kernel
> 4.18-rc1 (944ae08d4e71) related to either btrfs or the ssd disk but
> since I needed the system just started over and avoid to run 4.18 if I
> do not have a full backup...

Ouch. FWIW, it's possible to run inside a VM with PCI(e) devices
outside, at least on some machines.

> > > hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
> > > and are implementing the state machine and are in a good position to
> > > handle the fallout... at least theoretically.
> > 
> > Ideally it would even know beforehand that we don't want to handle the
> > PTK rekeying, and then could reconnect instead of going through the
> > handshake.
> > 
> 
> Don't see how we could do that in the kernel, all the relevant
> information is handled in the state machine. I guess an API extension
> telling hostap/supplicant if we can handle rekeys or not would tbe he
> only way to avoid that.

Right. Not really much point for now I guess.

> > So I think we're probably better off accepting the set_key but not
> > actually using it, and instead disconnecting... even if that's awkward
> > and should come with a big comment :-)
> 
> Instead of returning an error I'll change the code to accept the rekey
> but do nothing with it. (Basically delete the new key and keep the old
> active).
> And of course calling ieee80211_set_disassoc() prior to return "success".

Right. Did you handle/consider modes other than BSS/P2P client btw?

> Let's see how the supplicant will react on a disassoc while doing a rekey...

Shouldn't matter to it, I'd think.

johannes

Denis Kenzior July 10, 2018, 9:05 p.m. UTC | #12

Hi Alexander,

>>> hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
>>> and are implementing the state machine and are in a good position to
>>> handle the fallout... at least theoretically.
>>
>> Ideally it would even know beforehand that we don't want to handle the
>> PTK rekeying, and then could reconnect instead of going through the
>> handshake.
>>
> 
> Don't see how we could do that in the kernel, all the relevant
> information is handled in the state machine. I guess an API extension
> telling hostap/supplicant if we can handle rekeys or not would tbe he
> only way to avoid that.
> 

Can the kernel / driver provide some sort of hint to user space that PTK 
rekey isn’t supported?  We could then have user space deauthenticate 
with a big warning about what/why this is happening and try to 
re-connect to the last used BSS.

>> So I think we're probably better off accepting the set_key but not
>> actually using it, and instead disconnecting... even if that's awkward
>> and should come with a big comment :-)
> 
> Instead of returning an error I'll change the code to accept the rekey
> but do nothing with it. (Basically delete the new key and keep the old
> active).
> And of course calling ieee80211_set_disassoc() prior to return "success".
> 
> Let's see how the supplicant will react on a disassoc while doing a rekey...
> 

This sounds pretty awful actually.  Now that wpa_s is not the only game 
in town, can we stop resorting to these tactics?

Regards,
-Denis

Alexander Wetzel July 11, 2018, 4:59 p.m. UTC | #13

Hi,


>> I'll try that, but will probably take another week. My main main work
>> station got severe file system corruption, forcing me to reinstall it
>> from scratch. I suspect it was something in the wireless testing kernel
>> 4.18-rc1 (944ae08d4e71) related to either btrfs or the ssd disk but
>> since I needed the system just started over and avoid to run 4.18 if I
>> do not have a full backup...
> 
> Ouch. FWIW, it's possible to run inside a VM with PCI(e) devices
> outside, at least on some machines.
> 
>>>> hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
>>>> and are implementing the state machine and are in a good position to
>>>> handle the fallout... at least theoretically.
>>>
>>> Ideally it would even know beforehand that we don't want to handle the
>>> PTK rekeying, and then could reconnect instead of going through the
>>> handshake.
>>>
>>
>> Don't see how we could do that in the kernel, all the relevant
>> information is handled in the state machine. I guess an API extension
>> telling hostap/supplicant if we can handle rekeys or not would tbe he
>> only way to avoid that.
> 
> Right. Not really much point for now I guess.
> 
>>> So I think we're probably better off accepting the set_key but not
>>> actually using it, and instead disconnecting... even if that's awkward
>>> and should come with a big comment :-)
>>
>> Instead of returning an error I'll change the code to accept the rekey
>> but do nothing with it. (Basically delete the new key and keep the old
>> active).
>> And of course calling ieee80211_set_disassoc() prior to return "success".
> 
> Right. Did you handle/consider modes other than BSS/P2P client btw?

I still have huge gaps in my wlan knowledge, I only drilled into the
rekey issue so far and never used anything besides AP/STA...

That said we only should have mesh and IBSS left to consider, right?
(I think I remember some new mode allowing STAs to diretly talk to each
other while beeing associated to an AP, but can't find it again and
don't see how this could work with PTK keys without somehow making a
mesh out of it.)

So far I have avoided non-generic changes. The current patch should work
with all modes, shouldn't it?. My initial impression of
ieee80211_set_disassoc() was much the same, but you seem to imply it's
not usable in some modes? That would again be awkward...

As for IBSS: I have no idea how a mesh would handle rekeys. If it can
but won't work with ieee80211_set_disassoc() it will be quite some time
till I've can propose a new patch.
I normally only have a few hours per week for the forseeable future, and
some weeks not even those...

On the plus side i got my hands on a AP using ath10k and can look at
that from yet another angle. I'll devinitelly continue here, but I
suspect I'll be slow with patches for a while...

> 
>> Let's see how the supplicant will react on a disassoc while doing a rekey...
> 
> Shouldn't matter to it, I'd think.

Alexander

Alexander Wetzel July 11, 2018, 5:08 p.m. UTC | #14

Hi Denis,

> Hi Alexander,
> 
>>>> hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
>>>> and are implementing the state machine and are in a good position to
>>>> handle the fallout... at least theoretically.
>>>
>>> Ideally it would even know beforehand that we don't want to handle the
>>> PTK rekeying, and then could reconnect instead of going through the
>>> handshake.
>>>
>>
>> Don't see how we could do that in the kernel, all the relevant
>> information is handled in the state machine. I guess an API extension
>> telling hostap/supplicant if we can handle rekeys or not would tbe he
>> only way to avoid that.
>>
> 
> Can the kernel / driver provide some sort of hint to user space that PTK
> rekey isn’t supported?  We could then have user space deauthenticate
> with a big warning about what/why this is happening and try to
> re-connect to the last used BSS.
> 

Sure. In fact the latest patch is already doing that by returning an
error when set_key is called for PTK and it's not an initial call.
Tests with wpa_supplicant shows that this is is then handled like the
initial key set is failing. Networkmanager prompts for the password and
wpa_supplicant running without seems to blacklist a reconnect for 15s.

I kind of liked that solution, but with existing implematations out this
is indeed awkward to find a "correct" solution.

The main problem for me currently is to find a correct and still
acceptable solution. This turned from "let's fix this nasty wlan
connection freezes" to a projet spanning the complete wlan stack: From
hardware up to and including the userspace...

It's fun to learn how that interacts (if not very fast), I'm stuggling
finding the best way forward here. Whatever we do has undesired
consequences.
Maybe I'm missing something, but here the high level options we have in
my opinion:

1) Keep it as it is and solve that in a indefinite future when we and
the world implement the ieee802.11 2012 addition, to use key 0+1 for PTK
and 2+3 for GTK
- rekey has a extrem high probability of freezing connections and
leaking a few clear text packets for years (decades?) to come
+ The issue is fixed at the core

2) Make it worse, like some (most) Windows systems/cards seem to handle
it by encrypting EAPOL #4 with the NEW key, breaking the handshake and
forcing a reconnect.
- break something more to fix a problem sounds like a insane approach
+ This seems to be quite common and therefore well "tested" (based on my
very limited data on that)

3) Fix what we can in mac80211 but keep the API stable
- Without driver actions still many drivers will be "undefined" and even
if they are not freezing leak packets
+ This will reduce the problems to a fraction of what is is today with
only a mac80211 update

4) Redesign the mac80211 rekey handling and interaction with drivers to
only rekey if it is save and decline when not.
+ We only have to touch the kernel
- any supplicant (whatever runs the wpa state machine) may get errors
where the programmes did not expect them, leading to unexpected side
effects.

5) The full-stack solution: Update the API for the userpace
+ We do not have to "trick" the wpa state machine to disconnect, the
programmers of it have to code it.
- Well, it must be suppurted from the wpa state machine. If not we still
have to handle the rekey somehow or we accept freezes/cleartext leaks...

The last two solutions will also need some "fallback" when a secure
rekey is not possible and/or the user is runing an old state machine not
knowing about the new way...


>>> So I think we're probably better off accepting the set_key but not
>>> actually using it, and instead disconnecting... even if that's awkward
>>> and should come with a big comment :-)
>>
>> Instead of returning an error I'll change the code to accept the rekey
>> but do nothing with it. (Basically delete the new key and keep the old
>> active).
>> And of course calling ieee80211_set_disassoc() prior to return "success".
>>
>> Let's see how the supplicant will react on a disassoc while doing a
>> rekey...
>>
> 
> This sounds pretty awful actually.  Now that wpa_s is not the only game
> in town, can we stop resorting to these tactics?

Nothing of it is wpa_supplicat/hostap specifiy. Only my current "test"
environment is using it, simply due to the fact that I tracked the issue
down in that environment.
Everything besides ath9k as an AP running hostapd and a iwldvm card
running wpa_supplicant is mostly untested. And even there I have some
areas marked for follow up after we find a solution acceptable for the
kernel...

Do you have any other software you think I should add to my prelimitary
tests? If possible I'll happy to extend the test of the patches with those.

Alexander

Denis Kenzior July 11, 2018, 7:43 p.m. UTC | #15

Hi Alexander,

On 07/11/2018 12:08 PM, Alexander Wetzel wrote:
> Hi Denis,
> 
>> Hi Alexander,
>>
>>>>> hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
>>>>> and are implementing the state machine and are in a good position to
>>>>> handle the fallout... at least theoretically.
>>>>
>>>> Ideally it would even know beforehand that we don't want to handle the
>>>> PTK rekeying, and then could reconnect instead of going through the
>>>> handshake.
>>>>
>>>
>>> Don't see how we could do that in the kernel, all the relevant
>>> information is handled in the state machine. I guess an API extension
>>> telling hostap/supplicant if we can handle rekeys or not would tbe he
>>> only way to avoid that.
>>>
>>
>> Can the kernel / driver provide some sort of hint to user space that PTK
>> rekey isn’t supported?  We could then have user space deauthenticate
>> with a big warning about what/why this is happening and try to
>> re-connect to the last used BSS.
>>
> 
> Sure. In fact the latest patch is already doing that by returning an
> error when set_key is called for PTK and it's not an initial call.
> Tests with wpa_supplicant shows that this is is then handled like the
> initial key set is failing. Networkmanager prompts for the password and
> wpa_supplicant running without seems to blacklist a reconnect for 15s.

Ideally we shouldn't even get this far.  We really need some kind of 
capability bit on the phy telling userspace whether PTK rekey is 
supported or not.  Then userspace can take proper action based on this 
information.  E.g. if PTK rekey isn't safe, then we can simply issue a 
CMD_DISCONNECT and re-connect to the last BSS.  The kernel doesn't need 
to play any 'tricks'.

The fact that current userspace implementations are broken is 
regrettable and needs to be fixed.

> 
> I kind of liked that solution, but with existing implematations out this
> is indeed awkward to find a "correct" solution.
> 
> The main problem for me currently is to find a correct and still
> acceptable solution. This turned from "let's fix this nasty wlan
> connection freezes" to a projet spanning the complete wlan stack: From
> hardware up to and including the userspace...

Right.  The problem is that this PTK rekey likely 'works' (for some 
definition thereof) in a vast majority of cases, e.g. the link isn't 
broken, so the user doesn't notice.  So, if the kernel starts to 
unilaterally issue disconnects, you will have a lot of grumbling users.

Just to clarify, I'm not arguing against this necessarily.  I can see 
why issuing a disconnect is a good idea for many reasons (e.g. security, 
etc.)  But, I would expect a lot of user backlash if this is done, and 
given that this has been an issue for many years, I wonder if its the 
right way of handling this?

> 
> It's fun to learn how that interacts (if not very fast), I'm stuggling
> finding the best way forward here. Whatever we do has undesired
> consequences.
> Maybe I'm missing something, but here the high level options we have in
> my opinion:
> 
> 1) Keep it as it is and solve that in a indefinite future when we and
> the world implement the ieee802.11 2012 addition, to use key 0+1 for PTK
> and 2+3 for GTK
> - rekey has a extrem high probability of freezing connections and
> leaking a few clear text packets for years (decades?) to come
> + The issue is fixed at the core

It would seem to me that 0+1 rekey is a separate issue that needs to be 
supported in both kernel and userspace anyway.

> 
> 2) Make it worse, like some (most) Windows systems/cards seem to handle
> it by encrypting EAPOL #4 with the NEW key, breaking the handshake and
> forcing a reconnect.
> - break something more to fix a problem sounds like a insane approach
> + This seems to be quite common and therefore well "tested" (based on my
> very limited data on that)

This seems awful.  And then if you're unlucky someone will come in and 
tell you that the kernel has to maintain this 'legacy' behavior forever. 
  So things can't ever be fixed.

Plus, as I already mentioned above, some users 'think' that PTK rekey 
already works just fine.

> 
> 3) Fix what we can in mac80211 but keep the API stable
> - Without driver actions still many drivers will be "undefined" and even
> if they are not freezing leak packets
> + This will reduce the problems to a fraction of what is is today with
> only a mac80211 update
> 
> 4) Redesign the mac80211 rekey handling and interaction with drivers to
> only rekey if it is save and decline when not.
> + We only have to touch the kernel
> - any supplicant (whatever runs the wpa state machine) may get errors
> where the programmes did not expect them, leading to unexpected side
> effects.
> 
> 5) The full-stack solution: Update the API for the userpace
> + We do not have to "trick" the wpa state machine to disconnect, the
> programmers of it have to code it.
> - Well, it must be suppurted from the wpa state machine. If not we still
> have to handle the rekey somehow or we accept freezes/cleartext leaks...
> 
> The last two solutions will also need some "fallback" when a secure
> rekey is not possible and/or the user is runing an old state machine not
> knowing about the new way...
> 

My vote is for something along the lines of 5.  I realize there are no 
good solutions here, but this really should be fixed by a combination of 
userspace and kernel working properly together.  If this isn't possible, 
then perhaps the whole rekey should be done in the kernel and userspace 
should be given no chance to make a bad decision.

> 
>>>> So I think we're probably better off accepting the set_key but not
>>>> actually using it, and instead disconnecting... even if that's awkward
>>>> and should come with a big comment :-)
>>>
>>> Instead of returning an error I'll change the code to accept the rekey
>>> but do nothing with it. (Basically delete the new key and keep the old
>>> active).
>>> And of course calling ieee80211_set_disassoc() prior to return "success".
>>>
>>> Let's see how the supplicant will react on a disassoc while doing a
>>> rekey...
>>>
>>
>> This sounds pretty awful actually.  Now that wpa_s is not the only game
>> in town, can we stop resorting to these tactics?
> 
> Nothing of it is wpa_supplicat/hostap specifiy. Only my current "test"
> environment is using it, simply due to the fact that I tracked the issue
> down in that environment.
> Everything besides ath9k as an AP running hostapd and a iwldvm card
> running wpa_supplicant is mostly untested. And even there I have some
> areas marked for follow up after we find a solution acceptable for the
> kernel...
> 
> Do you have any other software you think I should add to my prelimitary
> tests? If possible I'll happy to extend the test of the patches with those.
> 

You can try:
https://git.kernel.org/pub/scm/network/wireless/iwd.git/

We would be happy to implement whatever 'proper' userspace behavior / 
kernel api that this discussion leads to.

Regards,
-Denis

Alexander Wetzel July 15, 2018, 9:10 a.m. UTC | #16

Hi,


>>> So I think we're probably better off accepting the set_key but not
>>> actually using it, and instead disconnecting... even if that's awkward
>>> and should come with a big comment :-)
>>
>> Instead of returning an error I'll change the code to accept the rekey
>> but do nothing with it. (Basically delete the new key and keep the old
>> active).
>> And of course calling ieee80211_set_disassoc() prior to return "success".
> 
> Right. Did you handle/consider modes other than BSS/P2P client btw?

I've tested that on a client only and it's already not working. Problem
is, that with ieee80211_set_disassoc() we just dump the association in
the kernel and are not informing the userspace, so the state machine is
stuck in "connected" but the STA is unable to communicate.

I also tried ieee80211_mgd_deauth():
While better this is basically the same behaviour as returning an error
on key replace. By setting the error code to inactivity at least
wpa_supplicant was actually starting to reconnect, unfortunatelly
set_key then failes since we purged the assosication in the kernel. (Or
that's my best interpretation from the logs). Networkmanager then again
prompted for the password...

I started experimenting with just signals to the userspace, but then
paused... Trying to control the state machine with spoofed errors trying
to trigger an "desireable" action can't be the way forward, can it?

Even if we get that working changes or different implementations in
userspace may well break it.

I now think we only have two reasonable ways forward:

1) The user friendly one: If the userspace does not know about the new
API just print a error message and do the insecure key replace. With
potential clear text packet leaks and connection freezes.

2) The secure way: Report an error on key install and accept that users
will encounter new problems when they are using rekeys with the old API
with "old" userspace software.

Since we have this issue in the kernel at least as long as we have
mac80211 I would vote for 1) here. Fix mac80211 and add a new way for
the userspace to to securely replace the keys and detect when this is
not possible. Then get the userspace software updated to act
accordingly, finally preventing the clear text packet leaks.

After some time we can then decide to reject insecure rekeys, burning
only those who use kernels much newer than the userspace.

Does this sound like an reasonable approch or should I go back figuring
out how to trick the userspace to reconnect without user
notification/intervention?

Alexander

[v2] mac80211: Fix wlan freezes under load at rekey

Commit Message

Comments

Patch