diff mbox

AR9462 problems connecting again..

Message ID 20150223224305.GA30228@w1.fi
State Not Applicable
Headers show

Commit Message

Jouni Malinen Feb. 23, 2015, 10:43 p.m. UTC
On Mon, Feb 23, 2015 at 02:22:32PM -0800, Adrian Chadd wrote:
> On 23 February 2015 at 13:53, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > So the theory that the driver starts at too high a transmit rate, and
> > then does not handle failures well, might be true. Of course, "not
> > handle failures well" is something of an understatement.

> Hm, can we just hack mac80211/ath9k to set /all/ EAPOL frames to the
> lowest negotiated basic rate and test? That way we don't have to worry
> about things resetting fixed rates or whatnot.

This did not get exactly supportive response when this was proposed last
time (Sep 2013). Anyway, for a quick test, this can be done with the
following one-liner:



Even I think that this goes a bit too far especially on 2.4 GHz band,
but I would actually consider limiting EAPOL frames to using non-HT/VHT
(e.g., 6 Mbps OFDM at least for EAPOL-Key frames) to avoid some
interoperability issues. I would say that the current minstrel_ht
behavior is somewhat excessive for EAPOL-Key frames in the other
direction. Using MCS 14 with fallback to something like MCS 5 for the
second Data frame after an association can certainly fail.

Number of Linux drivers do already limit EAPOL frame TX rate, so this
specific item is mainly applicable only to driver that use minstrel from
mac80211 (e.g., ath9k). Though, that IEEE80211_TX_CTL_USE_MINRATE would
likely affect most mac80211 drivers.

Comments

Linus Torvalds Feb. 23, 2015, 11 p.m. UTC | #1
On Mon, Feb 23, 2015 at 2:43 PM, Jouni Malinen <j@w1.fi> wrote:
>
> This did not get exactly supportive response when this was proposed last
> time (Sep 2013). Anyway, for a quick test, this can be done with the
> following one-liner:

fwiw, that one-liner seems to work fine for me.

Which I guess is not a huge surprise.

Side note: I've done the "turn off wifi and turn it back on" several
times to test that patch, and it has worked every time. BUT I also see
this odd behavior where the logs show that it tries to authenticate
twice: the first time it does that "send auth to 20:9f .." thing three
times (looks like 100ms apart), and nothing happens so it does
"authentication with 20:9f .. timed out". Then it waits three seconds
and tries again, and now it succeeds on the first try.

The only downside of that seems to be that it takes an extra 3s to
connect to the network - but it does now seem to *reliably* connect -
so it's not a big problem, but I wonder why it should be that
repeatable. Is there some difference between the first and the second
time it tries to authenticate?

Anyway, even if people don't like that particular patch, it does seem
like *something* like that should be done.

                         Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jouni Malinen Feb. 23, 2015, 11:13 p.m. UTC | #2
On Mon, Feb 23, 2015 at 03:00:39PM -0800, Linus Torvalds wrote:
> On Mon, Feb 23, 2015 at 2:43 PM, Jouni Malinen <j@w1.fi> wrote:
> >
> > This did not get exactly supportive response when this was proposed last
> > time (Sep 2013). Anyway, for a quick test, this can be done with the
> > following one-liner:
> 
> fwiw, that one-liner seems to work fine for me.

Good to know. That seems to confirm that the issue is most likely in
some of the higher HT rates not working with that AP for some reason at
least for EAPOL frames immediately following the association. This
should really have worked since I'm pretty sure minstrel will add a
lower MCS index as a fallback rate, but maybe there are some interop
issues or just close enough to being below the required signal strength
(though, I'd find that somewhat unlikely unless there is something
messed up with the antennas etc. taken into account your description on
how close the devices are and what's between them).

> Side note: I've done the "turn off wifi and turn it back on" several
> times to test that patch, and it has worked every time. BUT I also see
> this odd behavior where the logs show that it tries to authenticate
> twice: the first time it does that "send auth to 20:9f .." thing three
> times (looks like 100ms apart), and nothing happens so it does
> "authentication with 20:9f .. timed out". Then it waits three seconds
> and tries again, and now it succeeds on the first try.
> 
> The only downside of that seems to be that it takes an extra 3s to
> connect to the network - but it does now seem to *reliably* connect -
> so it's not a big problem, but I wonder why it should be that
> repeatable. Is there some difference between the first and the second
> time it tries to authenticate?

There have been some issues in this area in the past, but it is a bit
difficult to say what could have caused it here without a wireless
capture from the operating channel of the AP. It is possible that those
Authentication frames were not actually transmitted due to some issue or
they could have even been sent out on incorrect channel, etc. That
shouldn't really happen that frequently (i.e., others should be seeing
and reporting this as well..), but it's possible something gets messed
up in channel configuration.

> Anyway, even if people don't like that particular patch, it does seem
> like *something* like that should be done.

Agreed. Just need to figure out a suitable compromise somewhere in the
middle of the minimum and close-to-maximum rates. As far as ath9k is
concerned, it would actually support four different rates for retry
series where minstrel uses only three, so it would be easy even for that
driver on its own to add a lower rate at the end if we cannot find
consensus on something more generic in mac80211. That said, I'd rather
see minstrel getting more conservative for EAPOL frames.
Sujith Manoharan Feb. 24, 2015, 12:29 a.m. UTC | #3
Jouni Malinen wrote:
> Even I think that this goes a bit too far especially on 2.4 GHz band,
> but I would actually consider limiting EAPOL frames to using non-HT/VHT
> (e.g., 6 Mbps OFDM at least for EAPOL-Key frames) to avoid some
> interoperability issues. I would say that the current minstrel_ht
> behavior is somewhat excessive for EAPOL-Key frames in the other
> direction. Using MCS 14 with fallback to something like MCS 5 for the
> second Data frame after an association can certainly fail.
> 
> Number of Linux drivers do already limit EAPOL frame TX rate, so this
> specific item is mainly applicable only to driver that use minstrel from
> mac80211 (e.g., ath9k). Though, that IEEE80211_TX_CTL_USE_MINRATE would
> likely affect most mac80211 drivers.

We had this in the ath9k RC, where EAPOL frames were sent out at min-rate
on the VO queue. I think instead of re-introducing it in the driver,
having it in minstrel is cleaner.

Sujith
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew McGregor Feb. 24, 2015, 2:29 a.m. UTC | #4
(sorry to send HTML the first time... oops)

Over the weekend I found a bug in minstrel-ht that might well be
implicated here.

The last retransmit rate is meant to be a 'get the packet there
reliably' rate; minstrel-ht doesn't do that right, and can pick a
fairly flaky rate instead.

Can't generate a proper patch right now, so this diff might not apply
cleanly, but the fix is simply to change 75 to 99 in the two places
below:

@@ -437,7 +448,7 @@ minstrel_ht_set_best_prob_rate(struct mi
     (max_tp_group != MINSTREL_CCK_GROUP))
  return;

- if (mrs->prob_ewma > MINSTREL_FRAC(75, 100)) {
+ if (mrs->prob_ewma > MINSTREL_FRAC(99, 100)) {
  cur_tp_avg = minstrel_ht_get_tp_avg(mi, cur_group, cur_idx);
  if (cur_tp_avg > tmp_tp_avg)
  mi->max_prob_rate = index;
@@ -526,7 +537,7 @@ minstrel_ht_prob_rate_reduce_streams(str
  * Rules for rate selection:
  *  - max_prob_rate must use only one stream, as a tradeoff between delivery
  *    probability and throughput during strong fluctuations
- *  - as long as the max prob rate has a probability of more than 75%, pick
+ *  - as long as the max prob rate has a probability of more than 99%, pick
  *    higher throughput rates, even if the probablity is a bit lower
  */
 static void

On Tue, Feb 24, 2015 at 11:29 AM, Sujith Manoharan <sujith@msujith.org> wrote:

> Jouni Malinen wrote:
>
> > Even I think that this goes a bit too far especially on 2.4 GHz band,
>
> > but I would actually consider limiting EAPOL frames to using non-HT/VHT
>
> > (e.g., 6 Mbps OFDM at least for EAPOL-Key frames) to avoid some
>
> > interoperability issues. I would say that the current minstrel_ht
>
> > behavior is somewhat excessive for EAPOL-Key frames in the other
>
> > direction. Using MCS 14 with fallback to something like MCS 5 for the
>
> > second Data frame after an association can certainly fail.
>
> >
>
> > Number of Linux drivers do already limit EAPOL frame TX rate, so this
>
> > specific item is mainly applicable only to driver that use minstrel from
>
> > mac80211 (e.g., ath9k). Though, that IEEE80211_TX_CTL_USE_MINRATE would
>
> > likely affect most mac80211 drivers.
>
>
> We had this in the ath9k RC, where EAPOL frames were sent out at min-rate
>
> on the VO queue. I think instead of re-introducing it in the driver,
>
> having it in minstrel is cleaner.
>
>
> Sujith
>
> --
>
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
>
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jouni Malinen Feb. 24, 2015, 10:26 a.m. UTC | #5
On Tue, Feb 24, 2015 at 01:29:27PM +1100, Andrew McGregor wrote:
> Over the weekend I found a bug in minstrel-ht that might well be
> implicated here.
> 
> The last retransmit rate is meant to be a 'get the packet there
> reliably' rate; minstrel-ht doesn't do that right, and can pick a
> fairly flaky rate instead.
> 
> Can't generate a proper patch right now, so this diff might not apply
> cleanly, but the fix is simply to change 75 to 99 in the two places
> below:

While this may indeed be helpful, I don't think it is sufficient for
this EAPOL frame related issue. What I would like to see is minstrel_ht
using a basic rate (something non-HT) at the end of the retry series for
EAPOL frames.

The current behavior looks very suspicious to me. The early EAPOL frames
after association are being used to probe for higher rates. This results
in the total number of retry attempts actually getting smaller than any
other frame, i.e., minstrel_ht seems to be using significantly _less_
robust choices for the EAPOL frames than following "normal" data frames!
This should really be the other way around..

As an example, I'm seeing this on 5 GHz band (with the 75 to 99 change
in place, but behavior was more or less identical without it):
- the first EAPOL frame (msg 2/4) getting one attempt at MCS 3, 2
  attempts at MCS 0, 2 attempts at MCS 0 (yes, identical to the previous
  one) with total maximum of 5 attempts
- the second EAPOL frame (msg 4/4) getting one attempt at MCS 9, 2
  attempts at MCS 0, 2 attempts at MCS 0 with total maximum of 5
  attempts
- another data frame after this: 5 attempts at MCS 9, 5 attempts at MCS
  3, 5 attempts at MCS 3 with total maximum of 15 attempts(!!)

This cannot be the best approach here.. For the
IEEE80211_TX_CTRL_PORT_CTRL_PROTO cases, there are identified issues
where failing to deliver the frame results is significant issues either
in getting connected in the first place or getting disconnected if
rekeying fails. 

I'm not sure how this would be implemented cleanly in minstrel_ht or
whether that is even the best place (i.e., rate.c could do this
instead), but I'd like that third attempt for control port cases to be
dropped to use a (lowish) basic rate and non-MCS at that since there may
be some interop issues with HT MCS early during association.
Alternatively with drivers like ath9k that support 4 rate values, it
would also be fine to add this basic rate attempt (or well, I'd have
multiple, say 4, such attempts) as an additional 4th entry which does
not currently seem to get used with minstrel at all.

The "(lowish) basic rate" here could be defined as 6 Mbps OFDM for 5 GHz
band and either that or maybe even 2 Mbps or 5.5 Mbps on 2.4 GHz (if
included by the AP in basic rate set).
Dave Taht Feb. 24, 2015, 4:58 p.m. UTC | #6
On Tue, Feb 24, 2015 at 2:26 AM, Jouni Malinen <j@w1.fi> wrote:
> On Tue, Feb 24, 2015 at 01:29:27PM +1100, Andrew McGregor wrote:
>> Over the weekend I found a bug in minstrel-ht that might well be
>> implicated here.
>>
>> The last retransmit rate is meant to be a 'get the packet there
>> reliably' rate; minstrel-ht doesn't do that right, and can pick a
>> fairly flaky rate instead.
>>
>> Can't generate a proper patch right now, so this diff might not apply
>> cleanly, but the fix is simply to change 75 to 99 in the two places
>> below:
>
> While this may indeed be helpful, I don't think it is sufficient for
> this EAPOL frame related issue. What I would like to see is minstrel_ht
> using a basic rate (something non-HT) at the end of the retry series for
> EAPOL frames.
>
> The current behavior looks very suspicious to me. The early EAPOL frames
> after association are being used to probe for higher rates. This results
> in the total number of retry attempts actually getting smaller than any
> other frame, i.e., minstrel_ht seems to be using significantly _less_
> robust choices for the EAPOL frames than following "normal" data frames!
> This should really be the other way around..
>
> As an example, I'm seeing this on 5 GHz band (with the 75 to 99 change
> in place, but behavior was more or less identical without it):
> - the first EAPOL frame (msg 2/4) getting one attempt at MCS 3, 2
>   attempts at MCS 0, 2 attempts at MCS 0 (yes, identical to the previous
>   one) with total maximum of 5 attempts
> - the second EAPOL frame (msg 4/4) getting one attempt at MCS 9, 2
>   attempts at MCS 0, 2 attempts at MCS 0 with total maximum of 5
>   attempts
> - another data frame after this: 5 attempts at MCS 9, 5 attempts at MCS
>   3, 5 attempts at MCS 3 with total maximum of 15 attempts(!!)

I would in general prefer that the excessive retries in the present
driver layers in wifi be
dramatically reduced, the packet dropped and the problem punted to
higher layers.

> This cannot be the best approach here..

Falling back faster to the lowest possible rate with minimum retries,
and then giving up sooner would be better. 15 attempts? jeeze....

> For the
> IEEE80211_TX_CTRL_PORT_CTRL_PROTO cases, there are identified issues
> where failing to deliver the frame results is significant issues either
> in getting connected in the first place or getting disconnected if
> rekeying fails.
>
> I'm not sure how this would be implemented cleanly in minstrel_ht or
> whether that is even the best place (i.e., rate.c could do this
> instead), but I'd like that third attempt for control port cases to be
> dropped to use a (lowish) basic rate and non-MCS at that since there may
> be some interop issues with HT MCS early during association.
> Alternatively with drivers like ath9k that support 4 rate values, it
> would also be fine to add this basic rate attempt (or well, I'd have
> multiple, say 4, such attempts) as an additional 4th entry which does
> not currently seem to get used with minstrel at all.
>
> The "(lowish) basic rate" here could be defined as 6 Mbps OFDM for 5 GHz
> band and either that or maybe even 2 Mbps or 5.5 Mbps on 2.4 GHz (if
> included by the AP in basic rate set).
>
> --
> Jouni Malinen                                            PGP id EFC895FA
> _______________________________________________
> ath9k-devel mailing list
> ath9k-devel@lists.ath9k.org
> https://lists.ath9k.org/mailman/listinfo/ath9k-devel
Thomas Huehn Feb. 24, 2015, 5:54 p.m. UTC | #7
Hi Jouni,

Currently Minstrel_HT just skips EAPOL packets for its rate sampling on non-mrr chips by testing: (info->control.flags & IEEE80211_TX_CTRL_PORT_CTRL_PROTO)
On mrr hardware it uses them for probing. 
But the general MRR-chain should look like this for ath5k and ath9k chips that support 4 mrr chains:

mrr[0]:= max_tp_rate[0]
mrr[1]:= max_tp_rate[1]
mrr[2]:= max_prob_rate
mrr[3]:= basic_rate

So for Minstrel Sampling Packets as well as for data packets, the 4th mrr stage should use the slowest rate in case all other 3 mrr stages failed with their retry attempts.

I do see two possible options for control frames like EAPOL to be send out in a more robust fashion:
 - exclude those frames from AMPDU aggragates 
 - change their mrr setup to be more conservative

What do you think about this ?

Greetings Thomas


> Am 24.02.2015 um 11:26 schrieb Jouni Malinen <j@w1.fi>:
> 
> On Tue, Feb 24, 2015 at 01:29:27PM +1100, Andrew McGregor wrote:
>> Over the weekend I found a bug in minstrel-ht that might well be
>> implicated here.
>> 
>> The last retransmit rate is meant to be a 'get the packet there
>> reliably' rate; minstrel-ht doesn't do that right, and can pick a
>> fairly flaky rate instead.
>> 
>> Can't generate a proper patch right now, so this diff might not apply
>> cleanly, but the fix is simply to change 75 to 99 in the two places
>> below:
> 
> While this may indeed be helpful, I don't think it is sufficient for
> this EAPOL frame related issue. What I would like to see is minstrel_ht
> using a basic rate (something non-HT) at the end of the retry series for
> EAPOL frames.
> 
> The current behavior looks very suspicious to me. The early EAPOL frames
> after association are being used to probe for higher rates. This results
> in the total number of retry attempts actually getting smaller than any
> other frame, i.e., minstrel_ht seems to be using significantly _less_
> robust choices for the EAPOL frames than following "normal" data frames!
> This should really be the other way around..
> 
> As an example, I'm seeing this on 5 GHz band (with the 75 to 99 change
> in place, but behavior was more or less identical without it):
> - the first EAPOL frame (msg 2/4) getting one attempt at MCS 3, 2
>  attempts at MCS 0, 2 attempts at MCS 0 (yes, identical to the previous
>  one) with total maximum of 5 attempts
> - the second EAPOL frame (msg 4/4) getting one attempt at MCS 9, 2
>  attempts at MCS 0, 2 attempts at MCS 0 with total maximum of 5
>  attempts
> - another data frame after this: 5 attempts at MCS 9, 5 attempts at MCS
>  3, 5 attempts at MCS 3 with total maximum of 15 attempts(!!)
> 
> This cannot be the best approach here.. For the
> IEEE80211_TX_CTRL_PORT_CTRL_PROTO cases, there are identified issues
> where failing to deliver the frame results is significant issues either
> in getting connected in the first place or getting disconnected if
> rekeying fails. 
> 
> I'm not sure how this would be implemented cleanly in minstrel_ht or
> whether that is even the best place (i.e., rate.c could do this
> instead), but I'd like that third attempt for control port cases to be
> dropped to use a (lowish) basic rate and non-MCS at that since there may
> be some interop issues with HT MCS early during association.
> Alternatively with drivers like ath9k that support 4 rate values, it
> would also be fine to add this basic rate attempt (or well, I'd have
> multiple, say 4, such attempts) as an additional 4th entry which does
> not currently seem to get used with minstrel at all.
> 
> The "(lowish) basic rate" here could be defined as 6 Mbps OFDM for 5 GHz
> band and either that or maybe even 2 Mbps or 5.5 Mbps on 2.4 GHz (if
> included by the AP in basic rate set).
> 
> -- 
> Jouni Malinen                                            PGP id EFC895FA
> _______________________________________________
> ath9k-devel mailing list
> ath9k-devel@lists.ath9k.org
> https://lists.ath9k.org/mailman/listinfo/ath9k-devel

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jouni Malinen Feb. 24, 2015, 6:14 p.m. UTC | #8
On Tue, Feb 24, 2015 at 06:54:47PM +0100, Thomas Hühn wrote:
> Currently Minstrel_HT just skips EAPOL packets for its rate sampling on non-mrr chips by testing: (info->control.flags & IEEE80211_TX_CTRL_PORT_CTRL_PROTO)

Yeah, I noticed that when going through the implementation, but it was
indeed only for cases other than ath9k-like drivers.

> On mrr hardware it uses them for probing. 
> But the general MRR-chain should look like this for ath5k and ath9k chips that support 4 mrr chains:
> 
> mrr[0]:= max_tp_rate[0]
> mrr[1]:= max_tp_rate[1]
> mrr[2]:= max_prob_rate
> mrr[3]:= basic_rate

Where is that mrr[3] part implemented? I did not find it when reviewing
the design (hw->max_rates >= 3 is used, but not >= 4) and this does not
match my experiments either when printing out all four values from
ath9k. In every single case I observed, the last entry was unused (idx =
-1) and only MCS values were used (i.e., not even a single case of basic
rate visible; basic rates being 6, 12, 24 Mbps OFDM in this specific
case with the AP that I used in the tests).

> So for Minstrel Sampling Packets as well as for data packets, the 4th mrr stage should use the slowest rate in case all other 3 mrr stages failed with their retry attempts.
> 
> I do see two possible options for control frames like EAPOL to be send out in a more robust fashion:
>  - exclude those frames from AMPDU aggragates 

ath9k does that for IEEE80211_TX_CTL_RATE_CTRL_PROBE which seemed to
get set for the initial EAPOL frames. I guess this could be done more
generically for all EAPOL frames.

>  - change their mrr setup to be more conservative

That mrr[3]:= basic_rate is the part I was really asking for as far as
EAPOL frames are concerned.
Thomas Huehn Feb. 24, 2015, 10:38 p.m. UTC | #9
Hi Jouni,

> Where is that mrr[3] part implemented? I did not find it when reviewing
> the design (hw->max_rates >= 3 is used, but not >= 4) and this does not
> match my experiments either when printing out all four values from
> ath9k. In every single case I observed, the last entry was unused (idx =
> -1) and only MCS values were used (i.e., not even a single case of basic
> rate visible; basic rates being 6, 12, 24 Mbps OFDM in this specific
> case with the AP that I used in the tests).

Minstrel_HT does only set mrr[0..2] and does not touch the fourth mrr[3], assumed chips with 4 mrr stages.
In function minstrel_ht_update_rates() the rate table struct ieee80211_sta_rates is set for the first 3 out of 4 rates and the fouth rate (mrr[3]) is „-1“.
In case of ath9k, the driver in xmit.c allocates in function ath_tx_fill_desc() a  struct ath_tx_info info by  memset(&info, 0, sizeof(info)) .. so all info->rates[xy].Rate == 0.
Than function ath_buf_set_rate() sets all info->rates[xy].Rate to those values specified in the rate table (struct ieee8021_sta_rate) IF (!rates[i].count || (rates[i].idx < 0)) …
… so info->rates[3].Rate is untouched and still  „0“.

I just added a printk() to xmit.c in function ath_tx_fill_desc() just before ath9k_hw_set_txdesc() is called.
From the log I get, e.g.:

[  169.543554] mrr setup: mrr[0]:133 mrr[1]:132 mrr[2]:134 mrr[3]:0 

I have not check yet if info->rates[3].Rate == 0 is the lowest possible rate… but I would guess so.


Greetings Thomas

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Adrian Chadd Feb. 24, 2015, 10:50 p.m. UTC | #10
Hi,

I thought about doing this for rate probing with FreeBSD's sample rate
algorithm, but after actually having to use the damned thing in noisy
environments I realised that it just wasn't worth the effort to
optimise rate control selection whilst doing EAPOL frames.

If we did more useful software retransmission at lower rates then
sure, one could just be much more conservative about the rates and
retry a few times at a lower rate. However, since that isn't the case
and we only get three or so attempts at EAPOL exchange - once a second
- before an AP decides we're no good, I figured I'd just skip anything
other than the lowest basic rates for the frame exchange and make it
associate quicker.

I had the same problem with DHCP - initial DHCP leases were failing
because the rate control code started high/medium and didn't degrade
quickly enough before the next round of DHCP attempts. DHCP backs off
even more aggressively, so failing that initial DHCP was quite
expensive.



-adrian
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Felix Fietkau Feb. 25, 2015, 5 a.m. UTC | #11
On 2015-02-25 07:14, Jouni Malinen wrote:
> On Tue, Feb 24, 2015 at 06:54:47PM +0100, Thomas Hühn wrote:
>> Currently Minstrel_HT just skips EAPOL packets for its rate sampling on non-mrr chips by testing: (info->control.flags & IEEE80211_TX_CTRL_PORT_CTRL_PROTO)
> 
> Yeah, I noticed that when going through the implementation, but it was
> indeed only for cases other than ath9k-like drivers.
> 
>> On mrr hardware it uses them for probing. 
>> But the general MRR-chain should look like this for ath5k and ath9k chips that support 4 mrr chains:
>> 
>> mrr[0]:= max_tp_rate[0]
>> mrr[1]:= max_tp_rate[1]
>> mrr[2]:= max_prob_rate
>> mrr[3]:= basic_rate
> 
> Where is that mrr[3] part implemented? I did not find it when reviewing
> the design (hw->max_rates >= 3 is used, but not >= 4) and this does not
> match my experiments either when printing out all four values from
> ath9k. In every single case I observed, the last entry was unused (idx =
> -1) and only MCS values were used (i.e., not even a single case of basic
> rate visible; basic rates being 6, 12, 24 Mbps OFDM in this specific
> case with the AP that I used in the tests).
Minstrel_ht does *NOT* use mrr[3], nor should it. For normal data
packets, a little packet loss under tough conditions is good, otherwise
we risk lots of wasted airtime and bufferbloat.

>> So for Minstrel Sampling Packets as well as for data packets, the 4th mrr stage should use the slowest rate in case all other 3 mrr stages failed with their retry attempts.
>> 
>> I do see two possible options for control frames like EAPOL to be send out in a more robust fashion:
>>  - exclude those frames from AMPDU aggragates 
> 
> ath9k does that for IEEE80211_TX_CTL_RATE_CTRL_PROBE which seemed to
> get set for the initial EAPOL frames. I guess this could be done more
> generically for all EAPOL frames.
I agree.

>>  - change their mrr setup to be more conservative
> 
> That mrr[3]:= basic_rate is the part I was really asking for as far as
> EAPOL frames are concerned.
I don't think we need that. If we just exclude EAPOL from both probing
and aggregation, it should be safe. While it's connecting, that leaves
in low rates in the retry chain anyway.

If it still fails often enough to be noticeable under normal conditions,
there must be something seriously wrong outside of rate control, and we
should not paper over it with a crude band-aid workaround.

- Felix
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jouni Malinen Feb. 25, 2015, 2:47 p.m. UTC | #12
On Wed, Feb 25, 2015 at 06:00:08PM +1300, Felix Fietkau wrote:
> Minstrel_ht does *NOT* use mrr[3], nor should it. For normal data
> packets, a little packet loss under tough conditions is good, otherwise
> we risk lots of wasted airtime and bufferbloat.

I agree for normal data packets, but EAPOL frames are not really normal
data packet even though they happen to be transmitted as Data frames.
EAPOL frames are rare enough to not wast much airtime (and recovery from
an issue would anyway use way more airtime) and bufferbloat is
irrelevant for EAPOL frames. For EAPOL frames, little packet loss is not
good. Especially for EAPOL-Key msg 4/4, the only recovery option in many
cases is to reassociate with the AP and start from scratch.

> > That mrr[3]:= basic_rate is the part I was really asking for as far as
> > EAPOL frames are concerned.
> I don't think we need that. If we just exclude EAPOL from both probing
> and aggregation, it should be safe. While it's connecting, that leaves
> in low rates in the retry chain anyway.

Not low enough IMHO. EAPOL is a special case and needs to be addressed
as such. It is special for at least two reasons: being very early in the
association (well, the very _first_ Data frames using rate control) and
being critical for maintaining the connection (AP will disconnect if it
does not reply response).

What happens now is way too optimistic:
- one try at MCS 2 followed by four tries at MCS 0 for EAPOL-Key msg 2/4
- one try at MCS 9 (or so) followed by four tries at MCS 0 for EAPOL-Key
  msg 4/4 (this being the most critical frame in the connection sequence
  due to not having a good recovery mechanism)

Dropping probing from these would allow one more attempt at the first
rate and I guess it would also drop the first rate to somewhat lower.

I'm fine with using these MCS rates as the first option, but I do think
that we have to add one more rate to the end (or change the 3rd rate if
that is easier for implementation) to be non-MCS and I think one of the
basic rates (say, 6 Mbps on 5 GHz and maybe 2 or 5.5 Mbps on 2.4 GHz)
with number of tries (say, 4).

There have been way too many cases reported where "strange issues" with
4-way exchange (those EAPOL-Key frames) result in connection failing.
While not all of these can be explained with the TX rate, I'm pretty
sure large portion of these issues are indeed caused by too optimistic
TX rate selection. Such policies may be acceptable for other Data
frames, but not for EAPOL.

> If it still fails often enough to be noticeable under normal conditions,
> there must be something seriously wrong outside of rate control, and we
> should not paper over it with a crude band-aid workaround.

There may be something else wrong (say, some kind of interference), but
there is no way we can assume normal users to be able to fix such
issues. If we make EAPOL frames go through more robustly, the connection
can be established in more cases and this can result in relatively
functional network connection and rate control can handle the less
critical data frames through whatever means to get optimal throughput
from the network. As such, I do think we do need to "paper over" this
for EAPOL frames.
Jouni Malinen Feb. 25, 2015, 2:53 p.m. UTC | #13
On Tue, Feb 24, 2015 at 11:38:02PM +0100, Thomas Hühn wrote:
> Minstrel_HT does only set mrr[0..2] and does not touch the fourth mrr[3], assumed chips with 4 mrr stages.
> In function minstrel_ht_update_rates() the rate table struct ieee80211_sta_rates is set for the first 3 out of 4 rates and the fouth rate (mrr[3]) is „-1“.

Agreed.

> In case of ath9k, the driver in xmit.c allocates in function ath_tx_fill_desc() a  struct ath_tx_info info by  memset(&info, 0, sizeof(info)) .. so all info->rates[xy].Rate == 0.
> Than function ath_buf_set_rate() sets all info->rates[xy].Rate to those values specified in the rate table (struct ieee8021_sta_rate) IF (!rates[i].count || (rates[i].idx < 0)) …
> … so info->rates[3].Rate is untouched and still  „0“.

Sure, it can be 0 and so is the number of tries at that rate..

> I just added a printk() to xmit.c in function ath_tx_fill_desc() just before ath9k_hw_set_txdesc() is called.
> From the log I get, e.g.:
> 
> [  169.543554] mrr setup: mrr[0]:133 mrr[1]:132 mrr[2]:134 mrr[3]:0 
> 
> I have not check yet if info->rates[3].Rate == 0 is the lowest possible rate… but I would guess so.

It is not and even if it were, it does not matter since this 4th item is
used for _0_ tries. I've verified the exact behavior with a sniffer for
a case where the target device does not ACK frames. ath9k ends up
sending at exactly the three different rates indicated in the first
three values and nothing else. With RC probing (which happens to occur
for the initial EAPOL frames, this results in only one attempt at
MCS(>0) and two + two attempts at MCS0. No non-MCS rates are tried. As
pointed out previously, this is likely fine for normal data frames, but
not for EAPOL.
Linus Torvalds Feb. 25, 2015, 6:14 p.m. UTC | #14
On Wed, Feb 25, 2015 at 6:47 AM, Jouni Malinen <j@w1.fi> wrote:
>
> There may be something else wrong (say, some kind of interference), but
> there is no way we can assume normal users to be able to fix such
> issues. If we make EAPOL frames go through more robustly, the connection
> can be established in more cases and this can result in relatively
> functional network connection and rate control can handle the less
> critical data frames through whatever means to get optimal throughput
> from the network. As such, I do think we do need to "paper over" this
> for EAPOL frames.

While I realize that people may disagree about the exact details of
how to fix this in the long run, may I suggest that in the meantime we
at least get the two workaround patches applied?

I'm talking about the two from Jouni - the "don't encrypt EAPOL
frames" one, and the one-liner that makes all EAPOL frames go at the
lowest data rate.

Even if "lowest data rate" is ridiculously low, and even if that might
disturb other things going on on the same channel at the same time,
those authentication packets shouldn't be so common as to be a
problem.  No?

Jouni has a few packet dumps for me, and he's stumped as to what
exactly is going on, but those two patches (well, the one-liner "low
data rate EAPOL" in particular, it seems) do seem to make my
connections go through reliably.

And it seems that other drivers already are working around the EAPOL
issue in similar ways, judging by the comments about iwlwifi.

Last time I had connection issues with this laptop, nothing ended up
happening in the end, and I had people pipe up saying they had had
similar problems. I'd hate for the same "nothing" to happen this time
just because people aren't 100% sure what the final right thing is to
do. So I'd really like people to apply the simple workarounds for now
because clearly something is badly wrong, and *if* there is some
better resolution later, that's fine.

I'll happily test patches. It seems to be pretty repeatable for me,
even if that "pretty repeatable" seems to be very much about the
laptop being in one very particular place (it's right next to another
AP, there's random other electronics around, since it's on my messy
desk etc). So I wouldn't be at all surprised by horribly interference.
And the AP is supposed to be ceiling- or wall-mounted, but because I'm
just testing things out it's just sitting on a table in the next room,
so for all I know it's in the *exact* worst position for the antennas
etc etc.

So I'm sure I can improve reception of my laptop, but that's not the
point. The point is that bad wireless networks aren't so unusual, and
right now things clearly don't work as well as they could.

Does anybody hate Jouni's two patches *so* much that they can
articulate *why* it would be wrong to apply them as interim patches?
And if so, do you have better patches for me to try? Because if not..

                                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Stuge Feb. 25, 2015, 6:25 p.m. UTC | #15
Linus Torvalds wrote:
> Last time I had connection issues with this laptop, nothing ended up
> happening in the end, and I had people pipe up saying they had had
> similar problems. I'd hate for the same "nothing" to happen this time
> just because people aren't 100% sure what the final right thing is to
> do. So I'd really like people to apply the simple workarounds for now
> because clearly something is badly wrong, and *if* there is some
> better resolution later, that's fine.
..
> So I'm sure I can improve reception of my laptop, but that's not the
> point. The point is that bad wireless networks aren't so unusual, and
> right now things clearly don't work as well as they could.

Those two patches (the one-liner in particular) should have gone
in already 5-10 years ago. This has been an embarrassing problem
for many years.

I'm glad that people jumped this time around.


//Peter
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Adrian Chadd Feb. 25, 2015, 8:22 p.m. UTC | #16
On 25 February 2015 at 10:14, Linus Torvalds
<torvalds@linux-foundation.org> wrote:


> While I realize that people may disagree about the exact details of
> how to fix this in the long run, may I suggest that in the meantime we
> at least get the two workaround patches applied?
>
> I'm talking about the two from Jouni - the "don't encrypt EAPOL
> frames" one, and the one-liner that makes all EAPOL frames go at the
> lowest data rate.
>
> Even if "lowest data rate" is ridiculously low, and even if that might
> disturb other things going on on the same channel at the same time,
> those authentication packets shouldn't be so common as to be a
> problem.  No?
>
> Jouni has a few packet dumps for me, and he's stumped as to what
> exactly is going on, but those two patches (well, the one-liner "low
> data rate EAPOL" in particular, it seems) do seem to make my
> connections go through reliably.
>
> And it seems that other drivers already are working around the EAPOL
> issue in similar ways, judging by the comments about iwlwifi.

[snip]

> So I'm sure I can improve reception of my laptop, but that's not the
> point. The point is that bad wireless networks aren't so unusual, and
> right now things clearly don't work as well as they could.
>
> Does anybody hate Jouni's two patches *so* much that they can
> articulate *why* it would be wrong to apply them as interim patches?
> And if so, do you have better patches for me to try? Because if not..

I agree with you. I think you should just have EAPOL frames go out at
the lowest rate for now and then worry about experimenting with more
interesting ways to make EAPOL / DHCP frames cheaper. It fixes a lot
of problems in noisy areas. That hack was hiding around in various
commercial drivers I've seen, and it's been in FreeBSD for a while.

Same with DHCP traffic too - it's the second set of data frames that
the rate control code sees, and it's the primary reason I dropped the
initial sample rate down in FreeBSD so the DHCP exchange would have a
better chance of succeeding after association.



-adrian
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Huehn Feb. 25, 2015, 8:52 p.m. UTC | #17
Hi Jouni,

>> In case of ath9k, the driver in xmit.c allocates in function ath_tx_fill_desc() a  struct ath_tx_info info by  memset(&info, 0, sizeof(info)) .. so all info->rates[xy].Rate == 0.
>> Than function ath_buf_set_rate() sets all info->rates[xy].Rate to those values specified in the rate table (struct ieee8021_sta_rate) IF (!rates[i].count || (rates[i].idx < 0)) …
>> … so info->rates[3].Rate is untouched and still  „0“.
> 
> Sure, it can be 0 and so is the number of tries at that rate..

You are absolutely right … as the tries are 0 as well, the 4th mrr[3] is not used. I missed that obvious point.


>> I just added a printk() to xmit.c in function ath_tx_fill_desc() just before ath9k_hw_set_txdesc() is called.
>> From the log I get, e.g.:
>> 
>> [  169.543554] mrr setup: mrr[0]:133 mrr[1]:132 mrr[2]:134 mrr[3]:0 
>> 
>> I have not check yet if info->rates[3].Rate == 0 is the lowest possible rate… but I would guess so.
> 
> It is not and even if it were, it does not matter since this 4th item is
> used for _0_ tries. I've verified the exact behavior with a sniffer for
> a case where the target device does not ACK frames. ath9k ends up
> sending at exactly the three different rates indicated in the first
> three values and nothing else. With RC probing (which happens to occur
> for the initial EAPOL frames, this results in only one attempt at
> MCS(>0) and two + two attempts at MCS0. No non-MCS rates are tried. As
> pointed out previously, this is likely fine for normal data frames, but
> not for EAPOL.

I do agree in changing the way to assign a (mrr) rate set for  EAPOL frames in the most robust fashion.

Greetings Thomas


> 
> -- 
> Jouni Malinen                                            PGP id EFC895FA
> _______________________________________________
> ath9k-devel mailing list
> ath9k-devel@lists.ath9k.org
> https://lists.ath9k.org/mailman/listinfo/ath9k-devel

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew McGregor Feb. 26, 2015, 5:02 a.m. UTC | #18
On Thu, Feb 26, 2015 at 7:22 AM, Adrian Chadd <adrian@freebsd.org> wrote:
> On 25 February 2015 at 10:14, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>
>
>> While I realize that people may disagree about the exact details of
>> how to fix this in the long run, may I suggest that in the meantime we
>> at least get the two workaround patches applied?
>>
>> I'm talking about the two from Jouni - the "don't encrypt EAPOL
>> frames" one, and the one-liner that makes all EAPOL frames go at the
>> lowest data rate.
>>
>> Even if "lowest data rate" is ridiculously low, and even if that might
>> disturb other things going on on the same channel at the same time,
>> those authentication packets shouldn't be so common as to be a
>> problem.  No?
>>
>> Jouni has a few packet dumps for me, and he's stumped as to what
>> exactly is going on, but those two patches (well, the one-liner "low
>> data rate EAPOL" in particular, it seems) do seem to make my
>> connections go through reliably.
>>
>> And it seems that other drivers already are working around the EAPOL
>> issue in similar ways, judging by the comments about iwlwifi.
>
> [snip]
>
>> So I'm sure I can improve reception of my laptop, but that's not the
>> point. The point is that bad wireless networks aren't so unusual, and
>> right now things clearly don't work as well as they could.
>>
>> Does anybody hate Jouni's two patches *so* much that they can
>> articulate *why* it would be wrong to apply them as interim patches?
>> And if so, do you have better patches for me to try? Because if not..
>
> I agree with you. I think you should just have EAPOL frames go out at
> the lowest rate for now and then worry about experimenting with more
> interesting ways to make EAPOL / DHCP frames cheaper. It fixes a lot
> of problems in noisy areas. That hack was hiding around in various
> commercial drivers I've seen, and it's been in FreeBSD for a while.
>
> Same with DHCP traffic too - it's the second set of data frames that
> the rate control code sees, and it's the primary reason I dropped the
> initial sample rate down in FreeBSD so the DHCP exchange would have a
> better chance of succeeding after association.
>
>
>
> -adrian

+1

Really, who cares about efficiency here; these are rare control packets.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Feb. 26, 2015, 5:55 a.m. UTC | #19
On Wed, Feb 25, 2015 at 10:14 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'm talking about the two from Jouni - the "don't encrypt EAPOL
> frames" one, and the one-liner that makes all EAPOL frames go at the
> lowest data rate.

So I just found out and confirmed that this is not Atheros-specific in
any way - it looks like it's simply the UniFi AP that does not like
high-data-rate authentification frames at all.

Because it looks like the brcmsmac driver has *exactly* the same
behavior with this AP (in an Apple Macbook air). I assume brcmsmac
uses the net/80211/tx.c logic too.

And Jouni's one-liner fixes that one too, although as usual, maybe
there is some testing noise, and I screwed something up. This time I
only did the one-liner, so that's the critical one.

It's interesting to note how nothing else has been unhappy with that
network (admittedly it's been mainly android devices and a HP printer
that I've tested), so it looks like everybody else does low-rate
authentication packets anyway.

So this actually looks like a Ubiquiti UniFi AP bug to me, but it also
looks like presumably everybody else does low-rate initial packets,
and our kernel 802.11 layer should just follow suit. The whole
robustness principle and "be conservative in what you send, and
liberal in what you accept" etc.

                        Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arend van Spriel Feb. 26, 2015, 10:01 a.m. UTC | #20
On 02/26/15 06:55, Linus Torvalds wrote:
> Because it looks like the brcmsmac driver has*exactly*  the same
> behavior with this AP (in an Apple Macbook air). I assume brcmsmac
> uses the net/80211/tx.c logic too.

That makes sense as brcmsmac is a mac80211 driver that uses the 
minstrel-ht rate control algorithm. iwlwifi is also a mac80211 driver 
but it uses a different rate control algorithm so if you have that lying 
around it would be interesting to test that (if you did not already do 
that).

Regards,
Arend
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jouni Malinen Feb. 26, 2015, 10:20 a.m. UTC | #21
On Wed, Feb 25, 2015 at 10:14:45AM -0800, Linus Torvalds wrote:
> While I realize that people may disagree about the exact details of
> how to fix this in the long run, may I suggest that in the meantime we
> at least get the two workaround patches applied?

> Does anybody hate Jouni's two patches *so* much that they can
> articulate *why* it would be wrong to apply them as interim patches?
> And if so, do you have better patches for me to try? Because if not..

Of all people, I do actually have some hatred on the one-liner to force
minimum rate for all EAPOL TX attempts. That is punishing the vast
majority of cases where the AP is perfectly fine with higher MCS rates
being used (and MCS 0 being sufficient fallback option) for EAPOL. Being
able to use higher TX rates as the initial attempt is a nice feature and
even though this may be limited to number of upstream Linux drivers
today, that part of the feature is an improvement, IMHO. This can even
be more robust in some environments especially when going through long
EAP exchange with certain types of interference.

That said, it is clear to me that we need to modify the current behavior
to be more conservative with some of the EAPOL frame retries. I'm not
familiar enough with the current minstrel implementation to prepare a
clean patch doing the approach I think that should be used here (replace
the last rate with a low basic rate and add couple of tries with it). I
have reason to believe that Felix might be able to look at this in
couple of weeks, though.

In other words, I think I'll send out a more formal version of the
patches so that they can be applied now if desired while keeping in mind
that we may want to replace the minrate-for-EAPOL with something else
later. As far as the EAPOL-Key msg 4/4 retries without encryption patch
is concerned, I think I have enough concern on the couple of corner
cases that I know it does not address (and may break), so it may take
some more time before I'm ready to suggest it (or a bit modified version
of it) to be applied, though.
Peter Stuge Feb. 26, 2015, 4:04 p.m. UTC | #22
Jouni Malinen wrote:
> punishing the vast majority of cases where the AP is perfectly fine
> with higher MCS rates

First make it work (everywhere), then make it fast (where possible).


//Peter
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Adrian Chadd Feb. 26, 2015, 7:03 p.m. UTC | #23
On 26 February 2015 at 02:20, Jouni Malinen <j@w1.fi> wrote:
> On Wed, Feb 25, 2015 at 10:14:45AM -0800, Linus Torvalds wrote:
>> While I realize that people may disagree about the exact details of
>> how to fix this in the long run, may I suggest that in the meantime we
>> at least get the two workaround patches applied?
>
>> Does anybody hate Jouni's two patches *so* much that they can
>> articulate *why* it would be wrong to apply them as interim patches?
>> And if so, do you have better patches for me to try? Because if not..
>
> Of all people, I do actually have some hatred on the one-liner to force
> minimum rate for all EAPOL TX attempts. That is punishing the vast
> majority of cases where the AP is perfectly fine with higher MCS rates
> being used (and MCS 0 being sufficient fallback option) for EAPOL. Being
> able to use higher TX rates as the initial attempt is a nice feature and
> even though this may be limited to number of upstream Linux drivers
> today, that part of the feature is an improvement, IMHO. This can even
> be more robust in some environments especially when going through long
> EAP exchange with certain types of interference.

I remember running the math - well, the "airtime" math, and realised
it was almost always cheaper to do the single, non-aggregated EAPOL
frame exchange at the lowest rate than to try higher rates and fall
back to the lowest rates.




-adrian
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index c314c59..b45038f 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -566,6 +566,7 @@  ieee80211_tx_h_check_control_port_protocol(struct ieee80211_tx_data *tx)
 		if (tx->sdata->control_port_no_encrypt)
 			info->flags |= IEEE80211_TX_INTFL_DONT_ENCRYPT;
 		info->control.flags |= IEEE80211_TX_CTRL_PORT_CTRL_PROTO;
+		info->flags |= IEEE80211_TX_CTL_USE_MINRATE;
 	}
 
 	return TX_CONTINUE;