mbox series

[00/10] Ben's grab bag of mac80211 patches

Message ID 20191108194210.23618-1-greearb@candelatech.com (mailing list archive)
Headers show
Series Ben's grab bag of mac80211 patches | expand

Message

Ben Greear Nov. 8, 2019, 7:42 p.m. UTC
From: Ben Greear <greearb@candelatech.com>

Here are some patches from my tree, freshly applied and compile tested on top of
upstream 5.4.0-rc6.  One incorporates a fix suggested back in 2013 :)

There should not be much dependencies between these, so just
skip any you don't want.

Ben Greear (10):
  mac80211: Add comment about tx on monitor devs.
  mac80211: Change warn-on to warn-on-once
  mac80211: Don't spam kernel with sdata-in-driver failures.
  mac80211: Don't spam so loud about warned-sdata-in-driver.
  mac80211: Improve connection-loss probing.
  mac80211: Make max-auth-tries configurable as module option
  mac80211: Revert some of e8e4f5, fixes setting single rate in ath10k.
  mac80211: Support decrypting action frames for reinsertion into the
    driver.
  mac80211: Use warn-on-once for queue mis-configuration.
  mlme: Don't unlink bss on assoc timeout and similar.

 include/net/mac80211.h     | 10 +++++
 net/mac80211/cfg.c         |  6 ++-
 net/mac80211/debug.h       |  3 ++
 net/mac80211/driver-ops.c  |  9 ++++
 net/mac80211/driver-ops.h  | 36 ++++++++++++++--
 net/mac80211/ieee80211_i.h |  3 +-
 net/mac80211/iface.c       |  7 ++++
 net/mac80211/mlme.c        | 84 +++++++++++++++++---------------------
 net/mac80211/rx.c          |  2 +-
 net/mac80211/tx.c          |  6 +++
 net/mac80211/util.c        |  5 ++-
 11 files changed, 115 insertions(+), 56 deletions(-)

Comments

Johannes Berg Nov. 22, 2019, 11:49 a.m. UTC | #1
Hi Ben,

Well, for me to consider any of these, you really should come with
commit logs and reasons that actually make sense outside of your
environment :)

I committed a note that replaces patch 1.

I don't like 2-4, that just adds state and one should never hit it. If
you're hitting these you should investigate the root cause, not fix the
symptoms.

5 seems like it might be reasonable, but it's hard to read and
understand, maybe revisit that?

I tend to think the previous module options along the lines of patch 6
were a mistake, rather than add more ...

7 is totally not understandable, but might be legitimate? Unlikely, but
hard to say.

8 I don't like at all. How about you do it in the driver somehow?

9 is like 2-4 really, I guess maybe this one I could get behind if it
came with a commit log that actually explains why one is likely to hit
this multiple times or something?

10 we did to fix other behaviour, so ...

johannes
Ben Greear Nov. 22, 2019, 4:37 p.m. UTC | #2
On 11/22/2019 03:49 AM, Johannes Berg wrote:
> Hi Ben,
>
> Well, for me to consider any of these, you really should come with
> commit logs and reasons that actually make sense outside of your
> environment :)
>
> I committed a note that replaces patch 1.

Well, 1 out of 10 ain't that bad!

> I don't like 2-4, that just adds state and one should never hit it. If
> you're hitting these you should investigate the root cause, not fix the
> symptoms.
>
> 5 seems like it might be reasonable, but it's hard to read and
> understand, maybe revisit that?

This is the patch you previously said you liked but it would not apply upstream.
Now it applies.  That whole code mess is hard to understand, but I have been running
a similar patch for a while and it has worked well.

Instead of trying to understand the patch, try applying it and then read the resulting
code.  It is a lot simpler to understand that way I think.

You can also sniff air with current code w/out this patch and watch the crappy
retry behaviour where the retries are clumped in a few ms of time instead of being
spread out.

>
> I tend to think the previous module options along the lines of patch 6
> were a mistake, rather than add more ...
>
> 7 is totally not understandable, but might be legitimate? Unlikely, but
> hard to say.

Ath10k will always use legacy rates for ctrl frames and such even if
you otherwise restrict the rateset.  So, it is legit to set a single rate even if
that would leave no legacy rates configured as far as mac80211 is concerned.

Your patch broke the ability to set a single rate in ath10k, and my change will
allow it to work again.

>
> 8 I don't like at all. How about you do it in the driver somehow?

I had low hopes for this one anyway.  mac80211 has the software decrypt logic, not the
driver, so it seemed reasonable to have the mac80211 do a callback.  This patch is likely
only useful for drivers that do block-ack in the firmware and support software decrypt,
which may only be my modified ath10k-ct driver/firmware.

>
> 9 is like 2-4 really, I guess maybe this one I could get behind if it
> came with a commit log that actually explains why one is likely to hit
> this multiple times or something?

Basically, it is almost never useful to use WARN_ON instead of WARN_ON_ONCE.  If you ever
do hit the bug, often the logs are full of WARN_ON spam and you cannot even find the real problem
until you change it to WARN_ON_ONCE and reproduce the problem.

I hit this problem due to some coding bug while poking at ath10k, and you get one splat
per tx frame, so you can image the spam.


> 10 we did to fix other behaviour, so ...

This one is especially useful for roaming several virtual stations, but maybe that is only useful test
case.  I included it more as an RFC, but it has worked well enough for us in case you see some worth in
it (and obviously it should be changed to not use // comments in case the functionality is actually
deemed useful).

Thanks,
Ben

>
> johannes
>
Johannes Berg Nov. 22, 2019, 5:53 p.m. UTC | #3
On Fri, 2019-11-22 at 08:37 -0800, Ben Greear wrote:

> Well, 1 out of 10 ain't that bad!

Heh.

> > 5 seems like it might be reasonable, but it's hard to read and
> > understand, maybe revisit that?
> 
> This is the patch you previously said you liked but it would not apply upstream.
> Now it applies.  That whole code mess is hard to understand, but I have been running
> a similar patch for a while and it has worked well.
> 
> Instead of trying to understand the patch, try applying it and then read the resulting
> code.  It is a lot simpler to understand that way I think.

OK :)

> You can also sniff air with current code w/out this patch and watch the crappy
> retry behaviour where the retries are clumped in a few ms of time instead of being
> spread out.

Right. I'll have to review this carefully. I think the Intel firmware
will soon spread out the hardware retries a bit too (vs. here the
software retries).

> Ath10k will always use legacy rates for ctrl frames and such even if

*control* frames have a very tightly controlled rate, e.g. ACK must have
the same or next lower mandatory rate, or basic rate, or something. I
don't have it in my head right now.

Do you mean *management* frames?

> you otherwise restrict the rateset.  So, it is legit to set a single rate even if
> that would leave no legacy rates configured as far as mac80211 is concerned.
> 
> Your patch broke the ability to set a single rate in ath10k, and my change will
> allow it to work again.

Either way, it's kind of an ath10k bug in a sense. You don't actually
want to "set this single rate", you want to "set this single rate for
data frames" or so ... which is a different API.

If you look at ath9k or any other driver with software rate control, it
would not be able to pick a legacy rate for mgmt frames at all with your
patch and the warning, and then I'm pretty sure you'll hit a WARN_ON()
in rate.c.

So no, I will certainly not apply this.

Maybe you want to add a different API that affects only data frames (or
a variant of this API or something), instead, but this is not it. We
can't leave the software drivers without any rates to pick from for
management frames.

> > 8 I don't like at all. How about you do it in the driver somehow?
> 
> I had low hopes for this one anyway.  mac80211 has the software decrypt logic, not the
> driver, so it seemed reasonable to have the mac80211 do a callback.  This patch is likely
> only useful for drivers that do block-ack in the firmware and support software decrypt,
> which may only be my modified ath10k-ct driver/firmware.

So far that's the only one, as far as I can tell, yeah ...

> > 9 is like 2-4 really, I guess maybe this one I could get behind if it
> > came with a commit log that actually explains why one is likely to hit
> > this multiple times or something?
> 
> Basically, it is almost never useful to use WARN_ON instead of WARN_ON_ONCE.  If you ever
> do hit the bug, often the logs are full of WARN_ON spam and you cannot even find the real problem
> until you change it to WARN_ON_ONCE and reproduce the problem.
> 
> I hit this problem due to some coding bug while poking at ath10k, and you get one splat
> per tx frame, so you can image the spam.

Yeah, but does it matter? It should really just happen during
development, right? Having the _ONCE makes this cost more space and
code, and I'm not sure I see that much point in that.

> > 10 we did to fix other behaviour, so ...
> 
> This one is especially useful for roaming several virtual stations, but maybe that is only useful test
> case.  I included it more as an RFC, but it has worked well enough for us in case you see some worth in
> it (and obviously it should be changed to not use // comments in case the functionality is actually
> deemed useful).

I guess we can go back and forth on this ... but in most cases you don't
know that the AP didn't go away, so IMHO the current behaviour is
better. It's just in your special case you know it's more likely some
local issues vs. the AP having disappeared.

johannes
Ben Greear Nov. 22, 2019, 6:22 p.m. UTC | #4
On 11/22/19 9:53 AM, Johannes Berg wrote:
> On Fri, 2019-11-22 at 08:37 -0800, Ben Greear wrote:
> 
>> Well, 1 out of 10 ain't that bad!
> 
> Heh.
> 
>>> 5 seems like it might be reasonable, but it's hard to read and
>>> understand, maybe revisit that?
>>
>> This is the patch you previously said you liked but it would not apply upstream.
>> Now it applies.  That whole code mess is hard to understand, but I have been running
>> a similar patch for a while and it has worked well.
>>
>> Instead of trying to understand the patch, try applying it and then read the resulting
>> code.  It is a lot simpler to understand that way I think.
> 
> OK :)
> 
>> You can also sniff air with current code w/out this patch and watch the crappy
>> retry behaviour where the retries are clumped in a few ms of time instead of being
>> spread out.
> 
> Right. I'll have to review this carefully. I think the Intel firmware
> will soon spread out the hardware retries a bit too (vs. here the
> software retries).
> 
>> Ath10k will always use legacy rates for ctrl frames and such even if
> 
> *control* frames have a very tightly controlled rate, e.g. ACK must have
> the same or next lower mandatory rate, or basic rate, or something. I
> don't have it in my head right now.
> 
> Do you mean *management* frames?

Yes, I mean all sorts of frames that should use legacy rates.  When you 'set the rate'
in ath10k, you are really just setting the data-packet's rate, and upstream driver/firmware
has other limitations (you can set single rate, or mcs-8 or mcs-9, but you cannot tell it to use
mcs-5 and mcs-6 only, for instance).

There are other ways to adjust management frame rates, but it is not through normal
mac80211 rate-setting logic.

>> you otherwise restrict the rateset.  So, it is legit to set a single rate even if
>> that would leave no legacy rates configured as far as mac80211 is concerned.
>>
>> Your patch broke the ability to set a single rate in ath10k, and my change will
>> allow it to work again.
> 
> Either way, it's kind of an ath10k bug in a sense. You don't actually
> want to "set this single rate", you want to "set this single rate for
> data frames" or so ... which is a different API. >
> If you look at ath9k or any other driver with software rate control, it
> would not be able to pick a legacy rate for mgmt frames at all with your
> patch and the warning, and then I'm pretty sure you'll hit a WARN_ON()
> in rate.c.
> 
> So no, I will certainly not apply this.
> 
> Maybe you want to add a different API that affects only data frames (or
> a variant of this API or something), instead, but this is not it. We
> can't leave the software drivers without any rates to pick from for
> management frames.

Maybe we can add a driver flag that indicates we can relax this check
and have ath10k set that flag?  Adding a whole top to bottom API to change
how rates are set sounds like a pain, and would probably be specific to
ath10k only in the end anyway?

(Remember the discussion about setting advertised vs used rates?  This would
  be even more complex and less useful than that I think).

> 
>>> 8 I don't like at all. How about you do it in the driver somehow?
>>
>> I had low hopes for this one anyway.  mac80211 has the software decrypt logic, not the
>> driver, so it seemed reasonable to have the mac80211 do a callback.  This patch is likely
>> only useful for drivers that do block-ack in the firmware and support software decrypt,
>> which may only be my modified ath10k-ct driver/firmware.
> 
> So far that's the only one, as far as I can tell, yeah ...
> 
>>> 9 is like 2-4 really, I guess maybe this one I could get behind if it
>>> came with a commit log that actually explains why one is likely to hit
>>> this multiple times or something?
>>
>> Basically, it is almost never useful to use WARN_ON instead of WARN_ON_ONCE.  If you ever
>> do hit the bug, often the logs are full of WARN_ON spam and you cannot even find the real problem
>> until you change it to WARN_ON_ONCE and reproduce the problem.
>>
>> I hit this problem due to some coding bug while poking at ath10k, and you get one splat
>> per tx frame, so you can image the spam.
> 
> Yeah, but does it matter? It should really just happen during
> development, right? Having the _ONCE makes this cost more space and
> code, and I'm not sure I see that much point in that.

Sure, for this one.  The sdata-in-driver warnings happen often-ish in ath10k.  No one can figure
out how to fix it, or really cares to look I think.

But, really, not worth arguing over.  The main patch in this series that I would like to see upstream
is the one that tweaks retry logic.  Others are fairly easy to maintain outside the tree and are
of more limited use to the general user.

> 
>>> 10 we did to fix other behaviour, so ...
>>
>> This one is especially useful for roaming several virtual stations, but maybe that is only useful test
>> case.  I included it more as an RFC, but it has worked well enough for us in case you see some worth in
>> it (and obviously it should be changed to not use // comments in case the functionality is actually
>> deemed useful).
> 
> I guess we can go back and forth on this ... but in most cases you don't
> know that the AP didn't go away, so IMHO the current behaviour is
> better. It's just in your special case you know it's more likely some
> local issues vs. the AP having disappeared.

What I saw is that supplicant would try to do a roam, and kernel would reject it because
it had just deleted the object.  Maybe it is better to have the kernel be more lenient and
let supplicant take care of the policy decisions?

Thanks,
Ben

> 
> johannes
>