[v1] wifi: mac80211: Initialize EWMA fail avg to 1

Message ID	20230417100258.22965-1-quic_karm@quicinc.com (mailing list archive)
State	Changes Requested
Delegated to:	Johannes Berg
Headers	show Return-Path: <linux-wireless-owner@vger.kernel.org> From: Karthik M <quic_karm@quicinc.com> To: <johannes@sipsolutions.net> CC: <linux-wireless@vger.kernel.org>, Karthik M <quic_karm@quicinc.com>, Tamizh Chelvam Raja <quic_tamizhr@quicinc.com> Subject: [PATCH v1] wifi: mac80211: Initialize EWMA fail avg to 1 Date: Mon, 17 Apr 2023 15:32:58 +0530 Message-ID: <20230417100258.22965-1-quic_karm@quicinc.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	[v1] wifi: mac80211: Initialize EWMA fail avg to 1 \| expand [v1] wifi: mac80211: Initialize EWMA fail avg to 1

Karthik M April 17, 2023, 10:02 a.m. UTC

If the average value in mesh metrics calculation
has been rounded to 0 (success), this resets it to
the smallest nonzero value (similarly to the initialization)
to avoid a case where a single failure would result in
an average value that goes beyond the value
of 95 (Link Failure Threshold).

Signed-off-by: Tamizh Chelvam Raja <quic_tamizhr@quicinc.com>
Signed-off-by: Karthik M <quic_karm@quicinc.com>
---
Changes since v1:
 - Altered the comment to mention "mesh" and thershold value.
 - Checkpatch done
---
 net/mac80211/mesh_hwmp.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Johannes Berg April 18, 2023, 1:04 p.m. UTC | #1

On Mon, 2023-04-17 at 15:32 +0530, Karthik M wrote:
> If the average value in mesh metrics calculation
> has been rounded to 0 (success), this resets it to
> the smallest nonzero value (similarly to the initialization)
> to avoid a case where a single failure would result in
> an average value that goes beyond the value
> of 95 (Link Failure Threshold).
> 
> Signed-off-by: Tamizh Chelvam Raja <quic_tamizhr@quicinc.com>
> Signed-off-by: Karthik M <quic_karm@quicinc.com>
> ---
> Changes since v1:
>  - Altered the comment to mention "mesh" and thershold value.
>  - Checkpatch done

This is v2 now, how did an explicit "v1" happen? :)

Anyway ...

> +++ b/net/mac80211/mesh_hwmp.c
> @@ -298,10 +298,23 @@ void ieee80211s_update_metric(struct ieee80211_local *local,
>  {
>  	struct ieee80211_tx_info *txinfo = st->info;
>  	int failed;
> +	u32 fail_avg;
>  	struct rate_info rinfo;
>  
>  	failed = !(txinfo->flags & IEEE80211_TX_STAT_ACK);
>  
> +	fail_avg = ewma_mesh_fail_avg_read(&sta->mesh->fail_avg);
> +	if (!fail_avg) {
> +		/* If the average value in mesh metrics calculation
> +		 * has been rounded to 0 (success), this resets it to
> +		 * the smallest nonzero value (similarly to the initialization)
> +		 * to avoid a case where a single failure would result in
> +		 * an average value that goes beyond the value
> +		 * of 95 (Link Failure Threshold)
> +		 */
> +		ewma_mesh_fail_avg_add(&sta->mesh->fail_avg, 1);
> +	}
> +
>  	/* moving average, scaled to 100.
>  	 * feed failure as 100 and success as 0
>  	 */

This still seems really non-intuitive to me.

It seems to me this is down to the special "0 means init" behaviour,
that you don't want, because your values actually fluctate between 0 and
100, and you can actually legitimately reach 0 with a lot of successes.

But to me it's really non-intuitive, if not counter-intuitive, to say
"oh yeah my values are 0 to 100 inclusive, but I can't ever deal with
reaching 0 because then I jump to 100 immediately". That doesn't make
much sense to me?

I mean, I guess I can see where this patch makes some sense like from a
code point of view, this is sort of the minimal code change you could
make to make the existing code work, but ... I'd argue you're optimising
to the wrong metric here, "minimal code changes to fix the bug" should
normally not be your metric, it should be "code changes that make this
clearer and avoid the problem", or something like that?

Anyway I guess that's all a long-winded way of saying that I don't
really agree with this change.

To me, basically, I see two ways to solve this:

1) we have DECLARE_EWMA_ZERO_VALID() or something like that which
   *doesn't* treat 0 as an uninitialized value, and either has a
   separate "not initialized yet" bit (but that's iffy storage wise),
   or simply has another argument to _init() for the initial value or
   so.

2) you don't just don't use 0 and 100 but say 1 and 100, that results in
   basically the same behaviour, but avoids the special 0.

johannes

Benjamin Beichler April 20, 2023, 9:30 a.m. UTC | #2

Hi,

> 
> This still seems really non-intuitive to me.
> 
> It seems to me this is down to the special "0 means init" behaviour,
> that you don't want, because your values actually fluctate between 0 and
> 100, and you can actually legitimately reach 0 with a lot of successes.
> 
> But to me it's really non-intuitive, if not counter-intuitive, to say
> "oh yeah my values are 0 to 100 inclusive, but I can't ever deal with
> reaching 0 because then I jump to 100 immediately". That doesn't make
> much sense to me?
> 
> I mean, I guess I can see where this patch makes some sense like from a
> code point of view, this is sort of the minimal code change you could
> make to make the existing code work, but ... I'd argue you're optimising
> to the wrong metric here, "minimal code changes to fix the bug" should
> normally not be your metric, it should be "code changes that make this
> clearer and avoid the problem", or something like that?
> 
> Anyway I guess that's all a long-winded way of saying that I don't
> really agree with this change.
> 
> To me, basically, I see two ways to solve this:
> 
> 1) we have DECLARE_EWMA_ZERO_VALID() or something like that which
>     *doesn't* treat 0 as an uninitialized value, and either has a
>     separate "not initialized yet" bit (but that's iffy storage wise),
>     or simply has another argument to _init() for the initial value or
>     so.
> 
> 2) you don't just don't use 0 and 100 but say 1 and 100, that results in
>     basically the same behaviour, but avoids the special 0.
> 
> johannes

I also ran into that problem in the past, and reviewing it again with a 
college, I think, this is a real bug in the EWMA implementation. I try 
to provide a proper patch in the next days, but actually the EWMA 
handles the internal value zero, always like in the initialization, 
which is wrong, e.g., for positive/negative averaged values.

A quick research shows, this bug is since the first implementation of 
the ewma in the code ...

kind regards

Benjamin

Johannes Berg April 20, 2023, 10:27 a.m. UTC | #3

On Thu, 2023-04-20 at 11:30 +0200, Benjamin Beichler wrote:
> > To me, basically, I see two ways to solve this:
> > 
> > 1) we have DECLARE_EWMA_ZERO_VALID() or something like that which
> >     *doesn't* treat 0 as an uninitialized value, and either has a
> >     separate "not initialized yet" bit (but that's iffy storage wise),
> >     or simply has another argument to _init() for the initial value or
> >     so.
> > 
> > 2) you don't just don't use 0 and 100 but say 1 and 100, that results in
> >     basically the same behaviour, but avoids the special 0.
> > 
> > johannes
> 
> I also ran into that problem in the past, and reviewing it again with a 
> college, I think, this is a real bug in the EWMA implementation. I try 
> to provide a proper patch in the next days, but actually the EWMA 
> handles the internal value zero, always like in the initialization, 
> which is wrong, e.g., for positive/negative averaged values.

Yes, it's always wrong as long as you feed it something zero, or values
with different sign.

For a lot of use cases, however, that doesn't matter. Originally, it was
used e.g. for signal strength averaging, average packet lengths, etc.
where it really doesn't matter since you can never use 0 or values that
have different sign.

> A quick research shows, this bug is since the first implementation of 
> the ewma in the code ...
> 

Yeah, I'm aware of that, I was around for it ;-)

But see above, I'm not sure I'd even call it a bug, at least not
originally with the users that we had intended.

Hence I don't know if it's really good to fix this in general - for many
of these cases zero can still be treated specially (and like I mentioned
in my previous email, we can even here avoid 0), and then we don't spend
an extra byte (or likely 4) to hold a "first time" flag.

Dunno. Maybe it's not worth thinking about the extra memory space vs.
the extra maintenance cost. But maybe at least on 64-bit we could steal
a bit from the unsigned long? Not sure what all the users are...

johannes

Benjamin Beichler April 20, 2023, 11:12 a.m. UTC | #4

>>> To me, basically, I see two ways to solve this:
>>>
>>> 1) we have DECLARE_EWMA_ZERO_VALID() or something like that which
>>>      *doesn't* treat 0 as an uninitialized value, and either has a
>>>      separate "not initialized yet" bit (but that's iffy storage wise),
>>>      or simply has another argument to _init() for the initial value or
>>>      so.
>>>
>>> 2) you don't just don't use 0 and 100 but say 1 and 100, that results in
>>>      basically the same behaviour, but avoids the special 0.
>>>
>>> johannes
>> I also ran into that problem in the past, and reviewing it again with a
>> college, I think, this is a real bug in the EWMA implementation. I try
>> to provide a proper patch in the next days, but actually the EWMA
>> handles the internal value zero, always like in the initialization,
>> which is wrong, e.g., for positive/negative averaged values.
> Yes, it's always wrong as long as you feed it something zero, or values
> with different sign.
>
> For a lot of use cases, however, that doesn't matter. Originally, it was
> used e.g. for signal strength averaging, average packet lengths, etc.
> where it really doesn't matter since you can never use 0 or values that
> have different sign.
>
>> A quick research shows, this bug is since the first implementation of
>> the ewma in the code ...
>>
> Yeah, I'm aware of that, I was around for it ;-)
>
> But see above, I'm not sure I'd even call it a bug, at least not
> originally with the users that we had intended.
>
> Hence I don't know if it's really good to fix this in general - for many
> of these cases zero can still be treated specially (and like I mentioned
> in my previous email, we can even here avoid 0), and then we don't spend
> an extra byte (or likely 4) to hold a "first time" flag.
>
> Dunno. Maybe it's not worth thinking about the extra memory space vs.
> the extra maintenance cost. But maybe at least on 64-bit we could steal
> a bit from the unsigned long? Not sure what all the users are...
I thought of introducing a separate function to initialize the 
"average", which could be optimized away, when unused. I had a look at 
the usage, and it looks like 10-15 places, which should work with 
initializing or simply weight the new value always, instead of the 
special case.

For me the problem is, that the current implementation is unintuitive or 
at least badly documented.
And I would even claim the same argument, that for most users, the 
behavior for initialization also does not matter, therefore I would use 
the mathematically more natural implementation :-)

Felix Fietkau April 20, 2023, 12:22 p.m. UTC | #5

On 20.04.23 12:27, Johannes Berg wrote:
> On Thu, 2023-04-20 at 11:30 +0200, Benjamin Beichler wrote:
>> > To me, basically, I see two ways to solve this:
>> > 
>> > 1) we have DECLARE_EWMA_ZERO_VALID() or something like that which
>> >     *doesn't* treat 0 as an uninitialized value, and either has a
>> >     separate "not initialized yet" bit (but that's iffy storage wise),
>> >     or simply has another argument to _init() for the initial value or
>> >     so.
>> > 
>> > 2) you don't just don't use 0 and 100 but say 1 and 100, that results in
>> >     basically the same behaviour, but avoids the special 0.
>> > 
>> > johannes
>> 
>> I also ran into that problem in the past, and reviewing it again with a 
>> college, I think, this is a real bug in the EWMA implementation. I try 
>> to provide a proper patch in the next days, but actually the EWMA 
>> handles the internal value zero, always like in the initialization, 
>> which is wrong, e.g., for positive/negative averaged values.
> 
> Yes, it's always wrong as long as you feed it something zero, or values
> with different sign.
> 
> For a lot of use cases, however, that doesn't matter. Originally, it was
> used e.g. for signal strength averaging, average packet lengths, etc.
> where it really doesn't matter since you can never use 0 or values that
> have different sign.
> 
>> A quick research shows, this bug is since the first implementation of 
>> the ewma in the code ...
>> 
> 
> Yeah, I'm aware of that, I was around for it ;-)
> 
> But see above, I'm not sure I'd even call it a bug, at least not
> originally with the users that we had intended.
> 
> Hence I don't know if it's really good to fix this in general - for many
> of these cases zero can still be treated specially (and like I mentioned
> in my previous email, we can even here avoid 0), and then we don't spend
> an extra byte (or likely 4) to hold a "first time" flag.
> 
> Dunno. Maybe it's not worth thinking about the extra memory space vs.
> the extra maintenance cost. But maybe at least on 64-bit we could steal
> a bit from the unsigned long? Not sure what all the users are..
We don't actually need a full bit. We can just add 1 to the internal 
value for initialized values. How about this (completely untested):
https://nbd.name/p/69b00c5b

- Felix

Benjamin Beichler April 20, 2023, 1 p.m. UTC | #6

Am 20.04.2023 um 14:22 schrieb Felix Fietkau:
> On 20.04.23 12:27, Johannes Berg wrote:
>> On Thu, 2023-04-20 at 11:30 +0200, Benjamin Beichler wrote:
>>> > To me, basically, I see two ways to solve this:
>>> >
>>> > 1) we have DECLARE_EWMA_ZERO_VALID() or something like that which
>>> >     *doesn't* treat 0 as an uninitialized value, and either has a
>>> >     separate "not initialized yet" bit (but that's iffy storage 
>>> wise),
>>> >     or simply has another argument to _init() for the initial 
>>> value or
>>> >     so.
>>> >
>>> > 2) you don't just don't use 0 and 100 but say 1 and 100, that 
>>> results in
>>> >     basically the same behaviour, but avoids the special 0.
>>> >
>>> > johannes
>>>
>>> I also ran into that problem in the past, and reviewing it again with a
>>> college, I think, this is a real bug in the EWMA implementation. I try
>>> to provide a proper patch in the next days, but actually the EWMA
>>> handles the internal value zero, always like in the initialization,
>>> which is wrong, e.g., for positive/negative averaged values.
>>
>> Yes, it's always wrong as long as you feed it something zero, or values
>> with different sign.
>>
>> For a lot of use cases, however, that doesn't matter. Originally, it was
>> used e.g. for signal strength averaging, average packet lengths, etc.
>> where it really doesn't matter since you can never use 0 or values that
>> have different sign.
>>
>>> A quick research shows, this bug is since the first implementation of
>>> the ewma in the code ...
>>>
>>
>> Yeah, I'm aware of that, I was around for it ;-)
>>
>> But see above, I'm not sure I'd even call it a bug, at least not
>> originally with the users that we had intended.
>>
>> Hence I don't know if it's really good to fix this in general - for many
>> of these cases zero can still be treated specially (and like I mentioned
>> in my previous email, we can even here avoid 0), and then we don't spend
>> an extra byte (or likely 4) to hold a "first time" flag.
>>
>> Dunno. Maybe it's not worth thinking about the extra memory space vs.
>> the extra maintenance cost. But maybe at least on 64-bit we could steal
>> a bit from the unsigned long? Not sure what all the users are..
> We don't actually need a full bit. We can just add 1 to the internal
> value for initialized values. How about this (completely untested):
> https://nbd.name/p/69b00c5b
>
Nice hack, but even a bit more dirty than before :-D

I think being more explicit, does not hurt here.

Maybe we could rename it ewma_*_add_or_init and do a separate function 
with ewma_*_add. Since it is inlined, there will be no significant text 
size increase (except a driver uses both functions).


> - Felix
>

Felix Fietkau April 20, 2023, 1:15 p.m. UTC | #7

> On 20. Apr 2023, at 15:00, Benjamin Beichler <Benjamin.Beichler@uni-rostock.de> wrote:
> 
> Am 20.04.2023 um 14:22 schrieb Felix Fietkau:
>>> On 20.04.23 12:27, Johannes Berg wrote:
>>> On Thu, 2023-04-20 at 11:30 +0200, Benjamin Beichler wrote:
>>>> > To me, basically, I see two ways to solve this:
>>>> >
>>>> > 1) we have DECLARE_EWMA_ZERO_VALID() or something like that which
>>>> >     *doesn't* treat 0 as an uninitialized value, and either has a
>>>> >     separate "not initialized yet" bit (but that's iffy storage wise),
>>>> >     or simply has another argument to _init() for the initial value or
>>>> >     so.
>>>> >
>>>> > 2) you don't just don't use 0 and 100 but say 1 and 100, that results in
>>>> >     basically the same behaviour, but avoids the special 0.
>>>> >
>>>> > johannes
>>>> 
>>>> I also ran into that problem in the past, and reviewing it again with a
>>>> college, I think, this is a real bug in the EWMA implementation. I try
>>>> to provide a proper patch in the next days, but actually the EWMA
>>>> handles the internal value zero, always like in the initialization,
>>>> which is wrong, e.g., for positive/negative averaged values.
>>> 
>>> Yes, it's always wrong as long as you feed it something zero, or values
>>> with different sign.
>>> 
>>> For a lot of use cases, however, that doesn't matter. Originally, it was
>>> used e.g. for signal strength averaging, average packet lengths, etc.
>>> where it really doesn't matter since you can never use 0 or values that
>>> have different sign.
>>> 
>>>> A quick research shows, this bug is since the first implementation of
>>>> the ewma in the code ...
>>>> 
>>> 
>>> Yeah, I'm aware of that, I was around for it ;-)
>>> 
>>> But see above, I'm not sure I'd even call it a bug, at least not
>>> originally with the users that we had intended.
>>> 
>>> Hence I don't know if it's really good to fix this in general - for many
>>> of these cases zero can still be treated specially (and like I mentioned
>>> in my previous email, we can even here avoid 0), and then we don't spend
>>> an extra byte (or likely 4) to hold a "first time" flag.
>>> 
>>> Dunno. Maybe it's not worth thinking about the extra memory space vs.
>>> the extra maintenance cost. But maybe at least on 64-bit we could steal
>>> a bit from the unsigned long? Not sure what all the users are..
>> We don't actually need a full bit. We can just add 1 to the internal
>> value for initialized values. How about this (completely untested):
>> https://nbd.name/p/69b00c5b
>> 
> Nice hack, but even a bit more dirty than before :-D

I disagree. With my “hack” at least the implementation will do the right thing without the API user having to worry about 0 as a value. It could use some comments to clarify the details, but I do think it makes things easier.

- Felix

Johannes Berg April 21, 2023, 9:35 a.m. UTC | #8

On Thu, 2023-04-20 at 15:15 +0200, Felix Fietkau wrote:
> 
> > On 20. Apr 2023, at 15:00, Benjamin Beichler <Benjamin.Beichler@uni-rostock.de> wrote:
> > 
> > Am 20.04.2023 um 14:22 schrieb Felix Fietkau:
> > > > On 20.04.23 12:27, Johannes Berg wrote:
> > > > On Thu, 2023-04-20 at 11:30 +0200, Benjamin Beichler wrote:
> > > > > > To me, basically, I see two ways to solve this:
> > > > > > 
> > > > > > 1) we have DECLARE_EWMA_ZERO_VALID() or something like that which
> > > > > >     *doesn't* treat 0 as an uninitialized value, and either has a
> > > > > >     separate "not initialized yet" bit (but that's iffy storage wise),
> > > > > >     or simply has another argument to _init() for the initial value or
> > > > > >     so.
> > > > > > 
> > > > > > 2) you don't just don't use 0 and 100 but say 1 and 100, that results in
> > > > > >     basically the same behaviour, but avoids the special 0.
> > > > > > 
> > > > > > johannes
> > > > > 
> > > > > I also ran into that problem in the past, and reviewing it again with a
> > > > > college, I think, this is a real bug in the EWMA implementation. I try
> > > > > to provide a proper patch in the next days, but actually the EWMA
> > > > > handles the internal value zero, always like in the initialization,
> > > > > which is wrong, e.g., for positive/negative averaged values.
> > > > 
> > > > Yes, it's always wrong as long as you feed it something zero, or values
> > > > with different sign.
> > > > 
> > > > For a lot of use cases, however, that doesn't matter. Originally, it was
> > > > used e.g. for signal strength averaging, average packet lengths, etc.
> > > > where it really doesn't matter since you can never use 0 or values that
> > > > have different sign.
> > > > 
> > > > > A quick research shows, this bug is since the first implementation of
> > > > > the ewma in the code ...
> > > > > 
> > > > 
> > > > Yeah, I'm aware of that, I was around for it ;-)
> > > > 
> > > > But see above, I'm not sure I'd even call it a bug, at least not
> > > > originally with the users that we had intended.
> > > > 
> > > > Hence I don't know if it's really good to fix this in general - for many
> > > > of these cases zero can still be treated specially (and like I mentioned
> > > > in my previous email, we can even here avoid 0), and then we don't spend
> > > > an extra byte (or likely 4) to hold a "first time" flag.
> > > > 
> > > > Dunno. Maybe it's not worth thinking about the extra memory space vs.
> > > > the extra maintenance cost. But maybe at least on 64-bit we could steal
> > > > a bit from the unsigned long? Not sure what all the users are..
> > > We don't actually need a full bit. We can just add 1 to the internal
> > > value for initialized values. How about this (completely untested):
> > > https://nbd.name/p/69b00c5b
> > > 
> > Nice hack, but even a bit more dirty than before :-D
> 
> I disagree. With my “hack” at least the implementation will do the
> right thing without the API user having to worry about 0 as a value.
> It could use some comments to clarify the details, but I do think it
> makes things easier.
> 

To me, the first question is if there are potentially any users that are
_relying_ on the current behaviour. This seems unlikely though, looking
at the ~30 users, most sound like signal/rssi, packet sizes, etc.

So let's say with the bug found here that prompted this patch, chances
are that there aren't any users that really want 0 to be special. I also
can't even really think of a reason for wanting that.


So then let's say we want to fix the existing code. I can think of these
possible ways:

 * splitting off a bit for initialized from the unsigned long
   (which at least for 64-bit should be OK since presumably most code
    using this will run on 32-bit systems too)
 * adding another value for it, e.g. making it u32 and adding a bool for
   "first value"
 * biasing the value, like Felix proposes, could be by 1 or -1 for
   example

All of these have a memory cost, of course, though the first two are
data and the second code, so for things like stations the code exists
only once and the data multiple times. On 64-bit we can probably make
the first two not have a data memory cost though.

As for biasing the value, couldn't that lead to a similar problem? It's
clearly less likely that the end of the range is reached rather than
zero, but still?

johannes

Felix Fietkau April 21, 2023, 9:53 a.m. UTC | #9

On 21.04.23 11:35, Johannes Berg wrote:
> To me, the first question is if there are potentially any users that are
> _relying_ on the current behaviour. This seems unlikely though, looking
> at the ~30 users, most sound like signal/rssi, packet sizes, etc.
> 
> So let's say with the bug found here that prompted this patch, chances
> are that there aren't any users that really want 0 to be special. I also
> can't even really think of a reason for wanting that.
> 
> 
> So then let's say we want to fix the existing code. I can think of these
> possible ways:
> 
>   * splitting off a bit for initialized from the unsigned long
>     (which at least for 64-bit should be OK since presumably most code
>      using this will run on 32-bit systems too)
>   * adding another value for it, e.g. making it u32 and adding a bool for
>     "first value"
>   * biasing the value, like Felix proposes, could be by 1 or -1 for
>     example
> 
> All of these have a memory cost, of course, though the first two are
> data and the second code, so for things like stations the code exists
> only once and the data multiple times. On 64-bit we can probably make
> the first two not have a data memory cost though.
> 
> As for biasing the value, couldn't that lead to a similar problem? It's
> clearly less likely that the end of the range is reached rather than
> zero, but still?
I don't see how it can reduce the range in any way, since the bias is 
added to the fractional part. A range reduction would seem to imply 
having an average value that's bigger than the maximum allowed shifted 
input (top bits cut off), and I don't think that's possible.

- Felix

Benjamin Beichler April 21, 2023, 10:23 a.m. UTC | #10

Am 20.04.2023 um 15:15 schrieb Felix Fietkau:
>> On 20. Apr 2023, at 15:00, Benjamin Beichler<Benjamin.Beichler@uni-rostock.de>  wrote:
>>
>> Am 20.04.2023 um 14:22 schrieb Felix Fietkau:
>>>> On 20.04.23 12:27, Johannes Berg wrote:
>>>> On Thu, 2023-04-20 at 11:30 +0200, Benjamin Beichler wrote:
>>>>>> To me, basically, I see two ways to solve this:
>>>>>>
>>>>>> 1) we have DECLARE_EWMA_ZERO_VALID() or something like that which
>>>>>>      *doesn't* treat 0 as an uninitialized value, and either has a
>>>>>>      separate "not initialized yet" bit (but that's iffy storage wise),
>>>>>>      or simply has another argument to _init() for the initial value or
>>>>>>      so.
>>>>>>
>>>>>> 2) you don't just don't use 0 and 100 but say 1 and 100, that results in
>>>>>>      basically the same behaviour, but avoids the special 0.
>>>>>>
>>>>>> johannes
>>>>> I also ran into that problem in the past, and reviewing it again with a
>>>>> college, I think, this is a real bug in the EWMA implementation. I try
>>>>> to provide a proper patch in the next days, but actually the EWMA
>>>>> handles the internal value zero, always like in the initialization,
>>>>> which is wrong, e.g., for positive/negative averaged values.
>>>> Yes, it's always wrong as long as you feed it something zero, or values
>>>> with different sign.
>>>>
>>>> For a lot of use cases, however, that doesn't matter. Originally, it was
>>>> used e.g. for signal strength averaging, average packet lengths, etc.
>>>> where it really doesn't matter since you can never use 0 or values that
>>>> have different sign.
>>>>
>>>>> A quick research shows, this bug is since the first implementation of
>>>>> the ewma in the code ...
>>>>>
>>>> Yeah, I'm aware of that, I was around for it ;-)
>>>>
>>>> But see above, I'm not sure I'd even call it a bug, at least not
>>>> originally with the users that we had intended.
>>>>
>>>> Hence I don't know if it's really good to fix this in general - for many
>>>> of these cases zero can still be treated specially (and like I mentioned
>>>> in my previous email, we can even here avoid 0), and then we don't spend
>>>> an extra byte (or likely 4) to hold a "first time" flag.
>>>>
>>>> Dunno. Maybe it's not worth thinking about the extra memory space vs.
>>>> the extra maintenance cost. But maybe at least on 64-bit we could steal
>>>> a bit from the unsigned long? Not sure what all the users are..
>>> We don't actually need a full bit. We can just add 1 to the internal
>>> value for initialized values. How about this (completely untested):
>>> https://nbd.name/p/69b00c5b
>>>
>> Nice hack, but even a bit more dirty than before :-D
> I disagree. With my “hack” at least the implementation will do the right thing without the API user having to worry about 0 as a value. It could use some comments to clarify the details, but I do think it makes things easier.

I would like to cite Johannes here:

> I mean, I guess I can see where this patch makes some sense like from a
> code point of view, this is sort of the minimal code change you could
> make to make the existing code work, but ... I'd argue you're optimising
> to the wrong metric here, "minimal code changes to fix the bug" should
> normally not be your metric, it should be "code changes that make this
> clearer and avoid the problem", or something like that?
I agree that you solve the problem, but it is not clearer. And as a user I expect to have a classical EWMA, which I initialize with a proper value and then use it with normal inputs (where I include zero and negative numbers).
 From my understanding, you shifted the problem only to one digit before zero, which may even not solve it for the mesh case here (but maybe I did not catch it right).

I may also argue, that you introduce 2 additions into the hot path to avoid a separate init function, which is mostly called once. And it even introduces non-intuitive behavior in corner cases.
However, even the mostly unused branch on the value of internal for initialization is suboptimal for the hot path and would be an indicator for a separate init function.

 From my dig into the usages, most do not care for initialization or do it explicitly with bogus add calls, which can be simply replaced.


Benjamin

Benjamin Beichler April 21, 2023, 10:34 a.m. UTC | #11

Am 21.04.2023 um 11:53 schrieb Felix Fietkau:
> ________________________________
> Achtung! Externe E-Mail: Klicken Sie erst dann auf Links und Anhänge, 
> nachdem Sie die Vertrauenswürdigkeit der Absenderadresse geprüft haben.
> ________________________________
>
> On 21.04.23 11:35, Johannes Berg wrote:
>> To me, the first question is if there are potentially any users that are
>> _relying_ on the current behaviour. This seems unlikely though, looking
>> at the ~30 users, most sound like signal/rssi, packet sizes, etc.
>>
>> So let's say with the bug found here that prompted this patch, chances
>> are that there aren't any users that really want 0 to be special. I also
>> can't even really think of a reason for wanting that.
>>
>>
>> So then let's say we want to fix the existing code. I can think of these
>> possible ways:
>>
>>   * splitting off a bit for initialized from the unsigned long
>>     (which at least for 64-bit should be OK since presumably most code
>>      using this will run on 32-bit systems too)
>>   * adding another value for it, e.g. making it u32 and adding a bool 
>> for
>>     "first value"
>>   * biasing the value, like Felix proposes, could be by 1 or -1 for
>>     example
You forgot the possibility to introduce a separate init function, which 
boils down to a shift with an assignment statement for code, and no 
further data memory cost. Even simply extending the current init 
function (which simply always set 0) would be enough.
>>
>> All of these have a memory cost, of course, though the first two are
>> data and the second code, so for things like stations the code exists
>> only once and the data multiple times. On 64-bit we can probably make
>> the first two not have a data memory cost though.
>>
>> As for biasing the value, couldn't that lead to a similar problem? It's
>> clearly less likely that the end of the range is reached rather than
>> zero, but still?
> I don't see how it can reduce the range in any way, since the bias is
> added to the fractional part. A range reduction would seem to imply
> having an average value that's bigger than the maximum allowed shifted
> input (top bits cut off), and I don't think that's possible.
>
It does not reduce the range, but it does not matter whether your 
internal state is 0 or 2^(-precision), the non-intuitive behavior stays 
the same.


> - Felix
>

Johannes Berg April 21, 2023, 11:13 a.m. UTC | #12

On Fri, 2023-04-21 at 12:34 +0200, Benjamin Beichler wrote:
> > > 
> > > So then let's say we want to fix the existing code. I can think of these
> > > possible ways:
> > > 
> > >   * splitting off a bit for initialized from the unsigned long
> > >     (which at least for 64-bit should be OK since presumably most code
> > >      using this will run on 32-bit systems too)
> > >   * adding another value for it, e.g. making it u32 and adding a bool 
> > > for
> > >     "first value"
> > >   * biasing the value, like Felix proposes, could be by 1 or -1 for
> > >     example
> You forgot the possibility to introduce a separate init function, which 
> boils down to a shift with an assignment statement for code, and no 
> further data memory cost. Even simply extending the current init 
> function (which simply always set 0) would be enough.

Sort of. Yeah I should've mentioned it, but that means you actually have
to know the first value, and track "first time usage" separately in the
user code.

Or you init to something useful at the first value, e.g. saying for
signal strength "let's assume -45dBm average if we don't know". That
doesn't seem very practical?

The behaviour of "first value inserted will init" seems sensible.

> > > As for biasing the value, couldn't that lead to a similar problem? It's
> > > clearly less likely that the end of the range is reached rather than
> > > zero, but still?
> > I don't see how it can reduce the range in any way, since the bias is
> > added to the fractional part. A range reduction would seem to imply
> > having an average value that's bigger than the maximum allowed shifted
> > input (top bits cut off), and I don't think that's possible.
> > 
> It does not reduce the range, but it does not matter whether your 
> internal state is 0 or 2^(-precision), the non-intuitive behavior stays 
> the same.

OK I was sort of handwaving ... :-)

To have a problem, basically the +1 has to overflow the value, so that
we think that the next time around we should init, rather than add.

That means the existing average has to be 0xffff'ffff (let's take 32
bits, its easier to type). Clearly that can't happen on the first time
since then the precision bits are all 0.

But I think Felix is right (thought not sure about the reasoning) and
that cannot happen, because the add calculation does a ">>weight_rcp"
shift at the end, so there are always some top bits that are non-zero,
and _weight_rcp has to be a power of two. Now, 1 is a power of two, but
that'd be really stupid, and nobody is using it ... So I think if we
prohibit 1 for that, we're fine?

Btw, Felix, shouldn't your patch have said "bool init = !internal"?

johannes

Benjamin Beichler April 21, 2023, 11:51 a.m. UTC | #13

Am 21.04.2023 um 13:13 schrieb Johannes Berg:
> ________________________________
>   Achtung! Externe E-Mail: Klicken Sie erst dann auf Links und Anhänge, nachdem Sie die Vertrauenswürdigkeit der Absenderadresse geprüft haben.
> ________________________________
>
> On Fri, 2023-04-21 at 12:34 +0200, Benjamin Beichler wrote:
>>>> So then let's say we want to fix the existing code. I can think of these
>>>> possible ways:
>>>>
>>>>    * splitting off a bit for initialized from the unsigned long
>>>>      (which at least for 64-bit should be OK since presumably most code
>>>>       using this will run on 32-bit systems too)
>>>>    * adding another value for it, e.g. making it u32 and adding a bool
>>>> for
>>>>      "first value"
>>>>    * biasing the value, like Felix proposes, could be by 1 or -1 for
>>>>      example
>> You forgot the possibility to introduce a separate init function, which
>> boils down to a shift with an assignment statement for code, and no
>> further data memory cost. Even simply extending the current init
>> function (which simply always set 0) would be enough.
> Sort of. Yeah I should've mentioned it, but that means you actually have
> to know the first value, and track "first time usage" separately in the
> user code.
>
> Or you init to something useful at the first value, e.g. saying for
> signal strength "let's assume -45dBm average if we don't know". That
> doesn't seem very practical?
I think (at least for the usages I dig into), either those values are 
already defined and applied with a bogus *_add.
>
> The behaviour of "first value inserted will init" seems sensible.

Okay, my suggestion, as this is a c-macro version, which is generated 
for each user: why not offer both? If it is unused, the compiler will 
sort that out. Then we simply keep the zero as special value for init 
(and document it) and otherwise for the more mathematically defined 
usage. But my feeling is, that most users do not care about the first 
values that much and wait to let the ewma settle to some value.

I try to do a RFC patch, with what I mean.


Benjamin

Felix Fietkau April 21, 2023, noon UTC | #14

On 21.04.23 13:13, Johannes Berg wrote:
> On Fri, 2023-04-21 at 12:34 +0200, Benjamin Beichler wrote:
>> > > 
>> > > So then let's say we want to fix the existing code. I can think of these
>> > > possible ways:
>> > > 
>> > >   * splitting off a bit for initialized from the unsigned long
>> > >     (which at least for 64-bit should be OK since presumably most code
>> > >      using this will run on 32-bit systems too)
>> > >   * adding another value for it, e.g. making it u32 and adding a bool 
>> > > for
>> > >     "first value"
>> > >   * biasing the value, like Felix proposes, could be by 1 or -1 for
>> > >     example
>> You forgot the possibility to introduce a separate init function, which 
>> boils down to a shift with an assignment statement for code, and no 
>> further data memory cost. Even simply extending the current init 
>> function (which simply always set 0) would be enough.
> 
> Sort of. Yeah I should've mentioned it, but that means you actually have
> to know the first value, and track "first time usage" separately in the
> user code.
> 
> Or you init to something useful at the first value, e.g. saying for
> signal strength "let's assume -45dBm average if we don't know". That
> doesn't seem very practical?
> 
> The behaviour of "first value inserted will init" seems sensible.
> 
>> > > As for biasing the value, couldn't that lead to a similar problem? It's
>> > > clearly less likely that the end of the range is reached rather than
>> > > zero, but still?
>> > I don't see how it can reduce the range in any way, since the bias is
>> > added to the fractional part. A range reduction would seem to imply
>> > having an average value that's bigger than the maximum allowed shifted
>> > input (top bits cut off), and I don't think that's possible.
>> > 
>> It does not reduce the range, but it does not matter whether your 
>> internal state is 0 or 2^(-precision), the non-intuitive behavior stays 
>> the same.
> 
> OK I was sort of handwaving ... :-)
> 
> To have a problem, basically the +1 has to overflow the value, so that
> we think that the next time around we should init, rather than add.
> 
> That means the existing average has to be 0xffff'ffff (let's take 32
> bits, its easier to type). Clearly that can't happen on the first time
> since then the precision bits are all 0.
> 
> But I think Felix is right (thought not sure about the reasoning) and
> that cannot happen, because the add calculation does a ">>weight_rcp"
> shift at the end, so there are always some top bits that are non-zero,
> and _weight_rcp has to be a power of two. Now, 1 is a power of two, but
> that'd be really stupid, and nobody is using it ... So I think if we
> prohibit 1 for that, we're fine?
> 
> Btw, Felix, shouldn't your patch have said "bool init = !internal"?

No, 'internal' is the correct value. The name is confusing though - I 
intended it to be short for 'initalized', but it should probably be 
renamed to 'valid' or something like that.

- Felix

[v1] wifi: mac80211: Initialize EWMA fail avg to 1

Commit Message

Comments

Patch