Message ID | 20250131155346.1313580-1-kniv@yandex-team.ru (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | ftrace: Avoid potential division by zero in function_stat_show() | expand |
On Fri, 31 Jan 2025 18:53:46 +0300 Nikolay Kuratov <kniv@yandex-team.ru> wrote: > --- a/kernel/trace/ftrace.c > +++ b/kernel/trace/ftrace.c > @@ -570,12 +570,12 @@ static int function_stat_show(struct seq_file *m, void *v) > stddev = rec->counter * rec->time_squared - > rec->time * rec->time; > > + stddev = div64_ul(stddev, rec->counter * (rec->counter - 1)); > /* > * Divide only 1000 for ns^2 -> us^2 conversion. > * trace_print_graph_duration will divide 1000 again. > */ > - stddev = div64_ul(stddev, > - rec->counter * (rec->counter - 1) * 1000); > + stddev = div64_ul(stddev, 1000); > } > Why make it more complex than it needs to be? We can simply have: if (rec->counter <= 1) { stddev = 0; } else { unsigned long counter = rec->counter * (rec->counter - 1) * 1000; stddev = rec->counter * rec->time_squared - rec->time * rec->time; /* Check for overflow */ stddev = counter > rec->counter ? div64_ul(stddev, counter) : 0; } And be done with it. I'll make the change and give you a "Reported-by". Thanks, -- Steve
Thank you for the review! `counter > rec->counter` check does not protect us from overflows, so this could mislead especially with the comment included. I think we should either eliminate overflows or only protect from zero division. To prevent overflows it is wise to split the division into three and forget about it. But in 32-bit case overflows will still occur. As for zero division protection, maybe this variant even more concise? #ifdef CONFIG_FUNCTION_GRAPH_TRACER seq_puts(m, " "); stddev = 0; unsigned long denom = rec->counter * (rec->counter - 1) * 1000; if (denom) { stddev = rec->counter * rec->time_squared - rec->time * rec->time; stddev = div64_ul(stddev, denom); } trace_seq_init(&s); trace_print_graph_duration(rec->time, &s); -- My last attempt to get this email into lore with git-send-email, since no luck with mutt. Sorry if you see this message twice or thrice in your inbox. It won't happen again.
On Tue, 4 Feb 2025 10:43:02 +0300 Nikolay Kuratov <kniv@yandex-team.ru> wrote: > Thank you for the review! > > `counter > rec->counter` check does not protect us from overflows, > so this could mislead especially with the comment included. Ah you're right, as there's a big multiplication there. > > I think we should either eliminate overflows or only protect from zero > division. To prevent overflows it is wise to split the division into > three and forget about it. But in 32-bit case overflows will still occur. > > As for zero division protection, maybe this variant even more concise? > > #ifdef CONFIG_FUNCTION_GRAPH_TRACER > seq_puts(m, " "); > > stddev = 0; > unsigned long denom = rec->counter * (rec->counter - 1) * 1000; Actually, if the denom overflows, then this is bogus anyway. We should stop calculating if it overflows. That would be if: rec->counter * (rec->counter - 1) * 1000 >= 2^32 or >= 2^64 depending on the architecture. So we can calculate what value rec->counter has to be for that to happen. Using the quadratic equation (haven't used this since college!), we can figure that out. x = rec->counter x * (x - 1) * 1000 = (2^32 - 1) // use minus 1 just to be sure x * (x - 1) = (2^32 - 1) / 1000 x^2 - x = (2^32 - 1) / 1000 x^2 - x - (2^32 - 1) / 1000 = 0 x = (-b +/- sqrt(b^2 - 4ac)) / 2a a = 1 b = -1 c = -(2^32 - 1) / 1000 = -4294967.295 x = (-1 +/- sqrt((-1)^2 - 4 * -4294967.295)) / 2 x = 2071.930 for 32bit For 64bit we have c = -(2^64 - 1) / 1000 = -18446744073709551.615 That makes x = 135818790.812 We should stop calculating stddev when rec->counter > 2071 on 32 bit and > 135818790 on 64 bit! Feel free to check my math. The below patch should fix this. -- Steve > if (denom) { > stddev = rec->counter * rec->time_squared - > rec->time * rec->time; > stddev = div64_ul(stddev, denom); > } > > trace_seq_init(&s); > trace_print_graph_duration(rec->time, &s); > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 2e113f8b13a2..ae90443f32af 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -419,10 +419,42 @@ struct ftrace_profile { unsigned long counter; #ifdef CONFIG_FUNCTION_GRAPH_TRACER unsigned long long time; + unsigned long long stddev_time; unsigned long long time_squared; #endif }; +/* + * The calculation for stddev will overflow when the counter + * algorithm overflows the long size: + * + * rec->counter * (rec->counter - 1) * 1000 >= 2^BITS_PER_LONG + * + * Using the quadratic equation: x = (-b +/- sqrt(b^2 - 4ac)) / 2a + * we can figure out what the max rec->counter is before it + * overflows. + * + * x = rec->counter + * x * (x - 1) * 1000 = 2^BITS_PER_LONG - 1 // -1 for rounding + * x * (x - 1) = (2^BITS_PER_LONG - 1) / 1000 + * x^2 - x = (2^BITS_PER_LONG - 1) / 1000 + * x^2 - x - (2^BITS_PER_LONG - 1) / 1000 = 0 + * + * a = 1 + * b = -1 + * c = -(2^BITS_PER_LONG - 1) / 1000 + * + * x = (1 +/- sqrt(1 - 4 * -(2^BITS_PER_LONG - 1) / 1000)) / 2 + * + * For 32bit that is: x = 2072.930 (or 2072) + * For 64bit that is: x = 135818791.812 (or 135818791) + */ +#ifdef CONFIG_64BIT +# define MAX_STDDEV_COUNTER 135818791 +#else +# define MAX_STDDEV_COUNTER 2072 +#endif + struct ftrace_profile_page { struct ftrace_profile_page *next; unsigned long index; @@ -566,12 +598,16 @@ static int function_stat_show(struct seq_file *m, void *v) if (rec->counter <= 1) stddev = 0; else { + unsigned long counter; + + counter = rec->counter < MAX_STDDEV_COUNTER ? rec->counter : + MAX_STDDEV_COUNTER - 1; /* * Apply Welford's method: * s^2 = 1 / (n * (n-1)) * (n * \Sum (x_i)^2 - (\Sum x_i)^2) */ - stddev = rec->counter * rec->time_squared - - rec->time * rec->time; + stddev = counter * rec->time_squared - + rec->stddev_time * rec->stddev_time; /* * Divide only 1000 for ns^2 -> us^2 conversion. @@ -894,7 +930,19 @@ static void profile_graph_return(struct ftrace_graph_ret *trace, rec = ftrace_find_profiled_func(stat, trace->func); if (rec) { rec->time += calltime; - rec->time_squared += calltime * calltime; + /* + * This is used to calculate stddev, but the calculation + * uses Apply Welford's method that will multiply + * rec->counter * (rec->counter - 1) and also divide from + * that. The calculation will be bogus if that value overflows + * the long size it is stored in. Stop the calculation + * when the counter hits the point it will overflow. + * The stddev shouldn't change much after that anyway. + */ + if (rec->counter < MAX_STDDEV_COUNTER) { + rec->stddev_time = rec->time; + rec->time_squared += calltime * calltime; + } } out:
On Tue, 4 Feb 2025 17:20:45 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > x = rec->counter > > x * (x - 1) * 1000 = (2^32 - 1) // use minus 1 just to be sure > x * (x - 1) = (2^32 - 1) / 1000 > x^2 - x = (2^32 - 1) / 1000 > x^2 - x - (2^32 - 1) / 1000 = 0 > > x = (-b +/- sqrt(b^2 - 4ac)) / 2a > > a = 1 > b = -1 > c = -(2^32 - 1) / 1000 = -4294967.295 > > x = (-1 +/- sqrt((-1)^2 - 4 * -4294967.295)) / 2 And I did make a mistake, as b = -1, and the above should start with -(-1) or 1 :-p > > x = 2071.930 for 32bit This should be: 2072.930 > > For 64bit we have > > c = -(2^64 - 1) / 1000 = -18446744073709551.615 > > That makes > > x = 135818790.812 And this needs to be: 135818791.812 > +/* > + * The calculation for stddev will overflow when the counter > + * algorithm overflows the long size: > + * > + * rec->counter * (rec->counter - 1) * 1000 >= 2^BITS_PER_LONG > + * > + * Using the quadratic equation: x = (-b +/- sqrt(b^2 - 4ac)) / 2a > + * we can figure out what the max rec->counter is before it > + * overflows. > + * > + * x = rec->counter > + * x * (x - 1) * 1000 = 2^BITS_PER_LONG - 1 // -1 for rounding > + * x * (x - 1) = (2^BITS_PER_LONG - 1) / 1000 > + * x^2 - x = (2^BITS_PER_LONG - 1) / 1000 > + * x^2 - x - (2^BITS_PER_LONG - 1) / 1000 = 0 > + * > + * a = 1 > + * b = -1 > + * c = -(2^BITS_PER_LONG - 1) / 1000 > + * > + * x = (1 +/- sqrt(1 - 4 * -(2^BITS_PER_LONG - 1) / 1000)) / 2 > + * > + * For 32bit that is: x = 2072.930 (or 2072) > + * For 64bit that is: x = 135818791.812 (or 135818791) > + */ But in the patch, I redid the numbers and did not copy from the above, and here I did it correctly. -- Steve
Your approach should prevent both zero division and overflow in the
denominator. We also should consider that the numerator
> stddev = counter * rec->time_squared - rec->stddev_time * rec->stddev_time;
may overflow too. Suppose 1000 ns per call case. On each function execution
time_squared would get additional 10^6. So for 32-bit and
MAX_STDDEV_COUNTER = 2072 we would have
counter * rec->time_squared = 2071 * (2071 * 10^6) = 4289041000000,
a 42-bit number.
It looks like overflow prevention here is not viable with macro constants,
since function execution time is dynamic in nature.
My suggestion is to rewrite formula such as we perform division earlier
and avoid overflow checks as much as possible. Note that doing divisions
earlier may result in precision loss, but I think that is acceptable since
rec->time is nanoseconds precision. Also this allows us to not care about
counter overflow at all, because expression rec->time * rec->time will always
overflow earlier than both counter and rec->time_square_sum. Renamed
time_squared into time_square_sum just to clarify.
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 728ecda6e8d4..91e7185dd5de 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -419,7 +419,8 @@ struct ftrace_profile {
unsigned long counter;
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
unsigned long long time;
- unsigned long long time_squared;
+ unsigned long long time_square_sum;
+ unsigned long long saved_stddev;
#endif
};
@@ -560,22 +561,24 @@ static int function_stat_show(struct seq_file *m, void *v)
seq_puts(m, " ");
/* Sample standard deviation (s^2) */
- if (rec->counter <= 1)
+ if (rec->counter <= 1 || !rec->time)
stddev = 0;
+ else if (div64_ul(rec->time * rec->time, rec->time) != rec->time)
+ stddev = rec->saved_stddev;
else {
/*
- * Apply Welford's method:
- * s^2 = 1 / (n * (n-1)) * (n * \Sum (x_i)^2 - (\Sum x_i)^2)
+ * A formula for calculating the variance is:
+ * s^2 = (\Sum (x_i)^2 - (\Sum x_i)^2 / n) / (n - 1)
*/
- stddev = rec->counter * rec->time_squared -
- rec->time * rec->time;
-
+ stddev = rec->time_square_sum -
+ div64_ul(rec->time * rec->time, rec->counter);
+ stddev = div64_ul(stddev, rec->counter - 1);
/*
* Divide only 1000 for ns^2 -> us^2 conversion.
* trace_print_graph_duration will divide 1000 again.
*/
- stddev = div64_ul(stddev,
- rec->counter * (rec->counter - 1) * 1000);
+ stddev = div64_ul(stddev, 1000);
+ rec->saved_stddev = stddev;
}
trace_seq_init(&s);
@@ -888,7 +891,7 @@ static void profile_graph_return(struct ftrace_graph_ret *trace,
rec = ftrace_find_profiled_func(stat, trace->func);
if (rec) {
rec->time += calltime;
- rec->time_squared += calltime * calltime;
+ rec->time_square_sum += calltime * calltime;
}
}
Oops, this still does not fix an overflow because of
> rec->time_square_sum += calltime * calltime;
We can't eliminate overflows easily without both using u64 and converting
ns to us prior to any squaring. I will send an appropriate patch later
if needed. Maybe we're a way too paranoid about overflows and really want
only not to divide by zero?
On Wed, 5 Feb 2025 14:21:54 +0300 Nikolay Kuratov <kniv@yandex-team.ru> wrote: > Oops, this still does not fix an overflow because of > > > rec->time_square_sum += calltime * calltime; > > We can't eliminate overflows easily without both using u64 and converting > ns to us prior to any squaring. I will send an appropriate patch later > if needed. Maybe we're a way too paranoid about overflows and really want > only not to divide by zero? Well, the divide by zero will cause a kernel crash (very bad!) whereas the overflow just causes corrupted data (sorta bad). But yeah, I've noticed a lot of bogus data in the stddev and pretty much ignore it. Perhaps it would have been better not to add it to begin with :-p The divide by zero scenario definitely needs to be fixed, and I may take my last patch to fix it as it also stops some of the overflows. But we can add another patch to stop the other overflows with some runtime checks too. -- Steve
I would suggest just fixing zerodiv for now because IMO the patch fixing overflows properly on stddev path may or may not use macroconstants at all. Also ns^2 overflow will always happen first. I don't know in advance how this patch will look at the end and even not sure it's viable. Steps to consider: 0) Infer some assumptions from the rest of tracing code like `nanoseconds are stored in unsigned long, so on 32-bit machine function should not execute more than ~ 2 s anyway` 1) unsigned long -> u64 conversion 2) ns^2 -> us^2 if possible, inevitable precision loss (how much?) 3) Compute stddev using some numerical trick. Now comment about Welford's method is a bit misleading because its a plain variance formula. But maybe it's a hint that we should convert to that method and problems are gone? ;) I'll send the patch fixing zero division and simplifying code a bit as a follow-up. It's up to you to choose what to apply. I'm Ok with any patch fixing zerodiv and will look into overflow problem. Again, thank you very much for the review and your time.
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 728ecda6e8d4..e1c05c4c29c2 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -570,12 +570,12 @@ static int function_stat_show(struct seq_file *m, void *v) stddev = rec->counter * rec->time_squared - rec->time * rec->time; + stddev = div64_ul(stddev, rec->counter * (rec->counter - 1)); /* * Divide only 1000 for ns^2 -> us^2 conversion. * trace_print_graph_duration will divide 1000 again. */ - stddev = div64_ul(stddev, - rec->counter * (rec->counter - 1) * 1000); + stddev = div64_ul(stddev, 1000); } trace_seq_init(&s);
Cases rec->counter == {0, 1} are checked already. While x * (x - 1) * 1000 = 0 have many solutions greater than 1 for both modulo 2^32 and 2^64, that is not the case for x * (x - 1) = 0, so split division into two. It is not scary in practice because mod 2^64 solutions are huge and minimal mod 2^32 solution is 30-bit number. Cc: stable@vger.kernel.org Fixes: e31f7939c1c27 ("ftrace: Avoid potential division by zero in function profiler") Signed-off-by: Nikolay Kuratov <kniv@yandex-team.ru> --- kernel/trace/ftrace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)