diff mbox series

ftrace: Avoid potential division by zero in function_stat_show()

Message ID 20250131155346.1313580-1-kniv@yandex-team.ru (mailing list archive)
State Superseded
Headers show
Series ftrace: Avoid potential division by zero in function_stat_show() | expand

Commit Message

Nikolay Kuratov Jan. 31, 2025, 3:53 p.m. UTC
Cases rec->counter == {0, 1} are checked already.

While x * (x - 1) * 1000 = 0 have many solutions greater than 1 for both
modulo 2^32 and 2^64, that is not the case for x * (x - 1) = 0, so split
division into two.

It is not scary in practice because mod 2^64 solutions are huge and
minimal mod 2^32 solution is 30-bit number.

Cc: stable@vger.kernel.org
Fixes: e31f7939c1c27 ("ftrace: Avoid potential division by zero in function profiler")
Signed-off-by: Nikolay Kuratov <kniv@yandex-team.ru>
---
 kernel/trace/ftrace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Steven Rostedt Feb. 3, 2025, 3:06 p.m. UTC | #1
On Fri, 31 Jan 2025 18:53:46 +0300
Nikolay Kuratov <kniv@yandex-team.ru> wrote:

> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -570,12 +570,12 @@ static int function_stat_show(struct seq_file *m, void *v)
>  		stddev = rec->counter * rec->time_squared -
>  			 rec->time * rec->time;
>  
> +		stddev = div64_ul(stddev, rec->counter * (rec->counter - 1));
>  		/*
>  		 * Divide only 1000 for ns^2 -> us^2 conversion.
>  		 * trace_print_graph_duration will divide 1000 again.
>  		 */
> -		stddev = div64_ul(stddev,
> -				  rec->counter * (rec->counter - 1) * 1000);
> +		stddev = div64_ul(stddev, 1000);
>  	}
>  

Why make it more complex than it needs to be? We can simply have:

	if (rec->counter <= 1) {
		stddev = 0;
	} else {
		unsigned long counter = rec->counter * (rec->counter - 1) * 1000;

		stddev = rec->counter * rec->time_squared -
			 rec->time * rec->time;

		/* Check for overflow */
		stddev = counter > rec->counter ? div64_ul(stddev, counter) : 0;
	}

And be done with it.

I'll make the change and give you a "Reported-by".

Thanks,

-- Steve
Nikolay Kuratov Feb. 4, 2025, 7:43 a.m. UTC | #2
Thank you for the review!

`counter > rec->counter` check does not protect us from overflows,
so this could mislead especially with the comment included.

I think we should either eliminate overflows or only protect from zero
division. To prevent overflows it is wise to split the division into
three and forget about it. But in 32-bit case overflows will still occur.

As for zero division protection, maybe this variant even more concise?

#ifdef CONFIG_FUNCTION_GRAPH_TRACER
	seq_puts(m, "    ");

	stddev = 0;
	unsigned long denom = rec->counter * (rec->counter - 1) * 1000;
	if (denom) {
		stddev = rec->counter * rec->time_squared -
			 rec->time * rec->time;
		stddev = div64_ul(stddev, denom);
	}

	trace_seq_init(&s);
	trace_print_graph_duration(rec->time, &s);

--
My last attempt to get this email into lore with git-send-email, since
no luck with mutt. Sorry if you see this message twice or thrice in your
inbox. It won't happen again.
Steven Rostedt Feb. 4, 2025, 10:20 p.m. UTC | #3
On Tue,  4 Feb 2025 10:43:02 +0300
Nikolay Kuratov <kniv@yandex-team.ru> wrote:

> Thank you for the review!
> 
> `counter > rec->counter` check does not protect us from overflows,
> so this could mislead especially with the comment included.

Ah you're right, as there's a big multiplication there.

> 
> I think we should either eliminate overflows or only protect from zero
> division. To prevent overflows it is wise to split the division into
> three and forget about it. But in 32-bit case overflows will still occur.
> 
> As for zero division protection, maybe this variant even more concise?
> 
> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> 	seq_puts(m, "    ");
> 
> 	stddev = 0;
> 	unsigned long denom = rec->counter * (rec->counter - 1) * 1000;

Actually, if the denom overflows, then this is bogus anyway. We should stop
calculating if it overflows. That would be if:

  rec->counter * (rec->counter - 1) * 1000 >= 2^32 or >= 2^64

depending on the architecture. So we can calculate what value rec->counter
has to be for that to happen. Using the quadratic equation (haven't used
this since college!), we can figure that out.

  x = rec->counter

  x * (x - 1) * 1000 = (2^32 - 1) // use minus 1 just to be sure
  x * (x - 1) = (2^32 - 1) / 1000
  x^2 - x = (2^32 - 1) / 1000
  x^2 - x - (2^32 - 1) / 1000 = 0

 x = (-b +/- sqrt(b^2 - 4ac)) / 2a

 a = 1
 b = -1
 c = -(2^32 - 1) / 1000 = -4294967.295

 x = (-1 +/- sqrt((-1)^2 - 4 * -4294967.295)) / 2

 x = 2071.930 for 32bit

For 64bit we have

 c = -(2^64 - 1) / 1000 = -18446744073709551.615

That makes

  x = 135818790.812

We should stop calculating stddev when rec->counter > 2071 on 32 bit
and > 135818790 on 64 bit!

Feel free to check my math.

The below patch should fix this.

-- Steve


> 	if (denom) {
> 		stddev = rec->counter * rec->time_squared -
> 			 rec->time * rec->time;
> 		stddev = div64_ul(stddev, denom);
> 	}
> 
> 	trace_seq_init(&s);
> 	trace_print_graph_duration(rec->time, &s);
> 

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 2e113f8b13a2..ae90443f32af 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -419,10 +419,42 @@ struct ftrace_profile {
 	unsigned long			counter;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 	unsigned long long		time;
+	unsigned long long		stddev_time;
 	unsigned long long		time_squared;
 #endif
 };
 
+/*
+ * The calculation for stddev will overflow when the counter
+ * algorithm overflows the long size:
+ *
+ * rec->counter * (rec->counter - 1) * 1000 >= 2^BITS_PER_LONG
+ *
+ * Using the quadratic equation: x = (-b +/- sqrt(b^2 - 4ac)) / 2a
+ * we can figure out what the max rec->counter is before it
+ * overflows.
+ *
+ * x = rec->counter
+ * x * (x - 1) * 1000 = 2^BITS_PER_LONG - 1 // -1 for rounding
+ * x * (x - 1) = (2^BITS_PER_LONG - 1) / 1000
+ * x^2 - x = (2^BITS_PER_LONG - 1) / 1000
+ * x^2 - x - (2^BITS_PER_LONG - 1) / 1000 = 0
+ *
+ * a = 1
+ * b = -1
+ * c = -(2^BITS_PER_LONG - 1) / 1000
+ *
+ * x = (1 +/- sqrt(1 - 4 * -(2^BITS_PER_LONG - 1) / 1000)) / 2
+ *
+ * For 32bit that is: x = 2072.930 (or 2072)
+ * For 64bit that is: x = 135818791.812 (or 135818791)
+ */
+#ifdef CONFIG_64BIT
+# define MAX_STDDEV_COUNTER	135818791
+#else
+# define MAX_STDDEV_COUNTER	2072
+#endif
+
 struct ftrace_profile_page {
 	struct ftrace_profile_page	*next;
 	unsigned long			index;
@@ -566,12 +598,16 @@ static int function_stat_show(struct seq_file *m, void *v)
 	if (rec->counter <= 1)
 		stddev = 0;
 	else {
+		unsigned long counter;
+
+		counter = rec->counter < MAX_STDDEV_COUNTER ? rec->counter :
+			MAX_STDDEV_COUNTER - 1;
 		/*
 		 * Apply Welford's method:
 		 * s^2 = 1 / (n * (n-1)) * (n * \Sum (x_i)^2 - (\Sum x_i)^2)
 		 */
-		stddev = rec->counter * rec->time_squared -
-			 rec->time * rec->time;
+		stddev = counter * rec->time_squared -
+			 rec->stddev_time * rec->stddev_time;
 
 		/*
 		 * Divide only 1000 for ns^2 -> us^2 conversion.
@@ -894,7 +930,19 @@ static void profile_graph_return(struct ftrace_graph_ret *trace,
 	rec = ftrace_find_profiled_func(stat, trace->func);
 	if (rec) {
 		rec->time += calltime;
-		rec->time_squared += calltime * calltime;
+		/*
+		 * This is used to calculate stddev, but the calculation
+		 * uses Apply Welford's method that will multiply
+		 * rec->counter * (rec->counter - 1) and also divide from
+		 * that. The calculation will be bogus if that value overflows
+		 * the long size it is stored in. Stop the calculation
+		 * when the counter hits the point it will overflow.
+		 * The stddev shouldn't change much after that anyway.
+		 */
+		if (rec->counter < MAX_STDDEV_COUNTER) {
+			rec->stddev_time = rec->time;
+			rec->time_squared += calltime * calltime;
+		}
 	}
 
  out:
Steven Rostedt Feb. 5, 2025, 12:48 a.m. UTC | #4
On Tue, 4 Feb 2025 17:20:45 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> x = rec->counter
> 
>   x * (x - 1) * 1000 = (2^32 - 1) // use minus 1 just to be sure
>   x * (x - 1) = (2^32 - 1) / 1000
>   x^2 - x = (2^32 - 1) / 1000
>   x^2 - x - (2^32 - 1) / 1000 = 0
> 
>  x = (-b +/- sqrt(b^2 - 4ac)) / 2a
> 
>  a = 1
>  b = -1
>  c = -(2^32 - 1) / 1000 = -4294967.295
> 
>  x = (-1 +/- sqrt((-1)^2 - 4 * -4294967.295)) / 2

And I did make a mistake, as b = -1, and the above should start with -(-1)
or 1 :-p

> 
>  x = 2071.930 for 32bit

This should be: 2072.930

> 
> For 64bit we have
> 
>  c = -(2^64 - 1) / 1000 = -18446744073709551.615
> 
> That makes
> 
>   x = 135818790.812

And this needs to be: 135818791.812

> +/*
> + * The calculation for stddev will overflow when the counter
> + * algorithm overflows the long size:
> + *
> + * rec->counter * (rec->counter - 1) * 1000 >= 2^BITS_PER_LONG
> + *
> + * Using the quadratic equation: x = (-b +/- sqrt(b^2 - 4ac)) / 2a
> + * we can figure out what the max rec->counter is before it
> + * overflows.
> + *
> + * x = rec->counter
> + * x * (x - 1) * 1000 = 2^BITS_PER_LONG - 1 // -1 for rounding
> + * x * (x - 1) = (2^BITS_PER_LONG - 1) / 1000
> + * x^2 - x = (2^BITS_PER_LONG - 1) / 1000
> + * x^2 - x - (2^BITS_PER_LONG - 1) / 1000 = 0
> + *
> + * a = 1
> + * b = -1
> + * c = -(2^BITS_PER_LONG - 1) / 1000
> + *
> + * x = (1 +/- sqrt(1 - 4 * -(2^BITS_PER_LONG - 1) / 1000)) / 2
> + *
> + * For 32bit that is: x = 2072.930 (or 2072)
> + * For 64bit that is: x = 135818791.812 (or 135818791)
> + */


But in the patch, I redid the numbers and did not copy from the above, and
here I did it correctly.

-- Steve
Nikolay Kuratov Feb. 5, 2025, 10:39 a.m. UTC | #5
Your approach should prevent both zero division and overflow in the
denominator. We also should consider that the numerator

> stddev = counter * rec->time_squared - rec->stddev_time * rec->stddev_time;

may overflow too. Suppose 1000 ns per call case. On each function execution
time_squared would get additional 10^6. So for 32-bit and
MAX_STDDEV_COUNTER = 2072 we would have
counter * rec->time_squared = 2071 * (2071 * 10^6) = 4289041000000,
a 42-bit number.

It looks like overflow prevention here is not viable with macro constants,
since function execution time is dynamic in nature.

My suggestion is to rewrite formula such as we perform division earlier
and avoid overflow checks as much as possible. Note that doing divisions
earlier may result in precision loss, but I think that is acceptable since
rec->time is nanoseconds precision. Also this allows us to not care about
counter overflow at all, because expression rec->time * rec->time will always
overflow earlier than both counter and rec->time_square_sum. Renamed
time_squared into time_square_sum just to clarify.

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 728ecda6e8d4..91e7185dd5de 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -419,7 +419,8 @@ struct ftrace_profile {
 	unsigned long			counter;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 	unsigned long long		time;
-	unsigned long long		time_squared;
+	unsigned long long		time_square_sum;
+	unsigned long long		saved_stddev;
 #endif
 };
 
@@ -560,22 +561,24 @@ static int function_stat_show(struct seq_file *m, void *v)
 	seq_puts(m, "    ");
 
 	/* Sample standard deviation (s^2) */
-	if (rec->counter <= 1)
+	if (rec->counter <= 1 || !rec->time)
 		stddev = 0;
+	else if (div64_ul(rec->time * rec->time, rec->time) != rec->time)
+		stddev = rec->saved_stddev;
 	else {
 		/*
-		 * Apply Welford's method:
-		 * s^2 = 1 / (n * (n-1)) * (n * \Sum (x_i)^2 - (\Sum x_i)^2)
+		 * A formula for calculating the variance is:
+		 * s^2 = (\Sum (x_i)^2 - (\Sum x_i)^2 / n) / (n - 1)
 		 */
-		stddev = rec->counter * rec->time_squared -
-			 rec->time * rec->time;
-
+		stddev = rec->time_square_sum -
+			 div64_ul(rec->time * rec->time, rec->counter);
+		stddev = div64_ul(stddev, rec->counter - 1);
 		/*
 		 * Divide only 1000 for ns^2 -> us^2 conversion.
 		 * trace_print_graph_duration will divide 1000 again.
 		 */
-		stddev = div64_ul(stddev,
-				  rec->counter * (rec->counter - 1) * 1000);
+		stddev = div64_ul(stddev, 1000);
+		rec->saved_stddev = stddev;
 	}
 
 	trace_seq_init(&s);
@@ -888,7 +891,7 @@ static void profile_graph_return(struct ftrace_graph_ret *trace,
 	rec = ftrace_find_profiled_func(stat, trace->func);
 	if (rec) {
 		rec->time += calltime;
-		rec->time_squared += calltime * calltime;
+		rec->time_square_sum += calltime * calltime;
 	}
 }
Nikolay Kuratov Feb. 5, 2025, 11:21 a.m. UTC | #6
Oops, this still does not fix an overflow because of

> rec->time_square_sum += calltime * calltime;

We can't eliminate overflows easily without both using u64 and converting
ns to us prior to any squaring. I will send an appropriate patch later
if needed. Maybe we're a way too paranoid about overflows and really want
only not to divide by zero?
Steven Rostedt Feb. 5, 2025, 2:50 p.m. UTC | #7
On Wed,  5 Feb 2025 14:21:54 +0300
Nikolay Kuratov <kniv@yandex-team.ru> wrote:

> Oops, this still does not fix an overflow because of
> 
> > rec->time_square_sum += calltime * calltime;  
> 
> We can't eliminate overflows easily without both using u64 and converting
> ns to us prior to any squaring. I will send an appropriate patch later
> if needed. Maybe we're a way too paranoid about overflows and really want
> only not to divide by zero?

Well, the divide by zero will cause a kernel crash (very bad!) whereas the
overflow just causes corrupted data (sorta bad).

But yeah, I've noticed a lot of bogus data in the stddev and pretty much
ignore it. Perhaps it would have been better not to add it to begin with :-p

The divide by zero scenario definitely needs to be fixed, and I may take my
last patch to fix it as it also stops some of the overflows.

But we can add another patch to stop the other overflows with some runtime
checks too.

-- Steve
Nikolay Kuratov Feb. 6, 2025, 8:57 a.m. UTC | #8
I would suggest just fixing zerodiv for now because IMO the patch
fixing overflows properly on stddev path may or may not use macroconstants
at all. Also ns^2 overflow will always happen first. I don't know in advance
how this patch will look at the end and even not sure it's viable.
Steps to consider:
0) Infer some assumptions from the rest of tracing code like `nanoseconds
are stored in unsigned long, so on 32-bit machine function should not
execute more than ~ 2 s anyway`
1) unsigned long -> u64 conversion
2) ns^2 -> us^2 if possible, inevitable precision loss (how much?)
3) Compute stddev using some numerical trick. Now comment about Welford's
method is a bit misleading because its a plain variance formula. But maybe
it's a hint that we should convert to that method and problems are gone? ;)

I'll send the patch fixing zero division and simplifying code a bit as a
follow-up. It's up to you to choose what to apply. I'm Ok with any patch
fixing zerodiv and will look into overflow problem.

Again, thank you very much for the review and your time.
diff mbox series

Patch

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 728ecda6e8d4..e1c05c4c29c2 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -570,12 +570,12 @@  static int function_stat_show(struct seq_file *m, void *v)
 		stddev = rec->counter * rec->time_squared -
 			 rec->time * rec->time;
 
+		stddev = div64_ul(stddev, rec->counter * (rec->counter - 1));
 		/*
 		 * Divide only 1000 for ns^2 -> us^2 conversion.
 		 * trace_print_graph_duration will divide 1000 again.
 		 */
-		stddev = div64_ul(stddev,
-				  rec->counter * (rec->counter - 1) * 1000);
+		stddev = div64_ul(stddev, 1000);
 	}
 
 	trace_seq_init(&s);