diff mbox series

[v4,6/8] tracing/histogram: Optimize division by a power of 2

Message ID 20211025200852.3002369-7-kaleshsingh@google.com (mailing list archive)
State New
Headers show
Series tracing: Extend histogram triggers expression parsing | expand

Commit Message

Kalesh Singh Oct. 25, 2021, 8:08 p.m. UTC
The division is a slow operation. If the divisor is a power of 2, use a
shift instead.

Results were obtained using Android's version of perf (simpleperf[1]) as
described below:

1. hist_field_div() is modified to call 2 test functions:
   test_hist_field_div_[not]_optimized(); passing them the
   same args. Use noinline and volatile to ensure these are
   not optimized out by the compiler.
2. Create a hist event trigger that uses division:
      events/kmem/rss_stat$ echo 'hist:keys=common_pid:x=size/<divisor>'
         >> trigger
      events/kmem/rss_stat$ echo 'hist:keys=common_pid:vals=$x'
         >> trigger
3. Run Android's lmkd_test[2] to generate rss_stat events, and
   record CPU samples with Android's simpleperf:
      simpleperf record -a --exclude-perf --post-unwind=yes -m 16384 -g
         -f 2000 -o perf.data

== Results ==

Divisor is a power of 2 (divisor == 32):

   test_hist_field_div_not_optimized  | 8,717,091 cpu-cycles
   test_hist_field_div_optimized      | 1,643,137 cpu-cycles

If the divisor is a power of 2, the optimized version is ~5.3x faster.

Divisor is not a power of 2 (divisor == 33):

   test_hist_field_div_not_optimized  | 4,444,324 cpu-cycles
   test_hist_field_div_optimized      | 5,497,958 cpu-cycles

If the divisor is not a power of 2, as expected, the optimized version is
slightly slower (~24% slower).

[1] https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/README.md
[2] https://cs.android.com/android/platform/superproject/+/master:system/memory/lmkd/tests/lmkd_test.cpp

Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>

---
 kernel/trace/trace_events_hist.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Steven Rostedt Oct. 26, 2021, 7:14 p.m. UTC | #1
On Mon, 25 Oct 2021 13:08:38 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:

> == Results ==
> 
> Divisor is a power of 2 (divisor == 32):
> 
>    test_hist_field_div_not_optimized  | 8,717,091 cpu-cycles
>    test_hist_field_div_optimized      | 1,643,137 cpu-cycles
> 
> If the divisor is a power of 2, the optimized version is ~5.3x faster.
> 
> Divisor is not a power of 2 (divisor == 33):
> 
>    test_hist_field_div_not_optimized  | 4,444,324 cpu-cycles
>    test_hist_field_div_optimized      | 5,497,958 cpu-cycles

To optimize this even more, if the divisor is constant, we could make a
separate function to not do the branch, and just shift or divide.

And even if it is not a power of 2, for constants, we could implement a
multiplication and shift, and guarantee an accuracy up to a defined max.


If div is a constant, then we can calculate the mult and shift, and max
dividend. Let's use 20 for shift.

	// This works best for small divisors
	if (div > max_div) {
		// only do a real division
		return;
	}
	shift = 20;
	mult = ((1 << shift) + div - 1) / div;
	delta = mult * div - (1 << shift);
	if (!delta) {
		/* div is a power of 2 */
		max = -1;
		return;
	}
	max = (1 << shift) / delta;

We would of course need to use 64 bit operations (maybe only do this for 64
bit machines). And perhaps even use bigger shift values to get a bigger max.

Then we could do:

	if (val1 < max)
		return (val1 * mult) >> shift;

-- Steve
Kalesh Singh Oct. 26, 2021, 11:39 p.m. UTC | #2
On Tue, Oct 26, 2021 at 12:14 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 25 Oct 2021 13:08:38 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > == Results ==
> >
> > Divisor is a power of 2 (divisor == 32):
> >
> >    test_hist_field_div_not_optimized  | 8,717,091 cpu-cycles
> >    test_hist_field_div_optimized      | 1,643,137 cpu-cycles
> >
> > If the divisor is a power of 2, the optimized version is ~5.3x faster.
> >
> > Divisor is not a power of 2 (divisor == 33):
> >
> >    test_hist_field_div_not_optimized  | 4,444,324 cpu-cycles
> >    test_hist_field_div_optimized      | 5,497,958 cpu-cycles
>
> To optimize this even more, if the divisor is constant, we could make a
> separate function to not do the branch, and just shift or divide.

Ack. I can update to use separate functions for the constant divisors.

>
> And even if it is not a power of 2, for constants, we could implement a
> multiplication and shift, and guarantee an accuracy up to a defined max.
>
>
> If div is a constant, then we can calculate the mult and shift, and max
> dividend. Let's use 20 for shift.
>
>         // This works best for small divisors
>         if (div > max_div) {
>                 // only do a real division
>                 return;
>         }
>         shift = 20;
>         mult = ((1 << shift) + div - 1) / div;
>         delta = mult * div - (1 << shift);
>         if (!delta) {
>                 /* div is a power of 2 */
>                 max = -1;
>                 return;
>         }
>         max = (1 << shift) / delta;

I'm still trying to digest the above algorithm. But doesn't this add 2
extra divisions? What am I missing here?

Thanks,
Kalesh

>
> We would of course need to use 64 bit operations (maybe only do this for 64
> bit machines). And perhaps even use bigger shift values to get a bigger max.
>
> Then we could do:
>
>         if (val1 < max)
>                 return (val1 * mult) >> shift;
>
> -- Steve
Steven Rostedt Oct. 27, 2021, 12:18 a.m. UTC | #3
On Tue, 26 Oct 2021 16:39:13 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:

> >         // This works best for small divisors
> >         if (div > max_div) {
> >                 // only do a real division
> >                 return;
> >         }
> >         shift = 20;
> >         mult = ((1 << shift) + div - 1) / div;
> >         delta = mult * div - (1 << shift);
> >         if (!delta) {
> >                 /* div is a power of 2 */
> >                 max = -1;
> >                 return;
> >         }
> >         max = (1 << shift) / delta;  
> 
> I'm still trying to digest the above algorithm. 

mult = (2^20 + div - 1) / div;

The "div - 1" is to round up.

Basically, it's doing:  X / div  = X * (2^20 / div) / 2^20

If div is constant, the 2^20 / div is constant, and the "2^20" is the
same as a shift.

So multiplier is 2^20 / div, and the shift is 20.

But because there's rounding errors it is only accurate up to the
difference of:

  delta = mult * div / 2^20

That is if mult is a power of two, then there would be no rounding
errors, and the delta is zero, making the max infinite:

  max = 2^20 / delta as delta goes to zero.

> But doesn't this add 2 extra divisions? What am I missing here?

The above is only done at parsing not during the trace, where we care
about.

> > 
> >
> > We would of course need to use 64 bit operations (maybe only do this for 64
> > bit machines). And perhaps even use bigger shift values to get a bigger max.
> >
> > Then we could do:
> >
> >         if (val1 < max)
> >                 return (val1 * mult) >> shift;

This is done at the time of recording.

Actually, it would be:

	if (val1 < max)
		return (val1 * mult) >> shift;
	else
		return val1 / div;

-- Steve
Kalesh Singh Oct. 27, 2021, 1:09 a.m. UTC | #4
On Tue, Oct 26, 2021 at 5:18 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 26 Oct 2021 16:39:13 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > >         // This works best for small divisors
> > >         if (div > max_div) {
> > >                 // only do a real division
> > >                 return;
> > >         }
> > >         shift = 20;
> > >         mult = ((1 << shift) + div - 1) / div;
> > >         delta = mult * div - (1 << shift);
> > >         if (!delta) {
> > >                 /* div is a power of 2 */
> > >                 max = -1;
> > >                 return;
> > >         }
> > >         max = (1 << shift) / delta;
> >
> > I'm still trying to digest the above algorithm.
>
> mult = (2^20 + div - 1) / div;
>
> The "div - 1" is to round up.
>
> Basically, it's doing:  X / div  = X * (2^20 / div) / 2^20
>
> If div is constant, the 2^20 / div is constant, and the "2^20" is the
> same as a shift.
>
> So multiplier is 2^20 / div, and the shift is 20.
>
> But because there's rounding errors it is only accurate up to the
> difference of:
>
>   delta = mult * div / 2^20
>
> That is if mult is a power of two, then there would be no rounding
> errors, and the delta is zero, making the max infinite:
>
>   max = 2^20 / delta as delta goes to zero.
>
> > But doesn't this add 2 extra divisions? What am I missing here?
>
> The above is only done at parsing not during the trace, where we care
> about.

Hi Steve,

Thanks for the explanation, this cleared it up for me.

- Kalesh

>
> > >
> > >
> > > We would of course need to use 64 bit operations (maybe only do this for 64
> > > bit machines). And perhaps even use bigger shift values to get a bigger max.
> > >
> > > Then we could do:
> > >
> > >         if (val1 < max)
> > >                 return (val1 * mult) >> shift;
>
> This is done at the time of recording.
>
> Actually, it would be:
>
>         if (val1 < max)
>                 return (val1 * mult) >> shift;
>         else
>                 return val1 / div;
>
> -- Steve
Steven Rostedt Oct. 27, 2021, 1:15 a.m. UTC | #5
On Tue, 26 Oct 2021 18:09:22 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:

> >   delta = mult * div / 2^20
> >
> > That is if mult is a power of two, then there would be no rounding
> > errors, and the delta is zero, making the max infinite:

That should have been (as shown in the algorithm)

  delta = mult * div - 2 ^ 20

As mult is 2^20 / div; and the above should end up zero if there's no
rounding issues, as it would be:

 delta = (2^20 / div) * div - 2^20

-- Steve
Kalesh Singh Oct. 27, 2021, 1:31 a.m. UTC | #6
On Tue, Oct 26, 2021 at 6:15 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 26 Oct 2021 18:09:22 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > >   delta = mult * div / 2^20
> > >
> > > That is if mult is a power of two, then there would be no rounding
> > > errors, and the delta is zero, making the max infinite:
>
> That should have been (as shown in the algorithm)
>
>   delta = mult * div - 2 ^ 20
>
> As mult is 2^20 / div; and the above should end up zero if there's no
> rounding issues, as it would be:
>
>  delta = (2^20 / div) * div - 2^20

Good catch. We're checking if we get back the exact value.

And IIUC max_div is an arbitrary value we decide on that's <= 2^shift?
Is there a rule of thumb for choosing this?

Thanks,
Kalesh
>
> -- Steve
Steven Rostedt Oct. 27, 2021, 2:21 a.m. UTC | #7
On Tue, 26 Oct 2021 18:31:21 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:

> And IIUC max_div is an arbitrary value we decide on that's <= 2^shift?
> Is there a rule of thumb for choosing this?

The way I came up with the max was to figure out at what point is it no
longer guaranteed to be accurate. That is, what number can make the
mult/shift no longer match the division.

If we have some number div that is not a power of two. At some point:

	(X * mult) >> shift != X / div

Now I simply picked

  max = 1 << shift / (mult * div - (1 << shift))

Because that will always be within the precision of the actual number.

But I believe we can make max bigger, but because that deals with
truncation, it's not simple math.

That is, the above X / div is truncated and not the real number.

I'm sure there's an algorithm somewhere that can give as the real max.

-- Steve
Steven Rostedt Oct. 27, 2021, 3:15 a.m. UTC | #8
On Tue, 26 Oct 2021 22:21:23 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I'm sure there's an algorithm somewhere that can give as the real max.

You got me playing with this more ;-)

OK, I added the rounding in the wrong place. I found that we can make
the max_div to be the same as the shift! The bigger the shift, the
bigger the max!

	mult = (1 << shift) / div;
	max_div = (1 << shift)

But the rounding needs to be with the mult / shift:

	return (val * mult + ((1 << shift) - 1)) >> shift;


When val goes pass 1 << shift, then the error will be off by more than
one.

-- Steve
Kalesh Singh Oct. 27, 2021, 4:04 a.m. UTC | #9
On Tue, Oct 26, 2021 at 8:16 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 26 Oct 2021 22:21:23 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > I'm sure there's an algorithm somewhere that can give as the real max.
>
> You got me playing with this more ;-)
>
> OK, I added the rounding in the wrong place. I found that we can make
> the max_div to be the same as the shift! The bigger the shift, the
> bigger the max!

Nice! :)
>
>         mult = (1 << shift) / div;
>         max_div = (1 << shift)
>
> But the rounding needs to be with the mult / shift:
>
>         return (val * mult + ((1 << shift) - 1)) >> shift;
>
>
> When val goes pass 1 << shift, then the error will be off by more than
> one.
Did you mean, val should be such that when we do the (val * mult) we
only get rounding errors less than (1 << shift)?

I think we also need to flip the delta now since we round down initially:

    delta =  (1 << shift) - (mult * div)

Thanks,
Kalesh
>
> -- Steve
Steven Rostedt Oct. 27, 2021, 2:06 p.m. UTC | #10
On Tue, 26 Oct 2021 21:04:29 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:

> On Tue, Oct 26, 2021 at 8:16 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Tue, 26 Oct 2021 22:21:23 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >  
> > > I'm sure there's an algorithm somewhere that can give as the real max.  
> >
> > You got me playing with this more ;-)
> >
> > OK, I added the rounding in the wrong place. I found that we can make
> > the max_div to be the same as the shift! The bigger the shift, the
> > bigger the max!  
> 
> Nice! :)
> >
> >         mult = (1 << shift) / div;
> >         max_div = (1 << shift)
> >
> > But the rounding needs to be with the mult / shift:
> >
> >         return (val * mult + ((1 << shift) - 1)) >> shift;
> >
> >
> > When val goes pass 1 << shift, then the error will be off by more than
> > one.  
> Did you mean, val should be such that when we do the (val * mult) we
> only get rounding errors less than (1 << shift)?

We get rounding errors when val is greater than (1 << shift) because then
it exposes the bits that are not shifted out.

> 
> I think we also need to flip the delta now since we round down initially:
> 
>     delta =  (1 << shift) - (mult * div)
> 

Actually, we don't need the delta at all. Just what I showed above.

Pick some arbitrary shift (let's say 20 as that seems to be commonly used,
and works for 32 bit as well) and then we figure out the multiplier.

	mult = (1 << shift) / div;


No delta needed. Our max is going to be 1 << shift, and then all we need is:

	if (val < (1 << shift))
		return (val * mult + ((1 << shift) - 1)) >> shift;
	else
		return val / div;

All we need to save to do the operation is the shift, the constant div and
the calculated constant mult.

-- Steve
diff mbox series

Patch

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index db28bcf976f4..364cb3091789 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -304,6 +304,10 @@  static u64 hist_field_div(struct hist_field *hist_field,
 	if (!val2)
 		return -1;
 
+	/* Use shift if the divisor is a power of 2 */
+	if (!(val2 & (val2 - 1)))
+		return val1 >> __ffs64(val2);
+
 	return div64_u64(val1, val2);
 }