diff mbox series

[03/15] print_integer: new and improved way of printing integers

Message ID 20200420205743.19964-3-adobriyan@gmail.com (mailing list archive)
State New, archived
Headers show
Series [01/15] sched: make nr_running() return "unsigned int" | expand

Commit Message

Alexey Dobriyan April 20, 2020, 8:57 p.m. UTC
Time honored way to print integers via vsnprintf() or equivalent has
unavoidable slowdown of parsing format string. This can't be fixed in C,
without introducing external preprocessor.

seq_put_decimal_ull() partially saves the day, but there are a lot of
branches inside and overcopying still.

_print_integer_*() family of functions is meant to make printing
integers as fast as possible by deleting format string parsing and doing
as little work as possible.

It is based on the following observations:

1) memcpy is done in forward direction
	it can be done backwards but nobody does that,

2) digits can be extracted in a very simple loop which costs only
	1 multiplication and shift (division by constant is not division)

All the above asks for the following signature, semantics and pattern of
printing out beloved /proc files:

	/* seq_printf(seq, "%u %llu\n", A, b); */

	char buf[10 + 1 + 20 + 1];
	char *p = buf + sizeof(buf);

	*--p = '\n';
	p = _print_integer_u64(p, B);
	*--p = ' ';
	p = _print_integer_u32(p, A);

	seq_write(seq, p, buf + sizeof(buf) - p);

1) stack buffer capable of holding the biggest string is allocated.

2) "p" is pointer to start of the string. Initially it points past
	the end of the buffer WHICH IS NOT NUL-TERMINATED!

3) _print_integer_*() actually prints an integer from right to left
	and returns new start of the string.

			     <--------|
				123
				^
				|
				+-- p

4) 1 character is printed with

	*--p = 'x';

	It generates very efficient code as multiple writes can be
	merged.

5) fixed string is printed with

	p = memcpy(p - 3, "foo", 3);

	Complers know what memcpy() does and write-combine it.
	4/8-byte writes become 1 instruction and are very efficient.

6) Once everything is printed, the result is written to seq_file buffer.
	It does only one overflow check and 1 copy.

This generates very efficient code (and small!).

In regular seq_printf() calls, first argument and format string are
constantly reloaded. Format string will most likely with [rip+...] which
is quite verbose.

seq_put_decimal_ull() will do branches (and even more branches
with "width" argument)

	TODO
	benchmark with mainline because nouveau is broken for me -(
	vsnprintf() changes make the code slower

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 MAINTAINERS         |  6 ++++++
 lib/Makefile        |  2 +-
 lib/print-integer.c | 40 ++++++++++++++++++++++++++++++++++++++++
 lib/print-integer.h | 20 ++++++++++++++++++++
 4 files changed, 67 insertions(+), 1 deletion(-)
 create mode 100644 lib/print-integer.c
 create mode 100644 lib/print-integer.h

Comments

Andy Shevchenko April 20, 2020, 9:19 p.m. UTC | #1
On Mon, Apr 20, 2020 at 11:57:31PM +0300, Alexey Dobriyan wrote:
> Time honored way to print integers via vsnprintf() or equivalent has
> unavoidable slowdown of parsing format string. This can't be fixed in C,
> without introducing external preprocessor.
> 
> seq_put_decimal_ull() partially saves the day, but there are a lot of
> branches inside and overcopying still.
> 
> _print_integer_*() family of functions is meant to make printing
> integers as fast as possible by deleting format string parsing and doing
> as little work as possible.
> 
> It is based on the following observations:
> 
> 1) memcpy is done in forward direction
> 	it can be done backwards but nobody does that,
> 
> 2) digits can be extracted in a very simple loop which costs only
> 	1 multiplication and shift (division by constant is not division)
> 
> All the above asks for the following signature, semantics and pattern of
> printing out beloved /proc files:
> 
> 	/* seq_printf(seq, "%u %llu\n", A, b); */
> 
> 	char buf[10 + 1 + 20 + 1];
> 	char *p = buf + sizeof(buf);
> 
> 	*--p = '\n';
> 	p = _print_integer_u64(p, B);
> 	*--p = ' ';
> 	p = _print_integer_u32(p, A);
> 
> 	seq_write(seq, p, buf + sizeof(buf) - p);
> 
> 1) stack buffer capable of holding the biggest string is allocated.
> 
> 2) "p" is pointer to start of the string. Initially it points past
> 	the end of the buffer WHICH IS NOT NUL-TERMINATED!
> 
> 3) _print_integer_*() actually prints an integer from right to left
> 	and returns new start of the string.
> 
> 			     <--------|
> 				123
> 				^
> 				|
> 				+-- p
> 
> 4) 1 character is printed with
> 
> 	*--p = 'x';
> 
> 	It generates very efficient code as multiple writes can be
> 	merged.
> 
> 5) fixed string is printed with
> 
> 	p = memcpy(p - 3, "foo", 3);
> 
> 	Complers know what memcpy() does and write-combine it.
> 	4/8-byte writes become 1 instruction and are very efficient.
> 
> 6) Once everything is printed, the result is written to seq_file buffer.
> 	It does only one overflow check and 1 copy.
> 
> This generates very efficient code (and small!).
> 
> In regular seq_printf() calls, first argument and format string are
> constantly reloaded. Format string will most likely with [rip+...] which
> is quite verbose.
> 
> seq_put_decimal_ull() will do branches (and even more branches
> with "width" argument)
> 

> 	TODO
> 	benchmark with mainline because nouveau is broken for me -(
> 	vsnprintf() changes make the code slower

Exactly main point of this exercise. I don't believe that algos in vsprintf.c
are too dumb to use division per digit (yes, division by constant which is not
power of two is a heavy operation).


> +noinline
> +char *_print_integer_u32(char *p, u32 x)
> +{
> +	do {
> +		*--p = '0' + (x % 10);
> +	} while (x /= 10);
> +	return p;
> +}

> +noinline
> +char *_print_integer_u64(char *p, u64 x)
> +{
> +	while (x >= 100 * 1000 * 1000) {
> +		u32 r;
> +
> +		x = div_u64_rem(x, 100 * 1000 * 1000, &r);
> +		p = memset(p - 8, '0', 8);
> +		(void)_print_integer_u32(p + 8, r);
> +	}
> +	return _print_integer_u32(p, x);
> +}
Andy Shevchenko April 20, 2020, 9:27 p.m. UTC | #2
On Tue, Apr 21, 2020 at 12:19:11AM +0300, Andy Shevchenko wrote:
> On Mon, Apr 20, 2020 at 11:57:31PM +0300, Alexey Dobriyan wrote:
> > Time honored way to print integers via vsnprintf() or equivalent has
> > unavoidable slowdown of parsing format string. This can't be fixed in C,
> > without introducing external preprocessor.
> > 
> > seq_put_decimal_ull() partially saves the day, but there are a lot of
> > branches inside and overcopying still.
> > 
> > _print_integer_*() family of functions is meant to make printing
> > integers as fast as possible by deleting format string parsing and doing
> > as little work as possible.
> > 
> > It is based on the following observations:
> > 
> > 1) memcpy is done in forward direction
> > 	it can be done backwards but nobody does that,
> > 
> > 2) digits can be extracted in a very simple loop which costs only
> > 	1 multiplication and shift (division by constant is not division)
> > 
> > All the above asks for the following signature, semantics and pattern of
> > printing out beloved /proc files:
> > 
> > 	/* seq_printf(seq, "%u %llu\n", A, b); */
> > 
> > 	char buf[10 + 1 + 20 + 1];
> > 	char *p = buf + sizeof(buf);
> > 
> > 	*--p = '\n';
> > 	p = _print_integer_u64(p, B);
> > 	*--p = ' ';
> > 	p = _print_integer_u32(p, A);
> > 
> > 	seq_write(seq, p, buf + sizeof(buf) - p);
> > 
> > 1) stack buffer capable of holding the biggest string is allocated.
> > 
> > 2) "p" is pointer to start of the string. Initially it points past
> > 	the end of the buffer WHICH IS NOT NUL-TERMINATED!
> > 
> > 3) _print_integer_*() actually prints an integer from right to left
> > 	and returns new start of the string.
> > 
> > 			     <--------|
> > 				123
> > 				^
> > 				|
> > 				+-- p
> > 
> > 4) 1 character is printed with
> > 
> > 	*--p = 'x';
> > 
> > 	It generates very efficient code as multiple writes can be
> > 	merged.
> > 
> > 5) fixed string is printed with
> > 
> > 	p = memcpy(p - 3, "foo", 3);
> > 
> > 	Complers know what memcpy() does and write-combine it.
> > 	4/8-byte writes become 1 instruction and are very efficient.
> > 
> > 6) Once everything is printed, the result is written to seq_file buffer.
> > 	It does only one overflow check and 1 copy.
> > 
> > This generates very efficient code (and small!).
> > 
> > In regular seq_printf() calls, first argument and format string are
> > constantly reloaded. Format string will most likely with [rip+...] which
> > is quite verbose.
> > 
> > seq_put_decimal_ull() will do branches (and even more branches
> > with "width" argument)
> > 
> 
> > 	TODO
> > 	benchmark with mainline because nouveau is broken for me -(
> > 	vsnprintf() changes make the code slower
> 
> Exactly main point of this exercise. I don't believe that algos in vsprintf.c
> are too dumb to use division per digit (yes, division by constant which is not
> power of two is a heavy operation).
> 

And second point here, why not to use existing algos from vsprintf.c?
Steven Rostedt April 21, 2020, 1:54 a.m. UTC | #3
On Tue, 21 Apr 2020 00:27:23 +0300
Andy Shevchenko <andriy.shevchenko@linux.intel.com> wrote:

> >   
> > > 	TODO
> > > 	benchmark with mainline because nouveau is broken for me -(
> > > 	vsnprintf() changes make the code slower  
> > 
> > Exactly main point of this exercise. I don't believe that algos in vsprintf.c
> > are too dumb to use division per digit (yes, division by constant which is not
> > power of two is a heavy operation).
> >   
> 
> And second point here, why not to use existing algos from vsprintf.c?

Exactly. The code in _print_integer_u32() doesn't look as fast as the
code in vsprintf() that happens to use lookup tables and converts
without any loops.

Hint, loops are bad, they cause the CPU to slow down.

Anyway, this patch series would require a pretty good improvement, as
the code replacing the sprintf() usages is pretty ugly compared to a
simple sprintf() call.

Randomly picking patch 6:

 static int loadavg_proc_show(struct seq_file *m, void *v)
 {
 	unsigned long avnrun[3];
 
 	get_avenrun(avnrun, FIXED_1/200, 0);
 
	seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n",
		LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
		LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
		LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
		nr_running(), nr_threads,
		idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
 	return 0;
 }

  *vs* 

 static int loadavg_proc_show(struct seq_file *m, void *v)
 {
 	unsigned long avnrun[3];
	char buf[3 * (LEN_UL + 1 + 2 + 1) + 10 + 1 + 10 + 1 + 10 + 1];
	char *p = buf + sizeof(buf);
	int i;

	*--p = '\n';
	p = _print_integer_u32(p, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
	*--p = ' ';
	p = _print_integer_u32(p, nr_threads);
	*--p = '/';
	p = _print_integer_u32(p, nr_running());

 	get_avenrun(avnrun, FIXED_1/200, 0);
	for (i = 2; i >= 0; i--) {
		*--p = ' ';
		--p;		/* overwritten */
		*--p = '0';	/* conditionally overwritten */
		(void)_print_integer_u32(p + 2, LOAD_FRAC(avnrun[i]));
		*--p = '.';
		p = _print_integer_ul(p, LOAD_INT(avnrun[i]));
	}
 
	seq_write(m, p, buf + sizeof(buf) - p);
 	return 0;
 }


I much rather keep the first version.

-- Steve
Alexey Dobriyan April 21, 2020, 4:31 p.m. UTC | #4
On Tue, Apr 21, 2020 at 12:19:11AM +0300, Andy Shevchenko wrote:
> On Mon, Apr 20, 2020 at 11:57:31PM +0300, Alexey Dobriyan wrote:
> > Time honored way to print integers via vsnprintf() or equivalent has
> > unavoidable slowdown of parsing format string. This can't be fixed in C,
> > without introducing external preprocessor.
> > 
> > seq_put_decimal_ull() partially saves the day, but there are a lot of
> > branches inside and overcopying still.
> > 
> > _print_integer_*() family of functions is meant to make printing
> > integers as fast as possible by deleting format string parsing and doing
> > as little work as possible.
> > 
> > It is based on the following observations:
> > 
> > 1) memcpy is done in forward direction
> > 	it can be done backwards but nobody does that,
> > 
> > 2) digits can be extracted in a very simple loop which costs only
> > 	1 multiplication and shift (division by constant is not division)
> > 
> > All the above asks for the following signature, semantics and pattern of
> > printing out beloved /proc files:
> > 
> > 	/* seq_printf(seq, "%u %llu\n", A, b); */
> > 
> > 	char buf[10 + 1 + 20 + 1];
> > 	char *p = buf + sizeof(buf);
> > 
> > 	*--p = '\n';
> > 	p = _print_integer_u64(p, B);
> > 	*--p = ' ';
> > 	p = _print_integer_u32(p, A);
> > 
> > 	seq_write(seq, p, buf + sizeof(buf) - p);
> > 
> > 1) stack buffer capable of holding the biggest string is allocated.
> > 
> > 2) "p" is pointer to start of the string. Initially it points past
> > 	the end of the buffer WHICH IS NOT NUL-TERMINATED!
> > 
> > 3) _print_integer_*() actually prints an integer from right to left
> > 	and returns new start of the string.
> > 
> > 			     <--------|
> > 				123
> > 				^
> > 				|
> > 				+-- p
> > 
> > 4) 1 character is printed with
> > 
> > 	*--p = 'x';
> > 
> > 	It generates very efficient code as multiple writes can be
> > 	merged.
> > 
> > 5) fixed string is printed with
> > 
> > 	p = memcpy(p - 3, "foo", 3);
> > 
> > 	Complers know what memcpy() does and write-combine it.
> > 	4/8-byte writes become 1 instruction and are very efficient.
> > 
> > 6) Once everything is printed, the result is written to seq_file buffer.
> > 	It does only one overflow check and 1 copy.
> > 
> > This generates very efficient code (and small!).
> > 
> > In regular seq_printf() calls, first argument and format string are
> > constantly reloaded. Format string will most likely with [rip+...] which
> > is quite verbose.
> > 
> > seq_put_decimal_ull() will do branches (and even more branches
> > with "width" argument)
> > 
> 
> > 	TODO
> > 	benchmark with mainline because nouveau is broken for me -(
> > 	vsnprintf() changes make the code slower
> 
> Exactly main point of this exercise. I don't believe that algos in vsprintf.c
> are too dumb to use division per digit (yes, division by constant which is not
> power of two is a heavy operation).

It is not about division.

It is about fucktons of branches in vsprintf().

> > +noinline
> > +char *_print_integer_u32(char *p, u32 x)
> > +{
> > +	do {
> > +		*--p = '0' + (x % 10);
> > +	} while (x /= 10);
> > +	return p;
> > +}
> 
> > +noinline
> > +char *_print_integer_u64(char *p, u64 x)
> > +{
> > +	while (x >= 100 * 1000 * 1000) {
> > +		u32 r;
> > +
> > +		x = div_u64_rem(x, 100 * 1000 * 1000, &r);
> > +		p = memset(p - 8, '0', 8);
> > +		(void)_print_integer_u32(p + 8, r);
> > +	}
> > +	return _print_integer_u32(p, x);
> > +}
> 
> -- 
> With Best Regards,
> Andy Shevchenko
> 
>
Alexey Dobriyan April 21, 2020, 4:49 p.m. UTC | #5
On Mon, Apr 20, 2020 at 09:54:17PM -0400, Steven Rostedt wrote:
> On Tue, 21 Apr 2020 00:27:23 +0300
> Andy Shevchenko <andriy.shevchenko@linux.intel.com> wrote:
> 
> > >   
> > > > 	TODO
> > > > 	benchmark with mainline because nouveau is broken for me -(
> > > > 	vsnprintf() changes make the code slower  
> > > 
> > > Exactly main point of this exercise. I don't believe that algos in vsprintf.c
> > > are too dumb to use division per digit (yes, division by constant which is not
> > > power of two is a heavy operation).
> > >   
> > 
> > And second point here, why not to use existing algos from vsprintf.c?
> 
> Exactly. The code in _print_integer_u32() doesn't look as fast as the
> code in vsprintf() that happens to use lookup tables and converts
> without any loops.
> 
> Hint, loops are bad, they cause the CPU to slow down.

Oh, come on! Loops make code fit into icache and μop decode cache.

> Anyway, this patch series would require a pretty good improvement, as
> the code replacing the sprintf() usages is pretty ugly compared to a
> simple sprintf() call.

No! Fast code must look ugly. Or in other words if you try to optimise
integer printing to death you'll probably end with something like
_print_integer().

When the very first patch changed /proc/stat to seq_put_decimal_ull()
the speed up was 66% (or 33%). That's how slow printing was back then.
It can be made slightly faster even now.

> Randomly picking patch 6:
> 
>  static int loadavg_proc_show(struct seq_file *m, void *v)
>  {
>  	unsigned long avnrun[3];
>  
>  	get_avenrun(avnrun, FIXED_1/200, 0);
>  
> 	seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n",
> 		LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
> 		LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
> 		LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
> 		nr_running(), nr_threads,
> 		idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
>  	return 0;
>  }
> 
>   *vs* 
> 
>  static int loadavg_proc_show(struct seq_file *m, void *v)
>  {
>  	unsigned long avnrun[3];
> 	char buf[3 * (LEN_UL + 1 + 2 + 1) + 10 + 1 + 10 + 1 + 10 + 1];
> 	char *p = buf + sizeof(buf);
> 	int i;
> 
> 	*--p = '\n';
> 	p = _print_integer_u32(p, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> 	*--p = ' ';
> 	p = _print_integer_u32(p, nr_threads);
> 	*--p = '/';
> 	p = _print_integer_u32(p, nr_running());
> 
>  	get_avenrun(avnrun, FIXED_1/200, 0);
> 	for (i = 2; i >= 0; i--) {
> 		*--p = ' ';
> 		--p;		/* overwritten */
> 		*--p = '0';	/* conditionally overwritten */
> 		(void)_print_integer_u32(p + 2, LOAD_FRAC(avnrun[i]));
> 		*--p = '.';
> 		p = _print_integer_ul(p, LOAD_INT(avnrun[i]));
> 	}
>  
> 	seq_write(m, p, buf + sizeof(buf) - p);
>  	return 0;
>  }
> 
> 
> I much rather keep the first version.

I did the benchmarks (without stack protector though), everything except
/proc/cpuinfo and /proc/meminfo became faster. This requires investigation
and I can drop vsprintf() changes until then.

Now given that /proc/uptime format cast in stone, code may look a bit ugly
and unusual but it won't require maintainance
Matthew Wilcox April 21, 2020, 4:59 p.m. UTC | #6
On Mon, Apr 20, 2020 at 11:57:31PM +0300, Alexey Dobriyan wrote:
> 1) memcpy is done in forward direction
> 	it can be done backwards but nobody does that,

If you're determined to do this, then use memmove() which actually
guarantees to work with overlapping ranges.  Don't rely on non-guaranteed
behaviour of current implementations of memcpy().  Did you really check
the two dozen assembly implementations of memcpy() in the kernel?

> 2) digits can be extracted in a very simple loop which costs only
> 	1 multiplication and shift (division by constant is not division)

> +noinline
> +char *_print_integer_u32(char *p, u32 x)
> +{
> +	do {
> +		*--p = '0' + (x % 10);
> +	} while (x /= 10);
> +	return p;
> +}

Why not do two digits at a time like put_dec() does?
Steven Rostedt April 21, 2020, 9:08 p.m. UTC | #7
On Tue, 21 Apr 2020 19:49:24 +0300
Alexey Dobriyan <adobriyan@gmail.com> wrote:
> > Exactly. The code in _print_integer_u32() doesn't look as fast as the
> > code in vsprintf() that happens to use lookup tables and converts
> > without any loops.
> > 
> > Hint, loops are bad, they cause the CPU to slow down.  
> 
> Oh, come on! Loops make code fit into icache and μop decode cache.

Depends on the architecture.

> 
> > Anyway, this patch series would require a pretty good improvement, as
> > the code replacing the sprintf() usages is pretty ugly compared to a
> > simple sprintf() call.  
> 
> No! Fast code must look ugly. Or in other words if you try to optimise
> integer printing to death you'll probably end with something like
> _print_integer().

As I stated, it will require a "pretty good improvement". There's always a
trade off. If the improvement is noticeable for real life cases, then ugly
code is worth it. If we are making ugly code for a benefit of something
that never shows outside of noise, then the net cost (less maintainable
code), is not worth it.

> 
> When the very first patch changed /proc/stat to seq_put_decimal_ull()
> the speed up was 66% (or 33%). That's how slow printing was back then.
> It can be made slightly faster even now.

I'd like to see the tests that your ran (to reproduce them myself).

The first patch was making a 64bit number into 32bit number, thus
shortening the work by half.

> 
> > Randomly picking patch 6:
> > 
> >  static int loadavg_proc_show(struct seq_file *m, void *v)
> >  {
> >  	unsigned long avnrun[3];
> >  
> >  	get_avenrun(avnrun, FIXED_1/200, 0);
> >  
> > 	seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n",
> > 		LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
> > 		LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
> > 		LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
> > 		nr_running(), nr_threads,
> > 		idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> >  	return 0;
> >  }
> > 
> >   *vs* 
> > 
> >  static int loadavg_proc_show(struct seq_file *m, void *v)
> >  {
> >  	unsigned long avnrun[3];
> > 	char buf[3 * (LEN_UL + 1 + 2 + 1) + 10 + 1 + 10 + 1 + 10 + 1];
> > 	char *p = buf + sizeof(buf);
> > 	int i;
> > 
> > 	*--p = '\n';
> > 	p = _print_integer_u32(p, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> > 	*--p = ' ';
> > 	p = _print_integer_u32(p, nr_threads);
> > 	*--p = '/';
> > 	p = _print_integer_u32(p, nr_running());
> > 
> >  	get_avenrun(avnrun, FIXED_1/200, 0);
> > 	for (i = 2; i >= 0; i--) {
> > 		*--p = ' ';
> > 		--p;		/* overwritten */
> > 		*--p = '0';	/* conditionally overwritten */
> > 		(void)_print_integer_u32(p + 2, LOAD_FRAC(avnrun[i]));
> > 		*--p = '.';
> > 		p = _print_integer_ul(p, LOAD_INT(avnrun[i]));
> > 	}
> >  
> > 	seq_write(m, p, buf + sizeof(buf) - p);
> >  	return 0;
> >  }
> > 
> > 
> > I much rather keep the first version.  
> 
> I did the benchmarks (without stack protector though), everything except
> /proc/cpuinfo and /proc/meminfo became faster. This requires investigation
> and I can drop vsprintf() changes until then.
> 
> Now given that /proc/uptime format cast in stone, code may look a bit ugly
> and unusual but it won't require maintainance

Please share what you did as your benchmarks. If this is as good of a
performance as you claim, then these changes would be worth looking into.


So I applied your entire series, added the following patch:

diff --git a/include/linux/spinlock_api_smp.h
b/include/linux/spinlock_api_smp.h index 19a9be9d97ee..17c582d77ab7 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -152,9 +152,16 @@ static inline void __raw_spin_unlock(raw_spinlock_t
*lock) preempt_enable();
 }
 
+extern u64 sched_clock(void);
 static inline void __raw_spin_unlock_irqrestore(raw_spinlock_t *lock,
 					    unsigned long flags)
 {
+	char buf[32];
+	u64 start, stop;
+	start = sched_clock();
+	sprintf(buf,"%lld", (unsigned long)lock);
+	stop = sched_clock();
+	trace_printk("time: %lld '%s'\n", stop - start, buf);
 	spin_release(&lock->dep_map, _RET_IP_);
 	do_raw_spin_unlock(lock);
 	local_irq_restore(flags);


Then after boot up, I did the following:

 # trace-cmd stop
 # trace-cmd extract

Which captured the traces:

          <idle>-0     [003]     5.405208: bprint:               _raw_spin_unlock_irqrestore: time: 799 '-110308271193088'
          <idle>-0     [003]     5.405210: bprint:               _raw_spin_unlock_irqrestore: time: 273 '-110308271193088'
          <idle>-0     [003]     5.412235: bprint:               _raw_spin_unlock_irqrestore: time: 1138 '-110308271193088'
          <idle>-0     [003]     5.412236: bprint:               _raw_spin_unlock_irqrestore: time: 213 '-110308271193088'
          <idle>-0     [003]     5.414241: bprint:               _raw_spin_unlock_irqrestore: time: 1094 '-110308271193088'
          <idle>-0     [003]     5.414243: bprint:               _raw_spin_unlock_irqrestore: time: 182 '-110308271193088'
          <idle>-0     [003]     5.418241: bprint:               _raw_spin_unlock_irqrestore: time: 1113 '-110308271193088'


Where "time: X", X is the delta in nanoseconds around the sprintf() call.

The I ran the attached perl program on the output, and got the following
results:

Before your patches:

 # trace-cmd report | ./report.pl
full_total = 52844823
  average:   255.902176229032
  std:       439.269729814847

And with your patches:

 # trace-cmd report | ./report.pl
full_total = 84725476
  average:   407.873274762306
  std:       555.755670463724

As the standard deviation is bigger than the average, it appears to be all
in the noise.

Then I decided to see if it affects "ps -eux"

Original:

# perf stat -r 100 ps -eux > /dev/null

 Performance counter stats for 'ps -eux' (100 runs):

              8.92 msec task-clock                #    0.937 CPUs utilized            ( +-  0.90% )
                 5      context-switches          #    0.545 K/sec                    ( +-  1.24% )
                 0      cpu-migrations            #    0.000 K/sec                  
               259      page-faults               #    0.029 M/sec                    ( +-  0.07% )
        32,973,751      cycles                    #    3.698 GHz                      ( +-  0.09% )
        17,254,307      stalled-cycles-frontend   #   52.33% frontend cycles idle     ( +-  0.17% )
        38,707,960      instructions              #    1.17  insn per cycle         
                                                  #    0.45  stalled cycles per insn  ( +-  0.01% )
         8,153,197      branches                  #  914.274 M/sec                    ( +-  0.01% )
           114,992      branch-misses             #    1.41% of all branches          ( +-  0.12% )

         0.0095170 +- 0.0000829 seconds time elapsed  ( +-  0.87% )

With your patches:

# perf stat -r 100 ps -eux > /dev/null

 Performance counter stats for 'ps -eux' (100 runs):

              8.86 msec task-clock                #    0.918 CPUs utilized            ( +-  1.06% )
                 5      context-switches          #    0.527 K/sec                    ( +-  1.22% )
                 0      cpu-migrations            #    0.001 K/sec                    ( +-100.00% )
               259      page-faults               #    0.029 M/sec                    ( +-  0.08% )
        32,699,168      cycles                    #    3.692 GHz                      ( +-  0.12% )
        16,995,861      stalled-cycles-frontend   #   51.98% frontend cycles idle     ( +-  0.21% )
        38,114,396      instructions              #    1.17  insn per cycle         
                                                  #    0.45  stalled cycles per insn  ( +-  0.03% )
         7,985,526      branches                  #  901.625 M/sec                    ( +-  0.03% )
           112,852      branch-misses             #    1.41% of all branches          ( +-  0.17% )

          0.009652 +- 0.000276 seconds time elapsed  ( +-  2.86% )

Not really much difference.

Then what about just catting /proc/$$/stat, and do a 1000 runs!

Original:

# perf stat -r 1000 cat /proc/$$/stat > /dev/null

 Performance counter stats for 'cat /proc/1622/stat' (1000 runs):

              0.34 msec task-clock                #    0.680 CPUs utilized            ( +-  0.21% )
                 0      context-switches          #    0.071 K/sec                    ( +- 20.18% )
                 0      cpu-migrations            #    0.000 K/sec                  
                65      page-faults               #    0.192 M/sec                    ( +-  0.07% )
           993,486      cycles                    #    2.934 GHz                      ( +-  0.07% )
           577,903      stalled-cycles-frontend   #   58.17% frontend cycles idle     ( +-  0.09% )
           936,489      instructions              #    0.94  insn per cycle         
                                                  #    0.62  stalled cycles per insn  ( +-  0.07% )
           202,912      branches                  #  599.213 M/sec                    ( +-  0.07% )
             6,976      branch-misses             #    3.44% of all branches          ( +-  0.11% )

        0.00049797 +- 0.00000111 seconds time elapsed  ( +-  0.22% )


With your patches:

# perf stat -r 1000 cat /proc/$$/stat > /dev/null

 Performance counter stats for 'cat /proc/1624/stat' (1000 runs):

              0.34 msec task-clock                #    0.664 CPUs utilized            ( +-  0.23% )
                 0      context-switches          #    0.018 K/sec                    ( +- 40.72% )
                 0      cpu-migrations            #    0.000 K/sec                  
                65      page-faults               #    0.190 M/sec                    ( +-  0.07% )
           988,430      cycles                    #    2.892 GHz                      ( +-  0.07% )
           574,841      stalled-cycles-frontend   #   58.16% frontend cycles idle     ( +-  0.09% )
           933,852      instructions              #    0.94  insn per cycle         
                                                  #    0.62  stalled cycles per insn  ( +-  0.07% )
           202,096      branches                  #  591.297 M/sec                    ( +-  0.07% )
             6,836      branch-misses             #    3.38% of all branches          ( +-  0.09% )

        0.00051476 +- 0.00000557 seconds time elapsed  ( +-  1.08% )


They are pretty much identical.

What's the purpose of all these changes again? There was no cover letter.

-- Steve
Sergey Senozhatsky April 23, 2020, 3:42 a.m. UTC | #8
On (20/04/21 19:49), Alexey Dobriyan wrote:
> 
> I did the benchmarks (without stack protector though), everything except
> /proc/cpuinfo and /proc/meminfo became faster.
> 

Why does /proc/cpuinfo and /proc/meminfo performance matter?

	-ss
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index b816a453b10e..8322125bb929 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8470,6 +8470,12 @@  L:	linux-crypto@vger.kernel.org
 S:	Maintained
 F:	drivers/crypto/inside-secure/
 
+INTEGER PRINTING PRESS
+M:	Alexey Dobriyan <adobriyan@gmail.com>
+L:	linux-kernel@vger.kernel.org
+F:	lib/print-integer.[ch]
+S:	Maintained
+
 INTEGRITY MEASUREMENT ARCHITECTURE (IMA)
 M:	Mimi Zohar <zohar@linux.ibm.com>
 M:	Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
diff --git a/lib/Makefile b/lib/Makefile
index 685aee60de1d..a2f011fa6739 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -25,7 +25,7 @@  KASAN_SANITIZE_string.o := n
 CFLAGS_string.o := $(call cc-option, -fno-stack-protector)
 endif
 
-lib-y := ctype.o string.o vsprintf.o cmdline.o \
+lib-y := ctype.o string.o print-integer.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o timerqueue.o xarray.o \
 	 idr.o extable.o sha1.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
diff --git a/lib/print-integer.c b/lib/print-integer.c
new file mode 100644
index 000000000000..563aaca19b8c
--- /dev/null
+++ b/lib/print-integer.c
@@ -0,0 +1,40 @@ 
+#include <linux/compiler.h>
+#include <linux/math64.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include "print-integer.h"
+
+noinline
+char *_print_integer_u32(char *p, u32 x)
+{
+	do {
+		*--p = '0' + (x % 10);
+	} while (x /= 10);
+	return p;
+}
+
+noinline
+char *_print_integer_s32(char *p, s32 x)
+{
+	if (x < 0) {
+		p = _print_integer_u32(p, -x);
+		*--p = '-';
+		return p;
+	} else {
+		return _print_integer_u32(p, x);
+	}
+}
+
+noinline
+char *_print_integer_u64(char *p, u64 x)
+{
+	while (x >= 100 * 1000 * 1000) {
+		u32 r;
+
+		x = div_u64_rem(x, 100 * 1000 * 1000, &r);
+		p = memset(p - 8, '0', 8);
+		(void)_print_integer_u32(p + 8, r);
+	}
+	return _print_integer_u32(p, x);
+}
diff --git a/lib/print-integer.h b/lib/print-integer.h
new file mode 100644
index 000000000000..a6f8e1757a6f
--- /dev/null
+++ b/lib/print-integer.h
@@ -0,0 +1,20 @@ 
+#pragma once
+char *_print_integer_u32(char *, u32);
+char *_print_integer_u64(char *, u64);
+char *_print_integer_s32(char *, s32);
+
+static inline char *_print_integer_ul(char *p, unsigned long x)
+{
+#ifdef CONFIG_64BIT
+	return _print_integer_u64(p, x);
+#else
+	return _print_integer_u32(p, x);
+#endif
+}
+
+enum {
+	LEN_U32 = 10,
+	LEN_S32 = 1 + LEN_U32,
+	LEN_UL = sizeof(long) * 5 / 2,
+	LEN_U64 = 20,
+};