Message ID | 20200420205743.19964-3-adobriyan@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [01/15] sched: make nr_running() return "unsigned int" | expand |
On Mon, Apr 20, 2020 at 11:57:31PM +0300, Alexey Dobriyan wrote: > Time honored way to print integers via vsnprintf() or equivalent has > unavoidable slowdown of parsing format string. This can't be fixed in C, > without introducing external preprocessor. > > seq_put_decimal_ull() partially saves the day, but there are a lot of > branches inside and overcopying still. > > _print_integer_*() family of functions is meant to make printing > integers as fast as possible by deleting format string parsing and doing > as little work as possible. > > It is based on the following observations: > > 1) memcpy is done in forward direction > it can be done backwards but nobody does that, > > 2) digits can be extracted in a very simple loop which costs only > 1 multiplication and shift (division by constant is not division) > > All the above asks for the following signature, semantics and pattern of > printing out beloved /proc files: > > /* seq_printf(seq, "%u %llu\n", A, b); */ > > char buf[10 + 1 + 20 + 1]; > char *p = buf + sizeof(buf); > > *--p = '\n'; > p = _print_integer_u64(p, B); > *--p = ' '; > p = _print_integer_u32(p, A); > > seq_write(seq, p, buf + sizeof(buf) - p); > > 1) stack buffer capable of holding the biggest string is allocated. > > 2) "p" is pointer to start of the string. Initially it points past > the end of the buffer WHICH IS NOT NUL-TERMINATED! > > 3) _print_integer_*() actually prints an integer from right to left > and returns new start of the string. > > <--------| > 123 > ^ > | > +-- p > > 4) 1 character is printed with > > *--p = 'x'; > > It generates very efficient code as multiple writes can be > merged. > > 5) fixed string is printed with > > p = memcpy(p - 3, "foo", 3); > > Complers know what memcpy() does and write-combine it. > 4/8-byte writes become 1 instruction and are very efficient. > > 6) Once everything is printed, the result is written to seq_file buffer. > It does only one overflow check and 1 copy. > > This generates very efficient code (and small!). > > In regular seq_printf() calls, first argument and format string are > constantly reloaded. Format string will most likely with [rip+...] which > is quite verbose. > > seq_put_decimal_ull() will do branches (and even more branches > with "width" argument) > > TODO > benchmark with mainline because nouveau is broken for me -( > vsnprintf() changes make the code slower Exactly main point of this exercise. I don't believe that algos in vsprintf.c are too dumb to use division per digit (yes, division by constant which is not power of two is a heavy operation). > +noinline > +char *_print_integer_u32(char *p, u32 x) > +{ > + do { > + *--p = '0' + (x % 10); > + } while (x /= 10); > + return p; > +} > +noinline > +char *_print_integer_u64(char *p, u64 x) > +{ > + while (x >= 100 * 1000 * 1000) { > + u32 r; > + > + x = div_u64_rem(x, 100 * 1000 * 1000, &r); > + p = memset(p - 8, '0', 8); > + (void)_print_integer_u32(p + 8, r); > + } > + return _print_integer_u32(p, x); > +}
On Tue, Apr 21, 2020 at 12:19:11AM +0300, Andy Shevchenko wrote: > On Mon, Apr 20, 2020 at 11:57:31PM +0300, Alexey Dobriyan wrote: > > Time honored way to print integers via vsnprintf() or equivalent has > > unavoidable slowdown of parsing format string. This can't be fixed in C, > > without introducing external preprocessor. > > > > seq_put_decimal_ull() partially saves the day, but there are a lot of > > branches inside and overcopying still. > > > > _print_integer_*() family of functions is meant to make printing > > integers as fast as possible by deleting format string parsing and doing > > as little work as possible. > > > > It is based on the following observations: > > > > 1) memcpy is done in forward direction > > it can be done backwards but nobody does that, > > > > 2) digits can be extracted in a very simple loop which costs only > > 1 multiplication and shift (division by constant is not division) > > > > All the above asks for the following signature, semantics and pattern of > > printing out beloved /proc files: > > > > /* seq_printf(seq, "%u %llu\n", A, b); */ > > > > char buf[10 + 1 + 20 + 1]; > > char *p = buf + sizeof(buf); > > > > *--p = '\n'; > > p = _print_integer_u64(p, B); > > *--p = ' '; > > p = _print_integer_u32(p, A); > > > > seq_write(seq, p, buf + sizeof(buf) - p); > > > > 1) stack buffer capable of holding the biggest string is allocated. > > > > 2) "p" is pointer to start of the string. Initially it points past > > the end of the buffer WHICH IS NOT NUL-TERMINATED! > > > > 3) _print_integer_*() actually prints an integer from right to left > > and returns new start of the string. > > > > <--------| > > 123 > > ^ > > | > > +-- p > > > > 4) 1 character is printed with > > > > *--p = 'x'; > > > > It generates very efficient code as multiple writes can be > > merged. > > > > 5) fixed string is printed with > > > > p = memcpy(p - 3, "foo", 3); > > > > Complers know what memcpy() does and write-combine it. > > 4/8-byte writes become 1 instruction and are very efficient. > > > > 6) Once everything is printed, the result is written to seq_file buffer. > > It does only one overflow check and 1 copy. > > > > This generates very efficient code (and small!). > > > > In regular seq_printf() calls, first argument and format string are > > constantly reloaded. Format string will most likely with [rip+...] which > > is quite verbose. > > > > seq_put_decimal_ull() will do branches (and even more branches > > with "width" argument) > > > > > TODO > > benchmark with mainline because nouveau is broken for me -( > > vsnprintf() changes make the code slower > > Exactly main point of this exercise. I don't believe that algos in vsprintf.c > are too dumb to use division per digit (yes, division by constant which is not > power of two is a heavy operation). > And second point here, why not to use existing algos from vsprintf.c?
On Tue, 21 Apr 2020 00:27:23 +0300 Andy Shevchenko <andriy.shevchenko@linux.intel.com> wrote: > > > > > TODO > > > benchmark with mainline because nouveau is broken for me -( > > > vsnprintf() changes make the code slower > > > > Exactly main point of this exercise. I don't believe that algos in vsprintf.c > > are too dumb to use division per digit (yes, division by constant which is not > > power of two is a heavy operation). > > > > And second point here, why not to use existing algos from vsprintf.c? Exactly. The code in _print_integer_u32() doesn't look as fast as the code in vsprintf() that happens to use lookup tables and converts without any loops. Hint, loops are bad, they cause the CPU to slow down. Anyway, this patch series would require a pretty good improvement, as the code replacing the sprintf() usages is pretty ugly compared to a simple sprintf() call. Randomly picking patch 6: static int loadavg_proc_show(struct seq_file *m, void *v) { unsigned long avnrun[3]; get_avenrun(avnrun, FIXED_1/200, 0); seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n", LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]), LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]), LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]), nr_running(), nr_threads, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1); return 0; } *vs* static int loadavg_proc_show(struct seq_file *m, void *v) { unsigned long avnrun[3]; char buf[3 * (LEN_UL + 1 + 2 + 1) + 10 + 1 + 10 + 1 + 10 + 1]; char *p = buf + sizeof(buf); int i; *--p = '\n'; p = _print_integer_u32(p, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1); *--p = ' '; p = _print_integer_u32(p, nr_threads); *--p = '/'; p = _print_integer_u32(p, nr_running()); get_avenrun(avnrun, FIXED_1/200, 0); for (i = 2; i >= 0; i--) { *--p = ' '; --p; /* overwritten */ *--p = '0'; /* conditionally overwritten */ (void)_print_integer_u32(p + 2, LOAD_FRAC(avnrun[i])); *--p = '.'; p = _print_integer_ul(p, LOAD_INT(avnrun[i])); } seq_write(m, p, buf + sizeof(buf) - p); return 0; } I much rather keep the first version. -- Steve
On Tue, Apr 21, 2020 at 12:19:11AM +0300, Andy Shevchenko wrote: > On Mon, Apr 20, 2020 at 11:57:31PM +0300, Alexey Dobriyan wrote: > > Time honored way to print integers via vsnprintf() or equivalent has > > unavoidable slowdown of parsing format string. This can't be fixed in C, > > without introducing external preprocessor. > > > > seq_put_decimal_ull() partially saves the day, but there are a lot of > > branches inside and overcopying still. > > > > _print_integer_*() family of functions is meant to make printing > > integers as fast as possible by deleting format string parsing and doing > > as little work as possible. > > > > It is based on the following observations: > > > > 1) memcpy is done in forward direction > > it can be done backwards but nobody does that, > > > > 2) digits can be extracted in a very simple loop which costs only > > 1 multiplication and shift (division by constant is not division) > > > > All the above asks for the following signature, semantics and pattern of > > printing out beloved /proc files: > > > > /* seq_printf(seq, "%u %llu\n", A, b); */ > > > > char buf[10 + 1 + 20 + 1]; > > char *p = buf + sizeof(buf); > > > > *--p = '\n'; > > p = _print_integer_u64(p, B); > > *--p = ' '; > > p = _print_integer_u32(p, A); > > > > seq_write(seq, p, buf + sizeof(buf) - p); > > > > 1) stack buffer capable of holding the biggest string is allocated. > > > > 2) "p" is pointer to start of the string. Initially it points past > > the end of the buffer WHICH IS NOT NUL-TERMINATED! > > > > 3) _print_integer_*() actually prints an integer from right to left > > and returns new start of the string. > > > > <--------| > > 123 > > ^ > > | > > +-- p > > > > 4) 1 character is printed with > > > > *--p = 'x'; > > > > It generates very efficient code as multiple writes can be > > merged. > > > > 5) fixed string is printed with > > > > p = memcpy(p - 3, "foo", 3); > > > > Complers know what memcpy() does and write-combine it. > > 4/8-byte writes become 1 instruction and are very efficient. > > > > 6) Once everything is printed, the result is written to seq_file buffer. > > It does only one overflow check and 1 copy. > > > > This generates very efficient code (and small!). > > > > In regular seq_printf() calls, first argument and format string are > > constantly reloaded. Format string will most likely with [rip+...] which > > is quite verbose. > > > > seq_put_decimal_ull() will do branches (and even more branches > > with "width" argument) > > > > > TODO > > benchmark with mainline because nouveau is broken for me -( > > vsnprintf() changes make the code slower > > Exactly main point of this exercise. I don't believe that algos in vsprintf.c > are too dumb to use division per digit (yes, division by constant which is not > power of two is a heavy operation). It is not about division. It is about fucktons of branches in vsprintf(). > > +noinline > > +char *_print_integer_u32(char *p, u32 x) > > +{ > > + do { > > + *--p = '0' + (x % 10); > > + } while (x /= 10); > > + return p; > > +} > > > +noinline > > +char *_print_integer_u64(char *p, u64 x) > > +{ > > + while (x >= 100 * 1000 * 1000) { > > + u32 r; > > + > > + x = div_u64_rem(x, 100 * 1000 * 1000, &r); > > + p = memset(p - 8, '0', 8); > > + (void)_print_integer_u32(p + 8, r); > > + } > > + return _print_integer_u32(p, x); > > +} > > -- > With Best Regards, > Andy Shevchenko > >
On Mon, Apr 20, 2020 at 09:54:17PM -0400, Steven Rostedt wrote: > On Tue, 21 Apr 2020 00:27:23 +0300 > Andy Shevchenko <andriy.shevchenko@linux.intel.com> wrote: > > > > > > > > TODO > > > > benchmark with mainline because nouveau is broken for me -( > > > > vsnprintf() changes make the code slower > > > > > > Exactly main point of this exercise. I don't believe that algos in vsprintf.c > > > are too dumb to use division per digit (yes, division by constant which is not > > > power of two is a heavy operation). > > > > > > > And second point here, why not to use existing algos from vsprintf.c? > > Exactly. The code in _print_integer_u32() doesn't look as fast as the > code in vsprintf() that happens to use lookup tables and converts > without any loops. > > Hint, loops are bad, they cause the CPU to slow down. Oh, come on! Loops make code fit into icache and μop decode cache. > Anyway, this patch series would require a pretty good improvement, as > the code replacing the sprintf() usages is pretty ugly compared to a > simple sprintf() call. No! Fast code must look ugly. Or in other words if you try to optimise integer printing to death you'll probably end with something like _print_integer(). When the very first patch changed /proc/stat to seq_put_decimal_ull() the speed up was 66% (or 33%). That's how slow printing was back then. It can be made slightly faster even now. > Randomly picking patch 6: > > static int loadavg_proc_show(struct seq_file *m, void *v) > { > unsigned long avnrun[3]; > > get_avenrun(avnrun, FIXED_1/200, 0); > > seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n", > LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]), > LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]), > LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]), > nr_running(), nr_threads, > idr_get_cursor(&task_active_pid_ns(current)->idr) - 1); > return 0; > } > > *vs* > > static int loadavg_proc_show(struct seq_file *m, void *v) > { > unsigned long avnrun[3]; > char buf[3 * (LEN_UL + 1 + 2 + 1) + 10 + 1 + 10 + 1 + 10 + 1]; > char *p = buf + sizeof(buf); > int i; > > *--p = '\n'; > p = _print_integer_u32(p, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1); > *--p = ' '; > p = _print_integer_u32(p, nr_threads); > *--p = '/'; > p = _print_integer_u32(p, nr_running()); > > get_avenrun(avnrun, FIXED_1/200, 0); > for (i = 2; i >= 0; i--) { > *--p = ' '; > --p; /* overwritten */ > *--p = '0'; /* conditionally overwritten */ > (void)_print_integer_u32(p + 2, LOAD_FRAC(avnrun[i])); > *--p = '.'; > p = _print_integer_ul(p, LOAD_INT(avnrun[i])); > } > > seq_write(m, p, buf + sizeof(buf) - p); > return 0; > } > > > I much rather keep the first version. I did the benchmarks (without stack protector though), everything except /proc/cpuinfo and /proc/meminfo became faster. This requires investigation and I can drop vsprintf() changes until then. Now given that /proc/uptime format cast in stone, code may look a bit ugly and unusual but it won't require maintainance
On Mon, Apr 20, 2020 at 11:57:31PM +0300, Alexey Dobriyan wrote: > 1) memcpy is done in forward direction > it can be done backwards but nobody does that, If you're determined to do this, then use memmove() which actually guarantees to work with overlapping ranges. Don't rely on non-guaranteed behaviour of current implementations of memcpy(). Did you really check the two dozen assembly implementations of memcpy() in the kernel? > 2) digits can be extracted in a very simple loop which costs only > 1 multiplication and shift (division by constant is not division) > +noinline > +char *_print_integer_u32(char *p, u32 x) > +{ > + do { > + *--p = '0' + (x % 10); > + } while (x /= 10); > + return p; > +} Why not do two digits at a time like put_dec() does?
On Tue, 21 Apr 2020 19:49:24 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote: > > Exactly. The code in _print_integer_u32() doesn't look as fast as the > > code in vsprintf() that happens to use lookup tables and converts > > without any loops. > > > > Hint, loops are bad, they cause the CPU to slow down. > > Oh, come on! Loops make code fit into icache and μop decode cache. Depends on the architecture. > > > Anyway, this patch series would require a pretty good improvement, as > > the code replacing the sprintf() usages is pretty ugly compared to a > > simple sprintf() call. > > No! Fast code must look ugly. Or in other words if you try to optimise > integer printing to death you'll probably end with something like > _print_integer(). As I stated, it will require a "pretty good improvement". There's always a trade off. If the improvement is noticeable for real life cases, then ugly code is worth it. If we are making ugly code for a benefit of something that never shows outside of noise, then the net cost (less maintainable code), is not worth it. > > When the very first patch changed /proc/stat to seq_put_decimal_ull() > the speed up was 66% (or 33%). That's how slow printing was back then. > It can be made slightly faster even now. I'd like to see the tests that your ran (to reproduce them myself). The first patch was making a 64bit number into 32bit number, thus shortening the work by half. > > > Randomly picking patch 6: > > > > static int loadavg_proc_show(struct seq_file *m, void *v) > > { > > unsigned long avnrun[3]; > > > > get_avenrun(avnrun, FIXED_1/200, 0); > > > > seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n", > > LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]), > > LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]), > > LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]), > > nr_running(), nr_threads, > > idr_get_cursor(&task_active_pid_ns(current)->idr) - 1); > > return 0; > > } > > > > *vs* > > > > static int loadavg_proc_show(struct seq_file *m, void *v) > > { > > unsigned long avnrun[3]; > > char buf[3 * (LEN_UL + 1 + 2 + 1) + 10 + 1 + 10 + 1 + 10 + 1]; > > char *p = buf + sizeof(buf); > > int i; > > > > *--p = '\n'; > > p = _print_integer_u32(p, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1); > > *--p = ' '; > > p = _print_integer_u32(p, nr_threads); > > *--p = '/'; > > p = _print_integer_u32(p, nr_running()); > > > > get_avenrun(avnrun, FIXED_1/200, 0); > > for (i = 2; i >= 0; i--) { > > *--p = ' '; > > --p; /* overwritten */ > > *--p = '0'; /* conditionally overwritten */ > > (void)_print_integer_u32(p + 2, LOAD_FRAC(avnrun[i])); > > *--p = '.'; > > p = _print_integer_ul(p, LOAD_INT(avnrun[i])); > > } > > > > seq_write(m, p, buf + sizeof(buf) - p); > > return 0; > > } > > > > > > I much rather keep the first version. > > I did the benchmarks (without stack protector though), everything except > /proc/cpuinfo and /proc/meminfo became faster. This requires investigation > and I can drop vsprintf() changes until then. > > Now given that /proc/uptime format cast in stone, code may look a bit ugly > and unusual but it won't require maintainance Please share what you did as your benchmarks. If this is as good of a performance as you claim, then these changes would be worth looking into. So I applied your entire series, added the following patch: diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h index 19a9be9d97ee..17c582d77ab7 100644 --- a/include/linux/spinlock_api_smp.h +++ b/include/linux/spinlock_api_smp.h @@ -152,9 +152,16 @@ static inline void __raw_spin_unlock(raw_spinlock_t *lock) preempt_enable(); } +extern u64 sched_clock(void); static inline void __raw_spin_unlock_irqrestore(raw_spinlock_t *lock, unsigned long flags) { + char buf[32]; + u64 start, stop; + start = sched_clock(); + sprintf(buf,"%lld", (unsigned long)lock); + stop = sched_clock(); + trace_printk("time: %lld '%s'\n", stop - start, buf); spin_release(&lock->dep_map, _RET_IP_); do_raw_spin_unlock(lock); local_irq_restore(flags); Then after boot up, I did the following: # trace-cmd stop # trace-cmd extract Which captured the traces: <idle>-0 [003] 5.405208: bprint: _raw_spin_unlock_irqrestore: time: 799 '-110308271193088' <idle>-0 [003] 5.405210: bprint: _raw_spin_unlock_irqrestore: time: 273 '-110308271193088' <idle>-0 [003] 5.412235: bprint: _raw_spin_unlock_irqrestore: time: 1138 '-110308271193088' <idle>-0 [003] 5.412236: bprint: _raw_spin_unlock_irqrestore: time: 213 '-110308271193088' <idle>-0 [003] 5.414241: bprint: _raw_spin_unlock_irqrestore: time: 1094 '-110308271193088' <idle>-0 [003] 5.414243: bprint: _raw_spin_unlock_irqrestore: time: 182 '-110308271193088' <idle>-0 [003] 5.418241: bprint: _raw_spin_unlock_irqrestore: time: 1113 '-110308271193088' Where "time: X", X is the delta in nanoseconds around the sprintf() call. The I ran the attached perl program on the output, and got the following results: Before your patches: # trace-cmd report | ./report.pl full_total = 52844823 average: 255.902176229032 std: 439.269729814847 And with your patches: # trace-cmd report | ./report.pl full_total = 84725476 average: 407.873274762306 std: 555.755670463724 As the standard deviation is bigger than the average, it appears to be all in the noise. Then I decided to see if it affects "ps -eux" Original: # perf stat -r 100 ps -eux > /dev/null Performance counter stats for 'ps -eux' (100 runs): 8.92 msec task-clock # 0.937 CPUs utilized ( +- 0.90% ) 5 context-switches # 0.545 K/sec ( +- 1.24% ) 0 cpu-migrations # 0.000 K/sec 259 page-faults # 0.029 M/sec ( +- 0.07% ) 32,973,751 cycles # 3.698 GHz ( +- 0.09% ) 17,254,307 stalled-cycles-frontend # 52.33% frontend cycles idle ( +- 0.17% ) 38,707,960 instructions # 1.17 insn per cycle # 0.45 stalled cycles per insn ( +- 0.01% ) 8,153,197 branches # 914.274 M/sec ( +- 0.01% ) 114,992 branch-misses # 1.41% of all branches ( +- 0.12% ) 0.0095170 +- 0.0000829 seconds time elapsed ( +- 0.87% ) With your patches: # perf stat -r 100 ps -eux > /dev/null Performance counter stats for 'ps -eux' (100 runs): 8.86 msec task-clock # 0.918 CPUs utilized ( +- 1.06% ) 5 context-switches # 0.527 K/sec ( +- 1.22% ) 0 cpu-migrations # 0.001 K/sec ( +-100.00% ) 259 page-faults # 0.029 M/sec ( +- 0.08% ) 32,699,168 cycles # 3.692 GHz ( +- 0.12% ) 16,995,861 stalled-cycles-frontend # 51.98% frontend cycles idle ( +- 0.21% ) 38,114,396 instructions # 1.17 insn per cycle # 0.45 stalled cycles per insn ( +- 0.03% ) 7,985,526 branches # 901.625 M/sec ( +- 0.03% ) 112,852 branch-misses # 1.41% of all branches ( +- 0.17% ) 0.009652 +- 0.000276 seconds time elapsed ( +- 2.86% ) Not really much difference. Then what about just catting /proc/$$/stat, and do a 1000 runs! Original: # perf stat -r 1000 cat /proc/$$/stat > /dev/null Performance counter stats for 'cat /proc/1622/stat' (1000 runs): 0.34 msec task-clock # 0.680 CPUs utilized ( +- 0.21% ) 0 context-switches # 0.071 K/sec ( +- 20.18% ) 0 cpu-migrations # 0.000 K/sec 65 page-faults # 0.192 M/sec ( +- 0.07% ) 993,486 cycles # 2.934 GHz ( +- 0.07% ) 577,903 stalled-cycles-frontend # 58.17% frontend cycles idle ( +- 0.09% ) 936,489 instructions # 0.94 insn per cycle # 0.62 stalled cycles per insn ( +- 0.07% ) 202,912 branches # 599.213 M/sec ( +- 0.07% ) 6,976 branch-misses # 3.44% of all branches ( +- 0.11% ) 0.00049797 +- 0.00000111 seconds time elapsed ( +- 0.22% ) With your patches: # perf stat -r 1000 cat /proc/$$/stat > /dev/null Performance counter stats for 'cat /proc/1624/stat' (1000 runs): 0.34 msec task-clock # 0.664 CPUs utilized ( +- 0.23% ) 0 context-switches # 0.018 K/sec ( +- 40.72% ) 0 cpu-migrations # 0.000 K/sec 65 page-faults # 0.190 M/sec ( +- 0.07% ) 988,430 cycles # 2.892 GHz ( +- 0.07% ) 574,841 stalled-cycles-frontend # 58.16% frontend cycles idle ( +- 0.09% ) 933,852 instructions # 0.94 insn per cycle # 0.62 stalled cycles per insn ( +- 0.07% ) 202,096 branches # 591.297 M/sec ( +- 0.07% ) 6,836 branch-misses # 3.38% of all branches ( +- 0.09% ) 0.00051476 +- 0.00000557 seconds time elapsed ( +- 1.08% ) They are pretty much identical. What's the purpose of all these changes again? There was no cover letter. -- Steve
On (20/04/21 19:49), Alexey Dobriyan wrote: > > I did the benchmarks (without stack protector though), everything except > /proc/cpuinfo and /proc/meminfo became faster. > Why does /proc/cpuinfo and /proc/meminfo performance matter? -ss
diff --git a/MAINTAINERS b/MAINTAINERS index b816a453b10e..8322125bb929 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8470,6 +8470,12 @@ L: linux-crypto@vger.kernel.org S: Maintained F: drivers/crypto/inside-secure/ +INTEGER PRINTING PRESS +M: Alexey Dobriyan <adobriyan@gmail.com> +L: linux-kernel@vger.kernel.org +F: lib/print-integer.[ch] +S: Maintained + INTEGRITY MEASUREMENT ARCHITECTURE (IMA) M: Mimi Zohar <zohar@linux.ibm.com> M: Dmitry Kasatkin <dmitry.kasatkin@gmail.com> diff --git a/lib/Makefile b/lib/Makefile index 685aee60de1d..a2f011fa6739 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -25,7 +25,7 @@ KASAN_SANITIZE_string.o := n CFLAGS_string.o := $(call cc-option, -fno-stack-protector) endif -lib-y := ctype.o string.o vsprintf.o cmdline.o \ +lib-y := ctype.o string.o print-integer.o vsprintf.o cmdline.o \ rbtree.o radix-tree.o timerqueue.o xarray.o \ idr.o extable.o sha1.o irq_regs.o argv_split.o \ flex_proportions.o ratelimit.o show_mem.o \ diff --git a/lib/print-integer.c b/lib/print-integer.c new file mode 100644 index 000000000000..563aaca19b8c --- /dev/null +++ b/lib/print-integer.c @@ -0,0 +1,40 @@ +#include <linux/compiler.h> +#include <linux/math64.h> +#include <linux/string.h> +#include <linux/types.h> + +#include "print-integer.h" + +noinline +char *_print_integer_u32(char *p, u32 x) +{ + do { + *--p = '0' + (x % 10); + } while (x /= 10); + return p; +} + +noinline +char *_print_integer_s32(char *p, s32 x) +{ + if (x < 0) { + p = _print_integer_u32(p, -x); + *--p = '-'; + return p; + } else { + return _print_integer_u32(p, x); + } +} + +noinline +char *_print_integer_u64(char *p, u64 x) +{ + while (x >= 100 * 1000 * 1000) { + u32 r; + + x = div_u64_rem(x, 100 * 1000 * 1000, &r); + p = memset(p - 8, '0', 8); + (void)_print_integer_u32(p + 8, r); + } + return _print_integer_u32(p, x); +} diff --git a/lib/print-integer.h b/lib/print-integer.h new file mode 100644 index 000000000000..a6f8e1757a6f --- /dev/null +++ b/lib/print-integer.h @@ -0,0 +1,20 @@ +#pragma once +char *_print_integer_u32(char *, u32); +char *_print_integer_u64(char *, u64); +char *_print_integer_s32(char *, s32); + +static inline char *_print_integer_ul(char *p, unsigned long x) +{ +#ifdef CONFIG_64BIT + return _print_integer_u64(p, x); +#else + return _print_integer_u32(p, x); +#endif +} + +enum { + LEN_U32 = 10, + LEN_S32 = 1 + LEN_U32, + LEN_UL = sizeof(long) * 5 / 2, + LEN_U64 = 20, +};
Time honored way to print integers via vsnprintf() or equivalent has unavoidable slowdown of parsing format string. This can't be fixed in C, without introducing external preprocessor. seq_put_decimal_ull() partially saves the day, but there are a lot of branches inside and overcopying still. _print_integer_*() family of functions is meant to make printing integers as fast as possible by deleting format string parsing and doing as little work as possible. It is based on the following observations: 1) memcpy is done in forward direction it can be done backwards but nobody does that, 2) digits can be extracted in a very simple loop which costs only 1 multiplication and shift (division by constant is not division) All the above asks for the following signature, semantics and pattern of printing out beloved /proc files: /* seq_printf(seq, "%u %llu\n", A, b); */ char buf[10 + 1 + 20 + 1]; char *p = buf + sizeof(buf); *--p = '\n'; p = _print_integer_u64(p, B); *--p = ' '; p = _print_integer_u32(p, A); seq_write(seq, p, buf + sizeof(buf) - p); 1) stack buffer capable of holding the biggest string is allocated. 2) "p" is pointer to start of the string. Initially it points past the end of the buffer WHICH IS NOT NUL-TERMINATED! 3) _print_integer_*() actually prints an integer from right to left and returns new start of the string. <--------| 123 ^ | +-- p 4) 1 character is printed with *--p = 'x'; It generates very efficient code as multiple writes can be merged. 5) fixed string is printed with p = memcpy(p - 3, "foo", 3); Complers know what memcpy() does and write-combine it. 4/8-byte writes become 1 instruction and are very efficient. 6) Once everything is printed, the result is written to seq_file buffer. It does only one overflow check and 1 copy. This generates very efficient code (and small!). In regular seq_printf() calls, first argument and format string are constantly reloaded. Format string will most likely with [rip+...] which is quite verbose. seq_put_decimal_ull() will do branches (and even more branches with "width" argument) TODO benchmark with mainline because nouveau is broken for me -( vsnprintf() changes make the code slower Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> --- MAINTAINERS | 6 ++++++ lib/Makefile | 2 +- lib/print-integer.c | 40 ++++++++++++++++++++++++++++++++++++++++ lib/print-integer.h | 20 ++++++++++++++++++++ 4 files changed, 67 insertions(+), 1 deletion(-) create mode 100644 lib/print-integer.c create mode 100644 lib/print-integer.h