Message ID | 20210130191719.7085-8-yury.norov@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | lib/find_bit: fast path for small bitmaps | expand |
On Sat, Jan 30, 2021 at 11:17:18AM -0800, Yury Norov wrote: > Similarly to bitmap functions, find_next_*_bit() users will benefit > if we'll handle a case of bitmaps that fit into a single word. In the > very best case, the compiler may replace a function call with a > single ffs or ffz instruction. Would be nice to have the examples how it reduces the actual code size (based on the existing code in kernel, especially in widely used frameworks / subsystems, like PCI).
From: Andy Shevchenko > Sent: 01 February 2021 13:49 > > On Sat, Jan 30, 2021 at 11:17:18AM -0800, Yury Norov wrote: > > Similarly to bitmap functions, find_next_*_bit() users will benefit > > if we'll handle a case of bitmaps that fit into a single word. In the > > very best case, the compiler may replace a function call with a > > single ffs or ffz instruction. > > Would be nice to have the examples how it reduces the actual code size (based > on the existing code in kernel, especially in widely used frameworks / > subsystems, like PCI). I bet it makes the kernel bigger but very slightly faster. But the fact that the wrappers end up in the i-cache may mean that inlining actually makes it slower for some calling sequences. If a bitmap fits in a single word (as a compile-time constant) then you should (probably) be using different functions if you care about performance. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, Feb 01, 2021 at 04:02:30PM +0000, David Laight wrote: > From: Andy Shevchenko > > Sent: 01 February 2021 13:49 > > On Sat, Jan 30, 2021 at 11:17:18AM -0800, Yury Norov wrote: > > > Similarly to bitmap functions, find_next_*_bit() users will benefit > > > if we'll handle a case of bitmaps that fit into a single word. In the > > > very best case, the compiler may replace a function call with a > > > single ffs or ffz instruction. > > > > Would be nice to have the examples how it reduces the actual code size (based > > on the existing code in kernel, especially in widely used frameworks / > > subsystems, like PCI). > > I bet it makes the kernel bigger but very slightly faster. > But the fact that the wrappers end up in the i-cache may > mean that inlining actually makes it slower for some calling > sequences. > If a bitmap fits in a single word (as a compile-time constant) > then you should (probably) be using different functions if > you care about performance. Isn't this patch series exactly about it?
[ CC Alexey Klimov, andRasmus Villemoes as they also may be interested ] On Mon, Feb 1, 2021 at 8:22 AM Andy Shevchenko <andriy.shevchenko@linux.intel.com> wrote: > On Mon, Feb 01, 2021 at 04:02:30PM +0000, David Laight wrote: > > From: Andy Shevchenko > > > Sent: 01 February 2021 13:49 > > > On Sat, Jan 30, 2021 at 11:17:18AM -0800, Yury Norov wrote: > > > > Similarly to bitmap functions, find_next_*_bit() users will benefit > > > > if we'll handle a case of bitmaps that fit into a single word. In the > > > > very best case, the compiler may replace a function call with a > > > > single ffs or ffz instruction. > > > > > > Would be nice to have the examples how it reduces the actual code size (based > > > on the existing code in kernel, especially in widely used frameworks / > > > subsystems, like PCI). >> > > I bet it makes the kernel bigger but very slightly faster. > > But the fact that the wrappers end up in the i-cache may > > mean that inlining actually makes it slower for some calling > > sequences. > > > If a bitmap fits in a single word (as a compile-time constant) > > then you should (probably) be using different functions if > > you care about performance. > > Isn't this patch series exactly about it? Probably, David meant something like find_first_bit_fast_path(). Not sure that this approach would be better than pure inlining. Addressing questions above. My arm64 build for qemu: Before: Idx Name Size VMA LMA File off Algn 0 .head.text 00010000 ffff800010000000 ffff800010000000 00010000 2**16 CONTENTS, ALLOC, LOAD, READONLY, CODE 1 .text 011603a8 ffff800010010000 ffff800010010000 00020000 2**16 CONTENTS, ALLOC, LOAD, READONLY, CODE 2 .got.plt 00000018 ffff8000111703a8 ffff8000111703a8 011803a8 2**3 CONTENTS, ALLOC, LOAD, DATA 3 .rodata 007a29b2 ffff800011180000 ffff800011180000 01190000 2**12 CONTENTS, ALLOC, LOAD, DATA 4 .pci_fixup 000025f0 ffff8000119229c0 ffff8000119229c0 019329c0 2**4 CONTENTS, ALLOC, LOAD, READONLY, DATA 5 __ksymtab 000100d4 ffff800011924fb0 ffff800011924fb0 01934fb0 2**2 CONTENTS, ALLOC, LOAD, READONLY, DATA 6 __ksymtab_gpl 000147cc ffff800011935084 ffff800011935084 01945084 2**2 CONTENTS, ALLOC, LOAD, READONLY, DATA 7 __ksymtab_strings 0003b0be ffff800011949850 ffff800011949850 01959850 2**0 CONTENTS, ALLOC, LOAD, READONLY, DATA 8 __param 00003e58 ffff800011984910 ffff800011984910 01994910 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA 9 __modver 00000678 ffff800011988768 ffff800011988768 01998768 2**3 CONTENTS, ALLOC, LOAD, DATA 10 __ex_table 00002270 ffff800011988de0 ffff800011988de0 01998de0 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA 11 .notes 0000003c ffff80001198b050 ffff80001198b050 0199b050 2**2 CONTENTS, ALLOC, LOAD, READONLY, DATA 12 .init.text 0007b09c ffff8000119a0000 ffff8000119a0000 019a0000 2**2 CONTENTS, ALLOC, LOAD, READONLY, CODE 13 .exit.text 00004120 ffff800011a1b09c ffff800011a1b09c 01a1b09c 2**2 CONTENTS, ALLOC, LOAD, READONLY, CODE ... After: Idx Name Size VMA LMA File off Algn 0 .head.text 00010000 ffff800010000000 ffff800010000000 00010000 2**16 CONTENTS, ALLOC, LOAD, READONLY, CODE 1 .text 011613a8 ffff800010010000 ffff800010010000 00020000 2**16 CONTENTS, ALLOC, LOAD, READONLY, CODE 2 .got.plt 00000018 ffff8000111713a8 ffff8000111713a8 011813a8 2**3 CONTENTS, ALLOC, LOAD, DATA 3 .rodata 007a26b2 ffff800011180000 ffff800011180000 01190000 2**12 CONTENTS, ALLOC, LOAD, DATA 4 .pci_fixup 000025f0 ffff8000119226c0 ffff8000119226c0 019326c0 2**4 CONTENTS, ALLOC, LOAD, READONLY, DATA 5 __ksymtab 000100bc ffff800011924cb0 ffff800011924cb0 01934cb0 2**2 CONTENTS, ALLOC, LOAD, READONLY, DATA 6 __ksymtab_gpl 000147cc ffff800011934d6c ffff800011934d6c 01944d6c 2**2 CONTENTS, ALLOC, LOAD, READONLY, DATA 7 __ksymtab_strings 0003b09b ffff800011949538 ffff800011949538 01959538 2**0 CONTENTS, ALLOC, LOAD, READONLY, DATA 8 __param 00003e58 ffff8000119845d8 ffff8000119845d8 019945d8 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA 9 __modver 00000678 ffff800011988430 ffff800011988430 01998430 2**3 CONTENTS, ALLOC, LOAD, DATA 10 __ex_table 00002270 ffff800011988aa8 ffff800011988aa8 01998aa8 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA 11 .notes 0000003c ffff80001198ad18 ffff80001198ad18 0199ad18 2**2 CONTENTS, ALLOC, LOAD, READONLY, DATA 12 .init.text 0007b1a8 ffff8000119a0000 ffff8000119a0000 019a0000 2**2 CONTENTS, ALLOC, LOAD, READONLY, CODE 13 .exit.text 00004120 ffff800011a1b1a8 ffff800011a1b1a8 01a1b1a8 2**2 CONTENTS, ALLOC, LOAD, READONLY, CODE The size of .text is increased by 0x1000 bytes, or 0.02% The size of .rodata is decreased by 0x300 bytes, or 0.06% The size of __ksymtab is decreased by 24 bytes, or 0.018% So the cost of this fast path is ~3.3K. Regarding code generation, this is the quite typical find_bit() user: unsigned int cpumask_next(int n, const struct cpumask *srcp) { /* -1 is a legal arg here. */ if (n != -1) cpumask_check(n); return find_next_bit(cpumask_bits(srcp), nr_cpumask_bits, n + 1); } EXPORT_SYMBOL(cpumask_next); Before: 0000000000000000 <cpumask_next>: 0: a9bf7bfd stp x29, x30, [sp, #-16]! 4: 11000402 add w2, w0, #0x1 8: aa0103e0 mov x0, x1 c: d2800401 mov x1, #0x40 // #64 10: 910003fd mov x29, sp 14: 93407c42 sxtw x2, w2 18: 94000000 bl 0 <find_next_bit> 1c: a8c17bfd ldp x29, x30, [sp], #16 20: d65f03c0 ret 24: d503201f nop After: 0000000000000140 <cpumask_next>: 140: 11000400 add w0, w0, #0x1 144: 93407c00 sxtw x0, w0 148: f100fc1f cmp x0, #0x3f 14c: 54000168 b.hi 178 <cpumask_next+0x38> // b.pmore 150: f9400023 ldr x3, [x1] 154: 92800001 mov x1, #0xffffffffffffffff // #-1 158: 9ac02020 lsl x0, x1, x0 15c: 52800802 mov w2, #0x40 // #64 160: 8a030001 and x1, x0, x3 164: dac00020 rbit x0, x1 168: f100003f cmp x1, #0x0 16c: dac01000 clz x0, x0 170: 1a800040 csel w0, w2, w0, eq // eq = none 174: d65f03c0 ret 178: 52800800 mov w0, #0x40 // #64 17c: d65f03c0 ret find_next_bit() call is removed with the cost of 6 instructions. (And I suspect we can improve the GENMASK() for better code generation.) find_next_bit() itself is 41 instructions, so I believe this fast path would bring value for many users. On the other hand, if someone is limited with I-cache and wants to avoid generating fast paths, I think it's worth to make SMALL_CONST() configurable, which would also disable fast paths in lib/bitmap.c. We may rely on -Os flag, or enable fast path in config explicitly: diff --git a/include/asm-generic/bitsperlong.h b/include/asm-generic/bitsperlong.h index 0eeb77544f1d..209e531074c1 100644 --- a/include/asm-generic/bitsperlong.h +++ b/include/asm-generic/bitsperlong.h @@ -23,6 +23,10 @@ #define BITS_PER_LONG_LONG 64 #endif +#ifdef CONFIG_FAST_PATH #define SMALL_CONST(n) (__builtin_constant_p(n) && (unsigned long)(n) < BITS_PER_LONG) +#else +#define SMALL_CONST(n) (0) +#endif #endif /* __ASM_GENERIC_BITS_PER_LONG */ diff --git a/lib/Kconfig b/lib/Kconfig index a38cc61256f1..c9629b3ebce8 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -39,6 +39,14 @@ config PACKING When in doubt, say N. +config FAST_PATH + bool "Enable fast path code generation" + default y + help + This option enables fast path optimization with the cost + of increasing the text section. + config BITREVERSE tristate > With Best Regards, > Andy Shevchenko
diff --git a/include/asm-generic/bitops/find.h b/include/asm-generic/bitops/find.h index 7ad70dab8e93..8bd7a33a889d 100644 --- a/include/asm-generic/bitops/find.h +++ b/include/asm-generic/bitops/find.h @@ -20,6 +20,16 @@ static inline unsigned long find_next_bit(const unsigned long *addr, unsigned long size, unsigned long offset) { + if (SMALL_CONST(size - 1)) { + unsigned long val; + + if (unlikely(offset >= size)) + return size; + + val = *addr & GENMASK(size - 1, offset); + return val ? __ffs(val) : size; + } + return _find_next_bit(addr, NULL, size, offset, 0UL, 0); } #endif @@ -40,6 +50,16 @@ unsigned long find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long size, unsigned long offset) { + if (SMALL_CONST(size - 1)) { + unsigned long val; + + if (unlikely(offset >= size)) + return size; + + val = *addr1 & *addr2 & GENMASK(size - 1, offset); + return val ? __ffs(val) : size; + } + return _find_next_bit(addr1, addr2, size, offset, 0UL, 0); } #endif @@ -58,6 +78,16 @@ static inline unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size, unsigned long offset) { + if (SMALL_CONST(size - 1)) { + unsigned long val; + + if (unlikely(offset >= size)) + return size; + + val = *addr | ~GENMASK(size - 1, offset); + return val == ~0UL ? size : ffz(val); + } + return _find_next_bit(addr, NULL, size, offset, ~0UL, 0); } #endif diff --git a/include/asm-generic/bitops/le.h b/include/asm-generic/bitops/le.h index 21305f6cea0b..18ebcf639d7f 100644 --- a/include/asm-generic/bitops/le.h +++ b/include/asm-generic/bitops/le.h @@ -5,6 +5,7 @@ #include <asm-generic/bitops/find.h> #include <asm/types.h> #include <asm/byteorder.h> +#include <linux/swab.h> #if defined(__LITTLE_ENDIAN) @@ -37,6 +38,16 @@ static inline unsigned long find_next_zero_bit_le(const void *addr, unsigned long size, unsigned long offset) { + if (SMALL_CONST(size)) { + unsigned long val = *(const unsigned long *)addr; + + if (unlikely(offset >= size)) + return size; + + val = swab(val) | ~GENMASK(size - 1, offset); + return val == ~0UL ? size : ffz(val); + } + return _find_next_bit(addr, NULL, size, offset, ~0UL, 1); } #endif @@ -46,6 +57,16 @@ static inline unsigned long find_next_bit_le(const void *addr, unsigned long size, unsigned long offset) { + if (SMALL_CONST(size)) { + unsigned long val = *(const unsigned long *)addr; + + if (unlikely(offset >= size)) + return size; + + val = swab(val) & GENMASK(size - 1, offset); + return val ? __ffs(val) : size; + } + return _find_next_bit(addr, NULL, size, offset, 0UL, 1); } #endif diff --git a/tools/include/asm-generic/bitops/find.h b/tools/include/asm-generic/bitops/find.h index 9fe62d10b084..eff868bd22f8 100644 --- a/tools/include/asm-generic/bitops/find.h +++ b/tools/include/asm-generic/bitops/find.h @@ -20,6 +20,16 @@ static inline unsigned long find_next_bit(const unsigned long *addr, unsigned long size, unsigned long offset) { + if (SMALL_CONST(size - 1)) { + unsigned long val; + + if (unlikely(offset >= size)) + return size; + + val = *addr & GENMASK(size - 1, offset); + return val ? __ffs(val) : size; + } + return _find_next_bit(addr, NULL, size, offset, 0UL, 0); } #endif @@ -40,6 +50,16 @@ unsigned long find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long size, unsigned long offset) { + if (SMALL_CONST(size - 1)) { + unsigned long val; + + if (unlikely(offset >= size)) + return size; + + val = *addr1 & *addr2 & GENMASK(size - 1, offset); + return val ? __ffs(val) : size; + } + return _find_next_bit(addr1, addr2, size, offset, 0UL, 0); } #endif @@ -58,6 +78,16 @@ static inline unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size, unsigned long offset) { + if (SMALL_CONST(size - 1)) { + unsigned long val; + + if (unlikely(offset >= size)) + return size; + + val = *addr | ~GENMASK(size - 1, offset); + return val == ~0UL ? size : ffz(val); + } + return _find_next_bit(addr, NULL, size, offset, ~0UL, 0); } #endif
Similarly to bitmap functions, find_next_*_bit() users will benefit if we'll handle a case of bitmaps that fit into a single word. In the very best case, the compiler may replace a function call with a single ffs or ffz instruction. Signed-off-by: Yury Norov <yury.norov@gmail.com> --- include/asm-generic/bitops/find.h | 30 +++++++++++++++++++++++++ include/asm-generic/bitops/le.h | 21 +++++++++++++++++ tools/include/asm-generic/bitops/find.h | 30 +++++++++++++++++++++++++ 3 files changed, 81 insertions(+)