diff mbox series

[7/8] lib: add fast path for find_next_*_bit()

Message ID 20210130191719.7085-8-yury.norov@gmail.com (mailing list archive)
State New, archived
Headers show
Series lib/find_bit: fast path for small bitmaps | expand

Commit Message

Yury Norov Jan. 30, 2021, 7:17 p.m. UTC
Similarly to bitmap functions, find_next_*_bit() users will benefit
if we'll handle a case of bitmaps that fit into a single word. In the
very best case, the compiler may replace a function call with a
single ffs or ffz instruction.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/asm-generic/bitops/find.h       | 30 +++++++++++++++++++++++++
 include/asm-generic/bitops/le.h         | 21 +++++++++++++++++
 tools/include/asm-generic/bitops/find.h | 30 +++++++++++++++++++++++++
 3 files changed, 81 insertions(+)

Comments

Andy Shevchenko Feb. 1, 2021, 1:49 p.m. UTC | #1
On Sat, Jan 30, 2021 at 11:17:18AM -0800, Yury Norov wrote:
> Similarly to bitmap functions, find_next_*_bit() users will benefit
> if we'll handle a case of bitmaps that fit into a single word. In the
> very best case, the compiler may replace a function call with a
> single ffs or ffz instruction.

Would be nice to have the examples how it reduces the actual code size (based
on the existing code in kernel, especially in widely used frameworks /
subsystems, like PCI).
David Laight Feb. 1, 2021, 4:02 p.m. UTC | #2
From: Andy Shevchenko
> Sent: 01 February 2021 13:49
> 
> On Sat, Jan 30, 2021 at 11:17:18AM -0800, Yury Norov wrote:
> > Similarly to bitmap functions, find_next_*_bit() users will benefit
> > if we'll handle a case of bitmaps that fit into a single word. In the
> > very best case, the compiler may replace a function call with a
> > single ffs or ffz instruction.
> 
> Would be nice to have the examples how it reduces the actual code size (based
> on the existing code in kernel, especially in widely used frameworks /
> subsystems, like PCI).

I bet it makes the kernel bigger but very slightly faster.
But the fact that the wrappers end up in the i-cache may
mean that inlining actually makes it slower for some calling
sequences.

If a bitmap fits in a single word (as a compile-time constant)
then you should (probably) be using different functions if
you care about performance.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Andy Shevchenko Feb. 1, 2021, 4:22 p.m. UTC | #3
On Mon, Feb 01, 2021 at 04:02:30PM +0000, David Laight wrote:
> From: Andy Shevchenko
> > Sent: 01 February 2021 13:49
> > On Sat, Jan 30, 2021 at 11:17:18AM -0800, Yury Norov wrote:
> > > Similarly to bitmap functions, find_next_*_bit() users will benefit
> > > if we'll handle a case of bitmaps that fit into a single word. In the
> > > very best case, the compiler may replace a function call with a
> > > single ffs or ffz instruction.
> > 
> > Would be nice to have the examples how it reduces the actual code size (based
> > on the existing code in kernel, especially in widely used frameworks /
> > subsystems, like PCI).
> 
> I bet it makes the kernel bigger but very slightly faster.
> But the fact that the wrappers end up in the i-cache may
> mean that inlining actually makes it slower for some calling
> sequences.

> If a bitmap fits in a single word (as a compile-time constant)
> then you should (probably) be using different functions if
> you care about performance.

Isn't this patch series exactly about it?
Yury Norov Feb. 2, 2021, 7:02 a.m. UTC | #4
[ CC Alexey Klimov, andRasmus Villemoes as they also may be interested ]

On Mon, Feb 1, 2021 at 8:22 AM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:

> On Mon, Feb 01, 2021 at 04:02:30PM +0000, David Laight wrote:
> > From: Andy Shevchenko
> > > Sent: 01 February 2021 13:49
> > > On Sat, Jan 30, 2021 at 11:17:18AM -0800, Yury Norov wrote:
> > > > Similarly to bitmap functions, find_next_*_bit() users will benefit
> > > > if we'll handle a case of bitmaps that fit into a single word. In the
> > > > very best case, the compiler may replace a function call with a
> > > > single ffs or ffz instruction.
> > >
> > > Would be nice to have the examples how it reduces the actual code size (based
> > > on the existing code in kernel, especially in widely used frameworks /
> > > subsystems, like PCI).
>>
> > I bet it makes the kernel bigger but very slightly faster.
> > But the fact that the wrappers end up in the i-cache may
> > mean that inlining actually makes it slower for some calling
> > sequences.
>
> > If a bitmap fits in a single word (as a compile-time constant)
> > then you should (probably) be using different functions if
> > you care about performance.
>
> Isn't this patch series exactly about it?

Probably, David meant something like find_first_bit_fast_path().
Not sure that this approach would be better than pure inlining.

Addressing questions above.

My arm64 build for qemu:
Before:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .head.text    00010000  ffff800010000000  ffff800010000000  00010000  2**16
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .text         011603a8  ffff800010010000  ffff800010010000  00020000  2**16
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  2 .got.plt      00000018  ffff8000111703a8  ffff8000111703a8  011803a8  2**3
                  CONTENTS, ALLOC, LOAD, DATA
  3 .rodata       007a29b2  ffff800011180000  ffff800011180000  01190000  2**12
                  CONTENTS, ALLOC, LOAD, DATA
  4 .pci_fixup    000025f0  ffff8000119229c0  ffff8000119229c0  019329c0  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  5 __ksymtab     000100d4  ffff800011924fb0  ffff800011924fb0  01934fb0  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  6 __ksymtab_gpl 000147cc  ffff800011935084  ffff800011935084  01945084  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  7 __ksymtab_strings 0003b0be  ffff800011949850  ffff800011949850
01959850  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  8 __param       00003e58  ffff800011984910  ffff800011984910  01994910  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  9 __modver      00000678  ffff800011988768  ffff800011988768  01998768  2**3
                  CONTENTS, ALLOC, LOAD, DATA
 10 __ex_table    00002270  ffff800011988de0  ffff800011988de0  01998de0  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 11 .notes        0000003c  ffff80001198b050  ffff80001198b050  0199b050  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 12 .init.text    0007b09c  ffff8000119a0000  ffff8000119a0000  019a0000  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
 13 .exit.text    00004120  ffff800011a1b09c  ffff800011a1b09c  01a1b09c  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
...

After:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .head.text    00010000  ffff800010000000  ffff800010000000  00010000  2**16
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .text         011613a8  ffff800010010000  ffff800010010000  00020000  2**16
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  2 .got.plt      00000018  ffff8000111713a8  ffff8000111713a8  011813a8  2**3
                  CONTENTS, ALLOC, LOAD, DATA
  3 .rodata       007a26b2  ffff800011180000  ffff800011180000  01190000  2**12
                  CONTENTS, ALLOC, LOAD, DATA
  4 .pci_fixup    000025f0  ffff8000119226c0  ffff8000119226c0  019326c0  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  5 __ksymtab     000100bc  ffff800011924cb0  ffff800011924cb0  01934cb0  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  6 __ksymtab_gpl 000147cc  ffff800011934d6c  ffff800011934d6c  01944d6c  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  7 __ksymtab_strings 0003b09b  ffff800011949538  ffff800011949538
01959538  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  8 __param       00003e58  ffff8000119845d8  ffff8000119845d8  019945d8  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  9 __modver      00000678  ffff800011988430  ffff800011988430  01998430  2**3
                  CONTENTS, ALLOC, LOAD, DATA
 10 __ex_table    00002270  ffff800011988aa8  ffff800011988aa8  01998aa8  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 11 .notes        0000003c  ffff80001198ad18  ffff80001198ad18  0199ad18  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 12 .init.text    0007b1a8  ffff8000119a0000  ffff8000119a0000  019a0000  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
 13 .exit.text    00004120  ffff800011a1b1a8  ffff800011a1b1a8  01a1b1a8  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE

The size of  .text is increased by 0x1000 bytes, or 0.02%
The size of .rodata is decreased by 0x300 bytes, or 0.06%
The size of  __ksymtab is decreased by 24 bytes, or 0.018%

So the cost of this fast path is ~3.3K.

Regarding code generation, this is the quite typical find_bit() user:

unsigned int cpumask_next(int n, const struct cpumask *srcp)
{
        /* -1 is a legal arg here. */
        if (n != -1)
                cpumask_check(n);
        return find_next_bit(cpumask_bits(srcp), nr_cpumask_bits, n + 1);
}
EXPORT_SYMBOL(cpumask_next);

Before:
0000000000000000 <cpumask_next>:
   0:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
   4:   11000402        add     w2, w0, #0x1
   8:   aa0103e0        mov     x0, x1
   c:   d2800401        mov     x1, #0x40                       // #64
  10:   910003fd        mov     x29, sp
  14:   93407c42        sxtw    x2, w2
  18:   94000000        bl      0 <find_next_bit>
  1c:   a8c17bfd        ldp     x29, x30, [sp], #16
  20:   d65f03c0        ret
  24:   d503201f        nop

After:
0000000000000140 <cpumask_next>:
 140:   11000400        add     w0, w0, #0x1
 144:   93407c00        sxtw    x0, w0
 148:   f100fc1f        cmp     x0, #0x3f
 14c:   54000168        b.hi    178 <cpumask_next+0x38>  // b.pmore
 150:   f9400023        ldr     x3, [x1]
 154:   92800001        mov     x1, #0xffffffffffffffff         // #-1
 158:   9ac02020        lsl     x0, x1, x0
 15c:   52800802        mov     w2, #0x40                       // #64
 160:   8a030001        and     x1, x0, x3
 164:   dac00020        rbit    x0, x1
 168:   f100003f        cmp     x1, #0x0
 16c:   dac01000        clz     x0, x0
 170:   1a800040        csel    w0, w2, w0, eq  // eq = none
 174:   d65f03c0        ret
 178:   52800800        mov     w0, #0x40                       // #64
 17c:   d65f03c0        ret

find_next_bit() call is removed with the cost of 6 instructions.
(And I suspect we can improve the GENMASK() for better code generation.)
find_next_bit() itself is 41 instructions, so I believe this fast path
would bring
value for many users.

On the other hand, if someone is limited with I-cache and wants to avoid
generating fast paths, I think it's worth to make SMALL_CONST() configurable,
which would also disable fast paths in lib/bitmap.c. We may rely on -Os flag, or
enable fast path in config explicitly:

diff --git a/include/asm-generic/bitsperlong.h
b/include/asm-generic/bitsperlong.h
index 0eeb77544f1d..209e531074c1 100644
--- a/include/asm-generic/bitsperlong.h
+++ b/include/asm-generic/bitsperlong.h
@@ -23,6 +23,10 @@
 #define BITS_PER_LONG_LONG 64
 #endif

+#ifdef CONFIG_FAST_PATH
 #define SMALL_CONST(n) (__builtin_constant_p(n) && (unsigned long)(n)
< BITS_PER_LONG)
+#else
+#define SMALL_CONST(n) (0)
+#endif

 #endif /* __ASM_GENERIC_BITS_PER_LONG */
diff --git a/lib/Kconfig b/lib/Kconfig
index a38cc61256f1..c9629b3ebce8 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -39,6 +39,14 @@ config PACKING

          When in doubt, say N.

+config FAST_PATH
+       bool "Enable fast path code generation"
+       default y
+       help
+         This option enables fast path optimization with the cost
+         of increasing the text section.
+
 config BITREVERSE
        tristate

> With Best Regards,
> Andy Shevchenko
diff mbox series

Patch

diff --git a/include/asm-generic/bitops/find.h b/include/asm-generic/bitops/find.h
index 7ad70dab8e93..8bd7a33a889d 100644
--- a/include/asm-generic/bitops/find.h
+++ b/include/asm-generic/bitops/find.h
@@ -20,6 +20,16 @@  static inline
 unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
 			    unsigned long offset)
 {
+	if (SMALL_CONST(size - 1)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = *addr & GENMASK(size - 1, offset);
+		return val ? __ffs(val) : size;
+	}
+
 	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
 }
 #endif
@@ -40,6 +50,16 @@  unsigned long find_next_and_bit(const unsigned long *addr1,
 		const unsigned long *addr2, unsigned long size,
 		unsigned long offset)
 {
+	if (SMALL_CONST(size - 1)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = *addr1 & *addr2 & GENMASK(size - 1, offset);
+		return val ? __ffs(val) : size;
+	}
+
 	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
 }
 #endif
@@ -58,6 +78,16 @@  static inline
 unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 				 unsigned long offset)
 {
+	if (SMALL_CONST(size - 1)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = *addr | ~GENMASK(size - 1, offset);
+		return val == ~0UL ? size : ffz(val);
+	}
+
 	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
 }
 #endif
diff --git a/include/asm-generic/bitops/le.h b/include/asm-generic/bitops/le.h
index 21305f6cea0b..18ebcf639d7f 100644
--- a/include/asm-generic/bitops/le.h
+++ b/include/asm-generic/bitops/le.h
@@ -5,6 +5,7 @@ 
 #include <asm-generic/bitops/find.h>
 #include <asm/types.h>
 #include <asm/byteorder.h>
+#include <linux/swab.h>
 
 #if defined(__LITTLE_ENDIAN)
 
@@ -37,6 +38,16 @@  static inline
 unsigned long find_next_zero_bit_le(const void *addr, unsigned
 		long size, unsigned long offset)
 {
+	if (SMALL_CONST(size)) {
+		unsigned long val = *(const unsigned long *)addr;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = swab(val) | ~GENMASK(size - 1, offset);
+		return val == ~0UL ? size : ffz(val);
+	}
+
 	return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
 }
 #endif
@@ -46,6 +57,16 @@  static inline
 unsigned long find_next_bit_le(const void *addr, unsigned
 		long size, unsigned long offset)
 {
+	if (SMALL_CONST(size)) {
+		unsigned long val = *(const unsigned long *)addr;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = swab(val) & GENMASK(size - 1, offset);
+		return val ? __ffs(val) : size;
+	}
+
 	return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
 }
 #endif
diff --git a/tools/include/asm-generic/bitops/find.h b/tools/include/asm-generic/bitops/find.h
index 9fe62d10b084..eff868bd22f8 100644
--- a/tools/include/asm-generic/bitops/find.h
+++ b/tools/include/asm-generic/bitops/find.h
@@ -20,6 +20,16 @@  static inline
 unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
 			    unsigned long offset)
 {
+	if (SMALL_CONST(size - 1)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = *addr & GENMASK(size - 1, offset);
+		return val ? __ffs(val) : size;
+	}
+
 	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
 }
 #endif
@@ -40,6 +50,16 @@  unsigned long find_next_and_bit(const unsigned long *addr1,
 		const unsigned long *addr2, unsigned long size,
 		unsigned long offset)
 {
+	if (SMALL_CONST(size - 1)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = *addr1 & *addr2 & GENMASK(size - 1, offset);
+		return val ? __ffs(val) : size;
+	}
+
 	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
 }
 #endif
@@ -58,6 +78,16 @@  static inline
 unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 				 unsigned long offset)
 {
+	if (SMALL_CONST(size - 1)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = *addr | ~GENMASK(size - 1, offset);
+		return val == ~0UL ? size : ffz(val);
+	}
+
 	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
 }
 #endif