Message ID | 20160428164856.10120.qmail@ns.horizon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Apr 28, 2016 at 6:48 PM, George Spelvin <linux@horizon.com> wrote: > Another few comments: > > 1. Would ARCH_HAS_FAST_FFS involve fewer changes than CPU_NO_EFFICIENT_FFS? No, as you want to _disable_ ARCH_HAS_FAST_FFS / _enable_ CPU_NO_EFFICIENT_FFS as soon as you're enabling support for a CPU that doesn't support it. Logical OR is easier in both the Kconfig and C preprocessor languages than logical NAND. E.g. in Kconfig, a CPU core not supporting it can just select CPU_NO_EFFICIENT_FFS. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
On Thu, Apr 28, 2016 at 07:51:06PM +0200, Geert Uytterhoeven wrote: > On Thu, Apr 28, 2016 at 6:48 PM, George Spelvin <linux@horizon.com> wrote: > > Another few comments: > > > > 1. Would ARCH_HAS_FAST_FFS involve fewer changes than CPU_NO_EFFICIENT_FFS? > > No, as you want to _disable_ ARCH_HAS_FAST_FFS / _enable_ > CPU_NO_EFFICIENT_FFS as soon as you're enabling support for a > CPU that doesn't support it. > > Logical OR is easier in both the Kconfig and C preprocessor languages > than logical NAND. > > E.g. in Kconfig, a CPU core not supporting it can just select > CPU_NO_EFFICIENT_FFS. How does a CPU lack an efficient ffs/ctz anyway? There are all sorts of ways to implement it without a native insn, some of which are almost or just as fast as the native insn on cpus that have the latter. On anything with a fast multiply, the de Bruijn sequence approach is near-optimal, and otherwise one of the binary-search type approaches (possibly branchless) can be used. If the compiler doesn't generate an appropriate one for __builtin_ctz, that's arguably a compiler bug. Rich
On Thu, Apr 28, 2016 at 7:58 PM, Rich Felker <dalias@libc.org> wrote: > On Thu, Apr 28, 2016 at 07:51:06PM +0200, Geert Uytterhoeven wrote: >> On Thu, Apr 28, 2016 at 6:48 PM, George Spelvin <linux@horizon.com> wrote: >> > Another few comments: >> > >> > 1. Would ARCH_HAS_FAST_FFS involve fewer changes than CPU_NO_EFFICIENT_FFS? >> >> No, as you want to _disable_ ARCH_HAS_FAST_FFS / _enable_ >> CPU_NO_EFFICIENT_FFS as soon as you're enabling support for a >> CPU that doesn't support it. >> >> Logical OR is easier in both the Kconfig and C preprocessor languages >> than logical NAND. >> >> E.g. in Kconfig, a CPU core not supporting it can just select >> CPU_NO_EFFICIENT_FFS. > > How does a CPU lack an efficient ffs/ctz anyway? There are all sorts > of ways to implement it without a native insn, some of which are > almost or just as fast as the native insn on cpus that have the > latter. On anything with a fast multiply, the de Bruijn sequence > approach is near-optimal, and otherwise one of the binary-search type > approaches (possibly branchless) can be used. If the compiler doesn't > generate an appropriate one for __builtin_ctz, that's arguably a > compiler bug. m68k-linux-gcc 4.6.3 generates: jsr __ctzsi2 Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
> How does a CPU lack an efficient ffs/ctz anyway? There are all sorts > of ways to implement it without a native insn, some of which are > almost or just as fast as the native insn on cpus that have the > latter. On anything with a fast multiply, the de Bruijn sequence > approach is near-optimal, and otherwise one of the binary-search type > approaches (possibly branchless) can be used. If the compiler doesn't > generate an appropriate one for __builtin_ctz, that's arguably a > compiler bug. What's wanted here is something faster than any of those. Yes, there's a simple constant-time branch-free implementation: unsigned inline __attribute__((const)) hweight32(uint32_t x) { x -= (x >> 1) & 0x55555555; x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x += x >> 4; x &= 0x0f0f0f0f; x += x >> 8; x += x >> 16; return x & 63; } unsigned inline __attribute__((const)) __ffs32(uint32_t x) { return hweight(~x & (x-1)); } but if you work it through, that's about 19 instructions; a few more on platforms without 32-bit immediates. The shift itself makes an even 20, and there are a lot of sequential dependencies (I count a 17-op chain including the shift) limiting execution time. The de Bruijn hack reduces the length but adds a memory access for the table lookup. (http://supertech.csail.mit.edu/papers/debruijn.pdf) In the GCD code, the number to normalize is basically random, so the normalization loop shifts an average of 1 bit. One bit half the time, a second bit 1/4 of the time, etc. (The posted code in the FAST_FFS case omits one guaranteed shift at the end of the loop because the normalization code is constant-time.) So "fast __ffs" basically means faster than *one* iteration of "while (!(x & 1)) x >>= 1;". In this case "fast" means cheaper than *one* unpredictable branch, which is a very small handful of instructions.
> __ffs on the available architectures: > Alpha: sometimes (CONFIG_ALPHA_EV6, CONFIG_ALPHA_EV67) > ARC: sometimes (!CONFIG_ISA_ARCOMPACT) > ARM: sometimes (V5+) > ARM64: NO, could be written using RBIT and CLZ > AVR: yes > Blackfin: NO, could be written using hweight() > C6x: yes > CRIS: NO > FR-V: yes > H8300: NO > Hexagon: yes > IA64: yes > M32R: NO > M68k: sometimes > MetaG: NO > Microblaze: NO > MIPS: sometimes > MN10300: yes > OpenRISC: NO > PA-RISC: NO? Interesting code, but I think it's a net loss. > PowerPC: yes > S390: sometimes (CONFIG_HAVE_MARCH_Z9_109_FEATURES) > Score: NO > SH: NO > SPARC: NO SPARC: sparc64: YES, sparc32: NO Patch needs to be updated to refelct this. Sam
diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h index 4bdfbd44..c9c307a8 100644 --- a/arch/alpha/include/asm/bitops.h +++ b/arch/alpha/include/asm/bitops.h @@ -333,6 +333,7 @@ static inline unsigned long ffz(unsigned long word) static inline unsigned long __ffs(unsigned long word) { #if defined(CONFIG_ALPHA_EV6) && defined(CONFIG_ALPHA_EV67) +#define ARCH_HAS_FAST_FFS 1 /* Whee. EV67 can calculate it directly. */ return __kernel_cttz(word); #else