Message ID | 20160406005239.GA25081@flamenco (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 06/04/2016 02:52, Emilio G. Cota wrote: > +static inline uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e, int seed) I would keep just this version and unconditionally zero-extend to 64-bits. The compiler is able to detect the high 32 bits are zero, drop the more expensive multiplications and constant fold everything. For example if you write unsigned tb_hash_func(uint32_t phys_pc, uint32_t pc, int flags) { return tb_hash_func5(phys_pc, pc, flags, 1); } and check the optimized code with -fdump-tree-optimized you'll see that the rotated v1, the rotated v3 and the 20 merge into a single constant 1733907856. Thanks, Paolo > +{ > + uint32_t v1 = seed + PRIME32_1 + PRIME32_2; > + uint32_t v2 = seed + PRIME32_2; > + uint32_t v3 = seed + 0; > + uint32_t v4 = seed - PRIME32_1; > + uint32_t a = a0 >> 31 >> 1; > + uint32_t b = a0; > + uint32_t c = b0 >> 31 >> 1; > + uint32_t d = b0; > + uint32_t h32; > + > + v1 += a * PRIME32_2; > + v1 = XXH_rotl32(v1, 13); > + v1 *= PRIME32_1; > + > + v2 += b * PRIME32_2; > + v2 = XXH_rotl32(v2, 13); > + v2 *= PRIME32_1; > + > + v3 += c * PRIME32_2; > + v3 = XXH_rotl32(v3, 13); > + v3 *= PRIME32_1; > + > + v4 += d * PRIME32_2; > + v4 = XXH_rotl32(v4, 13); > + v4 *= PRIME32_1; > + > + h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) + > + XXH_rotl32(v4, 18); > + h32 += 20; > + > + h32 += e * PRIME32_3; > + h32 = XXH_rotl32(h32, 17) * PRIME32_4; > + > + return h32_finish(h32); > +} > + > +static __attribute__((noinline)) > +unsigned tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags) > +{ > +#if TARGET_LONG_BITS == 64 > + > + if (sizeof(phys_pc) == sizeof(pc)) { > + return tb_hash_func5(phys_pc, pc, flags, 1); > + }
On Wed, Apr 06, 2016 at 13:52:21 +0200, Paolo Bonzini wrote: > > > On 06/04/2016 02:52, Emilio G. Cota wrote: > > +static inline uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e, int seed) > > I would keep just this version and unconditionally zero-extend to > 64-bits. The compiler is able to detect the high 32 bits are zero, drop > the more expensive multiplications and constant fold everything. > > For example if you write > > unsigned tb_hash_func(uint32_t phys_pc, uint32_t pc, int flags) > { > return tb_hash_func5(phys_pc, pc, flags, 1); > } > > and check the optimized code with -fdump-tree-optimized you'll see that > the rotated v1, the rotated v3 and the 20 merge into a single constant > 1733907856. I like this idea, because the ugliness of the sizeof checks is significant. However, the quality of the resulting hash is not as good when always using func5. For instance, when we'd otherwise use func3, two fifths of every input contain exactly the same bits: all 0's. This inevitably leads to more collisions. Performance (for the debian arm bootup test) gets up to 15% worse. Emilio
On 06/04/2016 19:44, Emilio G. Cota wrote: > I like this idea, because the ugliness of the sizeof checks is significant. > However, the quality of the resulting hash is not as good when always using func5. > For instance, when we'd otherwise use func3, two fifths of every input contain > exactly the same bits: all 0's. This inevitably leads to more collisions. It doesn't necessarily lead to more collisions. For a completely stupid hash function "a+b+c+d+e", for example, adding zeros doesn't add more collisions. What if you rearrange the five words so that the "all 0" parts of the input are all at the beginning, or all at the end? Perhaps the problem is that the odd words at the end are hashed less effectively. Perhaps better is to always use a three-word xxhash, but pick the 64-bit version if any of phys_pc and pc are 64-bits. The unrolling would be very effective, and the performance penalty not too important (64-bit on 32-bit is very slow anyway). If you can fix the problems with the collisions, it also means that you get good performance running 32-bit guests on qemu-system-x86_64 or qemu-system-aarch64. So, if you can do it, it's a very desirable property. Paolo
On 04/06/2016 11:23 AM, Paolo Bonzini wrote: > Perhaps better is to always use a three-word xxhash, but pick the 64-bit > version if any of phys_pc and pc are 64-bits. The unrolling would be > very effective, and the performance penalty not too important (64-bit on > 32-bit is very slow anyway). True. It would also be interesting to put an average bucket chain length into the "info jit" dump, so that it's easy to see how the hashing is performing for a given guest. r~
On Wed, Apr 06, 2016 at 20:23:42 +0200, Paolo Bonzini wrote: > On 06/04/2016 19:44, Emilio G. Cota wrote: > > I like this idea, because the ugliness of the sizeof checks is significant. > > However, the quality of the resulting hash is not as good when always using func5. > > For instance, when we'd otherwise use func3, two fifths of every input contain > > exactly the same bits: all 0's. This inevitably leads to more collisions. I take this back. I don't know anymore what I measured earlier today--it's been a long day and was juggling quite a few things. I essentially see the same chain lengths (within 0.2%) for either function, i.e. func3 or func5 with the padded 0's when running arm-softmmu. So this is good news :> > Perhaps better is to always use a three-word xxhash, but pick the 64-bit > version if any of phys_pc and pc are 64-bits. The unrolling would be > very effective, and the performance penalty not too important (64-bit on > 32-bit is very slow anyway). By "the 64-bit version" you mean what I called func5? That is: if (sizeof(phys_pc) == sizeof(uint64_t) || sizeof(pc) == sizeof(uint64_t)) return tb_hash_func5(); return tb_hash_func3(); or do you mean xxhash64 (which I did not include in my patchset)? My tests with xxhash64 suggest that the quality of the results do not improve over xxhash32, and the computation takes longer (it's more instructions); not much, but measurable. So we should probably just go with func5 always, as you suggested initially. If so, I'm ready to send a v2. Thanks, Emilio
On 07/04/2016 02:37, Emilio G. Cota wrote: > I take this back. I don't know anymore what I measured earlier today--it's > been a long day and was juggling quite a few things. > > I essentially see the same chain lengths (within 0.2%) for either function, i.e. > func3 or func5 with the padded 0's when running arm-softmmu. So this > is good news :> It's also much more reasonable for a good hash function .:) >> Perhaps better is to always use a three-word xxhash, but pick the 64-bit >> version if any of phys_pc and pc are 64-bits. The unrolling would be >> very effective, and the performance penalty not too important (64-bit on >> 32-bit is very slow anyway). > > By "the 64-bit version" you mean what I called func5? That is: > > if (sizeof(phys_pc) == sizeof(uint64_t) || sizeof(pc) == sizeof(uint64_t)) > return tb_hash_func5(); > return tb_hash_func3(); > > or do you mean xxhash64 (which I did not include in my patchset)? I meant xxhash64, but using func5 unconditionally is by far my preferred choice. Paolo
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h index 6b97a7c..349a856 100644 --- a/include/exec/tb-hash.h +++ b/include/exec/tb-hash.h @@ -45,19 +45,124 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc) | (tmp & TB_JMP_ADDR_MASK)); } -static inline -uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags) +static inline uint32_t h32_finish(uint32_t h32) { - struct { - tb_page_addr_t phys_pc; - target_ulong pc; - int flags; - } QEMU_PACKED k; - - k.phys_pc = phys_pc; - k.pc = pc; - k.flags = flags; - return qemu_xxh32((uint32_t *)&k, sizeof(k) / sizeof(uint32_t), 1); + h32 ^= h32 >> 15; + h32 *= PRIME32_2; + h32 ^= h32 >> 13; + h32 *= PRIME32_3; + h32 ^= h32 >> 16; + + return h32; +} + +static inline uint32_t tb_hash_func3(uint32_t a, uint32_t b, uint32_t c, int seed) +{ + uint32_t h32 = seed + PRIME32_5; + + h32 += 12; + + h32 += a * PRIME32_3; + h32 = XXH_rotl32(h32, 17) * PRIME32_4; + + h32 += b * PRIME32_3; + h32 = XXH_rotl32(h32, 17) * PRIME32_4; + + h32 += c * PRIME32_3; + h32 = XXH_rotl32(h32, 17) * PRIME32_4; + + return h32_finish(h32); +} + +static inline uint32_t tb_hash_func4(uint64_t a0, uint32_t c, uint32_t d, int seed) +{ + uint32_t v1 = seed + PRIME32_1 + PRIME32_2; + uint32_t v2 = seed + PRIME32_2; + uint32_t v3 = seed + 0; + uint32_t v4 = seed - PRIME32_1; + uint32_t a = a0 >> 31 >> 1; + uint32_t b = a0; + uint32_t h32; + + v1 += a * PRIME32_2; + v1 = XXH_rotl32(v1, 13); + v1 *= PRIME32_1; + + v2 += b * PRIME32_2; + v2 = XXH_rotl32(v2, 13); + v2 *= PRIME32_1; + + v3 += c * PRIME32_2; + v3 = XXH_rotl32(v3, 13); + v3 *= PRIME32_1; + + v4 += d * PRIME32_2; + v4 = XXH_rotl32(v4, 13); + v4 *= PRIME32_1; + + h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) + + XXH_rotl32(v4, 18); + h32 += 16; + + return h32_finish(h32); +} + +static inline uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e, int seed) +{ + uint32_t v1 = seed + PRIME32_1 + PRIME32_2; + uint32_t v2 = seed + PRIME32_2; + uint32_t v3 = seed + 0; + uint32_t v4 = seed - PRIME32_1; + uint32_t a = a0 >> 31 >> 1; + uint32_t b = a0; + uint32_t c = b0 >> 31 >> 1; + uint32_t d = b0; + uint32_t h32; + + v1 += a * PRIME32_2; + v1 = XXH_rotl32(v1, 13); + v1 *= PRIME32_1; + + v2 += b * PRIME32_2; + v2 = XXH_rotl32(v2, 13); + v2 *= PRIME32_1; + + v3 += c * PRIME32_2; + v3 = XXH_rotl32(v3, 13); + v3 *= PRIME32_1; + + v4 += d * PRIME32_2; + v4 = XXH_rotl32(v4, 13); + v4 *= PRIME32_1; + + h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) + + XXH_rotl32(v4, 18); + h32 += 20; + + h32 += e * PRIME32_3; + h32 = XXH_rotl32(h32, 17) * PRIME32_4; + + return h32_finish(h32); +} + +static __attribute__((noinline)) +unsigned tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags) +{ +#if TARGET_LONG_BITS == 64 + + if (sizeof(phys_pc) == sizeof(pc)) { + return tb_hash_func5(phys_pc, pc, flags, 1); + } + return tb_hash_func4(pc, phys_pc, flags, 1); + +#else /* 32-bit target */ + + if (sizeof(phys_pc) > sizeof(pc)) { + return tb_hash_func4(phys_pc, pc, flags, 1); + } + return tb_hash_func3(pc, phys_pc, flags, 1); + +#endif /* TARGET_LONG_BITS */ } #endif