[07/10] tb hash: hash phys_pc, pc, and flags with xxhash

Message ID	20160406005239.GA25081@flamenco (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> Date: Tue, 5 Apr 2016 20:52:39 -0400 From: "Emilio G. Cota" <cota@braap.org> To: Richard Henderson <rth@twiddle.net> Message-ID: <20160406005239.GA25081@flamenco> References: <1459834253-8291-1-git-send-email-cota@braap.org> <1459834253-8291-8-git-send-email-cota@braap.org> <5703DCB7.50302@twiddle.net> <5703DE37.3080306@redhat.com> <5703E2DD.3020103@twiddle.net> <20160405194028.GA6671@flamenco> <5704293D.1070105@twiddle.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5704293D.1070105@twiddle.net> User-Agent: Mutt/1.5.23 (2014-03-12) Cc: MTTCG Devel <mttcg@greensocs.com>, Peter Maydell <peter.maydell@linaro.org>, Peter Crosthwaite <crosthwaite.peter@gmail.com>, QEMU Developers <qemu-devel@nongnu.org>, Sergey Fedorov <serge.fdrv@gmail.com>, Paolo Bonzini <pbonzini@redhat.com>, Alex =?iso-8859-1?Q?Benn=E9e?= <alex.bennee@linaro.org> Subject: Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash Precedence: list Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org

Emilio Cota April 6, 2016, 12:52 a.m. UTC

On Tue, Apr 05, 2016 at 14:08:13 -0700, Richard Henderson wrote:
> But the point is that we can do better than dropping data into memory.
> Particularly for those hosts that do not support unaligned data, such as you
> created with the packed structure.

If we made sure the fields in the struct were in the right order
(larger fields first), this shouldn't be an issue.

Anyway I took your proposal and implemented the patch below.
FWIW I cannot measure a perf. difference between this and the packed
struct for arm-softmmu (i.e. 16 bytes) on an x86_64 host.

How does the appended look?

Thanks,

		E.

commit af92a0690f49172621cd8b80759e3ca567d43567
Author: Emilio G. Cota <cota@braap.org>
Date:   Tue Apr 5 18:06:21 2016 -0400

    rth

    Signed-off-by: Emilio G. Cota <cota@braap.org>

Paolo Bonzini April 6, 2016, 11:52 a.m. UTC | #1

On 06/04/2016 02:52, Emilio G. Cota wrote:
> +static inline uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e, int seed)

I would keep just this version and unconditionally zero-extend to
64-bits.  The compiler is able to detect the high 32 bits are zero, drop
the more expensive multiplications and constant fold everything.

For example if you write

unsigned tb_hash_func(uint32_t phys_pc, uint32_t pc, int flags)
{
    return tb_hash_func5(phys_pc, pc, flags, 1);
}

and check the optimized code with -fdump-tree-optimized you'll see that
the rotated v1, the rotated v3 and the 20 merge into a single constant
1733907856.

Thanks,

Paolo

> +{
> +    uint32_t v1 = seed + PRIME32_1 + PRIME32_2;
> +    uint32_t v2 = seed + PRIME32_2;
> +    uint32_t v3 = seed + 0;
> +    uint32_t v4 = seed - PRIME32_1;
> +    uint32_t a = a0 >> 31 >> 1;
> +    uint32_t b = a0;
> +    uint32_t c = b0 >> 31 >> 1;
> +    uint32_t d = b0;
> +    uint32_t h32;
> +
> +    v1 += a * PRIME32_2;
> +    v1 = XXH_rotl32(v1, 13);
> +    v1 *= PRIME32_1;
> +
> +    v2 += b * PRIME32_2;
> +    v2 = XXH_rotl32(v2, 13);
> +    v2 *= PRIME32_1;
> +
> +    v3 += c * PRIME32_2;
> +    v3 = XXH_rotl32(v3, 13);
> +    v3 *= PRIME32_1;
> +
> +    v4 += d * PRIME32_2;
> +    v4 = XXH_rotl32(v4, 13);
> +    v4 *= PRIME32_1;
> +
> +    h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) +
> +          XXH_rotl32(v4, 18);
> +    h32 += 20;
> +
> +    h32 += e * PRIME32_3;
> +    h32  = XXH_rotl32(h32, 17) * PRIME32_4;
> +
> +    return h32_finish(h32);
> +}
> +
> +static __attribute__((noinline))
> +unsigned tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
> +{
> +#if TARGET_LONG_BITS == 64
> +
> +    if (sizeof(phys_pc) == sizeof(pc)) {
> +        return tb_hash_func5(phys_pc, pc, flags, 1);
> +    }

Emilio Cota April 6, 2016, 5:44 p.m. UTC | #2

On Wed, Apr 06, 2016 at 13:52:21 +0200, Paolo Bonzini wrote:
> 
> 
> On 06/04/2016 02:52, Emilio G. Cota wrote:
> > +static inline uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e, int seed)
> 
> I would keep just this version and unconditionally zero-extend to
> 64-bits.  The compiler is able to detect the high 32 bits are zero, drop
> the more expensive multiplications and constant fold everything.
> 
> For example if you write
> 
> unsigned tb_hash_func(uint32_t phys_pc, uint32_t pc, int flags)
> {
>     return tb_hash_func5(phys_pc, pc, flags, 1);
> }
> 
> and check the optimized code with -fdump-tree-optimized you'll see that
> the rotated v1, the rotated v3 and the 20 merge into a single constant
> 1733907856.

I like this idea, because the ugliness of the sizeof checks is significant.
However, the quality of the resulting hash is not as good when always using func5.
For instance, when we'd otherwise use func3, two fifths of every input contain
exactly the same bits: all 0's. This inevitably leads to more collisions.

Performance (for the debian arm bootup test) gets up to 15% worse.

		Emilio

Paolo Bonzini April 6, 2016, 6:23 p.m. UTC | #3

On 06/04/2016 19:44, Emilio G. Cota wrote:
> I like this idea, because the ugliness of the sizeof checks is significant.
> However, the quality of the resulting hash is not as good when always using func5.
> For instance, when we'd otherwise use func3, two fifths of every input contain
> exactly the same bits: all 0's. This inevitably leads to more collisions.

It doesn't necessarily lead to more collisions.  For a completely stupid
hash function "a+b+c+d+e", for example, adding zeros doesn't add more
collisions.

What if you rearrange the five words so that the "all 0" parts of the
input are all at the beginning, or all at the end?  Perhaps the problem
is that the odd words at the end are hashed less effectively.

Perhaps better is to always use a three-word xxhash, but pick the 64-bit
version if any of phys_pc and pc are 64-bits.  The unrolling would be
very effective, and the performance penalty not too important (64-bit on
32-bit is very slow anyway).

If you can fix the problems with the collisions, it also means that you
get good performance running 32-bit guests on qemu-system-x86_64 or
qemu-system-aarch64.  So, if you can do it, it's a very desirable property.

Paolo

Richard Henderson April 6, 2016, 6:27 p.m. UTC | #4

On 04/06/2016 11:23 AM, Paolo Bonzini wrote:
> Perhaps better is to always use a three-word xxhash, but pick the 64-bit
> version if any of phys_pc and pc are 64-bits.  The unrolling would be
> very effective, and the performance penalty not too important (64-bit on
> 32-bit is very slow anyway).

True.

It would also be interesting to put an average bucket chain length into the
"info jit" dump, so that it's easy to see how the hashing is performing for a
given guest.

r~

Emilio Cota April 7, 2016, 12:37 a.m. UTC | #5

On Wed, Apr 06, 2016 at 20:23:42 +0200, Paolo Bonzini wrote:
> On 06/04/2016 19:44, Emilio G. Cota wrote:
> > I like this idea, because the ugliness of the sizeof checks is significant.
> > However, the quality of the resulting hash is not as good when always using func5.
> > For instance, when we'd otherwise use func3, two fifths of every input contain
> > exactly the same bits: all 0's. This inevitably leads to more collisions.

I take this back. I don't know anymore what I measured earlier today--it's
been a long day and was juggling quite a few things.

I essentially see the same chain lengths (within 0.2%) for either function, i.e.
func3 or func5 with the padded 0's when running arm-softmmu. So this
is good news :>

> Perhaps better is to always use a three-word xxhash, but pick the 64-bit
> version if any of phys_pc and pc are 64-bits.  The unrolling would be
> very effective, and the performance penalty not too important (64-bit on
> 32-bit is very slow anyway).

By "the 64-bit version" you mean what I called func5? That is:

if (sizeof(phys_pc) == sizeof(uint64_t) || sizeof(pc) == sizeof(uint64_t))
	return tb_hash_func5();
return tb_hash_func3();

or do you mean xxhash64 (which I did not include in my patchset)?
My tests with xxhash64 suggest that the quality of the results do not
improve over xxhash32, and the computation takes longer (it's more
instructions); not much, but measurable.

So we should probably just go with func5 always, as you suggested
initially. If so, I'm ready to send a v2.

Thanks,

		Emilio

Paolo Bonzini April 7, 2016, 8:46 a.m. UTC | #6

On 07/04/2016 02:37, Emilio G. Cota wrote:
> I take this back. I don't know anymore what I measured earlier today--it's
> been a long day and was juggling quite a few things.
> 
> I essentially see the same chain lengths (within 0.2%) for either function, i.e.
> func3 or func5 with the padded 0's when running arm-softmmu. So this
> is good news :>

It's also much more reasonable for a good hash function .:)

>> Perhaps better is to always use a three-word xxhash, but pick the 64-bit
>> version if any of phys_pc and pc are 64-bits.  The unrolling would be
>> very effective, and the performance penalty not too important (64-bit on
>> 32-bit is very slow anyway).
> 
> By "the 64-bit version" you mean what I called func5? That is:
> 
> if (sizeof(phys_pc) == sizeof(uint64_t) || sizeof(pc) == sizeof(uint64_t))
> 	return tb_hash_func5();
> return tb_hash_func3();
> 
> or do you mean xxhash64 (which I did not include in my patchset)?

I meant xxhash64, but using func5 unconditionally is by far my preferred
choice.

Paolo

[07/10] tb hash: hash phys_pc, pc, and flags with xxhash

Commit Message

Comments

Patch