diff mbox series

fix: platform accordance while calculating murmur3

Message ID pull.1636.git.git.1704376606625.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series fix: platform accordance while calculating murmur3 | expand

Commit Message

Chen Xuewei Jan. 4, 2024, 1:56 p.m. UTC
From: Chen Xuewei <316403398@qq.com>

It is known that whether the highest bit is extended when char cast to
uint32, depends on CPU architecture, which will lead different hash
value. This is a fix to accord all architecture behaviour.

Signed-off-by: Chen Xuewei <316403398@qq.com>
---
    fix: platform accordance while calculating murmur3
    
    
    Short Description
    =================
    
    fix: platform accordance while calculating murmur3
    
    It is known that whether the highest bit is extended when char cast to
    uint32, depends on CPU architecture, which will lead different hash
    value. This is a fix to accord all architecture behaviour.
    
    
    Problem backgroud:
    ==================
    
    when using git log --max-count=1 <commit> -- <path> in an mixed cpu
    cluster environment both arm and x86 in a cluster as a service, where
    the <path> character is chinese or some other character that the highest
    bit of char is 1. all machines share the same repo disk. It happened
    that sometimes you can get the searched file among commit, sometimes you
    cannot.
    
    
    Conditions
    ==========
    
     1. file path include chinese characters or other characters that the
        highest bit is 1.
     2. mixed cpu architecture as a git cluster service
    
    
    Reason
    ======
    
    when you have over 2 machines (both arm and x86 are included at least
    one) as a git server cluster. once you open the commit-graph's
    bloom_filter feature. The bloom filter stores the file path as hash
    values using the murmur3 function. suppose the arm take it this time,
    then the char's highest bit is not extended. for example, on arm,
    char(11100110) to uint32(00000000 00000000 00000000 11100110) on x86,
    char(11100110) to uint32(11111111 11111111 11111111 11100110) then
    according to the murmur3 function that git currently use, the calculated
    hash value will be different. If the value was calculated through the
    same cpu architure machine, then it is ok. however, sometimes the hash
    value is calculated through a different cpu architure machine, then you
    cannot get the searched file. for example, bloom_filter's hash set is
    calculated through arm, and query through x86. So the hash value is
    incorrect, then missed the searched file.
    
    
    Solution
    ========
    
    No matter what the highest 24 bits will be when char cast to uint32, the
    murmur3 function only cares about the char part , which is only the
    lowest 8 bits, so we can use & 0xFF(11111111) to the casted uint32 value
    to choose only the lowest 8 bits.
    
    
    Others
    ======
    
    after fixed the bug, the historical bloom_filter data stored in
    commit-graph need to be updated. because the path's hash value is
    already calculated through a bad way. so we need to update it. this need
    to be done in repository

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1636%2Fcdegree%2Fmaster-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1636/cdegree/master-v1
Pull-Request: https://github.com/git/git/pull/1636

 bloom.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)


base-commit: a26002b62827b89a19b1084bd75d9371d565d03c

Comments

Taylor Blau Jan. 4, 2024, 2:53 p.m. UTC | #1
Hi Chen,

On Thu, Jan 04, 2024 at 01:56:46PM +0000, Chen Xuewei via GitGitGadget wrote:
> From: Chen Xuewei <316403398@qq.com>
>
> It is known that whether the highest bit is extended when char cast to
> uint32, depends on CPU architecture, which will lead different hash
> value. This is a fix to accord all architecture behaviour.

Thanks for your patch. A similar fix is being pursued in [1], part of
which includes [2], which I believe is functionally equivalent to your
patch here.

>     Others
>     ======
>
>     after fixed the bug, the historical bloom_filter data stored in
>     commit-graph need to be updated. because the path's hash value is
>     already calculated through a bad way. so we need to update it. this need
>     to be done in repository

We would not want to impose that burden on all users upon upgrading to
the latest Git version. In [1] we are perusing an approach where:

  - The Bloom data is stored with a version identifier, meaning that we
    can still use the existing/non-murmur3 Bloom filters after
    upgrading.

  - When the user decides to upgrade from v1 -> v2 Bloom filters, we
    reuse the existing Bloom filter data when possible, namely when all
    paths within a tree have no non-ASCII characters.

If you have thoughts on the approach in [1], they would be most welcome.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/cover.1697653929.git.me@ttaylorr.com/
[2]: https://lore.kernel.org/git/f6ab427ead86bc82284b2c721f3c177947ece3c9.1697653929.git.me@ttaylorr.com/
Junio C Hamano Jan. 4, 2024, 6:12 p.m. UTC | #2
"Chen Xuewei via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Chen Xuewei <316403398@qq.com>
>
> It is known that whether the highest bit is extended when char cast to
> uint32, depends on CPU architecture, which will lead different hash
> value. This is a fix to accord all architecture behaviour.
>
> Signed-off-by: Chen Xuewei <316403398@qq.com>
> ---

Jonathan and Taylor, isn't this what you two were working together
on?  How would we want to proceed?

Chen, using the right implementation of the hash function to be used
after the next rebuild of the Bloom data has so far been treated as
only one part of the solution (the others include "how to deal with
the existing data?  do we have a way to tell our binary to safely
ignore the Bloom data using a wrong hash?" and "how to make sure old
binaries do not get confused by the Bloom data using the right/new
hash?").

Jonathan and Taylor's (stalled) effort is here

    https://lore.kernel.org/git/cover.1697653929.git.me@ttaylorr.com


Thanks.
Taylor Blau Jan. 4, 2024, 6:27 p.m. UTC | #3
On Thu, Jan 04, 2024 at 10:12:42AM -0800, Junio C Hamano wrote:
> Jonathan and Taylor, isn't this what you two were working together
> on?  How would we want to proceed?

They are indeed similar. I think that Jonathan and my series would
supersede this effort.

But I would appreciate if Chen took a look at the approach in that
series to make sure that we're all on the same page and that Jonathan
and I aren't missing anything.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/bloom.c b/bloom.c
index 1474aa19fa5..bc40edac795 100644
--- a/bloom.c
+++ b/bloom.c
@@ -116,11 +116,11 @@  uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
 
 	uint32_t k;
 	for (i = 0; i < len4; i++) {
-		uint32_t byte1 = (uint32_t)data[4*i];
-		uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8;
-		uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16;
-		uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24;
-		k = byte1 | byte2 | byte3 | byte4;
+		uint32_t byte1 = ((uint32_t)data[4*i]) & 0xFF;
+		uint32_t byte2 = ((uint32_t)data[4*i + 1]) & 0xFF;
+		uint32_t byte3 = ((uint32_t)data[4*i + 2]) & 0xFF;
+		uint32_t byte4 = ((uint32_t)data[4*i + 3]) & 0xFF;
+		k = byte1 | (byte2 << 8) | (byte3 << 16) | (byte4 << 24);
 		k *= c1;
 		k = rotate_left(k, r1);
 		k *= c2;