Message ID | pull.1636.git.git.1704376606625.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fix: platform accordance while calculating murmur3 | expand |
Hi Chen, On Thu, Jan 04, 2024 at 01:56:46PM +0000, Chen Xuewei via GitGitGadget wrote: > From: Chen Xuewei <316403398@qq.com> > > It is known that whether the highest bit is extended when char cast to > uint32, depends on CPU architecture, which will lead different hash > value. This is a fix to accord all architecture behaviour. Thanks for your patch. A similar fix is being pursued in [1], part of which includes [2], which I believe is functionally equivalent to your patch here. > Others > ====== > > after fixed the bug, the historical bloom_filter data stored in > commit-graph need to be updated. because the path's hash value is > already calculated through a bad way. so we need to update it. this need > to be done in repository We would not want to impose that burden on all users upon upgrading to the latest Git version. In [1] we are perusing an approach where: - The Bloom data is stored with a version identifier, meaning that we can still use the existing/non-murmur3 Bloom filters after upgrading. - When the user decides to upgrade from v1 -> v2 Bloom filters, we reuse the existing Bloom filter data when possible, namely when all paths within a tree have no non-ASCII characters. If you have thoughts on the approach in [1], they would be most welcome. Thanks, Taylor [1]: https://lore.kernel.org/git/cover.1697653929.git.me@ttaylorr.com/ [2]: https://lore.kernel.org/git/f6ab427ead86bc82284b2c721f3c177947ece3c9.1697653929.git.me@ttaylorr.com/
"Chen Xuewei via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Chen Xuewei <316403398@qq.com> > > It is known that whether the highest bit is extended when char cast to > uint32, depends on CPU architecture, which will lead different hash > value. This is a fix to accord all architecture behaviour. > > Signed-off-by: Chen Xuewei <316403398@qq.com> > --- Jonathan and Taylor, isn't this what you two were working together on? How would we want to proceed? Chen, using the right implementation of the hash function to be used after the next rebuild of the Bloom data has so far been treated as only one part of the solution (the others include "how to deal with the existing data? do we have a way to tell our binary to safely ignore the Bloom data using a wrong hash?" and "how to make sure old binaries do not get confused by the Bloom data using the right/new hash?"). Jonathan and Taylor's (stalled) effort is here https://lore.kernel.org/git/cover.1697653929.git.me@ttaylorr.com Thanks.
On Thu, Jan 04, 2024 at 10:12:42AM -0800, Junio C Hamano wrote: > Jonathan and Taylor, isn't this what you two were working together > on? How would we want to proceed? They are indeed similar. I think that Jonathan and my series would supersede this effort. But I would appreciate if Chen took a look at the approach in that series to make sure that we're all on the same page and that Jonathan and I aren't missing anything. Thanks, Taylor
diff --git a/bloom.c b/bloom.c index 1474aa19fa5..bc40edac795 100644 --- a/bloom.c +++ b/bloom.c @@ -116,11 +116,11 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len) uint32_t k; for (i = 0; i < len4; i++) { - uint32_t byte1 = (uint32_t)data[4*i]; - uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8; - uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16; - uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24; - k = byte1 | byte2 | byte3 | byte4; + uint32_t byte1 = ((uint32_t)data[4*i]) & 0xFF; + uint32_t byte2 = ((uint32_t)data[4*i + 1]) & 0xFF; + uint32_t byte3 = ((uint32_t)data[4*i + 2]) & 0xFF; + uint32_t byte4 = ((uint32_t)data[4*i + 3]) & 0xFF; + k = byte1 | (byte2 << 8) | (byte3 << 16) | (byte4 << 24); k *= c1; k = rotate_left(k, r1); k *= c2;