Message ID | pull.1823.git.1730775907.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | pack-objects: Create an alternative name hash algorithm (recreated) | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > This series introduces a new name-hash algorithm, but does not replace the > existing one. There are cases, such as packing a single snapshot of a > repository, where the existing algorithm outperforms the new one. I came up with a hash function that both uses information from a lot more of the path (not the full name, though) and preserves the sortable property (diff at the end of this email). It also contains fixes to the existing algorithm: not wasting the most significant bits of the hash if files in the repo mostly end in a lowercase alphabetic character, and the cast from a possibly-signed-possibly-unsigned char to a uint32_t. The results look quite good. In summary, the pack sizes are comparable to Stolee's results in the case of fluentui, and better than Stolee's results in the case of git. Here's one run on the fluentui repo (git clone https:// github.com/microsoft/fluentui; cd fluentui; git checkout a637a06df05360ce5ff21420803f64608226a875^ following the instructions in [1]: (before my change) Test this tree --------------------------------------------------------------------- 5313.2: thin pack 0.03(0.01+0.01) 5313.3: thin pack size 1.1K 5313.4: thin pack with --full-name-hash 0.03(0.00+0.02) 5313.5: thin pack size with --full-name-hash 3.0K 5313.6: big pack 1.60(2.87+0.32) 5313.7: big pack size 57.9M 5313.8: big pack with --full-name-hash 1.41(1.94+0.37) 5313.9: big pack size with --full-name-hash 57.8M 5313.10: shallow fetch pack 1.69(2.70+0.22) 5313.11: shallow pack size 33.0M 5313.12: shallow pack with --full-name-hash 1.49(1.84+0.34) 5313.13: shallow pack size with --full-name-hash 33.6M 5313.14: repack 75.10(537.66+5.47) 5313.15: repack size 454.2M 5313.16: repack with --full-name-hash 18.10(92.50+5.14) 5313.17: repack size with --full-name-hash 174.8M (after my change) Test this tree --------------------------------------------------------------------- 5313.2: thin pack 0.03(0.01+0.02) 5313.3: thin pack size 1.1K 5313.4: thin pack with --full-name-hash 0.03(0.01+0.02) 5313.5: thin pack size with --full-name-hash 1.1K 5313.6: big pack 1.62(2.94+0.28) 5313.7: big pack size 57.9M 5313.8: big pack with --full-name-hash 1.35(2.07+0.37) 5313.9: big pack size with --full-name-hash 57.6M 5313.10: shallow fetch pack 1.63(2.52+0.29) 5313.11: shallow pack size 33.0M 5313.12: shallow pack with --full-name-hash 1.50(2.10+0.23) 5313.13: shallow pack size with --full-name-hash 33.1M 5313.14: repack 74.86(531.39+5.49) 5313.15: repack size 454.7M 5313.16: repack with --full-name-hash 19.71(111.39+5.12) 5313.17: repack size with --full-name-hash 165.6M The tests were run by: GENERATE_COMPILATION_DATABASE=yes make CC=clang && (cd t/perf && env GIT_PERF_LARGE_REPO=~/tmp/fluentui ./run -- p5313*.sh) The similarity in sizes looked suspicious, so I replaced the contents of the hash function with "return 0;" and indeed the sizes significantly increased, so hopefully there is nothing wrong with my setup. The git repo was called out in [1] as demonstrating "some of the issues with this approach", but here are the results, run by: GENERATE_COMPILATION_DATABASE=yes make CC=clang && (cd t/perf && ./run -- p5313*.sh) Test this tree -------------------------------------------------------------------- 5313.2: thin pack 0.03(0.00+0.02) 5313.3: thin pack size 2.9K 5313.4: thin pack with --full-name-hash 0.03(0.00+0.02) 5313.5: thin pack size with --full-name-hash 2.9K 5313.6: big pack 1.69(2.80+0.28) 5313.7: big pack size 18.7M 5313.8: big pack with --full-name-hash 1.68(2.82+0.31) 5313.9: big pack size with --full-name-hash 18.8M 5313.10: shallow fetch pack 0.96(1.47+0.16) 5313.11: shallow pack size 12.1M 5313.12: shallow pack with --full-name-hash 1.01(1.51+0.14) 5313.13: shallow pack size with --full-name-hash 12.1M 5313.14: repack 17.05(69.99+4.33) 5313.15: repack size 116.5M 5313.16: repack with --full-name-hash 15.74(67.03+4.18) 5313.17: repack size with --full-name-hash 116.1M [1] https://lore.kernel.org/git/c14ef6879e451401381ebbdb8f30d33c8f56c25b.1730775908.git.gitgitgadget@gmail.com/ > | Repo | Standard Repack | With --full-name-hash | > |----------|-----------------|-----------------------| > | fluentui | 438 MB | 168 MB | > | Repo B | 6,255 MB | 829 MB | > | Repo C | 37,737 MB | 7,125 MB | > | Repo D | 130,049 MB | 6,190 MB | > | Repo E | 100,957 MB | 22,979 MB | > | Repo F | 8,308 MB | 746 MB | > | Repo G | 4,329 MB | 3,643 MB | If the results are similar for some of the above repos (I do not have access to them), maybe it's worth considering using my hash function (or a variation of it). I'll also take a look at the rest of the patch set. --- diff --git a/pack-objects.h b/pack-objects.h index 88360aa3e8..c4f35eafa0 100644 --- a/pack-objects.h +++ b/pack-objects.h @@ -209,23 +209,24 @@ static inline uint32_t pack_name_hash(const char *name) static inline uint32_t pack_full_name_hash(const char *name) { - const uint32_t bigp = 1234572167U; - uint32_t c, hash = bigp; + uint32_t hash = 0, base = 0; + uint8_t c; if (!name) return 0; - /* - * Do the simplest thing that will resemble pseudo-randomness: add - * random multiples of a large prime number with a binary shift. - * The goal is not to be cryptographic, but to be generally - * uniformly distributed. - */ - while ((c = *name++) != 0) { - hash += c * bigp; - hash = (hash >> 5) | (hash << 27); + while ((c = (uint8_t) *name++) != 0) { + if (isspace(c)) + continue; + if (c == '/') { + base = (base >> 6) ^ hash; + hash = 0; + } else { + uint8_t nybble_swapped = (c >> 4) + ((c & 15) << 4); + hash = (hash >> 2) + (nybble_swapped << 24); + } } - return hash; + return (base >> 6) ^ hash; }
Jonathan Tan <jonathantanmy@google.com> writes: > + while ((c = (uint8_t) *name++) != 0) { > + if (isspace(c)) > + continue; > + if (c == '/') { > + base = (base >> 6) ^ hash; > + hash = 0; > + } else { > + uint8_t nybble_swapped = (c >> 4) + ((c & 15) << 4); > + hash = (hash >> 2) + (nybble_swapped << 24); > + } > } > + return (base >> 6) ^ hash; > } Nice. The diff relative to the --full-name-hash version is a bit hard to grok, but compared to the current hash function, there are two and a half changes that matter: (0) it is more careful with bytes with the MSB set (i.e. non-ASCII pathnames). (1) it hashes each path component separetely and rotates the whole thing only at a directory boundary. I'd imagine that this would make a big difference for languages that force overly long filenames at each level. (2) it gives more weight to lower bits by swapping nybbles of each byte. I wonder if we do even better if we reverse all 8 bits instead of swapping nybbles (if we were to do so, it might be more efficient to shift in from the right instead of left end of the base and hash accumulators in the loop and then swap the whole resulting word at the end). Thanks for a fun read.
Junio C Hamano <gitster@pobox.com> writes: > ... (if we were to do so, it might be more efficient to > shift in from the right instead of left end of the base and hash > accumulators in the loop and then swap the whole resulting word at > the end). That is garbage. We could do the "shift in from the right and then reverse the result" for the hash accumulator, but not the "base" one. Sorry for the noise.
On 11/21/24 10:01 PM, Junio C Hamano wrote: > Jonathan Tan <jonathantanmy@google.com> writes: > >> + while ((c = (uint8_t) *name++) != 0) { >> + if (isspace(c)) >> + continue; >> + if (c == '/') { >> + base = (base >> 6) ^ hash; >> + hash = 0; >> + } else { >> + uint8_t nybble_swapped = (c >> 4) + ((c & 15) << 4); >> + hash = (hash >> 2) + (nybble_swapped << 24); >> + } >> } >> + return (base >> 6) ^ hash; >> } > > Nice. The diff relative to the --full-name-hash version is a bit > hard to grok, but compared to the current hash function, there are > two and a half changes that matter: > > (0) it is more careful with bytes with the MSB set (i.e. non-ASCII > pathnames). > > (1) it hashes each path component separetely and rotates the whole > thing only at a directory boundary. I'd imagine that this > would make a big difference for languages that force overly > long filenames at each level. I was confused by the "rotates the whole thing only at a directory boundary" statement. I think one way to say what you mean is Each path component is hashed similarly to the standard name-hash, and parent path component hashes are contributed via XOR after a down-shift of 6 bits per level. So we are getting something like [ name-hash for level 0 ] ......[ name-hash for level 1 ](truncated by 6) ............[name-hash for level 2](truncated by 12) ..................[...for level 3 ](truncated by 18) ........................[ level 4 ](truncated by 24) ..............................[ 5 ](truncated by 30) and at each layer we get the "last 16 bytes matter" issue, though it is balanced quite well. Also, the name-hash in each layer is adjusted for nybble swaps. (I don't think my explanation is _better_ but just that it matches my personal mental model slightly better.) > (2) it gives more weight to lower bits by swapping nybbles of each > byte. > > I wonder if we do even better if we reverse all 8 bits instead of > swapping nybbles (if we were to do so, it might be more efficient to > shift in from the right instead of left end of the base and hash > accumulators in the loop and then swap the whole resulting word at > the end). I will give this a try in my private repos as well as with the name-hash collision perf test from patch 7. Thanks, -Stolee
Junio C Hamano <gitster@pobox.com> writes: > I wonder if we do even better if we reverse all 8 bits instead of > swapping nybbles (if we were to do so, it might be more efficient to > shift in from the right instead of left end of the base and hash > accumulators in the loop and then swap the whole resulting word at > the end). > > Thanks for a fun read. Ah, yes, reversing is better than swapping nybbles (the least significant 2 bits have more entropy than the next-least significant 2 bits). When writing this, I didn't think of shifting in from the right (if I had thought of that, I would have indeed reversed the bits instead).
Derrick Stolee <stolee@gmail.com> writes: >> (1) it hashes each path component separetely and rotates the whole >> thing only at a directory boundary. I'd imagine that this >> would make a big difference for languages that force overly >> long filenames at each level. > > I was confused by the "rotates the whole thing only at a directory > boundary" statement. Yeah, I guess it was confusing. What I meant was that the entire result is shifted down with new material from left to right, but unlike the original, the outer thing (i.e. what is given to the caller as the result) is shifted only at the directory boundary, so we are not as aggressive to lose early bits by shifting them down to the right, as we are not shifting as fast as before. > I think one way to say what you mean is > > Each path component is hashed similarly to the standard name-hash, > and parent path component hashes are contributed via XOR after a > down-shift of 6 bits per level. Yes.