Message ID | 68b4127580e2d475bec0d7cd0f6a9ae5e626b3c9.1734715194.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | dca924b4508e3147e82be5bba10cfc8aefa7ecf0 |
Headers | show |
Series | pack-objects: Create an alternative name hash algorithm (recreated) | expand |
On Fri, Dec 20, 2024 at 05:19:47PM +0000, Jonathan Tan via GitGitGadget wrote: > The first change is to be more careful about paths using non-ASCII > characters. With these characters in mind, reverse the bits in the byte > as the least-significant bits have the highest entropy and we want to > maximize their influence. This is done with some bit manipulation that > swaps the two halves, then the quarters within those halves, and then > the bits within those quarters. Makes sense, and seems quite reasonable. > The second change is to perform hash composition operations at every > level of the path. This is done by storing a 'base' hash value that > contains the hash of the parent directory. When reaching a directory > boundary, we XOR the current level's name-hash value with a downshift of > the previous level's hash. This perturbation intends to create low-bit > distinctions for paths with the same final 16 bytes but distinct parent > directory structures. Very clever, I love this idea. Thanks, Taylor
diff --git a/pack-objects.h b/pack-objects.h index b9898a4e64b..681c1116486 100644 --- a/pack-objects.h +++ b/pack-objects.h @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name) return hash; } +static inline uint32_t pack_name_hash_v2(const unsigned char *name) +{ + uint32_t hash = 0, base = 0, c; + + if (!name) + return 0; + + while ((c = *name++)) { + if (isspace(c)) + continue; + if (c == '/') { + base = (base >> 6) ^ hash; + hash = 0; + } else { + /* + * 'c' is only a single byte. Reverse it and move + * it to the top of the hash, moving the rest to + * less-significant bits. + */ + c = (c & 0xF0) >> 4 | (c & 0x0F) << 4; + c = (c & 0xCC) >> 2 | (c & 0x33) << 2; + c = (c & 0xAA) >> 1 | (c & 0x55) << 1; + hash = (hash >> 2) + (c << 24); + } + } + return (base >> 6) ^ hash; +} + static inline enum object_type oe_type(const struct object_entry *e) { return e->type_valid ? e->type_ : OBJ_BAD;