Message ID | 454b070d5bb0f64e11cab993b126ef5d37a3615b.1733181682.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | pack-objects: Create an alternative name hash algorithm (recreated) | expand |
"Jonathan Tan via GitGitGadget" <gitgitgadget@gmail.com> writes: [snip] > diff --git a/pack-objects.h b/pack-objects.h > index b9898a4e64b..15be8368d21 100644 > --- a/pack-objects.h > +++ b/pack-objects.h > @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name) > return hash; > } > > +static inline uint32_t pack_name_hash_v2(const char *name) > +{ > + uint32_t hash = 0, base = 0, c; > + > + if (!name) > + return 0; > + > + while ((c = *name++)) { > + if (isspace(c)) > + continue; > + if (c == '/') { > + base = (base >> 6) ^ hash; Just two questions about the implementation for my own knowledge. 1. Why use '6' here? I couldn't understand the reasoning for the specific value. 2. Generally in hashing algorithms the XOR is used to ensure that the output distribution is uniform which reduces collisions. Here, as you noted, we're more finding values for sorting rather than hashing in the traditional sense. So why use an XOR? > + hash = 0; > + } else { > + /* > + * 'c' is only a single byte. Reverse it and move > + * it to the top of the hash, moving the rest to > + * less-significant bits. > + */ > + c = (c & 0xF0) >> 4 | (c & 0x0F) << 4; > + c = (c & 0xCC) >> 2 | (c & 0x33) << 2; > + c = (c & 0xAA) >> 1 | (c & 0x55) << 1; > + hash = (hash >> 2) + (c << 24); > + } > + } > + return (base >> 6) ^ hash; > +} > + > static inline enum object_type oe_type(const struct object_entry *e) > { > return e->type_valid ? e->type_ : OBJ_BAD; > -- > gitgitgadget
karthik nayak <karthik.188@gmail.com> writes: > 2. Generally in hashing algorithms the XOR is used to ensure that the > output distribution is uniform which reduces collisions. Here, as you > noted, we're more finding values for sorting rather than hashing in the > traditional sense. So why use an XOR? I am not Jonathan, but since the mixing-of-bits is done with XOR in the original that Linus and I wrote, I think the question applies to our version as well. We prefer not to lose entropy from input bytes, so we do not want to use OR or AND, which tend to paint everything with 1 or 0 as you mix in bits from more bytes. Anyway the question sounds like "generally when you take tests you write with a pencil, but right now you are merely taking notes and not taking any tests. why are you writing with a pencil?" There are multiple occasions that a pencil is a great fit as writing equipment.
Junio C Hamano <gitster@pobox.com> writes: > karthik nayak <karthik.188@gmail.com> writes: > >> 2. Generally in hashing algorithms the XOR is used to ensure that the >> output distribution is uniform which reduces collisions. Here, as you >> noted, we're more finding values for sorting rather than hashing in the >> traditional sense. So why use an XOR? > > I am not Jonathan, but since the mixing-of-bits is done with XOR in > the original that Linus and I wrote, I think the question applies to > our version as well. We prefer not to lose entropy from input > bytes, so we do not want to use OR or AND, which tend to paint > everything with 1 or 0 as you mix in bits from more bytes. > Thanks for the answering. My reasoning at the start was more of my thoughts on why XOR could be used here and your explanation makes it clearer. > Anyway the question sounds like "generally when you take tests you > write with a pencil, but right now you are merely taking notes and > not taking any tests. why are you writing with a pencil?" There are > multiple occasions that a pencil is a great fit as writing equipment. I'd say my questions was more of "I know a pencil is used when taking tests because of XYZ reasons. On similar lines, why is a pencil used here?" :)
"Jonathan Tan via GitGitGadget" <gitgitgadget@gmail.com> writes: > diff --git a/pack-objects.h b/pack-objects.h > index b9898a4e64b..15be8368d21 100644 > --- a/pack-objects.h > +++ b/pack-objects.h > @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name) > return hash; > } > > +static inline uint32_t pack_name_hash_v2(const char *name) > +{ > + uint32_t hash = 0, base = 0, c; > + > + if (!name) > + return 0; > + > + while ((c = *name++)) { > + if (isspace(c)) > + continue; > + if (c == '/') { > + base = (base >> 6) ^ hash; > + hash = 0; > + } else { > + /* > + * 'c' is only a single byte. Reverse it and move > + * it to the top of the hash, moving the rest to > + * less-significant bits. > + */ > + c = (c & 0xF0) >> 4 | (c & 0x0F) << 4; > + c = (c & 0xCC) >> 2 | (c & 0x33) << 2; > + c = (c & 0xAA) >> 1 | (c & 0x55) << 1; > + hash = (hash >> 2) + (c << 24); > + } > + } > + return (base >> 6) ^ hash; > +} This works because `c` is masked before any arithmetic is performed on it, but the cast from potentially signed char to uint32_t still makes me nervous - if char is signed, it behaves as if it was first cast to int32_t and only then to uint32_t, as you can see from running the code below: #include <stdio.h> int main() { signed char c = 128; unsigned int u = c; printf("hello %u\n", u); return 0; } I would declare `c` as uint8_t or unsigned char.
Jonathan Tan <jonathantanmy@google.com> writes: > "Jonathan Tan via GitGitGadget" <gitgitgadget@gmail.com> writes: >> diff --git a/pack-objects.h b/pack-objects.h >> index b9898a4e64b..15be8368d21 100644 >> --- a/pack-objects.h >> +++ b/pack-objects.h >> @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name) >> return hash; >> } >> >> +static inline uint32_t pack_name_hash_v2(const char *name) >> +{ >> + uint32_t hash = 0, base = 0, c; >> + ... >> + while ((c = *name++)) { >> + ... >> + /* >> + * 'c' is only a single byte. Reverse it and move >> + * it to the top of the hash, moving the rest to >> + * less-significant bits. >> + */ >> + c = (c & 0xF0) >> 4 | (c & 0x0F) << 4; >> + ... > This works because `c` is masked before any arithmetic is performed on > it, but the cast from potentially signed char to uint32_t still makes > me nervous - if char is signed, it behaves as if it was first cast to > int32_t and only then to uint32_t, ... > I would declare `c` as uint8_t or unsigned char. I think you meant the type of "name", and your worry is that *name may pick up a negative integer from there when the platform char is signed? Here we are dealing with a pathname that has often UTF-8 encoded non-ASCII letters with 8th bit set, and I agree with you that being explicitly unsigned would certainly help reduce the cognitive load. Thanks.
diff --git a/pack-objects.h b/pack-objects.h index b9898a4e64b..15be8368d21 100644 --- a/pack-objects.h +++ b/pack-objects.h @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name) return hash; } +static inline uint32_t pack_name_hash_v2(const char *name) +{ + uint32_t hash = 0, base = 0, c; + + if (!name) + return 0; + + while ((c = *name++)) { + if (isspace(c)) + continue; + if (c == '/') { + base = (base >> 6) ^ hash; + hash = 0; + } else { + /* + * 'c' is only a single byte. Reverse it and move + * it to the top of the hash, moving the rest to + * less-significant bits. + */ + c = (c & 0xF0) >> 4 | (c & 0x0F) << 4; + c = (c & 0xCC) >> 2 | (c & 0x33) << 2; + c = (c & 0xAA) >> 1 | (c & 0x55) << 1; + hash = (hash >> 2) + (c << 24); + } + } + return (base >> 6) ^ hash; +} + static inline enum object_type oe_type(const struct object_entry *e) { return e->type_valid ? e->type_ : OBJ_BAD;