Message ID | xmqqa5vou9ar.fsf@gitster.g (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | rerere: match the hash algorithm with its length | expand |
On 2023-07-21 at 23:36:12, Junio C Hamano wrote: > The "conflict ID" used by "git rerere" to identify past conflicts we > saw has been a SHA-1 hash of the normalized text taken from the > conflicted region. 0d7c419a (rerere: convert to use the_hash_algo, > 2018-10-15) updated the rerere machinery to use more general "hash" > instead of hardcoded SHA-1 by using the_hash_algo, GIT_MAX_RAWSZ and > their friends, but the code that read from the MERGE_RR records were > left unconverted to still use get_sha1_hex(), possibly breaking the > operation in SHA-256 repositories. I agree consistency here is a good idea. However, I should point out the definition of `get_sha1_hex`: int get_sha1_hex(const char *hex, unsigned char *sha1) { return get_hash_hex_algop(hex, sha1, the_hash_algo); } Thus `get_sha1_hex` uses `the_hash_algo`, and therefore your change is equivalent to what was there before, I believe. That's because during the SHA-256 code work, we could either send a bunch of patches to fix all of the instance of `get_sha1_hex` or we could just patch that function to use the default hash algorithm, and I, for better or for worse, made the decision to avoid the churn. I still firmly agree that your change is better, because it is easier to read and less confusing, and that is a major improvement in itself. However, I would suggest that the commit message be updated to reflect that if possible.
"brian m. carlson" <sandals@crustytoothpaste.net> writes: > I agree consistency here is a good idea. However, I should point out > the definition of `get_sha1_hex`: > > int get_sha1_hex(const char *hex, unsigned char *sha1) > { > return get_hash_hex_algop(hex, sha1, the_hash_algo); > } Yeah, I think I lifted the inlining from there, and you are absolutely right. I think the main source of the confusion is that get_sha1_hex(), while it was a perfectly good name before the "struct object_id" world, has now become a misnomer. I'd retract the patch you reviewed, but now I wonder if the following is a good idea. ------- >8 ------------- >8 ------------- >8 ------- Subject: [PATCH] hex: retire get_sha1_hex() The naming convention around get_sha1_hex() and its friends is awkward these days (post "struct object_id". There are three public functions: * get_sha1_hex() - use the implied the_hash_algo, fill uchar * * get_oid_hex() - use the implied the_hash_algo, fill oid * * get_oid_hex_algop() - use the passed algop, fill oid * Between the latter two, the "_algop" suffix signals whether the the_hash_algo is used as the implied algorithm or the caller should pass an algorithm explicitly. That is very much understandable and is a good convention. Between the former two, however, the "SHA1" vs "OID" in the names differentiate in what type of variable the result is stored. We could argue that it makes sense to use "SHA1" to mean "flat byte buffer" to honor the historical practice in the days before "struct object_id" was invented, but when we introduce and name the natural fourth friend to the mix that takes an algop and fills a flat byte buffer, it would get an awkward name: get_sha1_hex_algop(). Do we use the passed in algo, or are we limited to SHA-1 ;-)? Correct the misnomer and use "hash" as "flat byte buffer that stores binary (as opposed to hexadecimal) representation of the hash". The four (2x2) friends now become: * get_hash_hex() - use the implied the_hash_algo, fill uchar * * get_oid_hex() - use the implied the_hash_algo, fill oid * * get_hash_hex_algop() - use the passed algop, fill uchar * * get_oid_hex_algop() - use the passed algop, fill oid * As there are only two remaining calls to get_sha1_hex() in the codebase right now, the blast radious is fairly small. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- hex.c | 6 +++--- hex.h | 13 +++++++++---- packfile.c | 2 +- rerere.c | 2 +- 4 files changed, 14 insertions(+), 9 deletions(-) diff --git c/hex.c w/hex.c index 7bb440e794..9ec4e674ad 100644 --- c/hex.c +++ w/hex.c @@ -49,8 +49,8 @@ int hex_to_bytes(unsigned char *binary, const char *hex, size_t len) return 0; } -static int get_hash_hex_algop(const char *hex, unsigned char *hash, - const struct git_hash_algo *algop) +int get_hash_hex_algop(const char *hex, unsigned char *hash, + const struct git_hash_algo *algop) { int i; for (i = 0; i < algop->rawsz; i++) { @@ -63,7 +63,7 @@ static int get_hash_hex_algop(const char *hex, unsigned char *hash, return 0; } -int get_sha1_hex(const char *hex, unsigned char *sha1) +int get_hash_hex(const char *hex, unsigned char *sha1) { return get_hash_hex_algop(hex, sha1, the_hash_algo); } diff --git c/hex.h w/hex.h index 7df4b3c460..9fa9c11fd0 100644 --- c/hex.h +++ w/hex.h @@ -20,14 +20,19 @@ static inline int hex2chr(const char *s) } /* - * Try to read a SHA1 in hexadecimal format from the 40 characters - * starting at hex. Write the 20-byte result to sha1 in binary form. + * Try to read a hash (specified by the_hash_algo) in hexadecimal format from + * the 40 (or whatever length the hash algorithm uses) characters + * starting at hex. Write the 20-byte (or the length of the hash) result to + * hash in binary form. * Return 0 on success. Reading stops if a NUL is encountered in the * input, so it is safe to pass this function an arbitrary * null-terminated string. */ -int get_sha1_hex(const char *hex, unsigned char *sha1); -int get_oid_hex(const char *hex, struct object_id *sha1); +int get_hash_hex(const char *hex, unsigned char *hash); +int get_oid_hex(const char *hex, struct object_id *oid); + +/* Like get_hash_hex, but for an arbitrary hash algorithm. */ +int get_hash_hex_algop(const char *hex, unsigned char *, const struct git_hash_algo *); /* Like get_oid_hex, but for an arbitrary hash algorithm. */ int get_oid_hex_algop(const char *hex, struct object_id *oid, const struct git_hash_algo *algop); diff --git c/packfile.c w/packfile.c index 030b7ec7a8..3076fc8d6f 100644 --- c/packfile.c +++ w/packfile.c @@ -751,7 +751,7 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local) p->pack_local = local; p->mtime = st.st_mtime; if (path_len < the_hash_algo->hexsz || - get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash)) + get_hash_hex(path + path_len - the_hash_algo->hexsz, p->hash)) hashclr(p->hash); return p; } diff --git c/rerere.c w/rerere.c index 7070f75014..725c1b6a95 100644 --- c/rerere.c +++ w/rerere.c @@ -204,7 +204,7 @@ static void read_rr(struct repository *r, struct string_list *rr) const unsigned hexsz = the_hash_algo->hexsz; /* There has to be the hash, tab, path and then NUL */ - if (buf.len < hexsz + 2 || get_sha1_hex(buf.buf, hash)) + if (buf.len < hexsz + 2 || get_hash_hex(buf.buf, hash)) die(_("corrupt MERGE_RR")); if (buf.buf[hexsz] != '.') {
On 2023-07-23 at 16:24:39, Junio C Hamano wrote: > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > I agree consistency here is a good idea. However, I should point out > > the definition of `get_sha1_hex`: > > > > int get_sha1_hex(const char *hex, unsigned char *sha1) > > { > > return get_hash_hex_algop(hex, sha1, the_hash_algo); > > } > > Yeah, I think I lifted the inlining from there, and you are > absolutely right. I think the main source of the confusion is that > get_sha1_hex(), while it was a perfectly good name before the > "struct object_id" world, has now become a misnomer. > > I'd retract the patch you reviewed, but now I wonder if the > following is a good idea. Yeah, I think that's a great idea, especially since now there are only a handful of those calls left.
diff --git a/rerere.c b/rerere.c index 7070f75014..f06172253b 100644 --- a/rerere.c +++ b/rerere.c @@ -203,8 +203,13 @@ static void read_rr(struct repository *r, struct string_list *rr) int variant; const unsigned hexsz = the_hash_algo->hexsz; - /* There has to be the hash, tab, path and then NUL */ - if (buf.len < hexsz + 2 || get_sha1_hex(buf.buf, hash)) + /* + * There has to be the "conflict ID", tab, path and then NUL. + * "conflict ID" would be a hash, possibly suffixed by "." and + * a small integer (variant number). + */ + if (buf.len < hexsz + 2 || + get_hash_hex_algop(buf.buf, hash, the_hash_algo)) die(_("corrupt MERGE_RR")); if (buf.buf[hexsz] != '.') {
The "conflict ID" used by "git rerere" to identify past conflicts we saw has been a SHA-1 hash of the normalized text taken from the conflicted region. 0d7c419a (rerere: convert to use the_hash_algo, 2018-10-15) updated the rerere machinery to use more general "hash" instead of hardcoded SHA-1 by using the_hash_algo, GIT_MAX_RAWSZ and their friends, but the code that read from the MERGE_RR records were left unconverted to still use get_sha1_hex(), possibly breaking the operation in SHA-256 repositories. We enumerate the subdirectories of $GIT_DIR/rr-cache/ and use the ones whose name passes parse_oid_hex() in full as conflict IDs, so they are always of correct length relative to the choice of the hash the repository makes, and they are written to the MERGE_RR file. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- * The "conflict ID" uses SHA-1 not because we needed a secure hash. We only needed something that is reasonably long with fewer collisions (the "rerere" machinery tolerates collisions). We just had SHA-1 readily available to us and that was the only reason we used it. As these "conflict ID" are not security sensitive, we could leave them as SHA-1 even in SHA-256 repositories and reverting 0d7c419a might be a good first step if we want to go in that direction, but let's be consistent. rerere.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)