diff mbox series

rerere: match the hash algorithm with its length

Message ID xmqqa5vou9ar.fsf@gitster.g (mailing list archive)
State New, archived
Headers show
Series rerere: match the hash algorithm with its length | expand

Commit Message

Junio C Hamano July 21, 2023, 11:36 p.m. UTC
The "conflict ID" used by "git rerere" to identify past conflicts we
saw has been a SHA-1 hash of the normalized text taken from the
conflicted region.  0d7c419a (rerere: convert to use the_hash_algo,
2018-10-15) updated the rerere machinery to use more general "hash"
instead of hardcoded SHA-1 by using the_hash_algo, GIT_MAX_RAWSZ and
their friends, but the code that read from the MERGE_RR records were
left unconverted to still use get_sha1_hex(), possibly breaking the
operation in SHA-256 repositories.

We enumerate the subdirectories of $GIT_DIR/rr-cache/ and use the
ones whose name passes parse_oid_hex() in full as conflict IDs,
so they are always of correct length relative to the choice of the
hash the repository makes, and they are written to the MERGE_RR
file.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * The "conflict ID" uses SHA-1 not because we needed a secure hash.
   We only needed something that is reasonably long with fewer
   collisions (the "rerere" machinery tolerates collisions).  We
   just had SHA-1 readily available to us and that was the only
   reason we used it.  As these "conflict ID" are not security
   sensitive, we could leave them as SHA-1 even in SHA-256
   repositories and reverting 0d7c419a might be a good first step if
   we want to go in that direction, but let's be consistent.

 rerere.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

brian m. carlson July 23, 2023, 3:03 p.m. UTC | #1
On 2023-07-21 at 23:36:12, Junio C Hamano wrote:
> The "conflict ID" used by "git rerere" to identify past conflicts we
> saw has been a SHA-1 hash of the normalized text taken from the
> conflicted region.  0d7c419a (rerere: convert to use the_hash_algo,
> 2018-10-15) updated the rerere machinery to use more general "hash"
> instead of hardcoded SHA-1 by using the_hash_algo, GIT_MAX_RAWSZ and
> their friends, but the code that read from the MERGE_RR records were
> left unconverted to still use get_sha1_hex(), possibly breaking the
> operation in SHA-256 repositories.

I agree consistency here is a good idea.  However, I should point out
the definition of `get_sha1_hex`:

int get_sha1_hex(const char *hex, unsigned char *sha1)
{
	return get_hash_hex_algop(hex, sha1, the_hash_algo);
}

Thus `get_sha1_hex` uses `the_hash_algo`, and therefore your change is
equivalent to what was there before, I believe.  That's because during
the SHA-256 code work, we could either send a bunch of patches to fix
all of the instance of `get_sha1_hex` or we could just patch that
function to use the default hash algorithm, and I, for better or for
worse, made the decision to avoid the churn.

I still firmly agree that your change is better, because it is easier to
read and less confusing, and that is a major improvement in itself.
However, I would suggest that the commit message be updated to reflect
that if possible.
Junio C Hamano July 23, 2023, 4:24 p.m. UTC | #2
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> I agree consistency here is a good idea.  However, I should point out
> the definition of `get_sha1_hex`:
>
> int get_sha1_hex(const char *hex, unsigned char *sha1)
> {
> 	return get_hash_hex_algop(hex, sha1, the_hash_algo);
> }

Yeah, I think I lifted the inlining from there, and you are
absolutely right.  I think the main source of the confusion is that
get_sha1_hex(), while it was a perfectly good name before the
"struct object_id" world, has now become a misnomer.

I'd retract the patch you reviewed, but now I wonder if the
following is a good idea.

------- >8 ------------- >8 ------------- >8 -------
Subject: [PATCH] hex: retire get_sha1_hex()

The naming convention around get_sha1_hex() and its friends is
awkward these days (post "struct object_id".  There are three public
functions:

 * get_sha1_hex()       - use the implied the_hash_algo, fill uchar *
 * get_oid_hex()        - use the implied the_hash_algo, fill oid *
 * get_oid_hex_algop()  - use the passed algop, fill oid *

Between the latter two, the "_algop" suffix signals whether the
the_hash_algo is used as the implied algorithm or the caller should
pass an algorithm explicitly.  That is very much understandable and
is a good convention.

Between the former two, however, the "SHA1" vs "OID" in the names
differentiate in what type of variable the result is stored.

We could argue that it makes sense to use "SHA1" to mean "flat byte
buffer" to honor the historical practice in the days before "struct
object_id" was invented, but when we introduce and name the natural
fourth friend to the mix that takes an algop and fills a flat byte
buffer, it would get an awkward name: get_sha1_hex_algop().

Do we use the passed in algo, or are we limited to SHA-1 ;-)?

Correct the misnomer and use "hash" as "flat byte buffer that stores
binary (as opposed to hexadecimal) representation of the hash".  The
four (2x2) friends now become:

 * get_hash_hex()       - use the implied the_hash_algo, fill uchar *
 * get_oid_hex()        - use the implied the_hash_algo, fill oid *
 * get_hash_hex_algop() - use the passed algop, fill uchar *
 * get_oid_hex_algop()  - use the passed algop, fill oid *

As there are only two remaining calls to get_sha1_hex() in the
codebase right now, the blast radious is fairly small.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 hex.c      |  6 +++---
 hex.h      | 13 +++++++++----
 packfile.c |  2 +-
 rerere.c   |  2 +-
 4 files changed, 14 insertions(+), 9 deletions(-)

diff --git c/hex.c w/hex.c
index 7bb440e794..9ec4e674ad 100644
--- c/hex.c
+++ w/hex.c
@@ -49,8 +49,8 @@ int hex_to_bytes(unsigned char *binary, const char *hex, size_t len)
 	return 0;
 }
 
-static int get_hash_hex_algop(const char *hex, unsigned char *hash,
-			      const struct git_hash_algo *algop)
+int get_hash_hex_algop(const char *hex, unsigned char *hash,
+		       const struct git_hash_algo *algop)
 {
 	int i;
 	for (i = 0; i < algop->rawsz; i++) {
@@ -63,7 +63,7 @@ static int get_hash_hex_algop(const char *hex, unsigned char *hash,
 	return 0;
 }
 
-int get_sha1_hex(const char *hex, unsigned char *sha1)
+int get_hash_hex(const char *hex, unsigned char *sha1)
 {
 	return get_hash_hex_algop(hex, sha1, the_hash_algo);
 }
diff --git c/hex.h w/hex.h
index 7df4b3c460..9fa9c11fd0 100644
--- c/hex.h
+++ w/hex.h
@@ -20,14 +20,19 @@ static inline int hex2chr(const char *s)
 }
 
 /*
- * Try to read a SHA1 in hexadecimal format from the 40 characters
- * starting at hex.  Write the 20-byte result to sha1 in binary form.
+ * Try to read a hash (specified by the_hash_algo) in hexadecimal format from
+ * the 40 (or whatever length the hash algorithm uses) characters
+ * starting at hex.  Write the 20-byte (or the length of the hash) result to
+ * hash in binary form.
  * Return 0 on success.  Reading stops if a NUL is encountered in the
  * input, so it is safe to pass this function an arbitrary
  * null-terminated string.
  */
-int get_sha1_hex(const char *hex, unsigned char *sha1);
-int get_oid_hex(const char *hex, struct object_id *sha1);
+int get_hash_hex(const char *hex, unsigned char *hash);
+int get_oid_hex(const char *hex, struct object_id *oid);
+
+/* Like get_hash_hex, but for an arbitrary hash algorithm. */
+int get_hash_hex_algop(const char *hex, unsigned char *, const struct git_hash_algo *);
 
 /* Like get_oid_hex, but for an arbitrary hash algorithm. */
 int get_oid_hex_algop(const char *hex, struct object_id *oid, const struct git_hash_algo *algop);
diff --git c/packfile.c w/packfile.c
index 030b7ec7a8..3076fc8d6f 100644
--- c/packfile.c
+++ w/packfile.c
@@ -751,7 +751,7 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	p->pack_local = local;
 	p->mtime = st.st_mtime;
 	if (path_len < the_hash_algo->hexsz ||
-	    get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash))
+	    get_hash_hex(path + path_len - the_hash_algo->hexsz, p->hash))
 		hashclr(p->hash);
 	return p;
 }
diff --git c/rerere.c w/rerere.c
index 7070f75014..725c1b6a95 100644
--- c/rerere.c
+++ w/rerere.c
@@ -204,7 +204,7 @@ static void read_rr(struct repository *r, struct string_list *rr)
 		const unsigned hexsz = the_hash_algo->hexsz;
 
 		/* There has to be the hash, tab, path and then NUL */
-		if (buf.len < hexsz + 2 || get_sha1_hex(buf.buf, hash))
+		if (buf.len < hexsz + 2 || get_hash_hex(buf.buf, hash))
 			die(_("corrupt MERGE_RR"));
 
 		if (buf.buf[hexsz] != '.') {
brian m. carlson July 24, 2023, 9:22 p.m. UTC | #3
On 2023-07-23 at 16:24:39, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > I agree consistency here is a good idea.  However, I should point out
> > the definition of `get_sha1_hex`:
> >
> > int get_sha1_hex(const char *hex, unsigned char *sha1)
> > {
> > 	return get_hash_hex_algop(hex, sha1, the_hash_algo);
> > }
> 
> Yeah, I think I lifted the inlining from there, and you are
> absolutely right.  I think the main source of the confusion is that
> get_sha1_hex(), while it was a perfectly good name before the
> "struct object_id" world, has now become a misnomer.
> 
> I'd retract the patch you reviewed, but now I wonder if the
> following is a good idea.

Yeah, I think that's a great idea, especially since now there are only a
handful of those calls left.
diff mbox series

Patch

diff --git a/rerere.c b/rerere.c
index 7070f75014..f06172253b 100644
--- a/rerere.c
+++ b/rerere.c
@@ -203,8 +203,13 @@  static void read_rr(struct repository *r, struct string_list *rr)
 		int variant;
 		const unsigned hexsz = the_hash_algo->hexsz;
 
-		/* There has to be the hash, tab, path and then NUL */
-		if (buf.len < hexsz + 2 || get_sha1_hex(buf.buf, hash))
+		/*
+		 * There has to be the "conflict ID", tab, path and then NUL.
+		 * "conflict ID" would be a hash, possibly suffixed by "." and
+		 * a small integer (variant number).
+		 */
+		if (buf.len < hexsz + 2 ||
+		    get_hash_hex_algop(buf.buf, hash, the_hash_algo))
 			die(_("corrupt MERGE_RR"));
 
 		if (buf.buf[hexsz] != '.') {