diff mbox series

[RFC,2/2] hash API: add and use a hash_short_id_by_algo() function

Message ID RFC-patch-2.2-051f0612ab9-20220519T113538Z-avarab@gmail.com (mailing list archive)
State New
Headers show
Series Utility functions for duplicated pack(write) code | expand

Commit Message

Ævar Arnfjörð Bjarmason May 19, 2022, 11:42 a.m. UTC
Add and use a hash_short_id_by_algo() function. As noted in the
comment being modified here (added in [1]) the intention wasn't to
have these end up in on-disk formats, but since [2], [3] and [3]
that's been the case, and there's an outstanding patch to add another
format that uses these[5].

So let's expose this functionality as a documented utility function,
instead of copy/pasting this code in various places.

Replacing the die() in the existing functions with a BUG() might be
overzelous, it's correct for the case of
e.g. write_commit_graph_file() and write_midx_header(), but we also
use this for parsing on-disk files, e.g. in parse_commit_graph().

We could add a "gently" version of this, but for now I think that
worrying about the distinction would be worrying too much. If we ever
end up parsing such files that'll almost certainly be a bug in our own
writing code, so the distinction would be rather academic, even though
such files could theoretically occur without a bug of ours.

1. f50e766b7b3 (Add structure representing hash algorithm, 2017-11-12)
2. 665d70ad033 (commit-graph: use the "hash version" byte, 2020-08-17)
3. d96075428a9 (multi-pack-index: use hash version byte, 2020-08-17)
4. 8ef50d9958f (pack-write.c: prepare to write 'pack-*.rev' files, 2021-01-25)
5. https://lore.kernel.org/git/1d775f9850f00b0c3d1e9133669a6365c8d7bbba.1652915424.git.me@ttaylorr.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 commit-graph.c | 18 +++---------------
 hash.h         | 26 ++++++++++++++++++++++++--
 midx.c         | 18 +++---------------
 pack-write.c   | 12 +-----------
 4 files changed, 31 insertions(+), 43 deletions(-)

Comments

Junio C Hamano May 19, 2022, 3:50 p.m. UTC | #1
Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Add and use a hash_short_id_by_algo() function. As noted in the
> comment being modified here (added in [1]) the intention wasn't to
> have these end up in on-disk formats, but since [2], [3] and [3]

double [3]?

> that's been the case, and there's an outstanding patch to add another
> format that uses these[5].
>
> So let's expose this functionality as a documented utility function,
> instead of copy/pasting this code in various places.
>
> Replacing the die() in the existing functions with a BUG() might be
> overzelous, it's correct for the case of
> e.g. write_commit_graph_file() and write_midx_header(), but we also
> use this for parsing on-disk files, e.g. in parse_commit_graph().

If we know the offending data can come from outside, not from
literals in the code, then there is no "might be", such a use of
BUG() is simply wrong.

> We could add a "gently" version of this, but for now I think that
> worrying about the distinction would be worrying too much. If we ever
> end up parsing such files that'll almost certainly be a bug in our own
> writing code, so the distinction would be rather academic, even though

It would be a file written by an ancient buggy version of our code,
or a buggy third-party reimplementation of Git.  It could be that a
new version of Git is using a yet-to-be-invented algorithm this
version of Git does not know about.

The distinction matters in that "The version of Git I downloaded and
built last week out of the latest release tag said BUG" should mean
only one thing: that version that reports BUG() is the culprit, not
some random other thing we do not even know where it came from that
left a corrupt data on disk.
Ævar Arnfjörð Bjarmason May 19, 2022, 7:07 p.m. UTC | #2
On Thu, May 19 2022, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> Add and use a hash_short_id_by_algo() function. As noted in the
>> comment being modified here (added in [1]) the intention wasn't to
>> have these end up in on-disk formats, but since [2], [3] and [3]
>
> double [3]?

*nod*, sorry.

>> that's been the case, and there's an outstanding patch to add another
>> format that uses these[5].
>>
>> So let's expose this functionality as a documented utility function,
>> instead of copy/pasting this code in various places.
>>
>> Replacing the die() in the existing functions with a BUG() might be
>> overzelous, it's correct for the case of
>> e.g. write_commit_graph_file() and write_midx_header(), but we also
>> use this for parsing on-disk files, e.g. in parse_commit_graph().
>
> If we know the offending data can come from outside, not from
> literals in the code, then there is no "might be", such a use of
> BUG() is simply wrong.

Fair enough, we could change it to have 1/2 of this (the writing) use
BUG(), and die() for the other part, or just leave it as die() for both.

>> We could add a "gently" version of this, but for now I think that
>> worrying about the distinction would be worrying too much. If we ever
>> end up parsing such files that'll almost certainly be a bug in our own
>> writing code, so the distinction would be rather academic, even though
>
> It would be a file written by an ancient buggy version of our code,
> or a buggy third-party reimplementation of Git.  It could be that a
> new version of Git is using a yet-to-be-invented algorithm this
> version of Git does not know about.
>
> The distinction matters in that "The version of Git I downloaded and
> built last week out of the latest release tag said BUG" should mean
> only one thing: that version that reports BUG() is the culprit, not
> some random other thing we do not even know where it came from that
> left a corrupt data on disk.

Makes sense.
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index 06107beedcb..157de4dd717 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -193,18 +193,6 @@  char *get_commit_graph_chain_filename(struct object_directory *odb)
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 static struct commit_graph *alloc_commit_graph(void)
 {
 	struct commit_graph *g = xcalloc(1, sizeof(*g));
@@ -365,9 +353,9 @@  struct commit_graph *parse_commit_graph(struct repository *r,
 	}
 
 	hash_version = *(unsigned char*)(data + 5);
-	if (hash_version != oid_version()) {
+	if (hash_version != hash_short_id_by_algo()) {
 		error(_("commit-graph hash version %X does not match version %X"),
-		      hash_version, oid_version());
+		      hash_version, hash_short_id_by_algo());
 		return NULL;
 	}
 
@@ -1924,7 +1912,7 @@  static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, hash_short_id_by_algo());
 	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
diff --git a/hash.h b/hash.h
index 5d40368f18a..31293401809 100644
--- a/hash.h
+++ b/hash.h
@@ -80,8 +80,7 @@  static inline void git_SHA256_Clone(git_SHA256_CTX *dst, const git_SHA256_CTX *s
 
 /*
  * Note that these constants are suitable for indexing the hash_algos array and
- * comparing against each other, but are otherwise arbitrary, so they should not
- * be exposed to the user or serialized to disk.  To know whether a
+ * comparing against each other, but are otherwise arbitrary. To know whether a
  * git_hash_algo struct points to some usable hash function, test the format_id
  * field for being non-zero.  Use the name field for user-visible situations and
  * the format_id field for fixed-length fields on disk.
@@ -337,4 +336,27 @@  static inline void oid_set_algo(struct object_id *oid, const struct git_hash_alg
 const char *empty_tree_oid_hex(void);
 const char *empty_blob_oid_hex(void);
 
+/**
+ * Convert GIT_HASH_SHA1 to 1, GIT_HASH_SHA256 to 2 etc.
+ *
+ * It's preferable to use GIT_{SHA1,SHA256}_FORMAT_ID instead for file
+ * formats. The original intention was not to make these short
+ * constants part of any file format.
+ * 
+ * But since that ship has sailed for various on-disk formats this
+ * utility function allows us to do that consistently in one place.
+ */
+static inline int hash_short_id_by_algo(void)
+{
+	int hash_algo = hash_algo_by_ptr(the_hash_algo);
+
+	switch (hash_algo) {
+	case GIT_HASH_SHA1:
+	case GIT_HASH_SHA256:
+		return hash_algo;
+	default:
+		BUG("invalid hash version");
+	}
+}
+
 #endif
diff --git a/midx.c b/midx.c
index 3db0e47735f..2e42afa5f00 100644
--- a/midx.c
+++ b/midx.c
@@ -41,18 +41,6 @@ 
 
 #define PACK_EXPIRED UINT_MAX
 
-static uint8_t oid_version(void)
-{
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		return 1;
-	case GIT_HASH_SHA256:
-		return 2;
-	default:
-		die(_("invalid hash version"));
-	}
-}
-
 const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
@@ -134,9 +122,9 @@  struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 		      m->version);
 
 	hash_version = m->data[MIDX_BYTE_HASH_VERSION];
-	if (hash_version != oid_version()) {
+	if (hash_version != hash_short_id_by_algo()) {
 		error(_("multi-pack-index hash version %u does not match version %u"),
-		      hash_version, oid_version());
+		      hash_version, hash_short_id_by_algo());
 		goto cleanup_fail;
 	}
 	m->hash_len = the_hash_algo->rawsz;
@@ -420,7 +408,7 @@  static size_t write_midx_header(struct hashfile *f,
 {
 	hashwrite_be32(f, MIDX_SIGNATURE);
 	hashwrite_u8(f, MIDX_VERSION);
-	hashwrite_u8(f, oid_version());
+	hashwrite_u8(f, hash_short_id_by_algo());
 	hashwrite_u8(f, num_chunks);
 	hashwrite_u8(f, 0); /* unused */
 	hashwrite_be32(f, num_packs);
diff --git a/pack-write.c b/pack-write.c
index 51812cb1299..c1ce8f6df8f 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -181,17 +181,7 @@  static int pack_order_cmp(const void *va, const void *vb, void *ctx)
 
 static void write_rev_header(struct hashfile *f)
 {
-	uint32_t oid_version;
-	switch (hash_algo_by_ptr(the_hash_algo)) {
-	case GIT_HASH_SHA1:
-		oid_version = 1;
-		break;
-	case GIT_HASH_SHA256:
-		oid_version = 2;
-		break;
-	default:
-		die("write_rev_header: unknown hash version");
-	}
+	uint32_t oid_version = hash_short_id_by_algo();
 
 	hashwrite_be32(f, RIDX_SIGNATURE);
 	hashwrite_be32(f, RIDX_VERSION);