diff mbox series

[RFC,02/22] hex: add functions to parse hex object IDs in any algorithm

Message ID 20200113124729.3684846-3-sandals@crustytoothpaste.net
State New, archived
Headers show
Series SHA-256 stage 4 implementation, part 1/3 | expand

Commit Message

brian m. carlson Jan. 13, 2020, 12:47 p.m. UTC
There are some places where we need to parse a hex object ID in any
algorithm without knowing beforehand which algorithm is in use. An
example is when parsing fast-import marks.

Add a get_oid_hex_any to parse an object ID and return the algorithm it
belongs to, and additionally add parse_oid_hex_any which is the
equivalent change for parse_oid_hex. If the object is not parseable, we
return GIT_HASH_UNKNOWN.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 cache.h | 10 ++++++++++
 hex.c   | 22 ++++++++++++++++++++++
 2 files changed, 32 insertions(+)

Comments

Junio C Hamano Jan. 15, 2020, 9:40 p.m. UTC | #1
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> +/*
> + * NOTE: This function relies on hash algorithms being in order from shortest
> + * length to longest length.
> + */
> +int get_oid_hex_any(const char *hex, struct object_id *oid)
> +{
> +	int i;
> +	for (i = GIT_HASH_NALGOS - 1; i > 0; i--) {
> +		if (!get_hash_hex_algop(hex, oid->hash, &hash_algos[i]))
> +			return i;
> +	}
> +	return GIT_HASH_UNKNOWN;
> +}

Two rather obvious questions are

 - what if we have more than one algos that produce hashes of the
   same length?

 - it feels that GIT_HASH_UNKNOWN being 0 wastes the first/zeroth
   element in the hash_algos[] array.

In the future, I would imagine that we would want to be able to say
"here I have a dozen hexdigits that is an abbreviated SHA2 hash",
and we would use some syntax (e.g. "sha2:123456123456") for that.
Would this function be at the layer that would be extended later to
support such a syntax, or would we have a layer higher than this to
do so?



>  int get_oid_hex(const char *hex, struct object_id *oid)
>  {
>  	return get_oid_hex_algop(hex, oid, the_hash_algo);
> @@ -87,6 +101,14 @@ int parse_oid_hex_algop(const char *hex, struct object_id *oid,
>  	return ret;
>  }
>  
> +int parse_oid_hex_any(const char *hex, struct object_id *oid, const char **end)
> +{
> +	int ret = get_oid_hex_any(hex, oid);
> +	if (ret)
> +		*end = hex + hash_algos[ret].hexsz;
> +	return ret;
> +}
> +
>  int parse_oid_hex(const char *hex, struct object_id *oid, const char **end)
>  {
>  	return parse_oid_hex_algop(hex, oid, end, the_hash_algo);
brian m. carlson Jan. 16, 2020, 12:22 a.m. UTC | #2
On 2020-01-15 at 21:40:54, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > +/*
> > + * NOTE: This function relies on hash algorithms being in order from shortest
> > + * length to longest length.
> > + */
> > +int get_oid_hex_any(const char *hex, struct object_id *oid)
> > +{
> > +	int i;
> > +	for (i = GIT_HASH_NALGOS - 1; i > 0; i--) {
> > +		if (!get_hash_hex_algop(hex, oid->hash, &hash_algos[i]))
> > +			return i;
> > +	}
> > +	return GIT_HASH_UNKNOWN;
> > +}
> 
> Two rather obvious questions are
> 
>  - what if we have more than one algos that produce hashes of the
>    same length?

Than we have a problem that we'll have to deal with then.  There are a
handful of functions that essentially document all these locations and
we'll have to decide how we fix them.

Notably, the dumb HTTP protocol doesn't provide any capability
advertisement, so it's not possible to ask the remote server which
algorithm it's using or negotiate a different one.  Bundles and
get-tar-commit-id are the other problem cases.

Granted, I did very much try to limit these cases as much as possible,
and most of our more modern code doesn't have this problem, but in some
cases it's just unavoidable.  I feel like with only three uses, doing
this won't be mortgaging our future too much.

>  - it feels that GIT_HASH_UNKNOWN being 0 wastes the first/zeroth
>    element in the hash_algos[] array.

I actually think it's really useful to have it this way, because then
it's easy to check for a valid hash algorithm by a comparison against 0.

> In the future, I would imagine that we would want to be able to say
> "here I have a dozen hexdigits that is an abbreviated SHA2 hash",
> and we would use some syntax (e.g. "sha2:123456123456") for that.
> Would this function be at the layer that would be extended later to
> support such a syntax, or would we have a layer higher than this to
> do so?

That's going to be at a different layer.  We'll have the ^{sha1} and
^{sha256} disambiguators that can be used with the normal revision
parsing syntax, and we'll handle the ambiguity there if one isn't
provided.  As for output, we'll only produce output in one algorithm at
a time so ambiguity isn't a problem there.
diff mbox series

Patch

diff --git a/cache.h b/cache.h
index 493d57febe..6c094c3210 100644
--- a/cache.h
+++ b/cache.h
@@ -1522,6 +1522,16 @@  int parse_oid_hex(const char *hex, struct object_id *oid, const char **end);
 int parse_oid_hex_algop(const char *hex, struct object_id *oid, const char **end,
 			const struct git_hash_algo *algo);
 
+
+/*
+ * These functions work like get_oid_hex and parse_oid_hex, but they will parse
+ * a hex value for any algorithm. The algorithm is detected based on the length
+ * and the algorithm in use is returned. If this is not a hex object ID in any
+ * algorithm, returns GIT_HASH_UNKNOWN.
+ */
+int get_oid_hex_any(const char *hex, struct object_id *oid);
+int parse_oid_hex_any(const char *hex, struct object_id *oid, const char **end);
+
 /*
  * This reads short-hand syntax that not only evaluates to a commit
  * object name, but also can act as if the end user spelled the name
diff --git a/hex.c b/hex.c
index 10e24dc2e4..da51e64929 100644
--- a/hex.c
+++ b/hex.c
@@ -72,6 +72,20 @@  int get_oid_hex_algop(const char *hex, struct object_id *oid,
 	return get_hash_hex_algop(hex, oid->hash, algop);
 }
 
+/*
+ * NOTE: This function relies on hash algorithms being in order from shortest
+ * length to longest length.
+ */
+int get_oid_hex_any(const char *hex, struct object_id *oid)
+{
+	int i;
+	for (i = GIT_HASH_NALGOS - 1; i > 0; i--) {
+		if (!get_hash_hex_algop(hex, oid->hash, &hash_algos[i]))
+			return i;
+	}
+	return GIT_HASH_UNKNOWN;
+}
+
 int get_oid_hex(const char *hex, struct object_id *oid)
 {
 	return get_oid_hex_algop(hex, oid, the_hash_algo);
@@ -87,6 +101,14 @@  int parse_oid_hex_algop(const char *hex, struct object_id *oid,
 	return ret;
 }
 
+int parse_oid_hex_any(const char *hex, struct object_id *oid, const char **end)
+{
+	int ret = get_oid_hex_any(hex, oid);
+	if (ret)
+		*end = hex + hash_algos[ret].hexsz;
+	return ret;
+}
+
 int parse_oid_hex(const char *hex, struct object_id *oid, const char **end)
 {
 	return parse_oid_hex_algop(hex, oid, end, the_hash_algo);