diff mbox series

[v2,02/30] oid-array: teach oid-array to handle multiple kinds of oids

Message ID 20231002024034.2611-2-ebiederm@gmail.com (mailing list archive)
State New, archived
Headers show
Series initial support for multiple hash functions | expand

Commit Message

Eric W. Biederman Oct. 2, 2023, 2:40 a.m. UTC
From: "Eric W. Biederman" <ebiederm@xmission.com>

While looking at how to handle input of both SHA-1 and SHA-256 oids in
get_oid_with_context, I realized that the oid_array in
repo_for_each_abbrev might have more than one kind of oid stored in it
simultaneously.

Update to oid_array_append to ensure that oids added to an oid array
always have an algorithm set.

Update void_hashcmp to first verify two oids use the same hash algorithm
before comparing them to each other.

With that oid-array should be safe to use with different kinds of
oids simultaneously.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 oid-array.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Comments

Linus Arver Feb. 13, 2024, 8:16 a.m. UTC | #1
"Eric W. Biederman" <ebiederm@gmail.com> writes:

> From: "Eric W. Biederman" <ebiederm@xmission.com>
>
> While looking at how to handle input of both SHA-1 and SHA-256 oids in
> get_oid_with_context, I realized that the oid_array in
> repo_for_each_abbrev might have more than one kind of oid stored in it
> simultaneously.
>
> Update to oid_array_append to ensure that oids added to an oid array

s/Update to/Update

> always have an algorithm set.
>
> Update void_hashcmp to first verify two oids use the same hash algorithm
> before comparing them to each other.
>
> With that oid-array should be safe to use with different kinds of

s/oid-array/oid_array

> oids simultaneously.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  oid-array.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/oid-array.c b/oid-array.c
> index 8e4717746c31..1f36651754ed 100644
> --- a/oid-array.c
> +++ b/oid-array.c
> @@ -6,12 +6,20 @@ void oid_array_append(struct oid_array *array, const struct object_id *oid)
>  {
>  	ALLOC_GROW(array->oid, array->nr + 1, array->alloc);
>  	oidcpy(&array->oid[array->nr++], oid);
> +	if (!oid->algo)
> +		oid_set_algo(&array->oid[array->nr - 1], the_hash_algo);

How come we can't set oid->algo _before_ we call oidcpy()? It seems odd
that we do the copy first and then modify what we just copied after the
fact, instead of making sure that the thing we want to copy is correct
before doing the copy.

But also, if we are going to make the oid object "correct" before
invoking oidcpy(), we might as well do it when the oid is first
created/used (in the caller(s) of this function). I don't demand that
you find/demonstrate where all these places are in this series (maybe
that's a hairy problem to tackle?), but it seems cleaner in principle to
fix the creation of oid objects instead of having to make oid users
clean up their act like this after using them.

>  	array->sorted = 0;
>  }
>  
> -static int void_hashcmp(const void *a, const void *b)
> +static int void_hashcmp(const void *va, const void *vb)
>  {
> -	return oidcmp(a, b);
> +	const struct object_id *a = va, *b = vb;
> +	int ret;
> +	if (a->algo == b->algo)
> +		ret = oidcmp(a, b);

This makes sense (per the commit message description) ...

> +	else
> +		ret = a->algo > b->algo ? 1 : -1;

... but this seems to go against it? I thought you wanted to only ever
compare hashes if they were of the same algo? It would be good to add a
comment explaining why this is OK (we are no longer doing a byte-by-byte
comparison of these oids any more here like we do for oidcmp() above
which boils down to calling memcmp()).

> +	return ret;

Also, in terms of style I think the "early return for errors" style
would be simpler to read. I.e.

    if (a->algo > b->algo)
        return 1;

    if (a->algo < b->algo)
        return -1;

    return oidcmd(a, b);

>  }
>  
>  void oid_array_sort(struct oid_array *array)
> -- 
> 2.41.0
Kristoffer Haugsbakk Feb. 13, 2024, 8:31 a.m. UTC | #2
On Mon, Oct 2, 2023, at 04:40, Eric W. Biederman wrote:
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Most of your patches have this sign-off line with your name quoted.
Eric W. Biederman Feb. 15, 2024, 6:22 a.m. UTC | #3
Linus Arver <linusa@google.com> writes:

> "Eric W. Biederman" <ebiederm@gmail.com> writes:
>
>> From: "Eric W. Biederman" <ebiederm@xmission.com>
>>
>> While looking at how to handle input of both SHA-1 and SHA-256 oids in
>> get_oid_with_context, I realized that the oid_array in
>> repo_for_each_abbrev might have more than one kind of oid stored in it
>> simultaneously.
>>
>> Update to oid_array_append to ensure that oids added to an oid array
>
> s/Update to/Update
>
>> always have an algorithm set.
>>
>> Update void_hashcmp to first verify two oids use the same hash algorithm
>> before comparing them to each other.
>>
>> With that oid-array should be safe to use with different kinds of
>
> s/oid-array/oid_array
>
>> oids simultaneously.
>>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  oid-array.c | 12 ++++++++++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/oid-array.c b/oid-array.c
>> index 8e4717746c31..1f36651754ed 100644
>> --- a/oid-array.c
>> +++ b/oid-array.c
>> @@ -6,12 +6,20 @@ void oid_array_append(struct oid_array *array, const struct object_id *oid)
>>  {
>>  	ALLOC_GROW(array->oid, array->nr + 1, array->alloc);
>>  	oidcpy(&array->oid[array->nr++], oid);
>> +	if (!oid->algo)
>> +		oid_set_algo(&array->oid[array->nr - 1], the_hash_algo);
>
> How come we can't set oid->algo _before_ we call oidcpy()? It seems odd
> that we do the copy first and then modify what we just copied after the
> fact, instead of making sure that the thing we want to copy is correct
> before doing the copy.
>
> But also, if we are going to make the oid object "correct" before
> invoking oidcpy(), we might as well do it when the oid is first
> created/used (in the caller(s) of this function). I don't demand that
> you find/demonstrate where all these places are in this series (maybe
> that's a hairy problem to tackle?), but it seems cleaner in principle to
> fix the creation of oid objects instead of having to make oid users
> clean up their act like this after using them.

There is a hairy problem here.

I believe for reasons of simplicity when the algo field was added to
struct object_id it was allowed to be zero for users that don't
particularly care about the hash algorithm, and are happy to use the git
default hash algorithm.

Me experience working on this set of change set showed that there
are oids without their algo set in all kinds of places in the tree.

I could not think of any sure way to go through the entire tree
and find those users, so I just made certain that oid array handled
that case.

I need algo to be set properly in the oids in the oid array so I
could extend oid_array to hold multiple kinds of oids at the same
time.  To allow multiple kinds of oids at the same time void_hashcmp
needs a simple and reliable way to tell what the algorithm is of
any given oid.

>
>>  	array->sorted = 0;
>>  }
>>  
>> -static int void_hashcmp(const void *a, const void *b)
>> +static int void_hashcmp(const void *va, const void *vb)
>>  {
>> -	return oidcmp(a, b);
>> +	const struct object_id *a = va, *b = vb;
>> +	int ret;
>> +	if (a->algo == b->algo)
>> +		ret = oidcmp(a, b);
>
> This makes sense (per the commit message description) ...
>
>> +	else
>> +		ret = a->algo > b->algo ? 1 : -1;
>
> ... but this seems to go against it? I thought you wanted to only ever
> compare hashes if they were of the same algo? It would be good to add a
> comment explaining why this is OK (we are no longer doing a byte-by-byte
> comparison of these oids any more here like we do for oidcmp() above
> which boils down to calling memcmp()).

So the goal of this change is for oid_array to be able to hold hashes
from multiple algorithms at the same time.

A key part of oid_array is oid_array_sort that allows functions such
as oid_array_lookup and oid_array_for_each_unique.

To that end there needs to be a total ordering of oids.

The function oidcmp is only defined when two oids are of the same
algorithm, it does not even test to detect the case of comparing
mismatched algorithms.

Therefore to get a total ordering of oids.  I must use oidcmp
when the algorithm is the same (the common case) or simply order
the oids by algorithm when the algorithms are different.



All of this is relevant to get_oid_with_context as get_oid_with_context
and it's helper functions contain the logic that determines what
we do when a hex string that is ambiguous is specified.

In the ambiguous case all of the possible candidates are placed in
an oid_array, sorted and then displayed.


With a repository that can knows both the sha1 and the sha256 oid
of it's objects it is possible for a short oid to match both
some sha1 oids and some sha256 oids.

>> +	return ret;
>
> Also, in terms of style I think the "early return for errors" style
> would be simpler to read. I.e.
>
>     if (a->algo > b->algo)
>         return 1;
>
>     if (a->algo < b->algo)
>         return -1;
>
>     return oidcmd(a, b);
>

I can see doing:
	if (a->algo == b->algo)
        	return oidcmp(a,b);

	if (a->algo > b->algo)
        	return 1;
        else
        	return -1;

Or even:
	if (a->algo == b->algo)
        	return oidcmp(a,b);

	return a->algo - b->algo;

Although I suspect using subtraction is a bit too clever.

Comparing for less than, and greater than, and then assuming
the values are equal hides what is important before calling
oidcmp which is that the algo values are equal.


>>  }
>>  
>>  void oid_array_sort(struct oid_array *array)
>> -- 
>> 2.41.0

Eric
Eric W. Biederman Feb. 15, 2024, 6:24 a.m. UTC | #4
"Kristoffer Haugsbakk" <code@khaugsbakk.name> writes:

> On Mon, Oct 2, 2023, at 04:40, Eric W. Biederman wrote:
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> Most of your patches have this sign-off line with your name quoted.

At least for email syntax from which Signed-off-by syntax descends
having a period after my middle initial requires the name to be in
quotes.

Eric
Patrick Steinhardt Feb. 15, 2024, 11:21 a.m. UTC | #5
On Sun, Oct 01, 2023 at 09:40:06PM -0500, Eric W. Biederman wrote:
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> While looking at how to handle input of both SHA-1 and SHA-256 oids in
> get_oid_with_context, I realized that the oid_array in
> repo_for_each_abbrev might have more than one kind of oid stored in it
> simultaneously.
> 
> Update to oid_array_append to ensure that oids added to an oid array
> always have an algorithm set.
> 
> Update void_hashcmp to first verify two oids use the same hash algorithm
> before comparing them to each other.
> 
> With that oid-array should be safe to use with different kinds of
> oids simultaneously.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  oid-array.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/oid-array.c b/oid-array.c
> index 8e4717746c31..1f36651754ed 100644
> --- a/oid-array.c
> +++ b/oid-array.c
> @@ -6,12 +6,20 @@ void oid_array_append(struct oid_array *array, const struct object_id *oid)
>  {
>  	ALLOC_GROW(array->oid, array->nr + 1, array->alloc);
>  	oidcpy(&array->oid[array->nr++], oid);
> +	if (!oid->algo)
> +		oid_set_algo(&array->oid[array->nr - 1], the_hash_algo);

I feel like it's a design wart that `oid_array_append()` now started to
depend on repository discovery, adding an external dependency to it that
may cause very confusing behaviour. Are there for example ever cases
where we populate such an OID array before we have discovered the repo?
Can it happen that we use OID arrays in the context of a submodule that
has a different object ID than the main repository?

>  	array->sorted = 0;
>  }
>  
> -static int void_hashcmp(const void *a, const void *b)
> +static int void_hashcmp(const void *va, const void *vb)
>  {
> -	return oidcmp(a, b);
> +	const struct object_id *a = va, *b = vb;
> +	int ret;
> +	if (a->algo == b->algo)
> +		ret = oidcmp(a, b);
> +	else
> +		ret = a->algo > b->algo ? 1 : -1;

Okay, so we basically end up sorting first by the algorithm, and then we
sort all object IDs of a specific algorithm relative to each other.

Patrick

> +	return ret;
>  }
>  
>  void oid_array_sort(struct oid_array *array)
> -- 
> 2.41.0
>
Linus Arver Feb. 16, 2024, 12:16 a.m. UTC | #6
"Eric W. Biederman" <ebiederm@gmail.com> writes:

> Linus Arver <linusa@google.com> writes:
>
>> "Eric W. Biederman" <ebiederm@gmail.com> writes:
>>
>>> From: "Eric W. Biederman" <ebiederm@xmission.com>
>>>
>>> While looking at how to handle input of both SHA-1 and SHA-256 oids in
>>> get_oid_with_context, I realized that the oid_array in
>>> repo_for_each_abbrev might have more than one kind of oid stored in it
>>> simultaneously.
>>>
>>> Update to oid_array_append to ensure that oids added to an oid array
>>
>> s/Update to/Update
>>
>>> always have an algorithm set.
>>>
>>> Update void_hashcmp to first verify two oids use the same hash algorithm
>>> before comparing them to each other.
>>>
>>> With that oid-array should be safe to use with different kinds of
>>
>> s/oid-array/oid_array
>>
>>> oids simultaneously.
>>>
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>> ---
>>>  oid-array.c | 12 ++++++++++--
>>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/oid-array.c b/oid-array.c
>>> index 8e4717746c31..1f36651754ed 100644
>>> --- a/oid-array.c
>>> +++ b/oid-array.c
>>> @@ -6,12 +6,20 @@ void oid_array_append(struct oid_array *array, const struct object_id *oid)
>>>  {
>>>  	ALLOC_GROW(array->oid, array->nr + 1, array->alloc);
>>>  	oidcpy(&array->oid[array->nr++], oid);
>>> +	if (!oid->algo)
>>> +		oid_set_algo(&array->oid[array->nr - 1], the_hash_algo);
>>
>> How come we can't set oid->algo _before_ we call oidcpy()? It seems odd
>> that we do the copy first and then modify what we just copied after the
>> fact, instead of making sure that the thing we want to copy is correct
>> before doing the copy.
>>
>> But also, if we are going to make the oid object "correct" before
>> invoking oidcpy(), we might as well do it when the oid is first
>> created/used (in the caller(s) of this function). I don't demand that
>> you find/demonstrate where all these places are in this series (maybe
>> that's a hairy problem to tackle?), but it seems cleaner in principle to
>> fix the creation of oid objects instead of having to make oid users
>> clean up their act like this after using them.
>
> There is a hairy problem here.
>
> I believe for reasons of simplicity when the algo field was added to
> struct object_id it was allowed to be zero for users that don't
> particularly care about the hash algorithm, and are happy to use the git
> default hash algorithm.
>
> Me experience working on this set of change set showed that there
> are oids without their algo set in all kinds of places in the tree.

Ah, I see. Thanks for the clarification.

> I could not think of any sure way to go through the entire tree
> and find those users, so I just made certain that oid array handled
> that case.
>
> I need algo to be set properly in the oids in the oid array so I
> could extend oid_array to hold multiple kinds of oids at the same
> time.  To allow multiple kinds of oids at the same time void_hashcmp
> needs a simple and reliable way to tell what the algorithm is of
> any given oid.

Makes sense.

>>
>>>  	array->sorted = 0;
>>>  }
>>>  
>>> -static int void_hashcmp(const void *a, const void *b)
>>> +static int void_hashcmp(const void *va, const void *vb)
>>>  {
>>> -	return oidcmp(a, b);
>>> +	const struct object_id *a = va, *b = vb;
>>> +	int ret;
>>> +	if (a->algo == b->algo)
>>> +		ret = oidcmp(a, b);
>>
>> This makes sense (per the commit message description) ...
>>
>>> +	else
>>> +		ret = a->algo > b->algo ? 1 : -1;
>>
>> ... but this seems to go against it? I thought you wanted to only ever
>> compare hashes if they were of the same algo? It would be good to add a
>> comment explaining why this is OK (we are no longer doing a byte-by-byte
>> comparison of these oids any more here like we do for oidcmp() above
>> which boils down to calling memcmp()).
>
> So the goal of this change is for oid_array to be able to hold hashes
> from multiple algorithms at the same time.
>
> A key part of oid_array is oid_array_sort that allows functions such
> as oid_array_lookup and oid_array_for_each_unique.
>
> To that end there needs to be a total ordering of oids.
>
> The function oidcmp is only defined when two oids are of the same
> algorithm, it does not even test to detect the case of comparing
> mismatched algorithms.
>
> Therefore to get a total ordering of oids.  I must use oidcmp
> when the algorithm is the same (the common case) or simply order
> the oids by algorithm when the algorithms are different.
>
>
>
> All of this is relevant to get_oid_with_context as get_oid_with_context
> and it's helper functions contain the logic that determines what
> we do when a hex string that is ambiguous is specified.
>
> In the ambiguous case all of the possible candidates are placed in
> an oid_array, sorted and then displayed.
>
>
> With a repository that can knows both the sha1 and the sha256 oid
> of it's objects it is possible for a short oid to match both
> some sha1 oids and some sha256 oids.

Thanks for the additional clarification. I think a lot of this could
have been added as comments or perhaps in the commit message. The "short
id can match both sha1 or sha256" is a very real scenario we need to
consider in the sha1+sha256 world, indeed.

>>> +	return ret;
>>
>> Also, in terms of style I think the "early return for errors" style
>> would be simpler to read. I.e.
>>
>>     if (a->algo > b->algo)
>>         return 1;
>>
>>     if (a->algo < b->algo)
>>         return -1;
>>
>>     return oidcmd(a, b);
>>
>
> I can see doing:
> 	if (a->algo == b->algo)
>         	return oidcmp(a,b);
>
> 	if (a->algo > b->algo)
>         	return 1;
>         else
>         	return -1;
>
> Or even:
> 	if (a->algo == b->algo)
>         	return oidcmp(a,b);
>
> 	return a->algo - b->algo;
>
> Although I suspect using subtraction is a bit too clever.

Agreed.

> Comparing for less than, and greater than, and then assuming
> the values are equal hides what is important before calling
> oidcmp which is that the algo values are equal.

I would still prefer the "early return for errors" style even in this
case. This is because I much prefer to have the question "how can things
go wrong?" answered first, and dealt with, such that as I read
top-to-bottom I am left with less things I have to consider to
understand the "happy path". WRT emphasizing the "algos equal each
other" concern, a simple comment like

     /* Only compare equal algorithms. */
     return oidcmp(a, b);

seems sufficient.

But, of course it is possible (perhaps even likely) that my preferred
style is in the minority. Up to you. Thanks.
Eric W. Biederman Feb. 16, 2024, 4:48 a.m. UTC | #7
Linus Arver <linusa@google.com> writes:

>
> I would still prefer the "early return for errors" style even in this
> case. This is because I much prefer to have the question "how can things
> go wrong?" answered first, and dealt with, such that as I read
> top-to-bottom I am left with less things I have to consider to
> understand the "happy path". WRT emphasizing the "algos equal each
> other" concern, a simple comment like

void_hashcmp is a function that reports how two elements are ordered.
There is no error handling.

There are in fact two cases that need to be handled with oid_array being
able to contain more then one kind of hash at a time.

The two entries are ordered by oidcmp.
The two entries are ordered by hash algorithm.

The order that is maintained is first everything is ordered by hash
algorithm, then for entries of identical hash algorithm they are ordered
by oidcmp.

So I don't think the concept of early return for errors can ever apply
to any version of void_hashcmp.


Eric
Linus Arver Feb. 17, 2024, 1:59 a.m. UTC | #8
"Eric W. Biederman" <ebiederm@gmail.com> writes:

> Linus Arver <linusa@google.com> writes:
>
>>
>> I would still prefer the "early return for errors" style even in this
>> case. This is because I much prefer to have the question "how can things
>> go wrong?" answered first, and dealt with, such that as I read
>> top-to-bottom I am left with less things I have to consider to
>> understand the "happy path". WRT emphasizing the "algos equal each
>> other" concern, a simple comment like
>
> void_hashcmp is a function that reports how two elements are ordered.
> There is no error handling.
>
> There are in fact two cases that need to be handled with oid_array being
> able to contain more then one kind of hash at a time.
>
> The two entries are ordered by oidcmp.
> The two entries are ordered by hash algorithm.
>
> The order that is maintained is first everything is ordered by hash
> algorithm, then for entries of identical hash algorithm they are ordered
> by oidcmp.
>
> So I don't think the concept of early return for errors can ever apply
> to any version of void_hashcmp.

Ah, yes. It appears that I simply read the code wrong. Thank you for the
clarification, and I agree with your analysis and preferred style.

Sorry for the noise.
diff mbox series

Patch

diff --git a/oid-array.c b/oid-array.c
index 8e4717746c31..1f36651754ed 100644
--- a/oid-array.c
+++ b/oid-array.c
@@ -6,12 +6,20 @@  void oid_array_append(struct oid_array *array, const struct object_id *oid)
 {
 	ALLOC_GROW(array->oid, array->nr + 1, array->alloc);
 	oidcpy(&array->oid[array->nr++], oid);
+	if (!oid->algo)
+		oid_set_algo(&array->oid[array->nr - 1], the_hash_algo);
 	array->sorted = 0;
 }
 
-static int void_hashcmp(const void *a, const void *b)
+static int void_hashcmp(const void *va, const void *vb)
 {
-	return oidcmp(a, b);
+	const struct object_id *a = va, *b = vb;
+	int ret;
+	if (a->algo == b->algo)
+		ret = oidcmp(a, b);
+	else
+		ret = a->algo > b->algo ? 1 : -1;
+	return ret;
 }
 
 void oid_array_sort(struct oid_array *array)