diff mbox series

[03/15] cache: add an algo member to struct object_id

Message ID 20210410152140.3525040-4-sandals@crustytoothpaste.net (mailing list archive)
State Superseded
Headers show
Series SHA-256 / SHA-1 interop, part 1 | expand

Commit Message

brian m. carlson April 10, 2021, 3:21 p.m. UTC
Now that we're working with multiple hash algorithms in the same repo,
it's best if we label each object ID with its algorithm so we can
determine how to format a given object ID. Add a member called algo to
struct object_id.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 hash.h | 1 +
 1 file changed, 1 insertion(+)

Comments

Ævar Arnfjörð Bjarmason April 11, 2021, 11:55 a.m. UTC | #1
On Sat, Apr 10 2021, brian m. carlson wrote:

> Now that we're working with multiple hash algorithms in the same repo,
> it's best if we label each object ID with its algorithm so we can
> determine how to format a given object ID. Add a member called algo to
> struct object_id.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  hash.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/hash.h b/hash.h
> index 3fb0c3d400..dafdcb3335 100644
> --- a/hash.h
> +++ b/hash.h
> @@ -181,6 +181,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
>  
>  struct object_id {
>  	unsigned char hash[GIT_MAX_RAWSZ];
> +	int algo;

Curiosity since I'm not as familiar as you with the multi-hash support
by far:

So struct object_id is GIT_MAX_RAWSZ, not two types of structs for
GIT_SHA1_RAWSZ and GIT_SHA256_RAWSZ. That pre-dates this series because
we'd like to not deal with two types of objects everywhere for SHA-1 and
SHA-256. Makes sense.

Before this series we'd memcmp them up to their actual length, but the
last GIT_MAX_RAWSZ-GIT_SHA1_RAWSZ would be uninitialized

Now we pad them out, so the last 96 bits of every SHA1 are 0000...;
Couldn't we also tell which hash an object is by memcmp-ing those last N
bits and see if they're all zero'd?

Feels a bit hackish, and we'd need to reconsider that method if we'd
ever support other same-length hashes.

But OTOH having these objects all padded out in memory to the same
length, but having to carry around a "what hash algo" is it yields the
arguably weird hack of having a per-hash NULL_OID, which has never been
an actual object of any hash type, but just a pseudo-object.

As another aside I had some local patches (just for playing around) to
implement SHA-256/160, i.e. a SHA-256-to-SHA-1-length that doesn't
officially exist. We'd store things as full-length SHA-256 internally,
but on anything that would format them (including plumbing output) we'd
emit the truncated version(s).

The idea was to support Git/SHA-256 when combined with legacy systems
who'd all need DB column changes to have different length hashes.

I abandoned it as insany sillyness after playing with it for about a
day, but it did reveal that much of the hash code now can assume
internal length == formatting length, which is why I'm 3 paragraphs into
this digression, i.e. maybe some of the code structure also makes having
a NULL_OID always be 256-bits when we want to format it as 160/256
painful...
brian m. carlson April 11, 2021, 9:37 p.m. UTC | #2
On 2021-04-11 at 11:55:57, Ævar Arnfjörð Bjarmason wrote:
> 
> On Sat, Apr 10 2021, brian m. carlson wrote:
> 
> > Now that we're working with multiple hash algorithms in the same repo,
> > it's best if we label each object ID with its algorithm so we can
> > determine how to format a given object ID. Add a member called algo to
> > struct object_id.
> >
> > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> > ---
> >  hash.h | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/hash.h b/hash.h
> > index 3fb0c3d400..dafdcb3335 100644
> > --- a/hash.h
> > +++ b/hash.h
> > @@ -181,6 +181,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
> >  
> >  struct object_id {
> >  	unsigned char hash[GIT_MAX_RAWSZ];
> > +	int algo;
> 
> Curiosity since I'm not as familiar as you with the multi-hash support
> by far:
> 
> So struct object_id is GIT_MAX_RAWSZ, not two types of structs for
> GIT_SHA1_RAWSZ and GIT_SHA256_RAWSZ. That pre-dates this series because
> we'd like to not deal with two types of objects everywhere for SHA-1 and
> SHA-256. Makes sense.
> 
> Before this series we'd memcmp them up to their actual length, but the
> last GIT_MAX_RAWSZ-GIT_SHA1_RAWSZ would be uninitialized
> 
> Now we pad them out, so the last 96 bits of every SHA1 are 0000...;
> Couldn't we also tell which hash an object is by memcmp-ing those last N
> bits and see if they're all zero'd?

That makes a lot of assumptions about the security of the hash
algorithm that I don't want to make here.  If anyone can ever find a
SHA-256 hash with trailing 96 bits zero, then they can confuse it with a
SHA-1 hash.  That means that our security level goes from 128 bits to 96
bits.  It's also a nonstandard construction.

More importantly, it results in the null OID being treated as a SHA-1
OID.  Because we do print the null OID in some cases, we're going to
break a lot of output formats if we print all the rest of the OIDs with
64 characters and then the null OID with 40.  That's to say nothing of
the problems in binary formats.

The reason we pad these objects is because our hashmaps are broken if we
don't.  I don't remember all the gory details, but it was obvious to me
that if they weren't consistently equal, things were going to be broken.
That's the only reason, not theoretical purity.

> Feels a bit hackish, and we'd need to reconsider that method if we'd
> ever support other same-length hashes.

My hope is that we don't need to do this, but we do have SHA-3 to serve
as a backup for SHA-2.  If quantum computers don't progress
substantially, SHA-3-256 is definitely a viable candidate for
replacement if anything ever happens to SHA-256.

> But OTOH having these objects all padded out in memory to the same
> length, but having to carry around a "what hash algo" is it yields the
> arguably weird hack of having a per-hash NULL_OID, which has never been
> an actual object of any hash type, but just a pseudo-object.

Unfortunately, as I mentioned above, we need to have two null OIDs to
handle printing things out.  It's inconvenient, I agree.

> I abandoned it as insany sillyness after playing with it for about a
> day, but it did reveal that much of the hash code now can assume
> internal length == formatting length, which is why I'm 3 paragraphs into
> this digression, i.e. maybe some of the code structure also makes having
> a NULL_OID always be 256-bits when we want to format it as 160/256
> painful...

We'll always format based on the algorithm in the OID.  That's the
simplest way to make things work because unfortunately we may end up
with both types of OIDs in the same code paths (as we're converting one
to the other) and otherwise our printing functions need a lot of special
handling and even more variants than they have today.
Derrick Stolee April 13, 2021, 12:12 p.m. UTC | #3
On 4/10/2021 11:21 AM, brian m. carlson wrote:
> Now that we're working with multiple hash algorithms in the same repo,
> it's best if we label each object ID with its algorithm so we can
> determine how to format a given object ID. Add a member called algo to
> struct object_id.
> 
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  hash.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/hash.h b/hash.h
> index 3fb0c3d400..dafdcb3335 100644
> --- a/hash.h
> +++ b/hash.h
> @@ -181,6 +181,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
>  
>  struct object_id {
>  	unsigned char hash[GIT_MAX_RAWSZ];
> +	int algo;
>  };

What are the performance implications of adding this single bit
(that actually costs us 4 to 8 bytes, based on alignment)? Later
in the series you add longer hash comparisons, too. These seem
like they will affect performance for existing SHA-1 repos, and
it would be nice to know how much we are paying for this support.

I assume that we already checked what happened when GIT_MAX_RAWSZ
increased, but that seemed worth the cost so we could have SHA-256
at all. I find the justification for this interoperability mode to
be less significant, and potentially adding too much of a tax onto
both SHA-1 repos that will never upgrade, and SHA-256 repos that
upgrade all at once (or start as SHA-256).

Of course, if there truly is no serious performance implication to
this change, then I support following the transition plan and
allowing us to be flexible on timelines for interoperability. It
just seems like we need to investigate what this will cost.

Thanks,
-Stolee
brian m. carlson April 14, 2021, 1:08 a.m. UTC | #4
On 2021-04-13 at 12:12:21, Derrick Stolee wrote:
> On 4/10/2021 11:21 AM, brian m. carlson wrote:
> > Now that we're working with multiple hash algorithms in the same repo,
> > it's best if we label each object ID with its algorithm so we can
> > determine how to format a given object ID. Add a member called algo to
> > struct object_id.
> > 
> > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> > ---
> >  hash.h | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/hash.h b/hash.h
> > index 3fb0c3d400..dafdcb3335 100644
> > --- a/hash.h
> > +++ b/hash.h
> > @@ -181,6 +181,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
> >  
> >  struct object_id {
> >  	unsigned char hash[GIT_MAX_RAWSZ];
> > +	int algo;
> >  };
> 
> What are the performance implications of adding this single bit
> (that actually costs us 4 to 8 bytes, based on alignment)? Later
> in the series you add longer hash comparisons, too. These seem
> like they will affect performance for existing SHA-1 repos, and
> it would be nice to know how much we are paying for this support.

I will do some performance numbers on these patches, but it will likely
be the weekend before I can get to it.  I think this will add 4 bytes on
most platforms, since int is typically 32 bits, and the alignment
requirement would be for the most strictly aligned member, which is the
int, so a 4-byte alignment.  I don't think the alignment requirements
are especially onerous here.

> I assume that we already checked what happened when GIT_MAX_RAWSZ
> increased, but that seemed worth the cost so we could have SHA-256
> at all. I find the justification for this interoperability mode to
> be less significant, and potentially adding too much of a tax onto
> both SHA-1 repos that will never upgrade, and SHA-256 repos that
> upgrade all at once (or start as SHA-256).

The entire goal of the interoperability is to let people seamlessly and
transparently move from SHA-1 to SHA-256.  Currently, the only way
people can move a SHA-1 repository to a SHA-256 repository is with
fast-import and fast-export, which loses all digital signatures and tags
to blobs.  This also requires a flag day.

SHA-1 can now be attacked for USD 45,000.  That means it is within the
budget of a dedicated professional and virtually all medium or large
corporations, including even most municipal governments, to create a
SHA-1 collision.  Unfortunately, the way we deal with this is to die, so
as soon as this happens, the repository fails closed.  While an attacker
cannot make use of the collisions to spread malicious objects, because
of the way Git works, they can effectively DoS a repository, which is in
itself a security issue.  Fixing this requires major surgery.

We need the interoperability code to let people transition their
repositories away from SHA-1, even if it has some performance impact,
because without that most SHA-1 repositories will never transition.
That's what's outlined in the transition plan, and why that approach was
proposed, even though it would be nicer to avoid having to implement it
at all.

I will endeavor to make the performance impact as small as possible, of
course, and ideally there will be none.  I am sensitive to the fact that
people do run absurdly large workloads on Git, as we both know, and I do
want to support that.
Ævar Arnfjörð Bjarmason April 15, 2021, 8:47 a.m. UTC | #5
On Wed, Apr 14 2021, brian m. carlson wrote:

> On 2021-04-13 at 12:12:21, Derrick Stolee wrote:
>> On 4/10/2021 11:21 AM, brian m. carlson wrote:
>> > Now that we're working with multiple hash algorithms in the same repo,
>> > it's best if we label each object ID with its algorithm so we can
>> > determine how to format a given object ID. Add a member called algo to
>> > struct object_id.
>> > 
>> > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
>> > ---
>> >  hash.h | 1 +
>> >  1 file changed, 1 insertion(+)
>> > 
>> > diff --git a/hash.h b/hash.h
>> > index 3fb0c3d400..dafdcb3335 100644
>> > --- a/hash.h
>> > +++ b/hash.h
>> > @@ -181,6 +181,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
>> >  
>> >  struct object_id {
>> >  	unsigned char hash[GIT_MAX_RAWSZ];
>> > +	int algo;
>> >  };
>> 
>> What are the performance implications of adding this single bit
>> (that actually costs us 4 to 8 bytes, based on alignment)? Later
>> in the series you add longer hash comparisons, too. These seem
>> like they will affect performance for existing SHA-1 repos, and
>> it would be nice to know how much we are paying for this support.
>
> I will do some performance numbers on these patches, but it will likely
> be the weekend before I can get to it.  I think this will add 4 bytes on
> most platforms, since int is typically 32 bits, and the alignment
> requirement would be for the most strictly aligned member, which is the
> int, so a 4-byte alignment.  I don't think the alignment requirements
> are especially onerous here.

I think if you're doing such a perf test one where we have SHA-1 mode
with SHA-1 length OID v.s. SHA-256 (the current behavior) would be
interesting as well.

It seems like good SHA1s to test that are 5a0cc8aca79 and
13eeedb5d17. Running:

    GIT_PERF_REPEAT_COUNT=10 \
    GIT_SKIP_TESTS="p0001.[3-9] p1450.2" \
    GIT_TEST_OPTS= GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3' \
    ./run 5a0cc8aca79 13eeedb5d17 -- p0001-rev-list.sh p1450-fsck.sh

(I added a fsck --connectivity-only test)

Gives us:

    Test                               5a0cc8aca79         13eeedb5d17            
    ------------------------------------------------------------------------------
    0001.1: rev-list --all             2.46(2.22+0.14)     2.48(2.25+0.14) +0.8%  
    0001.2: rev-list --all --objects   10.79(10.22+0.14)   10.92(10.24+0.20) +1.2%
    1450.1: fsck --connectivity-only   16.61(15.42+0.34)   16.94(15.60+0.32) +2.0%

So at least on my box none of those are outside of the confidence
intervals. This was against my copy of git.git. Perhaps it matters more
under memory pressure.

>> I assume that we already checked what happened when GIT_MAX_RAWSZ
>> increased, but that seemed worth the cost so we could have SHA-256
>> at all. I find the justification for this interoperability mode to
>> be less significant, and potentially adding too much of a tax onto
>> both SHA-1 repos that will never upgrade, and SHA-256 repos that
>> upgrade all at once (or start as SHA-256).
>
> The entire goal of the interoperability is to let people seamlessly and
> transparently move from SHA-1 to SHA-256.  Currently, the only way
> people can move a SHA-1 repository to a SHA-256 repository is with
> fast-import and fast-export, which loses all digital signatures and tags
> to blobs.  This also requires a flag day.
>
> SHA-1 can now be attacked for USD 45,000.  That means it is within the
> budget of a dedicated professional and virtually all medium or large
> corporations, including even most municipal governments, to create a
> SHA-1 collision.

Is that for vanilla SHA-1, or SHA-1DC?

> Unfortunately, the way we deal with this is to die, so
> as soon as this happens, the repository fails closed.  While an attacker
> cannot make use of the collisions to spread malicious objects, because
> of the way Git works, they can effectively DoS a repository, which is in
> itself a security issue.  Fixing this requires major surgery.

Can you elaborate on this? I believe that we die on any known collision
via the SHA1-DC code, and even if we didn't have that we'd detect the
collision (see [1] for the code) and die while the object is in the
temporary quarantine.

I believe such a request is cheaper to serve than one that doesn't
upload colliding objects, we die earlier (less CPU etc.), and don't add
objects to the store.

So what's the DoS vector?

> We need the interoperability code to let people transition their
> repositories away from SHA-1, even if it has some performance impact,
> because without that most SHA-1 repositories will never transition.
> That's what's outlined in the transition plan, and why that approach was
> proposed, even though it would be nicer to avoid having to implement it
> at all.

There's no question that we need working interop.

The question at least in my mind is why that interop is happening by
annotating every object held in memory with whether they're SHA-1 or
SHA-256, as opposed to having some translation layer earlier in the
chain.

Not all our file or in-memory structures are are like that, e.g. the
commit graph has a header saying "this is a bunch of SHA-1/256", and the
objects that follow are padded to that actual hash size, not the max
size we know about.

My understanding of the transition plan was that we'd e.g. have a
SHA-1<->SHA-256 mapping of objects, which we'd say use to push/pull.

But that by the time I ran say a "git commit" none of that machinery
needed to care that I was interoping with a SHA-1 repo on the other end.
It would just happily have all SHA-256 objects, create new ones, and
only by the time I needed to push them would something kick in to
re-hash them.

I *think* the anwer is just "it's easier on the C-level" and "the
wastage doesn't seem to matter much", which is fair enough.

*Goes and digs in the ML archive*:

    https://lore.kernel.org/git/1399147942-165308-1-git-send-email-sandals@crustytoothpaste.net/#t
    https://lore.kernel.org/git/55016A3A.6010100@alum.mit.edu/

To answer (some) of that myself:

Digging up some of the initial discussion that seems to be the case, at
that point there was a suggestion of:

    struct object_id {
        unsigned char hash_type;
        union {
            unsigned char sha1[GIT_SHA1_RAWSZ];
            unsigned char sha256[GIT_SHA256_RAWSZ];
        } hash;
    };

To which you replied:
    
    What I think might be more beneficial is to make the hash function
    specific to a particular repository, and therefore you could maintain
    something like this:
    
      struct object_id {
          unsigned char hash[GIT_MAX_RAWSZ];
      };

It wouldn't matter for the memory use to make it a union, but my reading
of the above is that the reason for the current object_id not-a-union
structure might not be valid now that there's a "hash_type" member, no?

> I will endeavor to make the performance impact as small as possible, of
> course, and ideally there will be none.  I am sensitive to the fact that
> people do run absurdly large workloads on Git, as we both know, and I do
> want to support that.

All of the above being said I do wonder if for those who worry about
hash size inflating their object store whether a more sensible end-goal
if that's an issue wouldn't be to store abbreviated hashes.

As long as you'd re-hash/inflate the size in the case of collisions
(which would be a more expensive check at the "fetch" boundary) you
could do so safely, and the result would be less memory consumption.

But maybe it's just a non-issue :)

1. https://lore.kernel.org/git/20181113201910.11518-1-avarab@gmail.com/
brian m. carlson April 15, 2021, 11:51 p.m. UTC | #6
On 2021-04-15 at 08:47:00, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Apr 14 2021, brian m. carlson wrote:
> > I will do some performance numbers on these patches, but it will likely
> > be the weekend before I can get to it.  I think this will add 4 bytes on
> > most platforms, since int is typically 32 bits, and the alignment
> > requirement would be for the most strictly aligned member, which is the
> > int, so a 4-byte alignment.  I don't think the alignment requirements
> > are especially onerous here.
> 
> I think if you're doing such a perf test one where we have SHA-1 mode
> with SHA-1 length OID v.s. SHA-256 (the current behavior) would be
> interesting as well.
> 
> It seems like good SHA1s to test that are 5a0cc8aca79 and
> 13eeedb5d17. Running:
> 
>     GIT_PERF_REPEAT_COUNT=10 \
>     GIT_SKIP_TESTS="p0001.[3-9] p1450.2" \
>     GIT_TEST_OPTS= GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3' \
>     ./run 5a0cc8aca79 13eeedb5d17 -- p0001-rev-list.sh p1450-fsck.sh
> 
> (I added a fsck --connectivity-only test)
> 
> Gives us:
> 
>     Test                               5a0cc8aca79         13eeedb5d17
>     ------------------------------------------------------------------------------
>     0001.1: rev-list --all             2.46(2.22+0.14)     2.48(2.25+0.14) +0.8%
>     0001.2: rev-list --all --objects   10.79(10.22+0.14)   10.92(10.24+0.20) +1.2%
>     1450.1: fsck --connectivity-only   16.61(15.42+0.34)   16.94(15.60+0.32) +2.0%
> 
> So at least on my box none of those are outside of the confidence
> intervals. This was against my copy of git.git. Perhaps it matters more
> under memory pressure.

I do plan to take a deeper look at this this weekend and do some
performance numbers, and I think these are great examples to use.  I
just have a limited amount of time most weeknights because, among other
things, I am taking French a couple nights a week.

I talked with Stolee today about this approach and the desire for
performance, so I think we're on the same page about trying to make this
as fast as possible.  I plan to try a couple alternative solutions which
may work as well (or at least, I will make notes this time about why
they didn't work) and be less impactful.

> >> I assume that we already checked what happened when GIT_MAX_RAWSZ
> >> increased, but that seemed worth the cost so we could have SHA-256
> >> at all. I find the justification for this interoperability mode to
> >> be less significant, and potentially adding too much of a tax onto
> >> both SHA-1 repos that will never upgrade, and SHA-256 repos that
> >> upgrade all at once (or start as SHA-256).
> >
> > The entire goal of the interoperability is to let people seamlessly and
> > transparently move from SHA-1 to SHA-256.  Currently, the only way
> > people can move a SHA-1 repository to a SHA-256 repository is with
> > fast-import and fast-export, which loses all digital signatures and tags
> > to blobs.  This also requires a flag day.
> >
> > SHA-1 can now be attacked for USD 45,000.  That means it is within the
> > budget of a dedicated professional and virtually all medium or large
> > corporations, including even most municipal governments, to create a
> > SHA-1 collision.
> 
> Is that for vanilla SHA-1, or SHA-1DC?

Well, for SHA-1 in general.  SHA-1DC doesn't do anything except detect
collisions.  People can still create collisions, but we detect them.

> > Unfortunately, the way we deal with this is to die, so
> > as soon as this happens, the repository fails closed.  While an attacker
> > cannot make use of the collisions to spread malicious objects, because
> > of the way Git works, they can effectively DoS a repository, which is in
> > itself a security issue.  Fixing this requires major surgery.
> 
> Can you elaborate on this? I believe that we die on any known collision
> via the SHA1-DC code, and even if we didn't have that we'd detect the
> collision (see [1] for the code) and die while the object is in the
> temporary quarantine.
> 
> I believe such a request is cheaper to serve than one that doesn't
> upload colliding objects, we die earlier (less CPU etc.), and don't add
> objects to the store.
> 
> So what's the DoS vector?

That assumes that the server is using SHA-1DC and that the object can't
be uploaded any way except a push where its hash is checked.  Those are
true for many, but not all, hosting providers.  For example, anyone
using Git in a FIPS-validated environment can only use FIPS-validated
crypto, and I'm not aware of any SHA-1DC implementations that are.
Also, some implementations like Dulwich don't use SHA-1DC.

Once someone can get that object to a standard Git which does use
SHA-1DC, then the repository is pretty much hosed.  I probably can make
this better by just dropping the non SHA-1DC mode here and in libgit2,
to at least disincentivize poor choices in the most common
implementations.

> > We need the interoperability code to let people transition their
> > repositories away from SHA-1, even if it has some performance impact,
> > because without that most SHA-1 repositories will never transition.
> > That's what's outlined in the transition plan, and why that approach was
> > proposed, even though it would be nicer to avoid having to implement it
> > at all.
> 
> There's no question that we need working interop.
> 
> The question at least in my mind is why that interop is happening by
> annotating every object held in memory with whether they're SHA-1 or
> SHA-256, as opposed to having some translation layer earlier in the
> chain.

This is a good question.  Let me provide an example.

When we speak the remote protocol with a remote system, we'll parse out
several object ID of the appropriate algorithm.  We will then pass those
around to various places in our transport code.  It makes it a lot
easier if we can just run every object ID through an inline mapping
function which immediately does nothing if the object ID is of the
appropriate type rather than adding additional code to keep a check of
the current algorithm that's being used in the transport code.

It also makes it a lot easier when we let people store data in SHA-256
and then print things in SHA-1 for compatibility if we can add an
oid_to_hex_output and just map every object automatically on output,
regardless of where it came from, without needing to keep track of what
algorithm it's in.  For example, think about situations where the user
may have passed in a SHA-1 object ID and we reuse that value instead of
reading a SHA-256 object from the store.

So it's not completely impossible to avoid a hash algorithm member, but
it does significantly complicate our code not to do it.

> Not all our file or in-memory structures are are like that, e.g. the
> commit graph has a header saying "this is a bunch of SHA-1/256", and the
> objects that follow are padded to that actual hash size, not the max
> size we know about.
> 
> My understanding of the transition plan was that we'd e.g. have a
> SHA-1<->SHA-256 mapping of objects, which we'd say use to push/pull.

Correct.  That code exists and mostly works.  There are still a lot of
failing tests, but I have a pack index v3 that stores that data and a
loose object store.

> But that by the time I ran say a "git commit" none of that machinery
> needed to care that I was interoping with a SHA-1 repo on the other end.
> It would just happily have all SHA-256 objects, create new ones, and
> only by the time I needed to push them would something kick in to
> re-hash them.
> 
> I *think* the anwer is just "it's easier on the C-level" and "the
> wastage doesn't seem to matter much", which is fair enough.

I think that's accurate.

> *Goes and digs in the ML archive*:
> 
>     https://lore.kernel.org/git/1399147942-165308-1-git-send-email-sandals@crustytoothpaste.net/#t
>     https://lore.kernel.org/git/55016A3A.6010100@alum.mit.edu/
> 
> To answer (some) of that myself:
> 
> Digging up some of the initial discussion that seems to be the case, at
> that point there was a suggestion of:
> 
>     struct object_id {
>         unsigned char hash_type;
>         union {
>             unsigned char sha1[GIT_SHA1_RAWSZ];
>             unsigned char sha256[GIT_SHA256_RAWSZ];
>         } hash;
>     };
> 
> To which you replied:
> 
>     What I think might be more beneficial is to make the hash function
>     specific to a particular repository, and therefore you could maintain
>     something like this:
> 
>       struct object_id {
>           unsigned char hash[GIT_MAX_RAWSZ];
>       };
> 
> It wouldn't matter for the memory use to make it a union, but my reading
> of the above is that the reason for the current object_id not-a-union
> structure might not be valid now that there's a "hash_type" member, no?

Probably at that time we didn't have the formal transition plan and
didn't anticipate interoperability as a concern.  I do think that using
a single hash member instead of a union makes things easier because we
generally don't want to look up two different members in cases like
printing a hex OID.  We ultimately just want to print the right number
of bytes from that data, and the union just makes things trickier with
the compiler when we do that.
diff mbox series

Patch

diff --git a/hash.h b/hash.h
index 3fb0c3d400..dafdcb3335 100644
--- a/hash.h
+++ b/hash.h
@@ -181,6 +181,7 @@  static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
 
 struct object_id {
 	unsigned char hash[GIT_MAX_RAWSZ];
+	int algo;
 };
 
 #define the_hash_algo the_repository->hash_algo